xpark.dataset.TextFlaggedWordRatio#

class xpark.dataset.TextFlaggedWordRatio(asset: str = 'data_juicer/flagged_words', asset_label: str = 'en', custom_flagged_words_list: List[str] | None = None, tokenizer: Literal['cjk', 'space'] = 'cjk', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] | None = None, words_aug_join_char: str = '')[source]#

Compute the ratio of flagged words in a text.

Tokenizes the input text and calculates the proportion of flagged words relative to the total word count. Supports CJK and space-based tokenization, as well as word-bag augmentation (combining adjacent tokens into new candidates).

Parameters:

asset – The asset to use for flagged words. Defaults to "data_juicer/flagged_words". Available assets: [‘data_juicer/flagged_words’].
asset_label – The label of the asset to use. Defaults to "en". Supported labels: ar, ca, cs, da, de, en, eo, es, eu, fa, fi, fil, fr, fr-CA-u-sd-caqc, ha, hi, hu, id, it, ja, kab, kn, ko, ml, mr, nl, no, pl, pt, ru, sv, ta, te, th, tlh, tr, vi, zh. Pass "all" to use the aggregated flagged words across all languages.
custom_flagged_words_list – A user-supplied list of flagged words. When this list is non-empty, the asset and asset_label parameters are ignored.
tokenizer – Tokenization strategy. Supports "cjk" (mixed CJK + whitespace splitting) and "space" (whitespace-only splitting). Defaults to "cjk". Support for "jieba" may be added in the future.
use_words_aug – Whether to enable word-bag augmentation. When enabled, adjacent tokens are joined using words_aug_join_char for each window size in words_aug_group_sizes and added to the candidate set. Defaults to False.
words_aug_group_sizes – Window sizes used for word-bag augmentation. Defaults to [2].
words_aug_join_char – Character used to join adjacent tokens during augmentation. Defaults to "".

Examples

from xpark.dataset.expressions import col
from xpark.dataset import from_items
from xpark.dataset.processors.text_flagged_word_ratio import TextFlaggedWordRatio

ds = from_items(["This is a bad text", "Hello world"])
ds = ds.with_column(
    "flagged_ratio",
    TextFlaggedWordRatio(
        custom_flagged_words_list=["bad"],
        tokenizer="space",
    )
    .options(num_workers={"CPU": 1})
    .with_column(col("item")),
)
print(ds.take(2))

Methods

`__call__`(texts)	Call self as a function.
`options`(**kwargs)
`with_column`(texts)

__call__(texts: pa.ChunkedArray) → pa.Array#: Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(texts: pa.ChunkedArray) → pa.Array#