xpark.dataset.TextFlaggedWordRatio#
- class xpark.dataset.TextFlaggedWordRatio(asset: str = 'data_juicer/flagged_words', asset_label: str = 'en', custom_flagged_words_list: List[str] | None = None, tokenizer: Literal['cjk', 'space'] = 'cjk', use_words_aug: bool = False, words_aug_group_sizes: List[Annotated[int, Gt(gt=0)]] | None = None, words_aug_join_char: str = '')[source]#
Compute the ratio of flagged words in a text.
Tokenizes the input text and calculates the proportion of flagged words relative to the total word count. Supports CJK and space-based tokenization, as well as word-bag augmentation (combining adjacent tokens into new candidates).
- Parameters:
asset – The asset to use for flagged words. Defaults to
"data_juicer/flagged_words". Available assets: [‘data_juicer/flagged_words’].asset_label – The label of the asset to use. Defaults to
"en". Supported labels:ar,ca,cs,da,de,en,eo,es,eu,fa,fi,fil,fr,fr-CA-u-sd-caqc,ha,hi,hu,id,it,ja,kab,kn,ko,ml,mr,nl,no,pl,pt,ru,sv,ta,te,th,tlh,tr,vi,zh. Pass"all"to use the aggregated flagged words across all languages.custom_flagged_words_list – A user-supplied list of flagged words. When this list is non-empty, the
assetandasset_labelparameters are ignored.tokenizer – Tokenization strategy. Supports
"cjk"(mixed CJK + whitespace splitting) and"space"(whitespace-only splitting). Defaults to"cjk". Support for"jieba"may be added in the future.use_words_aug – Whether to enable word-bag augmentation. When enabled, adjacent tokens are joined using
words_aug_join_charfor each window size inwords_aug_group_sizesand added to the candidate set. Defaults toFalse.words_aug_group_sizes – Window sizes used for word-bag augmentation. Defaults to
[2].words_aug_join_char – Character used to join adjacent tokens during augmentation. Defaults to
"".
Examples
from xpark.dataset.expressions import col from xpark.dataset import from_items from xpark.dataset.processors.text_flagged_word_ratio import TextFlaggedWordRatio ds = from_items(["This is a bad text", "Hello world"]) ds = ds.with_column( "flagged_ratio", TextFlaggedWordRatio( custom_flagged_words_list=["bad"], tokenizer="space", ) .options(num_workers={"CPU": 1}) .with_column(col("item")), ) print(ds.take(2))
Methods
__call__(texts)Call self as a function.
options(**kwargs)with_column(texts)- __call__(texts: pa.ChunkedArray) pa.Array#
Call self as a function.
- options(**kwargs: Unpack[ExprUDFOptions]) Self#
- with_column(texts: pa.ChunkedArray) pa.Array#