xpark.dataset.TextLanguageScore#
- class xpark.dataset.TextLanguageScore(_local_model: str = 'fasttext/lid.176.bin', lang: str = 'en')[source]#
Language score operator based on a fasttext model.
For each input text, returns the probability that the text belongs to the specified
lang.- Parameters:
_local_model – fasttext model name. Default is
"fasttext/lid.176.bin". available models: [‘fasttext/lid.176.bin’, ‘fasttext/lid.176.ftz’].lang – Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g.
en,zh.
Examples
from xpark.dataset.expressions import col from xpark.dataset import from_items from xpark.dataset.processors.text_language_detector import TextLanguageScore ds = from_items(["Hello world", "今天天气很好"]) ds = ds.with_column("en_score", TextLanguageScore(lang="en").options(num_workers={"CPU": 1}).with_column(col("item"))) print(ds.take(2))
Methods
__call__(batch)Score each text in the batch for the target language.
options(**kwargs)with_column(batch)Score each text in the batch for the target language.
- __call__(batch: pa.ChunkedArray) pa.Array#
Score each text in the batch for the target language.
- Parameters:
batch – A PyArrow ChunkedArray of string values.
- Returns:
A PyArrow float32 Array where each element is the probability that the corresponding text belongs to
self.lang(0.0 if not in top-10).
- options(**kwargs: Unpack[ExprUDFOptions]) Self#
- with_column(batch: pa.ChunkedArray) pa.Array#
Score each text in the batch for the target language.
- Parameters:
batch – A PyArrow ChunkedArray of string values.
- Returns:
A PyArrow float32 Array where each element is the probability that the corresponding text belongs to
self.lang(0.0 if not in top-10).