xpark.dataset.TextLanguageScore#

class xpark.dataset.TextLanguageScore(_local_model: str = 'fasttext/lid.176.bin', lang: str = 'en')[source]#

Language score operator based on a fasttext model.

For each input text, returns the probability that the text belongs to the specified lang.

Parameters:

_local_model – fasttext model name. Default is "fasttext/lid.176.bin". available models: [‘fasttext/lid.176.bin’, ‘fasttext/lid.176.ftz’].
lang – Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g. en, zh.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import from_items
from xpark.dataset.processors.text_language_detector import TextLanguageScore

ds = from_items(["Hello world", "今天天气很好"])
ds = ds.with_column("en_score", TextLanguageScore(lang="en").options(num_workers={"CPU": 1}).with_column(col("item")))
print(ds.take(2))

Methods

`__call__`(batch)	Score each text in the batch for the target language.
`options`(**kwargs)
`with_column`(batch)	Score each text in the batch for the target language.

__call__(batch: pa.ChunkedArray) → pa.Array#

Score each text in the batch for the target language.

Parameters:: batch – A PyArrow ChunkedArray of string values.
Returns:: A PyArrow float32 Array where each element is the probability that the corresponding text belongs to self.lang (0.0 if not in top-10).

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(batch: pa.ChunkedArray) → pa.Array#

Score each text in the batch for the target language.

Parameters:: batch – A PyArrow ChunkedArray of string values.
Returns:: A PyArrow float32 Array where each element is the probability that the corresponding text belongs to self.lang (0.0 if not in top-10).