xpark.dataset.TextLanguageScore#

class xpark.dataset.TextLanguageScore(_local_model: str = 'fasttext/lid.176.bin', lang: str = 'en')[source]#

Language score operator based on a fasttext model.

For each input text, returns the probability that the text belongs to the specified lang.

Parameters:
  • _local_model – fasttext model name. Default is "fasttext/lid.176.bin". available models: [‘fasttext/lid.176.bin’, ‘fasttext/lid.176.ftz’].

  • lang – Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g. en, zh.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import from_items
from xpark.dataset.processors.text_language_detector import TextLanguageScore

ds = from_items(["Hello world", "今天天气很好"])
ds = ds.with_column("en_score", TextLanguageScore(lang="en").options(num_workers={"CPU": 1}).with_column(col("item")))
print(ds.take(2))

Methods

__call__(batch)

Score each text in the batch for the target language.

options(**kwargs)

with_column(batch)

Score each text in the batch for the target language.

__call__(batch: pa.ChunkedArray) pa.Array#

Score each text in the batch for the target language.

Parameters:

batch – A PyArrow ChunkedArray of string values.

Returns:

A PyArrow float32 Array where each element is the probability that the corresponding text belongs to self.lang (0.0 if not in top-10).

options(**kwargs: Unpack[ExprUDFOptions]) Self#
with_column(batch: pa.ChunkedArray) pa.Array#

Score each text in the batch for the target language.

Parameters:

batch – A PyArrow ChunkedArray of string values.

Returns:

A PyArrow float32 Array where each element is the probability that the corresponding text belongs to self.lang (0.0 if not in top-10).