xpark.dataset.TextLanguageDetector#

class xpark.dataset.TextLanguageDetector(_local_model: str = 'fasttext/lid.176.bin')[source]#

Language detection operator based on a fasttext model.

Identifies the language of each input text and returns the top-1 language. Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g. en, zh.

Parameters:: _local_model – fasttext model name. Default is "fasttext/lid.176.bin". available models: {AVAILABLE_MODELS}.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import from_items
from xpark.dataset.processors.text_language_detector import TextLanguageDetector

ds = from_items(["Hello world", "今天阳光明媚"])
ds = ds.with_column("lang", TextLanguageDetector().options(num_workers={"CPU": 1})(col("text")))
print(ds.take(2))

Methods

`__call__`(batch)	Call self as a function.
`options`(**kwargs)
`with_column`(batch)

__call__(batch: pa.ChunkedArray) → pa.Array#: Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(batch: pa.ChunkedArray) → pa.Array#