xpark.dataset.TextLanguageDetector#
- class xpark.dataset.TextLanguageDetector(_local_model: str = 'fasttext/lid.176.bin')[source]#
Language detection operator based on a fasttext model.
Identifies the language of each input text and returns the top-1 language. Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g.
en,zh.- Parameters:
_local_model – fasttext model name. Default is
"fasttext/lid.176.bin". available models: {AVAILABLE_MODELS}.
Examples
from xpark.dataset.expressions import col from xpark.dataset import from_items from xpark.dataset.processors.text_language_detector import TextLanguageDetector ds = from_items(["Hello world", "今天阳光明媚"]) ds = ds.with_column("lang", TextLanguageDetector().options(num_workers={"CPU": 1})(col("text"))) print(ds.take(2))
Methods
__call__(batch)Call self as a function.
options(**kwargs)with_column(batch)- __call__(batch: pa.ChunkedArray) pa.Array#
Call self as a function.
- options(**kwargs: Unpack[ExprUDFOptions]) Self#
- with_column(batch: pa.ChunkedArray) pa.Array#