xpark.dataset.TextLanguageDetector#

class xpark.dataset.TextLanguageDetector(_local_model: str = 'fasttext/lid.176.bin')[source]#

Language detection operator based on a fasttext model.

Identifies the language of each input text and returns the top-1 language. Language code supported by fasttext. For details, see https://fasttext.cc/docs/en/language-identification.html. You can also refer to the ISO 639 standard, e.g. en, zh.

Parameters:

_local_model – fasttext model name. Default is "fasttext/lid.176.bin". available models: {AVAILABLE_MODELS}.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import from_items
from xpark.dataset.processors.text_language_detector import TextLanguageDetector

ds = from_items(["Hello world", "今天阳光明媚"])
ds = ds.with_column("lang", TextLanguageDetector().options(num_workers={"CPU": 1})(col("text")))
print(ds.take(2))

Methods

__call__(batch)

Call self as a function.

options(**kwargs)

with_column(batch)

__call__(batch: pa.ChunkedArray) pa.Array#

Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) Self#
with_column(batch: pa.ChunkedArray) pa.Array#