xpark.dataset.TextPerplexity#

class xpark.dataset.TextPerplexity(_local_model: str = 'Qwen/Qwen2.5-0.5B', /, *, max_length: int | None = None)[source]#

Computes the perplexity of text using a language model to evaluate fluency and naturalness.

Perplexity is a metric that measures how well a language model predicts a given text. A lower perplexity indicates more natural and fluent text, while a higher perplexity suggests the text is harder to predict (e.g., noisy, garbled, or low-quality content).

Parameters:

_local_model – Name of the language model used for perplexity computation. Available models: [‘Qwen/Qwen2.5-0.5B’, ‘Qwen/Qwen3.5-0.8B’, ‘Qwen/Qwen3.5-2B’, ‘Qwen/Qwen3.5-4B’]
max_length – Maximum token length for input truncation. Text exceeding this length will be truncated. Defaults to None, which uses the model’s maximum supported length.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import TextPerplexity, from_items

ds = from_items([
    "The quick brown fox jumps over the lazy dog.",
    "asdf qwer zxcv random noise text 1234",
])
ds = ds.with_column(
    "perplexity",
    TextPerplexity()
    .options(num_workers={"CPU": 4}, batch_size=8)
    .with_column(col("item")),
)
print(ds.take(2))

Methods

`__call__`(texts)	Call self as a function.
`options`(**kwargs)
`with_column`(texts)

__call__(texts: pa.ChunkedArray) → pa.Array#: Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(texts: pa.ChunkedArray) → pa.Array#