xpark.dataset.TextClassify#

class xpark.dataset.TextClassify(labels: list[str | dict[str, str]], /, *, base_url: str, model: str, api_key: str = 'NOT_SET', max_qps: int | None = None, max_concurrency: int | None = None, max_retries: int = 0, multi_label: bool = False, fallback_response: str | list[str] | None = 'NOT_SET', cascade: CascadeConfig | None = None, hint: str | list[str] | None = None, **kwargs: Any)[source]#

TextClassify processor extracts the single label that best matches the text content.

Parameters:

labels –
The labels to classify into. Accepts two formats:
- list[str]: plain label names, e.g. ["science", "sport"]
- list[dict]: dicts with "label" (required) and "description" (optional), e.g. [{"label": "science", "description": "natural science and research"}]
Descriptions are injected into the prompt to guide the model when label names alone are ambiguous.
base_url – The base URL of the LLM server.
model – The request model name.
api_key – The request API key.
max_qps – The maximum query-per-second rate for remote LLM requests.
max_concurrency – The maximum number of in-flight remote LLM requests allowed concurrently.
max_retries – The maximum number of retries per request in the event of failures. We retry with exponential backoff upto this specific maximum retries.
multi_label – If True, the processor will return a list of labels that match the text content.
fallback_response – The response value to return when the LLM request fails. If set to None, the exception will be raised instead.
cascade – Optional CascadeConfig for cascade mode. See CascadeConfig for details.
hint – Optional extra instructions or constraints to guide the model (e.g. domain-specific rules, output language, label tie-breaking policy). Accepts either a single string or a list of strings, where each item is one hint written in plain text. Passing a list is recommended — use one string per hint. Do not include output-format rules in the hint, as they are injected automatically.
**kwargs –
Keyword arguments to pass to the openai.AsyncClient.chat.completions.create API. logprobs: If True, return a pa.StructArray containing both the

prediction and per-token logprobs instead of a plain prediction string. Maps directly to the OpenAI logprobs parameter.

top_logprobs: Number of most likely tokens to return at each position,
maps directly to the OpenAI top_logprobs parameter. Only meaningful when logprobs is True.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import TextClassify, from_items

ds = from_items(
    [
        "The research team discovered a new exoplanet orbiting a nearby star.",
        "Manchester United secured a dramatic victory in the final minutes of the match.",
        "The government introduced new policies to reduce carbon emissions over the next decade.",
    ]
)

# Plain labels
ds = ds.with_column(
    "class",
    TextClassify(
        ["science", "sport", "politics"],
        model="deepseek-v3-0324",
        base_url=os.getenv("LLM_ENDPOINT"),
        api_key=os.getenv("LLM_API_KEY"),
    )
    .options(num_workers={"IO": 1})
    .with_column(col("item")),
)

# Labels with descriptions
ds = ds.with_column(
    "class",
    TextClassify(
        [
            {"label": "science", "description": "natural science, research, and technology"},
            {"label": "sport", "description": "sports events and athletic competitions"},
            {"label": "politics", "description": "government policies and political affairs"},
        ],
        model="deepseek-v3-0324",
        base_url=os.getenv("LLM_ENDPOINT"),
        api_key=os.getenv("LLM_API_KEY"),
    )
    .options(num_workers={"IO": 1})
    .with_column(col("item")),
)

# Cascade mode: proxy model first, then forward uncertain samples to base model
import math
from xpark.dataset.utils import CascadeConfig, elementwise_cascade

@elementwise_cascade
def cascade_fn(text: str, logprobs: list[dict] | None) -> bool:
    if not logprobs:
        return True
    prob = math.exp(logprobs[0]["logprob"]) * 100
    return prob < 95.0  # Forward if confidence < 95%

ds = ds.with_column(
    "class",
    TextClassify(
        ["science", "sport", "politics"],
        model="deepseek-v3-0324",
        base_url=os.getenv("LLM_ENDPOINT"),
        api_key=os.getenv("LLM_API_KEY"),
        cascade=CascadeConfig(
            proxy_model="Qwen2.5-3B-Instruct",
            proxy_base_url="http://local-vllm:8000/v1",
            cascade_factory=lambda: cascade_fn,
        ),
    )
    .options(num_workers={"IO": 1})
    .with_column(col("item")),
)

Methods

`__call__`(texts)	Call self as a function.
`options`(**kwargs)
`with_column`(texts)

__call__(texts: pa.ChunkedArray) → pa.Array | pa.StructArray#: Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(texts: pa.ChunkedArray) → pa.Array | pa.StructArray#