xpark.dataset.TextClassify#
- class xpark.dataset.TextClassify(labels: list[str | dict[str, str]], /, *, base_url: str, model: str, api_key: str = 'NOT_SET', max_qps: int | None = None, max_retries: int = 0, multi_label: bool = False, fallback_response: str | list[str] | None = 'NOT_SET', **kwargs: dict[str, Any])[source]#
TextClassify processor extracts the single label that best matches the text content.
- Parameters:
labels –
The labels to classify into. Accepts two formats:
list[str]: plain label names, e.g.["science", "sport"]list[dict]: dicts with"label"(required) and"description"(optional), e.g.[{"label": "science", "description": "natural science and research"}]
Descriptions are injected into the prompt to guide the model when label names alone are ambiguous.
base_url – The base URL of the LLM server.
model – The request model name.
api_key – The request API key.
max_qps – The maximum number of requests per second.
max_retries – The maximum number of retries per request in the event of failures. We retry with exponential backoff upto this specific maximum retries.
fallback_response – The response value to return when the LLM request fails. If set to None, the exception will be raised instead.
multi_label – If True, the processor will return a list of labels that match the text content.
**kwargs – Keyword arguments to pass to the openai.AsyncClient.chat.completions.create API.
Examples
from xpark.dataset.expressions import col from xpark.dataset import TextClassify, from_items ds = from_items( [ "The research team discovered a new exoplanet orbiting a nearby star.", "Manchester United secured a dramatic victory in the final minutes of the match.", "The government introduced new policies to reduce carbon emissions over the next decade.", ] ) # Plain labels ds = ds.with_column( "class", TextClassify( ["science", "sport", "politics"], model="deepseek-v3-0324", base_url=os.getenv("LLM_ENDPOINT"), api_key=os.getenv("LLM_API_KEY"), ) .options(num_workers={"IO": 1}) .with_column(col("item")), ) # Labels with descriptions ds = ds.with_column( "class", TextClassify( [ {"label": "science", "description": "natural science, research, and technology"}, {"label": "sport", "description": "sports events and athletic competitions"}, {"label": "politics", "description": "government policies and political affairs"}, ], model="deepseek-v3-0324", base_url=os.getenv("LLM_ENDPOINT"), api_key=os.getenv("LLM_API_KEY"), ) .options(num_workers={"IO": 1}) .with_column(col("item")), )
Methods
__call__(texts)Call self as a function.
options(**kwargs)with_column(texts)- __call__(texts: pa.ChunkedArray) pa.Array#
Call self as a function.
- options(**kwargs: Unpack[ExprUDFOptions]) Self#
- with_column(texts: pa.ChunkedArray) pa.Array#