xpark.dataset.TextChunking#

class xpark.dataset.TextChunking(*, strategy: Literal['fast'] = 'fast', chunk_size: int = DEFAULT_CHUNK_SIZE_BYTES, delimiters: str = '\n.?', pattern: str | None = None, prefix: bool = False, consecutive: bool = False, forward_fallback: bool = False)[source]#

class xpark.dataset.TextChunking(*, strategy: Literal['recursive'], chunk_size: int = DEFAULT_MAX_CHUNK_TOKENS, tokenizer: str | None = None, rules: chonkie.RecursiveRules | None = None, min_characters_per_chunk: int = 12, recipe: str | None = None, lang: str = 'en')

Unified text chunking operator with multiple strategies.

Supports two chunking strategies via the strategy parameter:

"fast" (default): SIMD-accelerated byte-based chunking (100+ GB/s throughput). No tokenization overhead. Best for high-throughput pipelines.

Warning

Splits on raw bytes, suitable for ASCII / Latin text. Multi-byte scripts (Chinese, Japanese, Korean, etc.) may be cut mid-character; use "recursive" instead for non-ASCII text.
"recursive": Recursive token-based chunking with hierarchical splitting. Uses multi-level rules (RecursiveRules) for better structure preservation. Supports loading pre-configured recipes via recipe parameter. Please refer to https://huggingface.co/datasets/chonkie-ai/recipes for recipes.

Warning

Tokenizers other than the chonkie built-ins ("character" / "word" / "byte" / "row") require fetching assets from the public internet on first use: tiktoken encoding names download BPE files from openaipublic.blob.core.windows.net, and arbitrary HuggingFace repo ids download from huggingface.co. In offline / intranet environments without such access these tokenizers may be unusable until xpark’s unified asset management ships.

Parameters:

strategy (Literal["fast", "recursive"]) – Chunking strategy to use. Defaults to "fast".
chunk_size (int | None) – Target chunk size. For "fast" strategy this is measured in bytes (default 4096). For "recursive" strategy this is measured in tokens (default 1024). If None, the strategy-specific default is used.
delimiters (str) – [fast only] Single-byte delimiter characters to split on. Each character in the string is treated as an individual delimiter. Defaults to "\n.?".
pattern (str | None) – [fast only] Multi-byte pattern to split on. If set, overrides delimiters. Defaults to None.
prefix (bool) – [fast only] If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk. Defaults to False.
consecutive (bool) – [fast only] If True, split at the start of consecutive delimiter runs instead of the middle. Defaults to False.
forward_fallback (bool) – [fast only] If True, search forward for a delimiter when none is found in the backward search window. Defaults to False.
tokenizer (str | TokenizerProtocol | None) – [recursive only] Tokenizer to use for token counting. Accepts a tiktoken encoding name (e.g. "cl100k_base", "o200k_base"), a chonkie built-in name ("character" / "word" / "byte" / "row"), or a pre-constructed tokenizer instance. Only the chonkie built-ins are guaranteed to work without internet access; see the warning above. Defaults to "cl100k_base".
rules (RecursiveRules | None) – [recursive only] Hierarchical splitting rules that define multi-level delimiters for recursive chunking. If None, the default RecursiveRules() is used, which splits by paragraphs → sentences → punctuation → whitespace → characters. Defaults to None.
min_characters_per_chunk (int) – [recursive only] Minimum number of characters per chunk. Chunks shorter than this will be merged with adjacent chunks. Defaults to 12.
recipe (str | None) – [recursive only] Name of a pre-configured recipe to load via RecursiveChunker.from_recipe(). When set, rules is ignored and the recipe’s built-in rules are used instead. Common recipes include "default". See https://huggingface.co/datasets/chonkie-ai/recipes for available recipes. Defaults to None.
lang (str) – [recursive only] Language code for recipe loading (e.g. "en", "zh"). Only used when recipe is set. Defaults to "en".

Returns:

A list of text chunks (pa.list_(pa.string())) per input row.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import TextChunking, from_items

text = "The quick brown fox. Jumps over the lazy dog. Hello world."

# Fast chunking (default) - byte-based, no tokenization
ds = from_items([text])
ds = ds.with_column(
    "chunks_fast",
    TextChunking(strategy="fast", chunk_size=20, delimiters=". \n")
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

# Recursive token chunking with custom parameters
ds = ds.with_column(
    "chunks_recursive",
    TextChunking(
        strategy="recursive",
        chunk_size=8,
        min_characters_per_chunk=1,
        tokenizer="cl100k_base",
    )
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

# Recursive chunking with pre-configured recipe
ds = ds.with_column(
    "chunks_with_recipe",
    TextChunking(
        strategy="recursive",
        recipe="default",
        lang="en",
        chunk_size=8,
        min_characters_per_chunk=1,
    )
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

print(ds.take_all())

Methods

`__call__`(texts)	Call self as a function.
`options`(**kwargs)
`with_column`(texts)

__call__(texts: pa.ChunkedArray) → pa.Array#: Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(texts: pa.ChunkedArray) → pa.Array#