xpark.dataset.TextChunking#
- class xpark.dataset.TextChunking(*, strategy: Literal['fast'] = 'fast', chunk_size: int = DEFAULT_CHUNK_SIZE_BYTES, delimiters: str = '\n.?', pattern: str | None = None, prefix: bool = False, consecutive: bool = False, forward_fallback: bool = False)[source]#
- class xpark.dataset.TextChunking(*, strategy: Literal['recursive'], chunk_size: int = DEFAULT_MAX_CHUNK_TOKENS, tokenizer: str | None = None, rules: chonkie.RecursiveRules | None = None, min_characters_per_chunk: int = 12, recipe: str | None = None, lang: str = 'en')
Unified text chunking operator with multiple strategies.
Supports two chunking strategies via the
strategyparameter:"fast"(default): SIMD-accelerated byte-based chunking (100+ GB/s throughput). No tokenization overhead. Best for high-throughput pipelines."recursive": Recursive token-based chunking with hierarchical splitting. Uses multi-level rules (RecursiveRules) for better structure preservation. Supports loading pre-configured recipes viarecipeparameter. Please refer to https://huggingface.co/datasets/chonkie-ai/recipes for recipes.
- Parameters:
strategy (Literal["fast", "recursive"]) – Chunking strategy to use. Defaults to
"fast".chunk_size (int | None) – Target chunk size. For
"fast"strategy this is measured in bytes (default 4096). For"recursive"strategy this is measured in tokens (default 1024). IfNone, the strategy-specific default is used.parameters** (**Recursive strategy)
delimiters (str) – Single-byte delimiter characters to split on. Each character in the string is treated as an individual delimiter. Defaults to
"\n.?".pattern (str | None) – Multi-byte pattern to split on. If set, overrides
delimiters. Defaults toNone.prefix (bool) – If
True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk. Defaults toFalse.consecutive (bool) – If
True, split at the start of consecutive delimiter runs instead of the middle. Defaults toFalse.forward_fallback (bool) – If
True, search forward for a delimiter when none is found in the backward search window. Defaults toFalse.parameters**
tokenizer (str | None) – Tokenizer to use for token counting. Can be a
tiktokenencoding name (e.g."cl100k_base","o200k_base"). Defaults to"cl100k_base".rules (RecursiveRules | None) – Hierarchical splitting rules that define multi-level delimiters for recursive chunking. If
None, the defaultRecursiveRules()is used, which splits by paragraphs → sentences → punctuation → whitespace → characters. Defaults toNone.min_characters_per_chunk (int) – Minimum number of characters per chunk. Chunks shorter than this will be merged with adjacent chunks. Defaults to
12.recipe (str | None) – Name of a pre-configured recipe to load via
RecursiveChunker.from_recipe(). When set,rulesis ignored and the recipe’s built-in rules are used instead. Common recipes include"default". See https://huggingface.co/datasets/chonkie-ai/recipes for available recipes. Defaults toNone.lang (str) – Language code for recipe loading (e.g.
"en","zh"). Only used whenrecipeis set. Defaults to"en".
Returns a list of text chunks (
pa.list_(pa.string())) per input row.Examples
from xpark.dataset.expressions import col from xpark.dataset import TextChunking, from_items text = "The quick brown fox. Jumps over the lazy dog. Hello world." # Fast chunking (default) - byte-based, no tokenization ds = from_items([text]) ds = ds.with_column( "chunks_fast", TextChunking(strategy="fast", chunk_size=20, delimiters=". \n") .options(num_workers={"CPU": 4}, batch_size=32) .with_column(col("item")), ) # Recursive token chunking with custom parameters ds = ds.with_column( "chunks_recursive", TextChunking( strategy="recursive", chunk_size=8, min_characters_per_chunk=1, tokenizer="cl100k_base", ) .options(num_workers={"CPU": 4}, batch_size=32) .with_column(col("item")), ) # Recursive chunking with pre-configured recipe ds = ds.with_column( "chunks_with_recipe", TextChunking( strategy="recursive", recipe="default", lang="en", chunk_size=8, min_characters_per_chunk=1, ) .options(num_workers={"CPU": 4}, batch_size=32) .with_column(col("item")), ) print(ds.take_all()) # Output: [{"item": "The quick brown fox. Jumps over the lazy dog. Hello world.", # "chunks_fast": ["The quick brown fox.", " Jumps over the ", "lazy dog. Hello ", "world."], # "chunks_recursive": ["The quick brown fox. ", "Jumps over the lazy dog. ", "Hello world."], # "chunks_with_recipe": ["The quick brown fox. ", "Jumps over the lazy dog. ", "Hello world."]}]
Methods
__call__(texts)Call self as a function.
options(**kwargs)with_column(texts)- __call__(texts: pa.ChunkedArray) pa.Array#
Call self as a function.
- options(**kwargs: Unpack[ExprUDFOptions]) Self#
- with_column(texts: pa.ChunkedArray) pa.Array#