xpark.dataset.TextChunking#

class xpark.dataset.TextChunking(*, strategy: Literal['fast'] = 'fast', chunk_size: int = DEFAULT_CHUNK_SIZE_BYTES, delimiters: str = '\n.?', pattern: str | None = None, prefix: bool = False, consecutive: bool = False, forward_fallback: bool = False)[source]#
class xpark.dataset.TextChunking(*, strategy: Literal['recursive'], chunk_size: int = DEFAULT_MAX_CHUNK_TOKENS, tokenizer: str | None = None, rules: chonkie.RecursiveRules | None = None, min_characters_per_chunk: int = 12, recipe: str | None = None, lang: str = 'en')

Unified text chunking operator with multiple strategies.

Supports two chunking strategies via the strategy parameter:

  • "fast" (default): SIMD-accelerated byte-based chunking (100+ GB/s throughput). No tokenization overhead. Best for high-throughput pipelines.

  • "recursive": Recursive token-based chunking with hierarchical splitting. Uses multi-level rules (RecursiveRules) for better structure preservation. Supports loading pre-configured recipes via recipe parameter. Please refer to https://huggingface.co/datasets/chonkie-ai/recipes for recipes.

Parameters:
  • strategy (Literal["fast", "recursive"]) – Chunking strategy to use. Defaults to "fast".

  • chunk_size (int | None) – Target chunk size. For "fast" strategy this is measured in bytes (default 4096). For "recursive" strategy this is measured in tokens (default 1024). If None, the strategy-specific default is used.

  • parameters** (**Recursive strategy)

  • delimiters (str) – Single-byte delimiter characters to split on. Each character in the string is treated as an individual delimiter. Defaults to "\n.?".

  • pattern (str | None) – Multi-byte pattern to split on. If set, overrides delimiters. Defaults to None.

  • prefix (bool) – If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk. Defaults to False.

  • consecutive (bool) – If True, split at the start of consecutive delimiter runs instead of the middle. Defaults to False.

  • forward_fallback (bool) – If True, search forward for a delimiter when none is found in the backward search window. Defaults to False.

  • parameters**

  • tokenizer (str | None) – Tokenizer to use for token counting. Can be a tiktoken encoding name (e.g. "cl100k_base", "o200k_base"). Defaults to "cl100k_base".

  • rules (RecursiveRules | None) – Hierarchical splitting rules that define multi-level delimiters for recursive chunking. If None, the default RecursiveRules() is used, which splits by paragraphs → sentences → punctuation → whitespace → characters. Defaults to None.

  • min_characters_per_chunk (int) – Minimum number of characters per chunk. Chunks shorter than this will be merged with adjacent chunks. Defaults to 12.

  • recipe (str | None) – Name of a pre-configured recipe to load via RecursiveChunker.from_recipe(). When set, rules is ignored and the recipe’s built-in rules are used instead. Common recipes include "default". See https://huggingface.co/datasets/chonkie-ai/recipes for available recipes. Defaults to None.

  • lang (str) – Language code for recipe loading (e.g. "en", "zh"). Only used when recipe is set. Defaults to "en".

Returns a list of text chunks (pa.list_(pa.string())) per input row.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import TextChunking, from_items

text = "The quick brown fox. Jumps over the lazy dog. Hello world."

# Fast chunking (default) - byte-based, no tokenization
ds = from_items([text])
ds = ds.with_column(
    "chunks_fast",
    TextChunking(strategy="fast", chunk_size=20, delimiters=". \n")
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

# Recursive token chunking with custom parameters
ds = ds.with_column(
    "chunks_recursive",
    TextChunking(
        strategy="recursive",
        chunk_size=8,
        min_characters_per_chunk=1,
        tokenizer="cl100k_base",
    )
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

# Recursive chunking with pre-configured recipe
ds = ds.with_column(
    "chunks_with_recipe",
    TextChunking(
        strategy="recursive",
        recipe="default",
        lang="en",
        chunk_size=8,
        min_characters_per_chunk=1,
    )
    .options(num_workers={"CPU": 4}, batch_size=32)
    .with_column(col("item")),
)

print(ds.take_all())
# Output: [{"item": "The quick brown fox. Jumps over the lazy dog. Hello world.",
#           "chunks_fast": ["The quick brown fox.", " Jumps over the ", "lazy dog. Hello ", "world."],
#           "chunks_recursive": ["The quick brown fox. ", "Jumps over the lazy dog. ", "Hello world."],
#           "chunks_with_recipe": ["The quick brown fox. ", "Jumps over the lazy dog. ", "Hello world."]}]

Methods

__call__(texts)

Call self as a function.

options(**kwargs)

with_column(texts)

__call__(texts: pa.ChunkedArray) pa.Array#

Call self as a function.

options(**kwargs: Unpack[ExprUDFOptions]) Self#
with_column(texts: pa.ChunkedArray) pa.Array#