Dataset API#

UDF#

xpark.dataset.udf(*, return_dtype[, ...])

Decorator to convert a UDF into an expression-compatible function.

Dataset#

xpark.dataset.Dataset(ray_dataset)

Construct a Dataset from a ray.data.Dataset.

Dataset Context#

xpark.dataset.DatasetContext(...)

Grouped Dataset#

xpark.dataset.GroupedData(ray_grouped_data)

Represents a grouped dataset created by calling Dataset.groupby().

Read API#

`xpark.dataset.from_arrow`(tables, *[, ...])	Create a `Dataset` from a list of PyArrow tables.
`xpark.dataset.from_blocks`(blocks)	Create a `Dataset` from a list of blocks.
`xpark.dataset.from_huggingface`(dataset[, ...])	Read a Hugging Face Dataset into a Xpark Datasetset.
`xpark.dataset.from_items`(items, *[, ...])	Create a `Dataset` from a list of local Python objects.
`xpark.dataset.from_range`(n, *[, ...])	Creates a `Dataset` from a range of integers [0..n).
`xpark.dataset.from_range_tensor`(n, *[, ...])	Creates a `Dataset` tensors of the provided shape from range [0...n].
`xpark.dataset.from_pandas`(dfs[, ...])	Create a `Dataset` from a list of pandas dataframes.
`xpark.dataset.from_numpy`(ndarrays)	Creates a `Dataset` from a list of NumPy ndarrays.
`xpark.dataset.read_json`(paths, *[, lines, ...])	Creates a `Dataset` from JSON and JSONL files.
`xpark.dataset.read_audio`(paths, *[, ...])	Creates a `Dataset` from audio files.
`xpark.dataset.read_video`(paths, *[, ...])	Creates a `Dataset` from video files.
`xpark.dataset.read_image`(paths, *[, ...])	Creates a `Dataset` from image files.
`xpark.dataset.read_parquet`(paths, *, ...)	Creates a `Dataset` from parquet files.
`xpark.dataset.read_iceberg`(*, table_identifier)	Create a `Dataset` from an Iceberg table.
`xpark.dataset.read_lance`(uri, *[, version, ...])	Create a `Dataset` from a Lance Dataset.
`xpark.dataset.read_lerobot`(path, *[, ...])	Creates a `Dataset` from a LeRobot format dataset.
`xpark.dataset.read_files`(paths, *[, ...])	Create a `Dataset` from binary files of arbitrary contents.

Consuming API#

`xpark.dataset.Dataset.iter_batches`(*[, ...])	Return an iterable over batches of data.
`xpark.dataset.Dataset.show`([limit])	Print up to the given number of rows from the `Dataset`.
`xpark.dataset.Dataset.take`([limit])	Return up to `limit` rows from the `Dataset`.
`xpark.dataset.Dataset.take_all`([limit])	Return all of the rows in this `Dataset`.
`xpark.dataset.Dataset.take_batch`([...])	Return up to `batch_size` rows from the `Dataset` in a batch.

I/O and Conversion API#

`xpark.dataset.Dataset.to_pandas`([limit])	Convert this `Dataset` to a single pandas DataFrame.
`xpark.dataset.Dataset.write_json`(path, *[, ...])	Writes the `Dataset` to JSON and JSONL files.
`xpark.dataset.Dataset.write_csv`(path, *[, ...])	Writes the `Dataset` to CSV files.
`xpark.dataset.Dataset.write_numpy`(path, *, ...)	Writes a column of the `Dataset` to .npy files.
`xpark.dataset.Dataset.write_parquet`(path, *)	Writes the `Dataset` to parquet files under the provided `path`.
`xpark.dataset.Dataset.write_iceberg`(...[, ...])	Writes the `Dataset` to an Iceberg table.
`xpark.dataset.Dataset.write_lance`(path, *[, ...])	Write the dataset to a Lance dataset.

Expressions#

`xpark.dataset.expressions.ExprUDFOptions`
`xpark.dataset.expressions.udf`(*, return_dtype)	Decorator to convert a UDF into an expression-compatible function.
`xpark.dataset.expressions.star`()	References all input columns from the input.
`xpark.dataset.expressions.col`(name)	Reference an existing column by name.
`xpark.dataset.expressions.lit`(value)	Create a literal expression from a constant value.
`xpark.dataset.expressions.download`(...[, ...])	Create a download expression that downloads content from URIs.

Expression namespaces#

These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: .list, .str, and .struct.

The following example shows how to use the string namespace to transform text columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with a text column
ds = from_items([
    {"name": "alice"},
    {"name": "bob"},
    {"name": "charlie"}
])

# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()

{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with list columns
ds = from_items([
    {"scores": [85, 90, 78]},
    {"scores": [92, 88]},
    {"scores": [76, 82, 88, 91]}
])

# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()

{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with struct columns
ds = from_items([
    {"user": {"name": "alice", "age": 25}},
    {"user": {"name": "bob", "age": 30}},
    {"user": {"name": "charlie", "age": 35}}
])

# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()

{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}

class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#

Namespace for list operations on expression columns.

This namespace provides methods for operating on list-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Get length of list column
>>> expr = col("item").list.len()
>>> # Get first item using method
>>> expr = col("item").list.get(0)
>>> # Get first item using indexing
>>> expr = col("item").list[0]
>>> # Slice list
>>> expr = col("item").list[1:3]

flatten() → UDFExpr[source]#: Flatten one level of nesting for each list value.

get(index: int) → PyArrowComputeUDFExpr[source]#

Get element at the specified index from each list.

Parameters:: index – The index of the element to retrieve. Negative indices are supported.
Returns:: Expression that extracts the element at the given index.

len() → PyArrowComputeUDFExpr[source]#: Get the length of each list.

slice(start: int | None = None, stop: int | None = None, step: int | None = None) → PyArrowComputeUDFExpr[source]#

Slice each list.

Parameters:

start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.

Returns:

Expression that extracts a slice from each list.

sort(order: Literal['ascending', 'descending'] = 'ascending', null_placement: Literal['at_start', 'at_end'] = 'at_end') → UDFExpr[source]#

Sort the elements within each (nested) list.

Parameters:

order – Sorting order, must be "ascending" or "descending".
null_placement – Placement for null values, "at_start" or "at_end".

Returns:

UDFExpr providing the sorted lists.

Example

>>> from ray.data.expressions import col
>>> # [[3,1],[2,None]] -> [[1,3],[2,None]]
>>> expr = col("items").list.sort()

class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#

Namespace for string operations on expression columns.

This namespace provides methods for operating on string-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert to uppercase
>>> expr = col("name").str.upper()
>>> # Get string length
>>> expr = col("name").str.len()
>>> # Check if string starts with a prefix
>>> expr = col("name").str.starts_with("A")

alpha_count() → UDFExpr[source]#

Count the number of alphabetic characters in each text.

Uses str.isalpha() to identify characters that are letters (including Unicode alphabetic characters), excluding digits and other symbols.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world! 123", "abc"])
ds = ds.with_column(
    "alpha_count",
    col("text").str.alpha_count(),
)
print(ds.take_all())

alpha_number_count() → UDFExpr[source]#

Count the number of alphanumeric characters in each text.

Uses str.isalnum() to identify characters that are either letters or digits (including Unicode alphanumeric characters).

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world! 123", "abc"])
ds = ds.with_column(
    "alpha_number_count",
    col("text").str.alpha_number_count(),
)
print(ds.take_all())

avg_line_length() → UDFExpr[source]#

Compute the average line length for each text.

Splits each text by newlines and returns the mean character count across all lines.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello\nworld", "This is a test"])
ds = ds.with_column(
    "avg_line_length",
    col("text").str.avg_line_length(),
)
print(ds.take_all())

byte_len() → PyArrowComputeUDFExpr[source]#: Get the length of each string in bytes.

capitalize() → PyArrowComputeUDFExpr[source]#: Capitalize the first character of each string.

center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Center strings in a field of given width.

contains(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Check if strings contain a substring.

count(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Count occurrences of a substring.

count_regex(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Count occurrences matching a regex pattern.

ends_with(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Check if strings end with a pattern.

extract(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Extract a substring matching a regex pattern.

find(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Find the first occurrence of a substring.

find_regex(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Find the first occurrence matching a regex pattern.

html_clean(separator: str = '\n') → UDFExpr[source]#

Extract plain text from HTML content.

Parses the HTML string and returns the visible text, with tags stripped. Multiple text nodes are joined using the specified separator.

Parameters:: separator – The string used to join text nodes extracted from the HTML. Defaults to “n”.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["<p>Hello</p><p>World</p>", "<b>foo</b> bar"])
ds = ds.with_column(
    "clean_text",
    col("text").str.html_clean(separator="\n"),
)
print(ds.take_all())

is_alnum() → PyArrowComputeUDFExpr[source]#: Check if strings contain only alphanumeric characters.

is_alpha() → PyArrowComputeUDFExpr[source]#: Check if strings contain only alphabetic characters.

is_ascii() → PyArrowComputeUDFExpr[source]#: Check if strings contain only ASCII characters.

is_decimal() → PyArrowComputeUDFExpr[source]#: Check if strings contain only decimal characters.

is_digit() → PyArrowComputeUDFExpr[source]#: Check if strings contain only digits.

is_lower() → PyArrowComputeUDFExpr[source]#: Check if strings are lowercase.

is_numeric() → PyArrowComputeUDFExpr[source]#: Check if strings contain only numeric characters.

is_printable() → PyArrowComputeUDFExpr[source]#: Check if strings contain only printable characters.

is_space() → PyArrowComputeUDFExpr[source]#: Check if strings contain only whitespace.

is_title() → PyArrowComputeUDFExpr[source]#: Check if strings are title-cased.

is_upper() → PyArrowComputeUDFExpr[source]#: Check if strings are uppercase.

len() → PyArrowComputeUDFExpr[source]#: Get the length of each string in characters.

lower() → PyArrowComputeUDFExpr[source]#: Convert strings to lowercase.

lpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#

Right-align strings by padding with a given character while respecting width.

If the string is longer than the specified width, it remains intact (no truncation occurs).

lstrip(characters: str | None = None) → PyArrowComputeUDFExpr[source]#

Remove leading whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: Expression that strips characters from the left.

match(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Match strings against a SQL LIKE pattern.

match_regex(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Check if strings match a regex pattern.

max_line_length() → UDFExpr[source]#

Compute the maximum line length for each text.

Splits each text by newlines and returns the length of the longest line.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello\nworld", "This is a test"])
ds = ds.with_column(
    "max_line_length",
    col("text").str.max_line_length(),
)
print(ds.take_all())

pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') → PyArrowComputeUDFExpr[source]#

Pad strings to a specified width.

Parameters:

width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.

Returns:

Expression that pads strings to the given width.

repeat(n: int, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Repeat each string n times.

replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Replace occurrences of a substring.

replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Replace occurrences matching a regex pattern.

replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Replace a slice with a string.

reverse() → PyArrowComputeUDFExpr[source]#: Reverse each string.

rpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#

Left-align strings by padding with a given character while respecting width.

If the string is longer than the specified width, it remains intact (no truncation occurs).

rstrip(characters: str | None = None) → PyArrowComputeUDFExpr[source]#

Remove trailing whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: Expression that strips characters from the right.

slice(*args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Slice strings by codeunit indices.

special_word_count() → UDFExpr[source]#

Count the number of special characters in each text.

Iterates over each character in the text and counts those that appear in the predefined SPECIAL_CHARACTERS set.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world!", "No specials here"])
ds = ds.with_column(
    "special_word_count",
    col("text").str.special_word_count(),
)
print(ds.take_all())

split(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Split strings by a pattern.

split_regex(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Split strings by a regex pattern.

split_whitespace(*args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Split strings on whitespace.

starts_with(pattern: str, *args: Any, **kwargs: Any) → PyArrowComputeUDFExpr[source]#: Check if strings start with a pattern.

strip(characters: str | None = None) → PyArrowComputeUDFExpr[source]#

Remove leading and trailing whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: Expression that strips characters from both ends.

swapcase() → PyArrowComputeUDFExpr[source]#: Swap the case of each character.

title() → PyArrowComputeUDFExpr[source]#: Convert strings to title case.

upper() → PyArrowComputeUDFExpr[source]#: Convert strings to uppercase.

word_count(tokenizer: str = 'cjk') → UDFExpr[source]#

Count words in texts using the specified tokenizer.

Parameters:: tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello world", "This is a test"])
ds = ds.with_column(
    "word_count",
    col("text").str.word_count(tokenizer="cjk"),
)
print(ds.take_all())

class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#

Namespace for struct operations on expression columns.

This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Access a field using method
>>> expr = col("user_record").struct.field("age")
>>> # Access a field using bracket notation
>>> expr = col("user_record").struct["age"]
>>> # Access nested field
>>> expr = col("user_record").struct["address"].struct["city"]

field(field_name: str) → PyArrowComputeUDFExpr[source]#

Extract a field from a struct.

Parameters:: field_name – The name of the field to extract.
Returns:: UDFExpr that extracts the specified field from each struct.

class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#

Namespace for datetime operations on expression columns.

This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Extract year component
>>> expr = col("datetime").dt.year
>>> # Extract month component
>>> expr = col("datetime").dt.month
>>> # Extract day component
>>> expr = col("datetime").dt.day
>>> # Extract hour component
>>> expr = col("datetime").dt.hour
>>> # Extract minute component
>>> expr = col("datetime").dt.minute
>>> # Extract second component
>>> expr = col("datetime").dt.second

ceil(unit: TemporalUnit) → PyArrowComputeUDFExpr[source]#: Ceil timestamps to the next multiple of the given unit.

day() → PyArrowComputeUDFExpr[source]#: Extract day component.

floor(unit: TemporalUnit) → PyArrowComputeUDFExpr[source]#: Floor timestamps to the previous multiple of the given unit.

hour() → PyArrowComputeUDFExpr[source]#: Extract hour component.

minute() → PyArrowComputeUDFExpr[source]#: Extract minute component.

month() → PyArrowComputeUDFExpr[source]#: Extract month component.

round(unit: TemporalUnit) → PyArrowComputeUDFExpr[source]#: Round timestamps to the nearest multiple of the given unit.

second() → PyArrowComputeUDFExpr[source]#: Extract second component.

strftime(fmt: str) → PyArrowComputeUDFExpr[source]#: Format timestamps with a strftime pattern.

year() → PyArrowComputeUDFExpr[source]#: Extract year component.

class xpark.dataset.namespace_expressions.array_namespace._ArrayNamespace(_expr: Expr)[source]#

Namespace for array operations on expression columns.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert fixed-size lists to variable-length lists
>>> expr = col("features").arr.to_list()

to_list() → UDFExpr[source]#: Convert FixedSizeList columns into variable-length lists.

class xpark.dataset.namespace_expressions.map_namespace._MapNamespace(_expr: Expr)[source]#

Namespace for map operations on expression columns.

This namespace provides methods for operating on map-typed columns (including MapArrays and ListArrays of Structs) using PyArrow UDFs.

Example

>>> from xpark.dataset.expressions import col
>>> # Get keys from map column
>>> expr = col("headers").map.keys()
>>> # Get values from map column
>>> expr = col("headers").map.values()

keys() → UDFExpr[source]#

Returns a list expression containing the keys of the map.

Example

>>> from ray.data.expressions import col
>>> # Get keys from map column
>>> expr = col("headers").map.keys()

Returns:: A list expression containing the keys.

values() → UDFExpr[source]#

Returns a list expression containing the values of the map.

Example

>>> from ray.data.expressions import col
>>> # Get values from map column
>>> expr = col("headers").map.values()

Returns:: A list expression containing the values.

Aggregation#

`xpark.dataset.aggregate.AggregateFnV2`(name, ...)	Provides an interface to implement efficient aggregations to be applied to the dataset.
`xpark.dataset.aggregate.Count`([on, ...])	Defines count aggregation.
`xpark.dataset.aggregate.Sum`([on, ...])	Defines sum aggregation.
`xpark.dataset.aggregate.Mean`([on, ...])	Defines mean (average) aggregation.
`xpark.dataset.aggregate.Max`(on, ...)	Defines max aggregation.
`xpark.dataset.aggregate.Min`(on, ...)	Defines min aggregation.
`xpark.dataset.aggregate.Std`([on, ddof, ...])	Defines standard deviation aggregation.
`xpark.dataset.aggregate.AbsMax`(on, ...)	Defines absolute max aggregation.
`xpark.dataset.aggregate.Quantile`([on, q, ...])	Defines Quantile aggregation.
`xpark.dataset.aggregate.Unique`([on, ...])	Defines unique aggregation.
`xpark.dataset.aggregate.MissingValuePercentage`(on)	Calculates the percentage of null values in a column.
`xpark.dataset.aggregate.ZeroPercentage`(on[, ...])	Calculates the percentage of zero values in a numeric column.