Dataset API#

UDF#

xpark.dataset.udf(*, return_dtype[, ...])

Decorator to convert a UDF into an expression-compatible function.

Dataset#

xpark.dataset.Dataset(ray_dataset)

Construct a Dataset from a ray.data.Dataset.

Dataset Context#

Grouped Dataset#

xpark.dataset.GroupedData(ray_grouped_data)

Represents a grouped dataset created by calling Dataset.groupby().

Read API#

xpark.dataset.from_arrow(tables, *[, ...])

Create a Dataset from a list of PyArrow tables.

xpark.dataset.from_blocks(blocks)

Create a Dataset from a list of blocks.

xpark.dataset.from_huggingface(dataset[, ...])

Read a Hugging Face Dataset into a Xpark Datasetset.

xpark.dataset.from_items(items, *[, ...])

Create a Dataset from a list of local Python objects.

xpark.dataset.from_range(n, *[, ...])

Creates a Dataset from a range of integers [0..n).

xpark.dataset.from_range_tensor(n, *[, ...])

Creates a Dataset tensors of the provided shape from range [0...n].

xpark.dataset.from_pandas(dfs[, ...])

Create a Dataset from a list of pandas dataframes.

xpark.dataset.from_numpy(ndarrays)

Creates a Dataset from a list of NumPy ndarrays.

xpark.dataset.read_json(paths, *[, lines, ...])

Creates a Dataset from JSON and JSONL files.

xpark.dataset.read_audio(paths, *[, ...])

Creates a Dataset from audio files.

xpark.dataset.read_video(paths, *[, ...])

Creates a Dataset from video files.

xpark.dataset.read_image(paths, *[, ...])

Creates a Dataset from image files.

xpark.dataset.read_parquet(paths, *, ...)

Creates a Dataset from parquet files.

xpark.dataset.read_iceberg(*, table_identifier)

Create a Dataset from an Iceberg table.

xpark.dataset.read_lance(uri, *[, version, ...])

Create a Dataset from a Lance Dataset.

xpark.dataset.read_lerobot(path, *[, ...])

Creates a Dataset from a LeRobot format dataset.

xpark.dataset.read_files(paths, *[, ...])

Create a Dataset from binary files of arbitrary contents.

Consuming API#

xpark.dataset.Dataset.iter_batches(*[, ...])

Return an iterable over batches of data.

xpark.dataset.Dataset.show([limit])

Print up to the given number of rows from the Dataset.

xpark.dataset.Dataset.take([limit])

Return up to limit rows from the Dataset.

xpark.dataset.Dataset.take_all([limit])

Return all of the rows in this Dataset.

xpark.dataset.Dataset.take_batch([...])

Return up to batch_size rows from the Dataset in a batch.

I/O and Conversion API#

xpark.dataset.Dataset.to_pandas([limit])

Convert this Dataset to a single pandas DataFrame.

xpark.dataset.Dataset.write_json(path, *[, ...])

Writes the Dataset to JSON and JSONL files.

xpark.dataset.Dataset.write_csv(path, *[, ...])

Writes the Dataset to CSV files.

xpark.dataset.Dataset.write_numpy(path, *, ...)

Writes a column of the Dataset to .npy files.

xpark.dataset.Dataset.write_parquet(path, *)

Writes the Dataset to parquet files under the provided path.

xpark.dataset.Dataset.write_iceberg(...[, ...])

Writes the Dataset to an Iceberg table.

xpark.dataset.Dataset.write_lance(path, *[, ...])

Write the dataset to a Lance dataset.

Expressions#

xpark.dataset.expressions.ExprUDFOptions

xpark.dataset.expressions.udf(*, return_dtype)

Decorator to convert a UDF into an expression-compatible function.

xpark.dataset.expressions.star()

References all input columns from the input.

xpark.dataset.expressions.col(name)

Reference an existing column by name.

xpark.dataset.expressions.lit(value)

Create a literal expression from a constant value.

xpark.dataset.expressions.download(...[, ...])

Create a download expression that downloads content from URIs.

Expression namespaces#

These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: .list, .str, and .struct.

The following example shows how to use the string namespace to transform text columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with a text column
ds = from_items([
    {"name": "alice"},
    {"name": "bob"},
    {"name": "charlie"}
])

# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with list columns
ds = from_items([
    {"scores": [85, 90, 78]},
    {"scores": [92, 88]},
    {"scores": [76, 82, 88, 91]}
])

# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with struct columns
ds = from_items([
    {"user": {"name": "alice", "age": 25}},
    {"user": {"name": "bob", "age": 30}},
    {"user": {"name": "charlie", "age": 35}}
])

# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#

Namespace for list operations on expression columns.

This namespace provides methods for operating on list-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Get length of list column
>>> expr = col("item").list.len()
>>> # Get first item using method
>>> expr = col("item").list.get(0)
>>> # Get first item using indexing
>>> expr = col("item").list[0]
>>> # Slice list
>>> expr = col("item").list[1:3]
flatten() UDFExpr[source]#

Flatten one level of nesting for each list value.

get(index: int) UDFExpr[source]#

Get element at the specified index from each list.

Parameters:

index – The index of the element to retrieve. Negative indices are supported.

Returns:

UDFExpr that extracts the element at the given index.

len() UDFExpr[source]#

Get the length of each list.

slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#

Slice each list.

Parameters:
  • start – Start index (inclusive). Defaults to 0.

  • stop – Stop index (exclusive). Defaults to list length.

  • step – Step size. Defaults to 1.

Returns:

UDFExpr that extracts a slice from each list.

sort(order: Literal['ascending', 'descending'] = 'ascending', null_placement: Literal['at_start', 'at_end'] = 'at_end') UDFExpr[source]#

Sort the elements within each (nested) list.

Parameters:
  • order – Sorting order, must be "ascending" or "descending".

  • null_placement – Placement for null values, "at_start" or "at_end".

Returns:

UDFExpr providing the sorted lists.

Example

>>> from ray.data.expressions import col
>>> # [[3,1],[2,None]] -> [[1,3],[2,None]]
>>> expr = col("items").list.sort()
class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#

Namespace for string operations on expression columns.

This namespace provides methods for operating on string-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert to uppercase
>>> expr = col("name").str.upper()
>>> # Get string length
>>> expr = col("name").str.len()
>>> # Check if string starts with a prefix
>>> expr = col("name").str.starts_with("A")
alpha_count() UDFExpr[source]#

Count the number of alphabetic characters in each text.

Uses str.isalpha() to identify characters that are letters (including Unicode alphabetic characters), excluding digits and other symbols.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world! 123", "abc"])
ds = ds.with_column(
    "alpha_count",
    col("text").str.alpha_count(),
)
print(ds.take_all())
alpha_number_count() UDFExpr[source]#

Count the number of alphanumeric characters in each text.

Uses str.isalnum() to identify characters that are either letters or digits (including Unicode alphanumeric characters).

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world! 123", "abc"])
ds = ds.with_column(
    "alpha_number_count",
    col("text").str.alpha_number_count(),
)
print(ds.take_all())
avg_line_length() UDFExpr[source]#

Compute the average line length for each text.

Splits each text by newlines and returns the mean character count across all lines.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello\nworld", "This is a test"])
ds = ds.with_column(
    "avg_line_length",
    col("text").str.avg_line_length(),
)
print(ds.take_all())
byte_len() UDFExpr[source]#

Get the length of each string in bytes.

capitalize() UDFExpr[source]#

Capitalize the first character of each string.

center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#

Center strings in a field of given width.

contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings contain a substring.

count(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences of a substring.

count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences matching a regex pattern.

ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings end with a pattern.

extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Extract a substring matching a regex pattern.

find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence of a substring.

find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence matching a regex pattern.

is_alnum() UDFExpr[source]#

Check if strings contain only alphanumeric characters.

is_alpha() UDFExpr[source]#

Check if strings contain only alphabetic characters.

is_ascii() UDFExpr[source]#

Check if strings contain only ASCII characters.

is_decimal() UDFExpr[source]#

Check if strings contain only decimal characters.

is_digit() UDFExpr[source]#

Check if strings contain only digits.

is_lower() UDFExpr[source]#

Check if strings are lowercase.

is_numeric() UDFExpr[source]#

Check if strings contain only numeric characters.

is_printable() UDFExpr[source]#

Check if strings contain only printable characters.

is_space() UDFExpr[source]#

Check if strings contain only whitespace.

is_title() UDFExpr[source]#

Check if strings are title-cased.

is_upper() UDFExpr[source]#

Check if strings are uppercase.

len() UDFExpr[source]#

Get the length of each string in characters.

lower() UDFExpr[source]#

Convert strings to lowercase.

lpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#

Right-align strings by padding with a given character while respecting width.

If the string is longer than the specified width, it remains intact (no truncation occurs).

lstrip(characters: str | None = None) UDFExpr[source]#

Remove leading whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the left.

match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Match strings against a SQL LIKE pattern.

match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings match a regex pattern.

max_line_length() UDFExpr[source]#

Compute the maximum line length for each text.

Splits each text by newlines and returns the length of the longest line.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello\nworld", "This is a test"])
ds = ds.with_column(
    "max_line_length",
    col("text").str.max_line_length(),
)
print(ds.take_all())
pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#

Pad strings to a specified width.

Parameters:
  • width – Target width.

  • fillchar – Character to use for padding.

  • side – “left”, “right”, or “both” for padding side.

Returns:

UDFExpr that pads strings.

repeat(n: int, *args: Any, **kwargs: Any) UDFExpr[source]#

Repeat each string n times.

replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences of a substring.

replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences matching a regex pattern.

replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace a slice with a string.

reverse() UDFExpr[source]#

Reverse each string.

rpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#

Left-align strings by padding with a given character while respecting width.

If the string is longer than the specified width, it remains intact (no truncation occurs).

rstrip(characters: str | None = None) UDFExpr[source]#

Remove trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the right.

slice(*args: Any, **kwargs: Any) UDFExpr[source]#

Slice strings by codeunit indices.

special_word_count() UDFExpr[source]#

Count the number of special characters in each text.

Iterates over each character in the text and counts those that appear in the predefined SPECIAL_CHARACTERS set.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello, world!", "No specials here"])
ds = ds.with_column(
    "special_word_count",
    col("text").str.special_word_count(),
)
print(ds.take_all())
split(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a pattern.

split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a regex pattern.

split_whitespace(*args: Any, **kwargs: Any) UDFExpr[source]#

Split strings on whitespace.

starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings start with a pattern.

strip(characters: str | None = None) UDFExpr[source]#

Remove leading and trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from both ends.

swapcase() UDFExpr[source]#

Swap the case of each character.

title() UDFExpr[source]#

Convert strings to title case.

upper() UDFExpr[source]#

Convert strings to uppercase.

word_count(tokenizer: str = 'cjk') UDFExpr[source]#

Count words in texts using the specified tokenizer.

Parameters:

tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello world", "This is a test"])
ds = ds.with_column(
    "word_count",
    col("text").str.word_count(tokenizer="cjk"),
)
print(ds.take_all())
class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#

Namespace for struct operations on expression columns.

This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Access a field using method
>>> expr = col("user_record").struct.field("age")
>>> # Access a field using bracket notation
>>> expr = col("user_record").struct["age"]
>>> # Access nested field
>>> expr = col("user_record").struct["address"].struct["city"]
field(field_name: str) UDFExpr[source]#

Extract a field from a struct.

Parameters:

field_name – The name of the field to extract.

Returns:

UDFExpr that extracts the specified field from each struct.

class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#

Namespace for datetime operations on expression columns.

This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Extract year component
>>> expr = col("datetime").dt.year
>>> # Extract month component
>>> expr = col("datetime").dt.month
>>> # Extract day component
>>> expr = col("datetime").dt.day
>>> # Extract hour component
>>> expr = col("datetime").dt.hour
>>> # Extract minute component
>>> expr = col("datetime").dt.minute
>>> # Extract second component
>>> expr = col("datetime").dt.second
ceil(unit: TemporalUnit) UDFExpr[source]#

Ceil timestamps to the next multiple of the given unit.

day() UDFExpr[source]#

Extract day component.

floor(unit: TemporalUnit) UDFExpr[source]#

Floor timestamps to the previous multiple of the given unit.

hour() UDFExpr[source]#

Extract hour component.

minute() UDFExpr[source]#

Extract minute component.

month() UDFExpr[source]#

Extract month component.

round(unit: TemporalUnit) UDFExpr[source]#

Round timestamps to the nearest multiple of the given unit.

second() UDFExpr[source]#

Extract second component.

strftime(fmt: str) UDFExpr[source]#

Format timestamps with a strftime pattern.

year() UDFExpr[source]#

Extract year component.

class xpark.dataset.namespace_expressions.array_namespace._ArrayNamespace(_expr: Expr)[source]#

Namespace for array operations on expression columns.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert fixed-size lists to variable-length lists
>>> expr = col("features").arr.to_list()
to_list() UDFExpr[source]#

Convert FixedSizeList columns into variable-length lists.

Aggregation#

xpark.dataset.aggregate.AggregateFnV2(name, ...)

Provides an interface to implement efficient aggregations to be applied to the dataset.

xpark.dataset.aggregate.Count([on, ...])

Defines count aggregation.

xpark.dataset.aggregate.Sum([on, ...])

Defines sum aggregation.

xpark.dataset.aggregate.Mean([on, ...])

Defines mean (average) aggregation.

xpark.dataset.aggregate.Max(on, ...)

Defines max aggregation.

xpark.dataset.aggregate.Min(on, ...)

Defines min aggregation.

xpark.dataset.aggregate.Std([on, ddof, ...])

Defines standard deviation aggregation.

xpark.dataset.aggregate.AbsMax(on, ...)

Defines absolute max aggregation.

xpark.dataset.aggregate.Quantile([on, q, ...])

Defines Quantile aggregation.

xpark.dataset.aggregate.Unique([on, ...])

Defines unique aggregation.

xpark.dataset.aggregate.MissingValuePercentage(on)

Calculates the percentage of null values in a column.

xpark.dataset.aggregate.ZeroPercentage(on[, ...])

Calculates the percentage of zero values in a numeric column.