Dataset API#

UDF#

xpark.dataset.udf(*, return_dtype[, ...])

Decorator to convert a UDF into an expression-compatible function.

Dataset#

xpark.dataset.Dataset(ray_dataset)

Construct a Dataset from a ray.data.Dataset.

Dataset Context#

Grouped Dataset#

xpark.dataset.GroupedData(ray_grouped_data)

Represents a grouped dataset created by calling Dataset.groupby().

Read API#

xpark.dataset.from_arrow(tables, *[, ...])

Create a Dataset from a list of PyArrow tables.

xpark.dataset.from_blocks(blocks)

Create a Dataset from a list of blocks.

xpark.dataset.from_huggingface(dataset[, ...])

Read a Hugging Face Dataset into a Xpark Datasetset.

xpark.dataset.from_items(items, *[, ...])

Create a Dataset from a list of local Python objects.

xpark.dataset.from_range(n, *[, ...])

Creates a Dataset from a range of integers [0..n).

xpark.dataset.from_range_tensor(n, *[, ...])

Creates a Dataset tensors of the provided shape from range [0...n].

xpark.dataset.from_pandas(dfs[, ...])

Create a Dataset from a list of pandas dataframes.

xpark.dataset.from_numpy(ndarrays)

Creates a Dataset from a list of NumPy ndarrays.

xpark.dataset.read_json(paths, *[, lines, ...])

Creates a Dataset from JSON and JSONL files.

xpark.dataset.read_audio(paths, *[, ...])

Creates a Dataset from audio files.

xpark.dataset.read_video(paths, *[, ...])

Creates a Dataset from video files.

xpark.dataset.read_image(paths, *[, ...])

Creates a Dataset from image files.

xpark.dataset.read_parquet(paths, *, ...)

Creates a Dataset from parquet files.

xpark.dataset.read_iceberg(*, table_identifier)

Create a Dataset from an Iceberg table.

xpark.dataset.read_lance(uri, *[, version, ...])

Create a Dataset from a Lance Dataset.

xpark.dataset.read_lerobot(path, *[, ...])

Creates a Dataset from a LeRobot format dataset.

xpark.dataset.read_files(paths, *[, ...])

Create a Dataset from binary files of arbitrary contents.

Expressions#

xpark.dataset.expressions.ExprUDFOptions

xpark.dataset.expressions.udf(*, return_dtype)

Decorator to convert a UDF into an expression-compatible function.

xpark.dataset.expressions.star()

References all input columns from the input.

xpark.dataset.expressions.col(name)

Reference an existing column by name.

xpark.dataset.expressions.lit(value)

Create a literal expression from a constant value.

xpark.dataset.expressions.download(...)

Create a download expression that downloads content from URIs.

Expression namespaces#

These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: .list, .str, and .struct.

The following example shows how to use the string namespace to transform text columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with a text column
ds = from_items([
    {"name": "alice"},
    {"name": "bob"},
    {"name": "charlie"}
])

# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with list columns
ds = from_items([
    {"scores": [85, 90, 78]},
    {"scores": [92, 88]},
    {"scores": [76, 82, 88, 91]}
])

# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with struct columns
ds = from_items([
    {"user": {"name": "alice", "age": 25}},
    {"user": {"name": "bob", "age": 30}},
    {"user": {"name": "charlie", "age": 35}}
])

# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#

Namespace for list operations on expression columns.

This namespace provides methods for operating on list-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Get length of list column
>>> expr = col("items").list.len()
>>> # Get first item using method
>>> expr = col("items").list.get(0)
>>> # Get first item using indexing
>>> expr = col("items").list[0]
>>> # Slice list
>>> expr = col("items").list[1:3]
get(index: int) UDFExpr[source]#

Get element at the specified index from each list.

Parameters:

index – The index of the element to retrieve. Negative indices are supported.

Returns:

UDFExpr that extracts the element at the given index.

len() UDFExpr[source]#

Get the length of each list.

slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#

Slice each list.

Parameters:
  • start – Start index (inclusive). Defaults to 0.

  • stop – Stop index (exclusive). Defaults to list length.

  • step – Step size. Defaults to 1.

Returns:

UDFExpr that extracts a slice from each list.

class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#

Namespace for string operations on expression columns.

This namespace provides methods for operating on string-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert to uppercase
>>> expr = col("name").str.upper()
>>> # Get string length
>>> expr = col("name").str.len()
>>> # Check if string starts with a prefix
>>> expr = col("name").str.starts_with("A")
byte_len() UDFExpr[source]#

Get the length of each string in bytes.

capitalize() UDFExpr[source]#

Capitalize the first character of each string.

center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#

Center strings in a field of given width.

contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings contain a substring.

count(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences of a substring.

count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Count occurrences matching a regex pattern.

ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings end with a pattern.

extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Extract a substring matching a regex pattern.

find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence of a substring.

find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Find the first occurrence matching a regex pattern.

is_alnum() UDFExpr[source]#

Check if strings contain only alphanumeric characters.

is_alpha() UDFExpr[source]#

Check if strings contain only alphabetic characters.

is_ascii() UDFExpr[source]#

Check if strings contain only ASCII characters.

is_decimal() UDFExpr[source]#

Check if strings contain only decimal characters.

is_digit() UDFExpr[source]#

Check if strings contain only digits.

is_lower() UDFExpr[source]#

Check if strings are lowercase.

is_numeric() UDFExpr[source]#

Check if strings contain only numeric characters.

is_printable() UDFExpr[source]#

Check if strings contain only printable characters.

is_space() UDFExpr[source]#

Check if strings contain only whitespace.

is_title() UDFExpr[source]#

Check if strings are title-cased.

is_upper() UDFExpr[source]#

Check if strings are uppercase.

len() UDFExpr[source]#

Get the length of each string in characters.

lower() UDFExpr[source]#

Convert strings to lowercase.

lstrip(characters: str | None = None) UDFExpr[source]#

Remove leading whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the left.

match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Match strings against a SQL LIKE pattern.

match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings match a regex pattern.

pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#

Pad strings to a specified width.

Parameters:
  • width – Target width.

  • fillchar – Character to use for padding.

  • side – “left”, “right”, or “both” for padding side.

Returns:

UDFExpr that pads strings.

repeat(n: int, *args: Any, **kwargs: Any) UDFExpr[source]#

Repeat each string n times.

replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences of a substring.

replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace occurrences matching a regex pattern.

replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Replace a slice with a string.

reverse() UDFExpr[source]#

Reverse each string.

rstrip(characters: str | None = None) UDFExpr[source]#

Remove trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from the right.

slice(*args: Any, **kwargs: Any) UDFExpr[source]#

Slice strings by codeunit indices.

split(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a pattern.

split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Split strings by a regex pattern.

split_whitespace(*args: Any, **kwargs: Any) UDFExpr[source]#

Split strings on whitespace.

starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#

Check if strings start with a pattern.

strip(characters: str | None = None) UDFExpr[source]#

Remove leading and trailing whitespace or specified characters.

Parameters:

characters – Characters to remove. If None, removes whitespace.

Returns:

UDFExpr that strips characters from both ends.

swapcase() UDFExpr[source]#

Swap the case of each character.

title() UDFExpr[source]#

Convert strings to title case.

upper() UDFExpr[source]#

Convert strings to uppercase.

word_count(tokenizer: str = 'cjk') UDFExpr[source]#

Count words in texts using the specified tokenizer.

Parameters:

tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello world", "This is a test"])
ds = ds.with_column(
    "word_count",
    col("text").str.word_count(tokenizer="cjk"),
)
print(ds.take_all())
class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#

Namespace for struct operations on expression columns.

This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Access a field using method
>>> expr = col("user_record").struct.field("age")
>>> # Access a field using bracket notation
>>> expr = col("user_record").struct["age"]
>>> # Access nested field
>>> expr = col("user_record").struct["address"].struct["city"]
field(field_name: str) UDFExpr[source]#

Extract a field from a struct.

Parameters:

field_name – The name of the field to extract.

Returns:

UDFExpr that extracts the specified field from each struct.

class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#

Namespace for datetime operations on expression columns.

This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Extract year component
>>> expr = col("datetime").dt.year
>>> # Extract month component
>>> expr = col("datetime").dt.month
>>> # Extract day component
>>> expr = col("datetime").dt.day
>>> # Extract hour component
>>> expr = col("datetime").dt.hour
>>> # Extract minute component
>>> expr = col("datetime").dt.minute
>>> # Extract second component
>>> expr = col("datetime").dt.second
ceil(unit: TemporalUnit) UDFExpr[source]#

Ceil timestamps to the next multiple of the given unit.

day() UDFExpr[source]#

Extract day component.

floor(unit: TemporalUnit) UDFExpr[source]#

Floor timestamps to the previous multiple of the given unit.

hour() UDFExpr[source]#

Extract hour component.

minute() UDFExpr[source]#

Extract minute component.

month() UDFExpr[source]#

Extract month component.

round(unit: TemporalUnit) UDFExpr[source]#

Round timestamps to the nearest multiple of the given unit.

second() UDFExpr[source]#

Extract second component.

strftime(fmt: str) UDFExpr[source]#

Format timestamps with a strftime pattern.

year() UDFExpr[source]#

Extract year component.

Aggregation#

xpark.dataset.aggregate.AggregateFnV2(name, ...)

Provides an interface to implement efficient aggregations to be applied to the dataset.

xpark.dataset.aggregate.Count([on, ...])

Defines count aggregation.

xpark.dataset.aggregate.Sum([on, ...])

Defines sum aggregation.

xpark.dataset.aggregate.Mean([on, ...])

Defines mean (average) aggregation.

xpark.dataset.aggregate.Max(on, ...)

Defines max aggregation.

xpark.dataset.aggregate.Min(on, ...)

Defines min aggregation.

xpark.dataset.aggregate.Std([on, ddof, ...])

Defines standard deviation aggregation.

xpark.dataset.aggregate.AbsMax(on, ...)

Defines absolute max aggregation.

xpark.dataset.aggregate.Quantile([on, q, ...])

Defines Quantile aggregation.

xpark.dataset.aggregate.Unique([on, ...])

Defines unique aggregation.

xpark.dataset.aggregate.MissingValuePercentage(on)

Calculates the percentage of null values in a column.

xpark.dataset.aggregate.ZeroPercentage(on[, ...])

Calculates the percentage of zero values in a numeric column.