Dataset API#

UDF#

xpark.dataset.udf(*, return_dtype[, ...])

Decorator to convert a UDF into an expression-compatible function.

Dataset#

xpark.dataset.Dataset(ray_dataset)

Construct a Dataset from a ray.data.Dataset.

Dataset Context#

xpark.dataset.DatasetContext(...)

Grouped Dataset#

xpark.dataset.GroupedData(ray_grouped_data)

Represents a grouped dataset created by calling Dataset.groupby().

Read API#

`xpark.dataset.from_arrow`(tables, *[, ...])	Create a `Dataset` from a list of PyArrow tables.
`xpark.dataset.from_blocks`(blocks)	Create a `Dataset` from a list of blocks.
`xpark.dataset.from_huggingface`(dataset[, ...])	Read a Hugging Face Dataset into a Xpark Datasetset.
`xpark.dataset.from_items`(items, *[, ...])	Create a `Dataset` from a list of local Python objects.
`xpark.dataset.from_range`(n, *[, ...])	Creates a `Dataset` from a range of integers [0..n).
`xpark.dataset.from_range_tensor`(n, *[, ...])	Creates a `Dataset` tensors of the provided shape from range [0...n].
`xpark.dataset.from_pandas`(dfs[, ...])	Create a `Dataset` from a list of pandas dataframes.
`xpark.dataset.from_numpy`(ndarrays)	Creates a `Dataset` from a list of NumPy ndarrays.
`xpark.dataset.read_json`(paths, *[, lines, ...])	Creates a `Dataset` from JSON and JSONL files.
`xpark.dataset.read_audio`(paths, *[, ...])	Creates a `Dataset` from audio files.
`xpark.dataset.read_video`(paths, *[, ...])	Creates a `Dataset` from video files.
`xpark.dataset.read_image`(paths, *[, ...])	Creates a `Dataset` from image files.
`xpark.dataset.read_parquet`(paths, *, ...)	Creates a `Dataset` from parquet files.
`xpark.dataset.read_iceberg`(*, table_identifier)	Create a `Dataset` from an Iceberg table.
`xpark.dataset.read_lance`(uri, *[, version, ...])	Create a `Dataset` from a Lance Dataset.
`xpark.dataset.read_lerobot`(path, *[, ...])	Creates a `Dataset` from a LeRobot format dataset.
`xpark.dataset.read_files`(paths, *[, ...])	Create a `Dataset` from binary files of arbitrary contents.

Expressions#

`xpark.dataset.expressions.ExprUDFOptions`
`xpark.dataset.expressions.udf`(*, return_dtype)	Decorator to convert a UDF into an expression-compatible function.
`xpark.dataset.expressions.star`()	References all input columns from the input.
`xpark.dataset.expressions.col`(name)	Reference an existing column by name.
`xpark.dataset.expressions.lit`(value)	Create a literal expression from a constant value.
`xpark.dataset.expressions.download`(...)	Create a download expression that downloads content from URIs.

Expression namespaces#

These namespace classes provide specialized operations for list, string, and struct columns. You access them through properties on expressions: .list, .str, and .struct.

The following example shows how to use the string namespace to transform text columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with a text column
ds = from_items([
    {"name": "alice"},
    {"name": "bob"},
    {"name": "charlie"}
])

# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()

{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with list columns
ds = from_items([
    {"scores": [85, 90, 78]},
    {"scores": [92, 88]},
    {"scores": [76, 82, 88, 91]}
])

# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()

{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

from xpark.dataset import from_items
from xpark.dataset.expressions import col

# Create a dataset with struct columns
ds = from_items([
    {"user": {"name": "alice", "age": 25}},
    {"user": {"name": "bob", "age": 30}},
    {"user": {"name": "charlie", "age": 35}}
])

# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()

{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}

class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#

Namespace for list operations on expression columns.

This namespace provides methods for operating on list-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Get length of list column
>>> expr = col("items").list.len()
>>> # Get first item using method
>>> expr = col("items").list.get(0)
>>> # Get first item using indexing
>>> expr = col("items").list[0]
>>> # Slice list
>>> expr = col("items").list[1:3]

get(index: int) → UDFExpr[source]#

Get element at the specified index from each list.

Parameters:: index – The index of the element to retrieve. Negative indices are supported.
Returns:: UDFExpr that extracts the element at the given index.

len() → UDFExpr[source]#: Get the length of each list.

slice(start: int | None = None, stop: int | None = None, step: int | None = None) → UDFExpr[source]#

Slice each list.

Parameters:

start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.

Returns:

UDFExpr that extracts a slice from each list.

class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#

Namespace for string operations on expression columns.

This namespace provides methods for operating on string-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Convert to uppercase
>>> expr = col("name").str.upper()
>>> # Get string length
>>> expr = col("name").str.len()
>>> # Check if string starts with a prefix
>>> expr = col("name").str.starts_with("A")

byte_len() → UDFExpr[source]#: Get the length of each string in bytes.

capitalize() → UDFExpr[source]#: Capitalize the first character of each string.

center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) → UDFExpr[source]#: Center strings in a field of given width.

contains(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Check if strings contain a substring.

count(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Count occurrences of a substring.

count_regex(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Count occurrences matching a regex pattern.

ends_with(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Check if strings end with a pattern.

extract(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Extract a substring matching a regex pattern.

find(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Find the first occurrence of a substring.

find_regex(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Find the first occurrence matching a regex pattern.

is_alnum() → UDFExpr[source]#: Check if strings contain only alphanumeric characters.

is_alpha() → UDFExpr[source]#: Check if strings contain only alphabetic characters.

is_ascii() → UDFExpr[source]#: Check if strings contain only ASCII characters.

is_decimal() → UDFExpr[source]#: Check if strings contain only decimal characters.

is_digit() → UDFExpr[source]#: Check if strings contain only digits.

is_lower() → UDFExpr[source]#: Check if strings are lowercase.

is_numeric() → UDFExpr[source]#: Check if strings contain only numeric characters.

is_printable() → UDFExpr[source]#: Check if strings contain only printable characters.

is_space() → UDFExpr[source]#: Check if strings contain only whitespace.

is_title() → UDFExpr[source]#: Check if strings are title-cased.

is_upper() → UDFExpr[source]#: Check if strings are uppercase.

len() → UDFExpr[source]#: Get the length of each string in characters.

lower() → UDFExpr[source]#: Convert strings to lowercase.

lstrip(characters: str | None = None) → UDFExpr[source]#

Remove leading whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: UDFExpr that strips characters from the left.

match(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Match strings against a SQL LIKE pattern.

match_regex(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Check if strings match a regex pattern.

pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') → UDFExpr[source]#

Pad strings to a specified width.

Parameters:

width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.

Returns:

UDFExpr that pads strings.

repeat(n: int, *args: Any, **kwargs: Any) → UDFExpr[source]#: Repeat each string n times.

replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Replace occurrences of a substring.

replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Replace occurrences matching a regex pattern.

replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Replace a slice with a string.

reverse() → UDFExpr[source]#: Reverse each string.

rstrip(characters: str | None = None) → UDFExpr[source]#

Remove trailing whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: UDFExpr that strips characters from the right.

slice(*args: Any, **kwargs: Any) → UDFExpr[source]#: Slice strings by codeunit indices.

split(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Split strings by a pattern.

split_regex(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Split strings by a regex pattern.

split_whitespace(*args: Any, **kwargs: Any) → UDFExpr[source]#: Split strings on whitespace.

starts_with(pattern: str, *args: Any, **kwargs: Any) → UDFExpr[source]#: Check if strings start with a pattern.

strip(characters: str | None = None) → UDFExpr[source]#

Remove leading and trailing whitespace or specified characters.

Parameters:: characters – Characters to remove. If None, removes whitespace.
Returns:: UDFExpr that strips characters from both ends.

swapcase() → UDFExpr[source]#: Swap the case of each character.

title() → UDFExpr[source]#: Convert strings to title case.

upper() → UDFExpr[source]#: Convert strings to uppercase.

word_count(tokenizer: str = 'cjk') → UDFExpr[source]#

Count words in texts using the specified tokenizer.

Parameters:: tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.

Examples

from xpark.dataset import from_items
from xpark.dataset.expressions import col

ds = from_items(["Hello world", "This is a test"])
ds = ds.with_column(
    "word_count",
    col("text").str.word_count(tokenizer="cjk"),
)
print(ds.take_all())

class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#

Namespace for struct operations on expression columns.

This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Access a field using method
>>> expr = col("user_record").struct.field("age")
>>> # Access a field using bracket notation
>>> expr = col("user_record").struct["age"]
>>> # Access nested field
>>> expr = col("user_record").struct["address"].struct["city"]

field(field_name: str) → UDFExpr[source]#

Extract a field from a struct.

Parameters:: field_name – The name of the field to extract.
Returns:: UDFExpr that extracts the specified field from each struct.

class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#

Namespace for datetime operations on expression columns.

This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.

Example

>>> from xpark.dataset.expressions import col
>>> # Extract year component
>>> expr = col("datetime").dt.year
>>> # Extract month component
>>> expr = col("datetime").dt.month
>>> # Extract day component
>>> expr = col("datetime").dt.day
>>> # Extract hour component
>>> expr = col("datetime").dt.hour
>>> # Extract minute component
>>> expr = col("datetime").dt.minute
>>> # Extract second component
>>> expr = col("datetime").dt.second

ceil(unit: TemporalUnit) → UDFExpr[source]#: Ceil timestamps to the next multiple of the given unit.

day() → UDFExpr[source]#: Extract day component.

floor(unit: TemporalUnit) → UDFExpr[source]#: Floor timestamps to the previous multiple of the given unit.

hour() → UDFExpr[source]#: Extract hour component.

minute() → UDFExpr[source]#: Extract minute component.

month() → UDFExpr[source]#: Extract month component.

round(unit: TemporalUnit) → UDFExpr[source]#: Round timestamps to the nearest multiple of the given unit.

second() → UDFExpr[source]#: Extract second component.

strftime(fmt: str) → UDFExpr[source]#: Format timestamps with a strftime pattern.

year() → UDFExpr[source]#: Extract year component.

Aggregation#

`xpark.dataset.aggregate.AggregateFnV2`(name, ...)	Provides an interface to implement efficient aggregations to be applied to the dataset.
`xpark.dataset.aggregate.Count`([on, ...])	Defines count aggregation.
`xpark.dataset.aggregate.Sum`([on, ...])	Defines sum aggregation.
`xpark.dataset.aggregate.Mean`([on, ...])	Defines mean (average) aggregation.
`xpark.dataset.aggregate.Max`(on, ...)	Defines max aggregation.
`xpark.dataset.aggregate.Min`(on, ...)	Defines min aggregation.
`xpark.dataset.aggregate.Std`([on, ddof, ...])	Defines standard deviation aggregation.
`xpark.dataset.aggregate.AbsMax`(on, ...)	Defines absolute max aggregation.
`xpark.dataset.aggregate.Quantile`([on, q, ...])	Defines Quantile aggregation.
`xpark.dataset.aggregate.Unique`([on, ...])	Defines unique aggregation.
`xpark.dataset.aggregate.MissingValuePercentage`(on)	Calculates the percentage of null values in a column.
`xpark.dataset.aggregate.ZeroPercentage`(on[, ...])	Calculates the percentage of zero values in a numeric column.

Dataset API#

UDF#

Dataset#

Dataset Context#

Grouped Dataset#

Read API#

Expressions#

Expression namespaces#

Aggregation#

This Page