Dataset API#
UDF#
|
Decorator to convert a UDF into an expression-compatible function. |
Dataset#
|
Construct a |
Dataset Context#
Grouped Dataset#
|
Represents a grouped dataset created by calling |
Read API#
|
Create a |
|
Create a |
|
Read a Hugging Face Dataset into a Xpark Datasetset. |
|
Create a |
|
Creates a |
|
Creates a |
|
Create a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Create a |
|
Create a |
|
Creates a |
|
Create a |
Expressions#
|
Decorator to convert a UDF into an expression-compatible function. |
References all input columns from the input. |
|
Reference an existing column by name. |
|
Create a literal expression from a constant value. |
|
Create a download expression that downloads content from URIs. |
Expression namespaces#
These namespace classes provide specialized operations for list, string, and struct columns.
You access them through properties on expressions: .list, .str, and .struct.
The following example shows how to use the string namespace to transform text columns:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with a text column
ds = from_items([
{"name": "alice"},
{"name": "bob"},
{"name": "charlie"}
])
# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}
The following example demonstrates using the list namespace to work with array columns:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with list columns
ds = from_items([
{"scores": [85, 90, 78]},
{"scores": [92, 88]},
{"scores": [76, 82, 88, 91]}
])
# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}
The following example shows how to use the struct namespace to access nested fields:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with struct columns
ds = from_items([
{"user": {"name": "alice", "age": 25}},
{"user": {"name": "bob", "age": 30}},
{"user": {"name": "charlie", "age": 35}}
])
# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
- class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#
Namespace for list operations on expression columns.
This namespace provides methods for operating on list-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Get length of list column >>> expr = col("items").list.len() >>> # Get first item using method >>> expr = col("items").list.get(0) >>> # Get first item using indexing >>> expr = col("items").list[0] >>> # Slice list >>> expr = col("items").list[1:3]
- get(index: int) UDFExpr[source]#
Get element at the specified index from each list.
- Parameters:
index – The index of the element to retrieve. Negative indices are supported.
- Returns:
UDFExpr that extracts the element at the given index.
- slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#
Slice each list.
- Parameters:
start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.
- Returns:
UDFExpr that extracts a slice from each list.
- class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#
Namespace for string operations on expression columns.
This namespace provides methods for operating on string-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Convert to uppercase >>> expr = col("name").str.upper() >>> # Get string length >>> expr = col("name").str.len() >>> # Check if string starts with a prefix >>> expr = col("name").str.starts_with("A")
- center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#
Center strings in a field of given width.
- contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings contain a substring.
- count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Count occurrences matching a regex pattern.
- ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings end with a pattern.
- extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Extract a substring matching a regex pattern.
- find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence of a substring.
- find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence matching a regex pattern.
- lstrip(characters: str | None = None) UDFExpr[source]#
Remove leading whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the left.
- match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Match strings against a SQL LIKE pattern.
- match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings match a regex pattern.
- pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#
Pad strings to a specified width.
- Parameters:
width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.
- Returns:
UDFExpr that pads strings.
- replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences of a substring.
- replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences matching a regex pattern.
- replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace a slice with a string.
- rstrip(characters: str | None = None) UDFExpr[source]#
Remove trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the right.
- split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Split strings by a regex pattern.
- starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings start with a pattern.
- strip(characters: str | None = None) UDFExpr[source]#
Remove leading and trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from both ends.
- word_count(tokenizer: str = 'cjk') UDFExpr[source]#
Count words in texts using the specified tokenizer.
- Parameters:
tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.
Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello world", "This is a test"]) ds = ds.with_column( "word_count", col("text").str.word_count(tokenizer="cjk"), ) print(ds.take_all())
- class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#
Namespace for struct operations on expression columns.
This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Access a field using method >>> expr = col("user_record").struct.field("age") >>> # Access a field using bracket notation >>> expr = col("user_record").struct["age"] >>> # Access nested field >>> expr = col("user_record").struct["address"].struct["city"]
- class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#
Namespace for datetime operations on expression columns.
This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Extract year component >>> expr = col("datetime").dt.year >>> # Extract month component >>> expr = col("datetime").dt.month >>> # Extract day component >>> expr = col("datetime").dt.day >>> # Extract hour component >>> expr = col("datetime").dt.hour >>> # Extract minute component >>> expr = col("datetime").dt.minute >>> # Extract second component >>> expr = col("datetime").dt.second
- floor(unit: TemporalUnit) UDFExpr[source]#
Floor timestamps to the previous multiple of the given unit.
Aggregation#
|
Provides an interface to implement efficient aggregations to be applied to the dataset. |
|
Defines count aggregation. |
|
Defines sum aggregation. |
|
Defines mean (average) aggregation. |
|
Defines max aggregation. |
|
Defines min aggregation. |
|
Defines standard deviation aggregation. |
|
Defines absolute max aggregation. |
|
Defines Quantile aggregation. |
|
Defines unique aggregation. |
Calculates the percentage of null values in a column. |
|
|
Calculates the percentage of zero values in a numeric column. |