Dataset API#
UDF#
|
Decorator to convert a UDF into an expression-compatible function. |
Dataset#
|
Construct a |
Dataset Context#
Grouped Dataset#
|
Represents a grouped dataset created by calling |
Read API#
|
Create a |
|
Create a |
|
Read a Hugging Face Dataset into a Xpark Datasetset. |
|
Create a |
|
Creates a |
|
Creates a |
|
Create a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Creates a |
|
Create a |
|
Create a |
|
Creates a |
|
Create a |
Consuming API#
|
Return an iterable over batches of data. |
|
Print up to the given number of rows from the |
|
Return up to |
|
Return all of the rows in this |
Return up to |
I/O and Conversion API#
|
Convert this |
|
Writes the |
|
Writes the |
|
Writes a column of the |
|
Writes the |
|
Writes the |
|
Write the dataset to a Lance dataset. |
Expressions#
|
Decorator to convert a UDF into an expression-compatible function. |
References all input columns from the input. |
|
Reference an existing column by name. |
|
Create a literal expression from a constant value. |
|
|
Create a download expression that downloads content from URIs. |
Expression namespaces#
These namespace classes provide specialized operations for list, string, and struct columns.
You access them through properties on expressions: .list, .str, and .struct.
The following example shows how to use the string namespace to transform text columns:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with a text column
ds = from_items([
{"name": "alice"},
{"name": "bob"},
{"name": "charlie"}
])
# Use the string namespace to uppercase the names
ds = ds.with_column("upper_name", col("name").str.upper())
ds.show()
{'name': 'alice', 'upper_name': 'ALICE'}
{'name': 'bob', 'upper_name': 'BOB'}
{'name': 'charlie', 'upper_name': 'CHARLIE'}
The following example demonstrates using the list namespace to work with array columns:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with list columns
ds = from_items([
{"scores": [85, 90, 78]},
{"scores": [92, 88]},
{"scores": [76, 82, 88, 91]}
])
# Use the list namespace to get the length of each list
ds = ds.with_column("num_scores", col("scores").list.len())
ds.show()
{'scores': [85, 90, 78], 'num_scores': 3}
{'scores': [92, 88], 'num_scores': 2}
{'scores': [76, 82, 88, 91], 'num_scores': 4}
The following example shows how to use the struct namespace to access nested fields:
from xpark.dataset import from_items
from xpark.dataset.expressions import col
# Create a dataset with struct columns
ds = from_items([
{"user": {"name": "alice", "age": 25}},
{"user": {"name": "bob", "age": 30}},
{"user": {"name": "charlie", "age": 35}}
])
# Use the struct namespace to extract a specific field
ds = ds.with_column("user_name", col("user").struct.field("name"))
ds.show()
{'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
{'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
{'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}
- class xpark.dataset.namespace_expressions.list_namespace._ListNamespace(_expr: Expr)[source]#
Namespace for list operations on expression columns.
This namespace provides methods for operating on list-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Get length of list column >>> expr = col("item").list.len() >>> # Get first item using method >>> expr = col("item").list.get(0) >>> # Get first item using indexing >>> expr = col("item").list[0] >>> # Slice list >>> expr = col("item").list[1:3]
- get(index: int) UDFExpr[source]#
Get element at the specified index from each list.
- Parameters:
index – The index of the element to retrieve. Negative indices are supported.
- Returns:
UDFExpr that extracts the element at the given index.
- slice(start: int | None = None, stop: int | None = None, step: int | None = None) UDFExpr[source]#
Slice each list.
- Parameters:
start – Start index (inclusive). Defaults to 0.
stop – Stop index (exclusive). Defaults to list length.
step – Step size. Defaults to 1.
- Returns:
UDFExpr that extracts a slice from each list.
- sort(order: Literal['ascending', 'descending'] = 'ascending', null_placement: Literal['at_start', 'at_end'] = 'at_end') UDFExpr[source]#
Sort the elements within each (nested) list.
- Parameters:
order – Sorting order, must be
"ascending"or"descending".null_placement – Placement for null values,
"at_start"or"at_end".
- Returns:
UDFExpr providing the sorted lists.
Example
>>> from ray.data.expressions import col >>> # [[3,1],[2,None]] -> [[1,3],[2,None]] >>> expr = col("items").list.sort()
- class xpark.dataset.namespace_expressions.string_namespace._StringNamespace(_expr: Expr)[source]#
Namespace for string operations on expression columns.
This namespace provides methods for operating on string-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Convert to uppercase >>> expr = col("name").str.upper() >>> # Get string length >>> expr = col("name").str.len() >>> # Check if string starts with a prefix >>> expr = col("name").str.starts_with("A")
- alpha_count() UDFExpr[source]#
Count the number of alphabetic characters in each text.
Uses
str.isalpha()to identify characters that are letters (including Unicode alphabetic characters), excluding digits and other symbols.Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello, world! 123", "abc"]) ds = ds.with_column( "alpha_count", col("text").str.alpha_count(), ) print(ds.take_all())
- alpha_number_count() UDFExpr[source]#
Count the number of alphanumeric characters in each text.
Uses
str.isalnum()to identify characters that are either letters or digits (including Unicode alphanumeric characters).Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello, world! 123", "abc"]) ds = ds.with_column( "alpha_number_count", col("text").str.alpha_number_count(), ) print(ds.take_all())
- avg_line_length() UDFExpr[source]#
Compute the average line length for each text.
Splits each text by newlines and returns the mean character count across all lines.
Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello\nworld", "This is a test"]) ds = ds.with_column( "avg_line_length", col("text").str.avg_line_length(), ) print(ds.take_all())
- center(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#
Center strings in a field of given width.
- contains(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings contain a substring.
- count_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Count occurrences matching a regex pattern.
- ends_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings end with a pattern.
- extract(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Extract a substring matching a regex pattern.
- find(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence of a substring.
- find_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Find the first occurrence matching a regex pattern.
- lpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#
Right-align strings by padding with a given character while respecting
width.If the string is longer than the specified width, it remains intact (no truncation occurs).
- lstrip(characters: str | None = None) UDFExpr[source]#
Remove leading whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the left.
- match(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Match strings against a SQL LIKE pattern.
- match_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings match a regex pattern.
- max_line_length() UDFExpr[source]#
Compute the maximum line length for each text.
Splits each text by newlines and returns the length of the longest line.
Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello\nworld", "This is a test"]) ds = ds.with_column( "max_line_length", col("text").str.max_line_length(), ) print(ds.take_all())
- pad(width: int, fillchar: str = ' ', side: Literal['left', 'right', 'both'] = 'right') UDFExpr[source]#
Pad strings to a specified width.
- Parameters:
width – Target width.
fillchar – Character to use for padding.
side – “left”, “right”, or “both” for padding side.
- Returns:
UDFExpr that pads strings.
- replace(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences of a substring.
- replace_regex(pattern: str, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace occurrences matching a regex pattern.
- replace_slice(start: int, stop: int, replacement: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Replace a slice with a string.
- rpad(width: int, padding: str = ' ', *args: Any, **kwargs: Any) UDFExpr[source]#
Left-align strings by padding with a given character while respecting
width.If the string is longer than the specified width, it remains intact (no truncation occurs).
- rstrip(characters: str | None = None) UDFExpr[source]#
Remove trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from the right.
- special_word_count() UDFExpr[source]#
Count the number of special characters in each text.
Iterates over each character in the text and counts those that appear in the predefined
SPECIAL_CHARACTERSset.Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello, world!", "No specials here"]) ds = ds.with_column( "special_word_count", col("text").str.special_word_count(), ) print(ds.take_all())
- split_regex(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Split strings by a regex pattern.
- starts_with(pattern: str, *args: Any, **kwargs: Any) UDFExpr[source]#
Check if strings start with a pattern.
- strip(characters: str | None = None) UDFExpr[source]#
Remove leading and trailing whitespace or specified characters.
- Parameters:
characters – Characters to remove. If None, removes whitespace.
- Returns:
UDFExpr that strips characters from both ends.
- word_count(tokenizer: str = 'cjk') UDFExpr[source]#
Count words in texts using the specified tokenizer.
- Parameters:
tokenizer – The tokenizer type to use for word segmentation. Defaults to “cjk” for Chinese-Japanese-Korean text processing.
Examples
from xpark.dataset import from_items from xpark.dataset.expressions import col ds = from_items(["Hello world", "This is a test"]) ds = ds.with_column( "word_count", col("text").str.word_count(tokenizer="cjk"), ) print(ds.take_all())
- class xpark.dataset.namespace_expressions.struct_namespace._StructNamespace(_expr: Expr)[source]#
Namespace for struct operations on expression columns.
This namespace provides methods for operating on struct-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Access a field using method >>> expr = col("user_record").struct.field("age") >>> # Access a field using bracket notation >>> expr = col("user_record").struct["age"] >>> # Access nested field >>> expr = col("user_record").struct["address"].struct["city"]
- class xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace(_expr: Expr)[source]#
Namespace for datetime operations on expression columns.
This namespace provides methods for operating on datetime-typed columns using PyArrow compute functions.
Example
>>> from xpark.dataset.expressions import col >>> # Extract year component >>> expr = col("datetime").dt.year >>> # Extract month component >>> expr = col("datetime").dt.month >>> # Extract day component >>> expr = col("datetime").dt.day >>> # Extract hour component >>> expr = col("datetime").dt.hour >>> # Extract minute component >>> expr = col("datetime").dt.minute >>> # Extract second component >>> expr = col("datetime").dt.second
- floor(unit: TemporalUnit) UDFExpr[source]#
Floor timestamps to the previous multiple of the given unit.
- class xpark.dataset.namespace_expressions.array_namespace._ArrayNamespace(_expr: Expr)[source]#
Namespace for array operations on expression columns.
Example
>>> from xpark.dataset.expressions import col >>> # Convert fixed-size lists to variable-length lists >>> expr = col("features").arr.to_list()
Aggregation#
|
Provides an interface to implement efficient aggregations to be applied to the dataset. |
|
Defines count aggregation. |
|
Defines sum aggregation. |
|
Defines mean (average) aggregation. |
|
Defines max aggregation. |
|
Defines min aggregation. |
|
Defines standard deviation aggregation. |
|
Defines absolute max aggregation. |
|
Defines Quantile aggregation. |
|
Defines unique aggregation. |
Calculates the percentage of null values in a column. |
|
|
Calculates the percentage of zero values in a numeric column. |