.. _reference_index:

=====================
Dataset API
=====================


UDF
---
.. autosummary::
    :toctree: generated/

    xpark.dataset.udf


Dataset
-------
.. autosummary::
    :toctree: generated/

    xpark.dataset.Dataset


Dataset Context
---------------
.. autosummary::
    :toctree: generated/

    xpark.dataset.DatasetContext


Grouped Dataset
---------------
.. autosummary::
    :toctree: generated/

    xpark.dataset.GroupedData


.. _read-api:

Read API
--------
.. autosummary::
    :toctree: generated/

    xpark.dataset.from_arrow
    xpark.dataset.from_blocks
    xpark.dataset.from_huggingface
    xpark.dataset.from_items
    xpark.dataset.from_range
    xpark.dataset.from_range_tensor
    xpark.dataset.from_pandas
    xpark.dataset.from_numpy
    xpark.dataset.read_json
    xpark.dataset.read_audio
    xpark.dataset.read_video
    xpark.dataset.read_image
    xpark.dataset.read_parquet
    xpark.dataset.read_iceberg
    xpark.dataset.read_lance
    xpark.dataset.read_lerobot
    xpark.dataset.read_files


Expressions
-----------
.. autosummary::
    :toctree: generated/

    xpark.dataset.expressions.ExprUDFOptions
    xpark.dataset.expressions.udf
    xpark.dataset.expressions.star
    xpark.dataset.expressions.col
    xpark.dataset.expressions.lit
    xpark.dataset.expressions.download


Expression namespaces
------------------------------------

These namespace classes provide specialized operations for list, string, and struct columns.
You access them through properties on expressions: ``.list``, ``.str``, and ``.struct``.

The following example shows how to use the string namespace to transform text columns:

.. testcode::

    from xpark.dataset import from_items
    from xpark.dataset.expressions import col

    # Create a dataset with a text column
    ds = from_items([
        {"name": "alice"},
        {"name": "bob"},
        {"name": "charlie"}
    ])

    # Use the string namespace to uppercase the names
    ds = ds.with_column("upper_name", col("name").str.upper())
    ds.show()

.. testoutput::

    {'name': 'alice', 'upper_name': 'ALICE'}
    {'name': 'bob', 'upper_name': 'BOB'}
    {'name': 'charlie', 'upper_name': 'CHARLIE'}

The following example demonstrates using the list namespace to work with array columns:

.. testcode::

    from xpark.dataset import from_items
    from xpark.dataset.expressions import col

    # Create a dataset with list columns
    ds = from_items([
        {"scores": [85, 90, 78]},
        {"scores": [92, 88]},
        {"scores": [76, 82, 88, 91]}
    ])

    # Use the list namespace to get the length of each list
    ds = ds.with_column("num_scores", col("scores").list.len())
    ds.show()

.. testoutput::

    {'scores': [85, 90, 78], 'num_scores': 3}
    {'scores': [92, 88], 'num_scores': 2}
    {'scores': [76, 82, 88, 91], 'num_scores': 4}

The following example shows how to use the struct namespace to access nested fields:

.. testcode::

    from xpark.dataset import from_items
    from xpark.dataset.expressions import col

    # Create a dataset with struct columns
    ds = from_items([
        {"user": {"name": "alice", "age": 25}},
        {"user": {"name": "bob", "age": 30}},
        {"user": {"name": "charlie", "age": 35}}
    ])

    # Use the struct namespace to extract a specific field
    ds = ds.with_column("user_name", col("user").struct.field("name"))
    ds.show()

.. testoutput::

    {'user': {'name': 'alice', 'age': 25}, 'user_name': 'alice'}
    {'user': {'name': 'bob', 'age': 30}, 'user_name': 'bob'}
    {'user': {'name': 'charlie', 'age': 35}, 'user_name': 'charlie'}

.. autoclass:: xpark.dataset.namespace_expressions.list_namespace._ListNamespace
    :members:
    :inherited-members:
    :exclude-members: _expr

.. autoclass:: xpark.dataset.namespace_expressions.string_namespace._StringNamespace
    :members:
    :inherited-members:
    :exclude-members: _expr

.. autoclass:: xpark.dataset.namespace_expressions.struct_namespace._StructNamespace
    :members:
    :inherited-members:
    :exclude-members: _expr

.. autoclass:: xpark.dataset.namespace_expressions.datetime_namespace._DatetimeNamespace
    :members:
    :inherited-members:
    :exclude-members: _expr

Aggregation
-----------
.. autosummary::
    :toctree: generated/

    xpark.dataset.aggregate.AggregateFnV2
    xpark.dataset.aggregate.Count
    xpark.dataset.aggregate.Sum
    xpark.dataset.aggregate.Mean
    xpark.dataset.aggregate.Max
    xpark.dataset.aggregate.Min
    xpark.dataset.aggregate.Std
    xpark.dataset.aggregate.AbsMax
    xpark.dataset.aggregate.Quantile
    xpark.dataset.aggregate.Unique
    xpark.dataset.aggregate.MissingValuePercentage
    xpark.dataset.aggregate.ZeroPercentage