xpark.dataset.read_lerobot#

xpark.dataset.read_lerobot(path: str, *, episodes: list[int] | None = None, columns: list[str] | None = None, include_video_paths: bool = True, decode_video: bool = False, parallelism: int = -1, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) → Dataset[source]#

Creates a Dataset from a LeRobot format dataset.

LeRobot is a framework for robot learning that stores datasets in a specific format with Parquet files for tabular data and MP4 videos for visual observations. This function reads LeRobot datasets from either HuggingFace Hub or local storage.

Dataset Structure:

LeRobot datasets are structured as:

dataset_root/
├── meta/
│   ├── info.json       # Dataset metadata (fps, features, etc.)
│   ├── stats.json      # Statistics for normalization
│   ├── episodes.jsonl  # Episode metadata
│   └── tasks.jsonl     # Task definitions
├── data/
│   └── chunk-000/
│       ├── episode_000000.parquet
│       └── ...
└── videos/
    └── observation.images.top/
        └── chunk-000/
            ├── episode_000000.mp4
            └── ...

Examples

Read a dataset from HuggingFace Hub:

>>> from xpark.dataset import read_lerobot
>>> ds = read_lerobot("hf://datasets/lerobot/pusht")
>>> ds.schema()
Column                    Type
------                    ----
observation.state         list<float32>
action                    list<float32>
episode_index             int64
frame_index               int64
timestamp                 float32
observation.images.top_path  string

Read specific episodes:

>>> ds = read_lerobot("hf://datasets/lerobot/pusht", episodes=[0, 1, 2])

Read specific columns:

>>> ds = read_lerobot(
...     "hf://datasets/lerobot/pusht",
...     columns=["observation.state", "action", "episode_index"]
... )

Read from local path:

>>> ds = xd.read_lerobot("/path/to/local/dataset")

Read with video decoding:

>>> ds = xd.read_lerobot("hf://datasets/lerobot/pusht", decode_video=True)

Read from COS:

>>> ds = xd.read_lerobot("cos://bucket/path/to/dataset")

Parameters:

path –
Dataset path. Can be: - Local filesystem path (e.g., “/path/to/dataset”) - Remote path with protocol prefix:
- HuggingFace: “hf://datasets/lerobot/pusht”
- COS: “cos://bucket/path”
- S3: “s3://bucket/path”
episodes – List of episode indices to load. If None, loads all episodes. This is useful for loading a subset of data for debugging or validation splits.
columns – List of column names to read. If None, reads all columns. Common columns include “observation.state”, “action”, “episode_index”, “frame_index”, “timestamp”.
include_video_paths – If True, adds columns with video file paths for each video key in the dataset (e.g., “observation.images.top_path”). These paths can be used with video decoding processors.
decode_video – If True, decode video frames into tensors (numpy arrays). Requires PyAV and PyTorch. When enabled, video columns will contain decoded frame data instead of path/timestamp references.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files in the dataset. Defaults to -1, which automatically determines the optimal parallelism.
num_cpus – The number of CPUs to reserve for each parallel read task.
num_gpus – The number of GPUs to reserve for each parallel read task.
memory – The heap memory in bytes to reserve for each parallel read task.
ray_remote_args – Additional kwargs passed to ray.remote() for the read tasks.
concurrency – The maximum number of concurrent read tasks.
override_num_blocks – Override the number of output blocks.

Returns:

Dataset containing the LeRobot dataset records.

Raises:

FileNotFoundError – If the dataset cannot be found locally and cannot be downloaded from HuggingFace Hub.
ImportError – If huggingface_hub is not installed when trying to download a dataset from HuggingFace Hub.

xpark.dataset.read_lerobot#

This Page