xpark.dataset.read_lerobot#
- xpark.dataset.read_lerobot(path: str, *, episodes: list[int] | None = None, columns: list[str] | None = None, include_video_paths: bool = True, decode_video: bool = False, parallelism: int = -1, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset[source]#
Creates a
Datasetfrom a LeRobot format dataset.LeRobot is a framework for robot learning that stores datasets in a specific format with Parquet files for tabular data and MP4 videos for visual observations. This function reads LeRobot datasets from either HuggingFace Hub or local storage.
- Dataset Structure:
LeRobot datasets are structured as:
dataset_root/ ├── meta/ │ ├── info.json # Dataset metadata (fps, features, etc.) │ ├── stats.json # Statistics for normalization │ ├── episodes.jsonl # Episode metadata │ └── tasks.jsonl # Task definitions ├── data/ │ └── chunk-000/ │ ├── episode_000000.parquet │ └── ... └── videos/ └── observation.images.top/ └── chunk-000/ ├── episode_000000.mp4 └── ...
Examples
Read a dataset from HuggingFace Hub:
>>> from xpark.dataset import read_lerobot >>> ds = read_lerobot("hf://datasets/lerobot/pusht") >>> ds.schema() Column Type ------ ---- observation.state list<float32> action list<float32> episode_index int64 frame_index int64 timestamp float32 observation.images.top_path string
Read specific episodes:
>>> ds = read_lerobot("hf://datasets/lerobot/pusht", episodes=[0, 1, 2])
Read specific columns:
>>> ds = read_lerobot( ... "hf://datasets/lerobot/pusht", ... columns=["observation.state", "action", "episode_index"] ... )
Read from local path:
>>> ds = xd.read_lerobot("/path/to/local/dataset")
Read with video decoding:
>>> ds = xd.read_lerobot("hf://datasets/lerobot/pusht", decode_video=True)
Read from COS:
>>> ds = xd.read_lerobot("cos://bucket/path/to/dataset")
- Parameters:
path –
Dataset path. Can be: - Local filesystem path (e.g., “/path/to/dataset”) - Remote path with protocol prefix:
HuggingFace: “hf://datasets/lerobot/pusht”
COS: “cos://bucket/path”
S3: “s3://bucket/path”
episodes – List of episode indices to load. If None, loads all episodes. This is useful for loading a subset of data for debugging or validation splits.
columns – List of column names to read. If None, reads all columns. Common columns include “observation.state”, “action”, “episode_index”, “frame_index”, “timestamp”.
include_video_paths – If True, adds columns with video file paths for each video key in the dataset (e.g., “observation.images.top_path”). These paths can be used with video decoding processors.
decode_video – If True, decode video frames into tensors (numpy arrays). Requires PyAV and PyTorch. When enabled, video columns will contain decoded frame data instead of path/timestamp references.
parallelism – The requested parallelism of the read. Parallelism may be limited by the number of files in the dataset. Defaults to -1, which automatically determines the optimal parallelism.
num_cpus – The number of CPUs to reserve for each parallel read task.
num_gpus – The number of GPUs to reserve for each parallel read task.
memory – The heap memory in bytes to reserve for each parallel read task.
ray_remote_args – Additional kwargs passed to
ray.remote()for the read tasks.concurrency – The maximum number of concurrent read tasks.
override_num_blocks – Override the number of output blocks.
- Returns:
Datasetcontaining the LeRobot dataset records.- Raises:
FileNotFoundError – If the dataset cannot be found locally and cannot be downloaded from HuggingFace Hub.
ImportError – If huggingface_hub is not installed when trying to download a dataset from HuggingFace Hub.
See also
LeRobot documentation: huggingface/lerobot