Welcome to Xpark!#

Xpark is a multimodal AI data processing platform designed to streamline and optimize data workflows for AI applications. It provides comprehensive capabilities for data handling, transformation, and seamless integration with AI workflows.

Processing Multimodal Data with Xpark#

from xpark.dataset import TextEmbedding, from_items
from xpark.dataset.expressions import col

ds = from_items(
   [
      "what is the advantage of using the GPU rendering options in Android?",
      "Blank video when converting uncompressed AVI files with ffmpeg",
   ]
)
ds = ds.with_column(
   "embedding",
   TextEmbedding(
      # Local embedding model.
      "Qwen/Qwen3-Embedding-0.6B",
   )
   .options(num_workers={"CPU": 1})
   .with_column(col("item")),
)

output = ds.take_all()

Cache Model (Required)#

Before using any AI processors, you must cache the required models locally.

# Cache test models
python /path/to/xpark/dataset/scripts/cache_models.py -g test

# Cache all models
python /path/to/xpark/dataset/scripts/cache_models.py -g all

The default model cache path is ~/.cache/xpark. For distributed Ray clusters, it is recommended to use a distributed cloud disk for model caching.

Note

The current mode is manual caching. In the future, the EMR product will integrate this into the management system, supporting model caching through configuration.

Next Steps#