Video Classify By Speech Content
===================================================

Background
----------

This pipeline demonstrates how to implement a video classification algorithm based on speech content
by chaining multiple Xpark operators.

To complete video classification, we need to execute the following steps in order:

The detailed workflow is as follows:

1. **Read Videos**: Use ``from_items`` to load a list of video file paths.
2. **Extract Audio**: Use the ``VideoCompute.extract_audio`` operator to extract audio binary data from videos.
3. **Speech to Text**: Use the ``SpeechToText`` operator to transcribe audio into text (supports local models(CPU/GPU) or remote HTTP interfaces).
4. **Text Classification**: Use the ``TextClassify`` operator to classify text content based on LLM, outputting predefined labels.


Video Classify
--------------

The following example shows how to chain Xpark operators to perform batch video content classification:

.. code-block:: python

    import os

    from xpark.dataset import (
        SpeechToText,
        TextClassify,
        VideoCompute,
        from_items,
    )
    from xpark.dataset.expressions import col

    # List of video file paths to classify (supports local paths, COS, S3, HTTP, etc.)
    video_paths = [
        "/path/to/video.mp4",
        "/path/to/video1.mp4",
        "/path/to/video2.mp4",
    ]

    # Step 1: Build the video dataset
    ds = from_items([{"video": path} for path in video_paths])

    # Step 2: Extract audio from videos (Video → Audio)
    # VideoCompute.extract_audio returns audio binary data (bytes)
    ds = ds.with_column(
        "audio",
        VideoCompute.extract_audio(col("video"), codec="mp3"),
    )

    # Step 3: Transcribe audio to text (Audio → Text)
    # SpeechToText supports both local models (CPU/GPU) and remote HTTP interfaces
    # Local model mode (using faster Whisper):
    ds = ds.with_column(
        "text",
        SpeechToText(
            # Local transcription model
            "Systran/faster-whisper-tiny",
        )
        .options(num_workers={"CPU": 1}, batch_size=4)
        .with_column(col("audio")),
    )

    # Step 4: Classify videos based on text content (Text → Category)
    # TextClassify uses LLM to assign text to one of the predefined labels
    ds = ds.with_column(
        "category",
        TextClassify(
            # Predefined classification labels
            ["Sports", "Entertainment", "Technology", "Education", "Lifestyle"],
            model="deepseek-v3-0324",
            base_url=os.getenv("LKE_ENDPOINT"),
            api_key=os.getenv("LKEAP_API_KEY"),
        )
        .options(num_workers={"IO": 1}, batch_size=4)
        .with_column(col("text")),
    )

    # Output classification results
    results = ds.select_columns(["video", "text", "category"]).take_all()
    for row in results:
        print(f"Video: {row['video']}")
        print(f"Transcript: {row['text'][:100]}...")
        print(f"Category: {row['category']}")
        print("-" * 50)

    # Results can also be written to a Parquet file
    # ds.select_columns(["video", "text", "category"]).write_parquet("/path/to/output/")


Example Output
--------------

.. code-block:: text

    Video: /path/to/video1.mp4
    Transcript: Today's match was incredibly exciting. The home team completed a stunning comeback...
    Category: Sports
    --------------------------------------------------
    Video: /path/to/video2.mp4
    Transcript: A new study shows that artificial intelligence has made a major breakthrough in...
    Category: Technology
    --------------------------------------------------
    Video: /path/to/video3.mp4
    Transcript: In this episode, we will show you how to make a simple and delicious home-cooked...
    Category: Lifestyle
    --------------------------------------------------


Operator Reference
------------------

- :py:class:`xpark.dataset.VideoCompute` — A collection of video processing operators. The ``extract_audio`` method supports extracting audio from specified time ranges, with support for COS, S3, HTTP, and other storage backends.
- :py:class:`xpark.dataset.SpeechToText` — Speech-to-text operator that supports local models (Whisper, FasterWhisper, etc.) and remote OpenAI-compatible interfaces.
- :py:class:`xpark.dataset.TextClassify` — Text classification operator that uses LLM to assign text to predefined labels, supporting any OpenAI-compatible LLM service.


Further Notes
-------------

As multimodal large language models continue to evolve, Xpark will provide interfaces for more modalities.
In addition to audio, support for classifying video content based on image information and music information
will be added, further improving the accuracy and applicability of video classification.