Video Classify By Speech Content#
Background#
This pipeline demonstrates how to implement a video classification algorithm based on speech content by chaining multiple Xpark operators.
To complete video classification, we need to execute the following steps in order:
The detailed workflow is as follows:
Read Videos: Use
from_itemsto load a list of video file paths.Extract Audio: Use the
VideoCompute.extract_audiooperator to extract audio binary data from videos.Speech to Text: Use the
SpeechToTextoperator to transcribe audio into text (supports local models(CPU/GPU) or remote HTTP interfaces).Text Classification: Use the
TextClassifyoperator to classify text content based on LLM, outputting predefined labels.
Video Classify#
The following example shows how to chain Xpark operators to perform batch video content classification:
import os
from xpark.dataset import (
SpeechToText,
TextClassify,
VideoCompute,
from_items,
)
from xpark.dataset.expressions import col
# List of video file paths to classify (supports local paths, COS, S3, HTTP, etc.)
video_paths = [
"/path/to/video.mp4",
"/path/to/video1.mp4",
"/path/to/video2.mp4",
]
# Step 1: Build the video dataset
ds = from_items([{"video": path} for path in video_paths])
# Step 2: Extract audio from videos (Video → Audio)
# VideoCompute.extract_audio returns audio binary data (bytes)
ds = ds.with_column(
"audio",
VideoCompute.extract_audio(col("video"), codec="mp3"),
)
# Step 3: Transcribe audio to text (Audio → Text)
# SpeechToText supports both local models (CPU/GPU) and remote HTTP interfaces
# Local model mode (using faster Whisper):
ds = ds.with_column(
"text",
SpeechToText(
# Local transcription model
"Systran/faster-whisper-tiny",
)
.options(num_workers={"CPU": 1}, batch_size=4)
.with_column(col("audio")),
)
# Step 4: Classify videos based on text content (Text → Category)
# TextClassify uses LLM to assign text to one of the predefined labels
ds = ds.with_column(
"category",
TextClassify(
# Predefined classification labels
["Sports", "Entertainment", "Technology", "Education", "Lifestyle"],
model="deepseek-v3-0324",
base_url=os.getenv("LKE_ENDPOINT"),
api_key=os.getenv("LKEAP_API_KEY"),
)
.options(num_workers={"IO": 1}, batch_size=4)
.with_column(col("text")),
)
# Output classification results
results = ds.select_columns(["video", "text", "category"]).take_all()
for row in results:
print(f"Video: {row['video']}")
print(f"Transcript: {row['text'][:100]}...")
print(f"Category: {row['category']}")
print("-" * 50)
# Results can also be written to a Parquet file
# ds.select_columns(["video", "text", "category"]).write_parquet("/path/to/output/")
Example Output#
Video: /path/to/video1.mp4
Transcript: Today's match was incredibly exciting. The home team completed a stunning comeback...
Category: Sports
--------------------------------------------------
Video: /path/to/video2.mp4
Transcript: A new study shows that artificial intelligence has made a major breakthrough in...
Category: Technology
--------------------------------------------------
Video: /path/to/video3.mp4
Transcript: In this episode, we will show you how to make a simple and delicious home-cooked...
Category: Lifestyle
--------------------------------------------------
Operator Reference#
xpark.dataset.VideoCompute— A collection of video processing operators. Theextract_audiomethod supports extracting audio from specified time ranges, with support for COS, S3, HTTP, and other storage backends.xpark.dataset.SpeechToText— Speech-to-text operator that supports local models (Whisper, FasterWhisper, etc.) and remote OpenAI-compatible interfaces.xpark.dataset.TextClassify— Text classification operator that uses LLM to assign text to predefined labels, supporting any OpenAI-compatible LLM service.
Further Notes#
As multimodal large language models continue to evolve, Xpark will provide interfaces for more modalities. In addition to audio, support for classifying video content based on image information and music information will be added, further improving the accuracy and applicability of video classification.