xpark.dataset.SpeechToText#

class xpark.dataset.SpeechToText(_local_model: str | None = None, /, *, base_url: str | None = None, model: str | None = None, api_key: str = 'NOT_SET', max_qps: int | None = None, max_retries: int = 0, **kwargs: dict[str, Any])[source]#

Speech to text processor for CPU, GPU and remote Http requests.

Parameters:

_local_model – The speech to text model name for CPU or GPU, available models: [‘openai/whisper-large-v3’, ‘Systran/faster-whisper-large-v3’, ‘nvidia/parakeet-tdt-0.6b-v3’, ‘OpenVINO/whisper-large-v3-int8-ov’]
base_url – The base URL of the LLM server.
model – The request model name.
api_key – The request API key.
batch_rows – The number of rows to request once.
max_qps – The maximum number of requests per second.
max_retries – The maximum number of retries per request in the event of failures. We retry with exponential backoff upto this specific maximum retries.
**kwargs – Keyword arguments to pass to the openai.AsyncClient.audio.transcriptions.create API.

Examples

from xpark.dataset.expressions import col
from xpark.dataset import SpeechToText, from_items

ds = from_items(["multilingual.mp3"])
ds = ds.with_column(
    "text",
    SpeechToText(
        # Local transcriptions model.
        "openai/whisper-large-v3",
        # For remote transcriptions requests.
        base_url="http://127.0.0.1:9997/v1",
        model="whisper1",
    )
    # One IO worker for HTTP request, 10 CPU workers for local transcriptions.
    .options(num_workers={"CPU": 10, "IO": 1})
    .with_column(col("item")),
)
print(ds.take(2))

Methods

`__call__`(audio_array)	Transcript the audio array to text array.
`options`(**kwargs)
`with_column`(audio_array)	Transcript the audio array to text array.

__call__(audio_array: pa.ChunkedArray) → pa.Array#

Transcript the audio array to text array.

Parameters:

audio_array –

The audio array. The type of array is either:

str that is either the filename of a local audio file, or a public URL address to download the audio file. The file will be read at the correct sampling rate to get the waveform using ffmpeg. This requires ffmpeg to be installed on the system.
bytes it is supposed to be the content of an audio file and is interpreted by ffmpeg in the same way.
(np.ndarray of shape (n, ) of type np.float32 or np.float64) Raw audio at the correct sampling rate (For example, 16K).

Returns:

The transcribed text array.

options(**kwargs: Unpack[ExprUDFOptions]) → Self#

with_column(audio_array: pa.ChunkedArray) → pa.Array#

Transcript the audio array to text array.

Parameters:

audio_array –

The audio array. The type of array is either:

str that is either the filename of a local audio file, or a public URL address to download the audio file. The file will be read at the correct sampling rate to get the waveform using ffmpeg. This requires ffmpeg to be installed on the system.
bytes it is supposed to be the content of an audio file and is interpreted by ffmpeg in the same way.
(np.ndarray of shape (n, ) of type np.float32 or np.float64) Raw audio at the correct sampling rate (For example, 16K).

Returns:

The transcribed text array.

xpark.dataset.SpeechToText#

This Page