xpark.dataset.AudioCompute#

class xpark.dataset.AudioCompute(*args, **kwargs)[source]#

Note

Do not construct this class, use the staticmethod instead.

Methods

`channels`(audios)	The number of channels in the sound file.
`duration`(audios)	The duration of the sound file.
`format`(audios)	The major format of the sound file.
`frames`(audios)	The number of frames in the sound file.
`load`(audios, *, sr, mono, offset, duration, ...)	Load an audio file as a floating point time series.
`samplerate`(audios)	The sample rate of the sound file.
`split_by_duration`(audios[, sample_rate, ...])	Split audio by duration.
`subtype`(audios)	The subtype of data in the the sound file.

static channels(audios: ChunkedArray) → Expr#: The number of channels in the sound file.

static duration(audios: ChunkedArray) → Expr#: The duration of the sound file.

static format(audios: ChunkedArray) → Expr#: The major format of the sound file.

static frames(audios: ChunkedArray) → Expr#: The number of frames in the sound file.

static load(audios: ChunkedArray, *, sr: float | None = 16000, mono: bool = True, offset: float = 0.0, duration: float | None = None, dtype: Any, int] | tuple[~typing.Any, ~typing.SupportsIndex | ~collections.abc.Sequence[~typing.SupportsIndex]] | list[~typing.Any] | ~numpy._typing._dtype_like._DTypeDict | tuple[~typing.Any, ~typing.Any]=<class 'numpy.float32'>, res_type: str = 'soxr_hq') → Expr#

Load an audio file as a floating point time series.

Audio will be automatically resampled to the given rate (default sr=16000).

To preserve the native sampling rate of the file, use sr=None.

Parameters:

audios (string, pathlib.Path, http URL or audio bytes) –
path to the input audio.

Any codec supported by soundfile or audioread will work.

The audio bytes is all the bytes of the audio file, not the ndarray.
sr (number > 0 [scalar]) –
target sampling rate

’None’ uses the native sampling rate
mono (bool) – convert signal to mono
offset (float) – start reading after this time (in seconds)
duration (float) – only load up to this much audio (in seconds)
dtype (numeric type) – data type of y
res_type (str) –
resample type (see note)

Note

By default, this uses soxr’s high-quality mode (‘HQ’).

For alternative resampling modes, see resample

Note

audioread may truncate the precision of the audio data to 16 bits.

See ioformats for alternate loading methods.

Returns:

ynp.ndarray [shape=(n,) or (…, n)]: audio time series. Multi-channel is supported.
srnumber > 0 [scalar]: sampling rate of y

Return type:

A StructArray of PyArrow that contains these two fields.

Examples

>>> # Load audio from path
>>> from xpark.dataset import AudioCompute, from_items
>>> from xpark.dataset.expressions import col
>>>
>>> items = ["sample.wav", "cos://my_bucket/sample.wav", "http://127.0.0.1:12345/sample.wav"]
>>> from_items(items).with_column("audio_data", AudioCompute.load(col("item"))).show()
{'item': 'sample.wav', 'audio_data': {'y': array([-7.4705182e-05, -5.2042997e-05,  5.3031382e-04, ...,
    -1.2249170e-02, -7.8290682e-03,  0.0000000e+00], shape=(77286,), dtype=float32), 'sr': 22050}}
...

>>> # Load audio from bytes
>>> from xpark.dataset import AudioCompute, from_items
>>> from xpark.dataset.expressions import col
>>>
>>> with open("sample.wav", "rb") as f:
>>>     items = [f.read()]
>>> from_items(items).with_column("audio_data", AudioCompute.load(col("item"))).show()
{'item': b'RIFFD...', 'audio_data': {'y': array([-7.4705182e-05, -5.2042997e-05,  5.3031382e-04, ...,
    -1.2249170e-02, -7.8290682e-03,  0.0000000e+00], shape=(77286,), dtype=float32), 'sr': 22050}}

>>> # Load and resample the audio to 16000 samplerate
>>> from xpark.dataset import AudioCompute, from_items
>>> from xpark.dataset.expressions import col
>>>
>>> from_items(["sample.wav"]).with_column("audio_data", AudioCompute.load(col("item"))).show()
{'item': 'sample.wav', 'audio_data': {'y': array([-6.1035156e-05,  9.1552734e-05,  1.0681152e-03, ...,
    -2.1972656e-03, -1.1383057e-02, -8.8195801e-03], shape=(56080,), dtype=float32), 'sr': 16000}}

static samplerate(audios: ChunkedArray) → Expr#: The sample rate of the sound file.

static split_by_duration(audios: ChunkedArray, sample_rate: int | ChunkedArray | None = None, max_audio_clip_s: int = 30, overlap_chunk_second: int = 1, min_energy_split_window_size: int = 1600) → Expr#

Split audio by duration.

Parameters:

audios (string, pathlib.Path, http URL, audio bytes, or ndarray) –
path to the input audio.

Any codec supported by soundfile or audioread will work.

The audio bytes is all the bytes of the audio file, not the ndarray. If the input audio type is ndarray, then sample rate must be specified.
sample_rate (int, optional) – The input audio sample rate. If the input audio type is not ndarray, then sample rate is not used.
max_audio_clip_s (int, optional) – Maximum duration in seconds for a single audio clip without chunking. Audio longer than this will be split into smaller chunks if allow_audio_chunking evaluates to True, otherwise it will be rejected.
overlap_chunk_second (int, optional) – Overlap duration in seconds between consecutive audio chunks when splitting long audio.
min_energy_split_window_size (int, optional) – Window size in samples for finding low-energy (quiet) regions to split audio chunks. The algorithm looks for the quietest moment within this window to minimize cutting through speech. Default 1600 samples ≈ 100ms at 16kHz. If None, no chunking will be done.

Return type:

An array of ArrowVariableShapedTensorArray, each element of the array is a part of the audio in ndarray format.

static subtype(audios: ChunkedArray) → Expr#: The subtype of data in the the sound file.

xpark.dataset.AudioCompute#

This Page