In Lhotse, we represent the data using a small number of Python classes, enhanced with methods for solving common data manipulation tasks, that can be stored as JSON or JSONL manifests. For most audio corpora, we will need two types of manifests to fully describe them: a recording manifest and a supervision manifest.
.. autoclass:: lhotse.audio.Recording :no-members: :no-special-members: :noindex:
.. autoclass:: lhotse.audio.RecordingSet :no-members: :no-special-members: :noindex:
.. autoclass:: lhotse.supervision.SupervisionSegment :no-members: :no-special-members: :noindex:
.. autoclass:: lhotse.supervision.SupervisionSet :no-members: :no-special-members: :noindex:
We provide a number of standard data preparation recipes. By that, we mean a collection of a Python function + a CLI tool that create the manifests given a corpus directory.
Corpus name | Function |
---|---|
Grid Audio-Visual Speech Corpus | :func:`lhotse.recipes.prepare_grid` |
Hint
Python data preparation recipes. Each corpus has a dedicated Python file in lhotse/recipes
,
which you can use as the basis for your own recipe.
Hint
(optional) Downloading utility. For publicly available corpora that can be freely downloaded,
we usually define a function called download_<corpus-name>()
.
Hint
Data preparation Python entry-point. Each data preparation recipe should expose a single function
called prepare_<corpus-name>
,
that produces dicts like: {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}
.
Hint
CLI recipe wrappers. We provide a command-line interface that wraps the download
and prepare
functions -- see lhotse/bin/modes/recipes
for examples of how to do it.
Hint
Pre-defined train/dev/test splits. When a corpus defines standard split (e.g. train/dev/test),
we return a dict with the following structure:
{'train': {'recordings': <RecordingSet>, 'supervisions': <SupervisionSet>}, 'dev': ...}
Hint
Manifest naming convention. The default naming convention is <corpus-name>_<manifest-type>_<split>.jsonl.gz
,
i.e., we save the manifests in a compressed JSONL file. Here, <manifest-type>
can be recordings
,
supervisions
, etc., and <split>
can be train
, dev
, test
, etc. In case the corpus
has no such split defined, we can use all
as default. Other information, e.g., mic type, language, etc. may
be included in the <corpus-name>
. Some examples are: cmu-indic_recordings_all.jsonl.gz
,
ami-ihm_supervisions_dev.jsonl.gz
, mtedx-english_recordings_train.jsonl.gz
.
Hint
Isolated utterance corpora. Some corpora (like LibriSpeech) come with pre-segmented recordings. In these cases, the :class:`~lhotse.supervision.SupervisionSegment` will exactly match the :class:`~lhotse.recording.Recording` duration (and there will likely be exactly one segment corresponding to any recording).
Hint
Conversational corpora. Corpora with longer recordings (e.g. conversational, like Switchboard) should have exactly one
:class:`~lhotse.audio.Recording` object corresponding to a single conversation/session,
that spans its whole duration.
Each speech segment in that recording should be represented as a :class:`~lhotse.supervision.SupervisionSegment`
with the same recording_id
value.
Hint
Multi-channel corpora. Corpora with multiple channels for each session (e.g. AMI) should have a single :class:`~lhotse.audio.Recording` with multiple :class:`~lhotse.audio.AudioSource` objects -- each corresponding to a separate channel. Remember to make the :class:`~lhotse.supervision.SupervisionSegment` objects correspond to the right channels!