Dataset Processing

Build custom dataset

One dataset contains two subsets: training subset and testing subset, each stored in a separate folder, e.g., "train/" and "test/".

In the train/ folder, each audio recording is paired with an annotation file. Here are the requirements for the audio recording and the corresponding annotation file:

Audio recording:

It is a ".wav" file, e.g., the file name can be arbitrary but must end with ".wav, such as "rec_0001.wav".
It has only one channel (mono sound).
Its sampling rate and length (duration) can be arbitrary.

Annotation file:

There are two options for the format of the annotation file:

Option 1:

The annotation file is a .csv file, and its name should be the same as the corresponding audio file except for the format. So given an audio file named "rec_0001.wav", the corresponding annotation file should be "rec_0001.csv"
This csv file contains three columns "onset", "offset" and "cluster". The unit of onset and offset is second (s). The csv file looks like below:

	onset	offset	cluster
0	20.953844	21.033750	call0
1	21.648063	21.741531	call0
2	21.850719	21.938219	call0
3	21.952313	21.999469	call1
4	22.013906	22.060031	call0

The file structures of the dataset looks like below:

data/marmoset/
├── test
│   ├── marmoset_pair1_animal1_animal1out_0.csv
│   ├── marmoset_pair1_animal1_animal1out_0.wav
│   ├── marmoset_pair2_animal1_animal1out_0.csv
│   ├── marmoset_pair2_animal1_animal1out_0.wav
└── train
    ├── marmoset_pair4_animal1_together_A_0.csv
    ├── marmoset_pair4_animal1_together_A_0.wav
    ├── marmoset_pair4_animal1_together_B_0.csv
    ├── marmoset_pair4_animal1_together_B_0.wav
    ├── marmoset_pair5_animal1_animal1out_0.csv
    └── marmoset_pair5_animal1_animal1out_0.wav

Option 2:

The annotation file is a .json file, and its name should be the same as the corresponding audio file except for the format. So given an audio file named "rec_0001.wav", the corresponding annotation file should be "rec_0001.json"
The json file contains the following keys:
- "onset": a list of the starting time (in second) of the segments in the audio, ordered chronologically.
- "offset": a list of the ending time (in second) of segments in the audio
- "cluster": a list of the segment types (plain text) of the segments in the audio.
The following keys are optional. You can add these keys and values if you want to have a more in-depth control of the segmentation parameters. If these parameters are not given, the model will use the default values based on the median of the segment length.
- "species": The species in the audio, e.g., "zebra_finch". In this paper, we experinmented on five species: "zebra_finch", "bengalese_finch", "mouse", "marmoset", "human". Adding new species is possible. When adding new species, go to the load_model() function in model.py, add a new pair of species_name:species_token to the species_codebook variable. E.g., "meerkat":"<|meerkat|>".
- "sr": The sampling rate that is used to load the audio. The audio file will be resampled to the sampling rate specified by sr, regardless of the native sampling rate of the audio file.
- "min_frequency": the minimum frequency when computing the Log Melspectrogram. Frequency components below min_frequency will not be included in the input spectrogram.
- "spec_time_step": Spectrogram Time Resolution. By default, one single input spectrogram of WhisperSeg contains 1000 columns. 'spec_time_step' represents the time difference between two adjacent columns in the spectrogram. It is equal to FFT_hop_size / sampling_rate: $\frac{L_\text{hop}}{f_s}$ .
- "min_segment_length": The minimum allowed length of predicted segments. The predicted segments whose length is below 'min_segment_length' will be discarded.
- "tolerance": When computing the $F1_\text{seg}$ score, we need to check if the both the absolute difference between the predicted onset and the ground-truth onset and the absolute difference between the predicted and ground-truth offsets are below a tolerance (in second). We choose tolerance 0.2 s for human and 0.01s for animals.
- "time_per_frame_for_scoring": The time bin size (in second) used when computing the $F1_\text{frame}$ score. We set time_per_frame_for_scoring to 0.001 for all datasets.
- "eps": The threshold $\epsilon_\text{vote}$ during the multi-trial majority voting when processing long audio files
Recommended values of sr, min_frequency, spec_time_step, min_segment_length, time_per_frame_for_scoring, and eps are available at config/segment_config.json
The test/ folder contains the test set and has the same structure as the training set. Here is the example file structures:

data/marmoset/
├── test
│   ├── marmoset_pair1_animal1_animal1out_0.json
│   ├── marmoset_pair1_animal1_animal1out_0.wav
│   ├── marmoset_pair2_animal1_animal1out_0.json
│   ├── marmoset_pair2_animal1_animal1out_0.wav
└── train
    ├── marmoset_pair4_animal1_together_A_0.json
    ├── marmoset_pair4_animal1_together_A_0.wav
    ├── marmoset_pair4_animal1_together_B_0.json
    ├── marmoset_pair4_animal1_together_B_0.wav
    ├── marmoset_pair5_animal1_animal1out_0.json
    └── marmoset_pair5_animal1_animal1out_0.wav

Here is how it looks like in an annotation file (take "marmoset_pair4_animal1_together_B_0.json" as an example):

{'onset': [0.1979075547210413,
  10.623481169473052,
  15.8850552318886,
  24.79427063612889,
  28.810797420332847,
  38.2856537289058,
  48.63584878094048,
  58.04026121482411,
  64.63873157548687,
  64.7831555952348,
  68.39671202322779,
  78.3215963108371,
  88.56905355060303,
  98.87149718874277,
  100.79554794271394,
  111.51698102894191,
  115.66329432038287,
  125.8880986515378,
  126.01743089368824,
  136.24306110239945,
  146.5839366300272,
  156.9141451247167],
 'offset': [0.6325124716552182,
  10.732488727627924,
  16.165301587301883,
  24.99787319137772,
  29.080698832384087,
  38.65104624298783,
  48.960403628118,
  58.40517025116287,
  64.71413591707028,
  64.94158551347277,
  68.73825700016414,
  78.5864520979826,
  88.86396659303114,
  99.44553737426918,
  101.51028706999728,
  111.81912465550818,
  115.90700986006686,
  125.97332390185488,
  126.26139656595478,
  136.58120794909837,
  146.90520472896583,
  157.0802205816326],
 'cluster': ['marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr',
  'marmoset_tr'],
 'species': 'marmoset',
 'sr': 48000,
 'min_frequency': 0,
 'spec_time_step': 0.0025,
 'min_segment_length': 0.01,
 'tolerance': 0.01,
 'time_per_frame_for_scoring': 0.001,
 'eps': 0.02}

The choice of the optional parameters are described in README/Illustration-of-segmentation-parameters and README/Segmentation-examples. For further illustration, please refer to our paper.

Note: All audio files in the training and test set need to be fully annotated (regarding the target vocal segments).

Download the dataset used in the paper

Note: Runing the following commands in the main folder of the repository (where the README.md and .py files are located)

from huggingface_hub import snapshot_download

Multi-Species VAD dataset

This dataset contains the union of the VAD datasets for five different species (zebra finch, bengalese finch, marmoset, mouse and human) used in the paper.

snapshot_download('nccratliri/vad-multi-species', local_dir = "data/multi-species", repo_type="dataset" )

Zebra finch

For the zebra finch only dataset, the training set contains the training examples for both adults and juveniles. For testing, we divide the test set into adults and juveniles sets and report the test performance separately.

snapshot_download('nccratliri/vad-zebra-finch', local_dir = "data/zebra-finch", repo_type="dataset" )

Bengalese finch

snapshot_download('nccratliri/vad-bengalese-finch', local_dir = "data/bengalese-finch", repo_type="dataset" )

Marmoset

snapshot_download('nccratliri/vad-marmoset', local_dir = "data/marmoset", repo_type="dataset" )

Mouse

snapshot_download('nccratliri/vad-mouse', local_dir = "data/mouse", repo_type="dataset" )

Human-AVA-Speech

snapshot_download('nccratliri/vad-human-ava-speech', local_dir = "data/human-ava-speech", repo_type="dataset" )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DatasetProcessing.md

DatasetProcessing.md

Dataset Processing

Build custom dataset

Audio recording:

Annotation file:

Option 1:

Option 2:

Download the dataset used in the paper

Multi-Species VAD dataset

Zebra finch

Bengalese finch

Marmoset

Mouse

Human-AVA-Speech

Files

DatasetProcessing.md

Latest commit

History

DatasetProcessing.md

File metadata and controls

Dataset Processing

Build custom dataset

Audio recording:

Annotation file:

Option 1:

Option 2:

Download the dataset used in the paper

Multi-Species VAD dataset

Zebra finch

Bengalese finch

Marmoset

Mouse

Human-AVA-Speech