-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[egs] Speeding up i-vector training in voxceleb v1 recipe #2421
[egs] Speeding up i-vector training in voxceleb v1 recipe #2421
Conversation
…able data, and uses a better performing wideband MFCC config
…e it takes an extremely long time to train
Maybe a better solution (to the problem of very short segments in the i-vector extractor training data) might be to include a mechanism that pools features across segments (based on a reco2seg or reco2utt file) before extracting i-vector stats. But I think the proposed solution in this PR is fine for now, as it massively reduces training time without impacting performance negatively. |
I think your proposed solution is fine! I'll update my BNF PR. One solution that I explored when I created the first version of the recipe is to concatenate individual utts to recordings (the code below would only work for voxceleb2, as it splits utt_ids using substrings):
|
Yes, I was thinking something along those lines. We could create a version of concat-feats that takes as input a feats archive (keyed on utterance ID) and a "reco2utt" file, concats the utterances together (if they belong to the same recording ID), and returns an archive of features, where the key is the recording ID. But, if we do this, I think it should be in a separate PR that handles just that issue. |
@danpovey pointed out that this script https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/data/combine_short_segments.sh exists, and may help with this issue. Something to try later. |
I looked into that script some time ago in a different context. While it prefers to combine segments from the same speaker, it can use segments from other speakers to satisfy the minimum target segment length. Empirically, this may not make much of a difference, but I think it's not ideal. choose_utts_to_combine.py, the script that does the hard work, could be modified to be strict about only combining segments from the same speaker (specifically, the part after line 256 could be made optional). In the case of VoxCeleb, I think combining all utterances that come from a single video into one recording makes the most sense, although I'd have to look into how wide the range of durations is. For VoxCeleb2 train, the number of videos per speaker ranges from 6 to 91. |
…2421) * [egs]: updating the voxceleb recipe so that it uses more of the available data, and uses a better performing wideband MFCC config * [egs]: fixing comment error in mfcc.conf * [egs] updating voxceleb/v1/run.sh results * [egs] changing url to download voxceleb1 test set from, updating READMEs * [egs] fixing comment in voxceleb/v2/run.sh * [egs] adding check that ffmpeg exists in voxceleb2 data prep * [egs] subsampling the i-vector training data in voxceleb/v2, otherwise it takes an extremely long time to train
…2421) * [egs]: updating the voxceleb recipe so that it uses more of the available data, and uses a better performing wideband MFCC config * [egs]: fixing comment error in mfcc.conf * [egs] updating voxceleb/v1/run.sh results * [egs] changing url to download voxceleb1 test set from, updating READMEs * [egs] fixing comment in voxceleb/v2/run.sh * [egs] adding check that ffmpeg exists in voxceleb2 data prep * [egs] subsampling the i-vector training data in voxceleb/v2, otherwise it takes an extremely long time to train
The VoxCeleb training data consists of lots of (over 1.2 million) short recordings. As a result, the i-vector training takes an extremely long time to finish. Also, if the recordings are very short, I believe it is harmful to the i-vector extractor.
In this PR, we train the i-vector extractor on just the longest 100,000 recordings. This reduces the amount of training time by over 90%, and slightly improves performance, from 5.53% EER to 5.419% EER.
FYI, @entn-at