Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[egs] Speeding up i-vector training in voxceleb v1 recipe #2421

Merged
merged 8 commits into from
May 14, 2018

Conversation

david-ryan-snyder
Copy link
Contributor

The VoxCeleb training data consists of lots of (over 1.2 million) short recordings. As a result, the i-vector training takes an extremely long time to finish. Also, if the recordings are very short, I believe it is harmful to the i-vector extractor.

In this PR, we train the i-vector extractor on just the longest 100,000 recordings. This reduces the amount of training time by over 90%, and slightly improves performance, from 5.53% EER to 5.419% EER.

FYI, @entn-at

@david-ryan-snyder
Copy link
Contributor Author

david-ryan-snyder commented May 14, 2018

Maybe a better solution (to the problem of very short segments in the i-vector extractor training data) might be to include a mechanism that pools features across segments (based on a reco2seg or reco2utt file) before extracting i-vector stats.

But I think the proposed solution in this PR is fine for now, as it massively reduces training time without impacting performance negatively.

@entn-at
Copy link
Contributor

entn-at commented May 14, 2018

I think your proposed solution is fine! I'll update my BNF PR.

One solution that I explored when I created the first version of the recipe is to concatenate individual utts to recordings (the code below would only work for voxceleb2, as it splits utt_ids using substrings):

for name in voxceleb2_train voxceleb2_test; do
    mkdir -p data/${name}_concat
    # discard segment portion of uttIds
    awk '{print substr($1,1,19), $2}' < data/${name}/utt2spk | sort | uniq > data/${name}_concat/utt2spk
    # update spk2utt file
    utils/utt2spk_to_spk2utt.pl < data/${name}_concat/utt2spk > data/${name}_concat/spk2utt
    # concatenate features
    awk '{print $1, substr($1,1,19)}' < data/${name}/utt2spk | utils/utt2spk_to_spk2utt.pl | \
      utils/apply_map.pl -f 2- data/${name}/feats.scp | \
      awk '{if (NF<=2){print;} else { $1 = $1 " concat-feats --print-args=false"; $NF = $NF " - |"; print; }}' > data/${name}_concat/feats.scp
done

@david-ryan-snyder
Copy link
Contributor Author

Yes, I was thinking something along those lines. We could create a version of concat-feats that takes as input a feats archive (keyed on utterance ID) and a "reco2utt" file, concats the utterances together (if they belong to the same recording ID), and returns an archive of features, where the key is the recording ID.

But, if we do this, I think it should be in a separate PR that handles just that issue.

@danpovey danpovey merged commit bce4336 into kaldi-asr:master May 14, 2018
@david-ryan-snyder
Copy link
Contributor Author

@danpovey pointed out that this script https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/data/combine_short_segments.sh exists, and may help with this issue. Something to try later.

@entn-at
Copy link
Contributor

entn-at commented May 15, 2018

I looked into that script some time ago in a different context. While it prefers to combine segments from the same speaker, it can use segments from other speakers to satisfy the minimum target segment length. Empirically, this may not make much of a difference, but I think it's not ideal. choose_utts_to_combine.py, the script that does the hard work, could be modified to be strict about only combining segments from the same speaker (specifically, the part after line 256 could be made optional).

In the case of VoxCeleb, I think combining all utterances that come from a single video into one recording makes the most sense, although I'd have to look into how wide the range of durations is. For VoxCeleb2 train, the number of videos per speaker ranges from 6 to 91.

dpriver pushed a commit to dpriver/kaldi that referenced this pull request Sep 13, 2018
…2421)

* [egs]: updating the voxceleb recipe so that it uses more of the available data, and uses a better performing wideband MFCC config

* [egs]: fixing comment error in mfcc.conf

* [egs] updating voxceleb/v1/run.sh results

* [egs] changing url to download voxceleb1 test set from, updating READMEs

* [egs] fixing comment in voxceleb/v2/run.sh

* [egs] adding check that ffmpeg exists in voxceleb2 data prep

* [egs] subsampling the i-vector training data in voxceleb/v2, otherwise it takes an extremely long time to train
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
…2421)

* [egs]: updating the voxceleb recipe so that it uses more of the available data, and uses a better performing wideband MFCC config

* [egs]: fixing comment error in mfcc.conf

* [egs] updating voxceleb/v1/run.sh results

* [egs] changing url to download voxceleb1 test set from, updating READMEs

* [egs] fixing comment in voxceleb/v2/run.sh

* [egs] adding check that ffmpeg exists in voxceleb2 data prep

* [egs] subsampling the i-vector training data in voxceleb/v2, otherwise it takes an extremely long time to train
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants