Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[egs] Add recipes for CN-Celeb #3758

Merged
merged 16 commits into from
Dec 14, 2019
Merged
Prev Previous commit
Next Next commit
[egs] add v2
  • Loading branch information
csltstu committed Dec 7, 2019
commit 324396d4c500279e921cfb0996ec60763d5abf75
3 changes: 2 additions & 1 deletion egs/cnceleb/README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ CSLT@Tsinghua http://cslt.riit.tsinghua.edu.cn/~data/CN-Celeb/

The subdirectories "v1" and so on are different speaker recognition
recipes. The recipe in v1 demonstrates a standard approach using a
full-covariance GMM-UBM, iVectors, and a PLDA backend.
full-covariance GMM-UBM, iVectors, and a PLDA backend. The example
in v2 demonstrates DNN speaker embeddings with a PLDA backend.
14 changes: 14 additions & 0 deletions egs/cnceleb/v2/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
This recipe replaces i-vectors used in the v1 recipe with embeddings extracted
from a deep neural network. In the scripts, we refer to these embeddings as
"x-vectors." The recipe in local/nnet3/xvector/tuning/run_xvector_1a.sh is
closesly based on the following paper:

@inproceedings{snyder2018xvector,
title={X-vectors: Robust DNN Embeddings for Speaker Recognition},
author={Snyder, D. and Garcia-Romero, D. and Sell, G. and Povey, D. and Khudanpur, S.},
booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2018},
organization={IEEE},
url={http://www.danielpovey.com/files/2018_icassp_xvectors.pdf}
}

15 changes: 15 additions & 0 deletions egs/cnceleb/v2/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine). queue.pl works with GridEngine (qsub). slurm.pl works
# with slurm. Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration. Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="queue.pl --mem 4G"


7 changes: 7 additions & 0 deletions egs/cnceleb/v2/conf/mfcc.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
--sample-frequency=16000
--frame-length=25 # the default is 25
--low-freq=20 # the default.
--high-freq=7600 # the default is zero meaning use the Nyquist (8k in this case).
--num-mel-bins=30
--num-ceps=30
--snip-edges=false
4 changes: 4 additions & 0 deletions egs/cnceleb/v2/conf/vad.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
--vad-energy-threshold=5.5
--vad-energy-mean-scale=0.5
--vad-proportion-threshold=0.12
--vad-frames-context=2
1 change: 1 addition & 0 deletions egs/cnceleb/v2/local
5 changes: 5 additions & 0 deletions egs/cnceleb/v2/path.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
export KALDI_ROOT=`pwd`/../../..
export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$KALDI_ROOT/tools/sph2pipe_v2.5:$PWD:$PATH
[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
. $KALDI_ROOT/tools/config/common_path.sh
export LC_ALL=C
152 changes: 152 additions & 0 deletions egs/cnceleb/v2/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
#!/bin/bash
# Copyright 2017 Johns Hopkins University (Author: Daniel Povey)
# 2017 Johns Hopkins University (Author: Daniel Garcia-Romero)
# 2018 Ewald Enzinger
# 2018 David Snyder
# 2019 Tsinghua University (Author: Jiawen Kang and Lantian Li)
# Apache 2.0.
#
# This is an x-vector-based recipe for CN-Celeb database.
# It is based on "X-vectors: Robust DNN Embeddings for Speaker Recognition"
# by Snyder et al. The recipe uses CN-Celeb/dev for training the UBM, T matrix
# and PLDA, and CN-Celeb/eval for evaluation. The results are reported in terms
# of EER and minDCF, and are inline in the comments below.

. ./cmd.sh
. ./path.sh
set -e
mfccdir=`pwd`/mfcc
vaddir=`pwd`/mfcc

cnceleb_root=/export/corpora/CN-Celeb
nnet_dir=exp/xvector_nnet_1a
eval_trials_core=data/eval_test/trials/trials.lst

stage=0

if [ $stage -le 0 ]; then
# Prepare the CN-Celeb dataset. The script is used to prepare the development
# dataset and evaluation dataset.
local/make_cnceleb.sh $cnceleb_root data
fi

if [ $stage -le 1 ]; then
# Make MFCCs and compute the energy-based VAD for each dataset
for name in train eval_enroll eval_test; do
steps/make_mfcc.sh --write-utt2num-frames true --mfcc-config conf/mfcc.conf --nj 20 --cmd "$train_cmd" \
data/${name} exp/make_mfcc $mfccdir
utils/fix_data_dir.sh data/${name}
sid/compute_vad_decision.sh --nj 20 --cmd "$train_cmd" \
data/${name} exp/make_vad $vaddir
utils/fix_data_dir.sh data/${name}
done
fi

if [ $stage -le 3 ]; then
# Note that there are over one-third of the utterances less than 2 seconds in our training set,
# and these short utterances are harmful for DNNs x-vector training. Therefore, to improve
# performance of DNN training, we will combine the short utterances longer than 5 seconds.
utils/data/combine_short_segments.sh --speaker-only true \
data/train 5 data/train_comb
# Compute the energy-based VAD for train_comb
sid/compute_vad_decision.sh --nj 20 --cmd "$train_cmd" \
data/train_comb exp/make_vad $vaddir
utils/fix_data_dir.sh data/train_comb
fi

# Now we prepare the features to generate examples for xvector training.
if [ $stage -le 4 ]; then
# This script applies CMVN and removes nonspeech frames. Note that this is somewhat
# wasteful, as it roughly doubles the amount of training data on disk. After
# creating training examples, this can be removed.
local/nnet3/xvector/prepare_feats_for_egs.sh --nj 20 --cmd "$train_cmd" \
data/train_comb data/train_comb_no_sil exp/train_comb_no_sil
utils/fix_data_dir.sh data/train_comb_no_sil
fi

if [ $stage -le 5 ]; then
# Now, we need to remove features that are too short after removing silence
# frames. We want atleast 5s (500 frames) per utterance.
min_len=400
mv data/train_comb_no_sil/utt2num_frames data/train_comb_no_sil/utt2num_frames.bak
awk -v min_len=${min_len} '$2 > min_len {print $1, $2}' data/train_comb_no_sil/utt2num_frames.bak > data/train_comb_no_sil/utt2num_frames
utils/filter_scp.pl data/train_comb_no_sil/utt2num_frames data/train_comb_no_sil/utt2spk > data/train_comb_no_sil/utt2spk.new
mv data/train_comb_no_sil/utt2spk.new data/train_comb_no_sil/utt2spk
utils/fix_data_dir.sh data/train_comb_no_sil

# We also want several utterances per speaker. Now we'll throw out speakers
# with fewer than 8 utterances.
min_num_utts=8
awk '{print $1, NF-1}' data/train_comb_no_sil/spk2utt > data/train_comb_no_sil/spk2num
awk -v min_num_utts=${min_num_utts} '$2 >= min_num_utts {print $1, $2}' data/train_comb_no_sil/spk2num | utils/filter_scp.pl - data/train_comb_no_sil/spk2utt > data/train_comb_no_sil/spk2utt.new
mv data/train_comb_no_sil/spk2utt.new data/train_comb_no_sil/spk2utt
utils/spk2utt_to_utt2spk.pl data/train_comb_no_sil/spk2utt > data/train_comb_no_sil/utt2spk

utils/filter_scp.pl data/train_comb_no_sil/utt2spk data/train_comb_no_sil/utt2num_frames > data/train_comb_no_sil/utt2num_frames.new
mv data/train_comb_no_sil/utt2num_frames.new data/train_comb_no_sil/utt2num_frames

# Now we're ready to create training examples.
utils/fix_data_dir.sh data/train_comb_no_sil
fi

# Stages 6 through 8 are handled in run_xvector.sh
local/nnet3/xvector/run_xvector.sh --stage $stage --train-stage -1 \
--data data/train_comb_no_sil --nnet-dir $nnet_dir \
--egs-dir $nnet_dir/egs

if [ $stage -le 9 ]; then
# These x-vectors will be used for mean-subtraction, LDA, and PLDA training.
sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 4G" --nj 20 \
$nnet_dir data/train_comb \
$nnet_dir/xvectors_train_comb

# Extract x-vector for eval sets.
for name in eval_enroll eval_test; do
sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 4G" --nj 20 \
$nnet_dir data/$name \
$nnet_dir/xvectors_$name
done
fi

if [ $stage -le 10 ]; then
# Compute the mean.vec used for centering.
$train_cmd $nnet_dir/xvectors_train_comb/log/compute_mean.log \
ivector-mean scp:$nnet_dir/xvectors_train_comb/xvector.scp \
$nnet_dir/xvectors_train_comb/mean.vec || exit 1;

# Use LDA to decrease the dimensionality prior to PLDA.
lda_dim=128
$train_cmd $nnet_dir/xvectors_train_comb/log/lda.log \
ivector-compute-lda --total-covariance-factor=0.0 --dim=$lda_dim \
"ark:ivector-subtract-global-mean scp:$nnet_dir/xvectors_train_comb/xvector.scp ark:- |" \
ark:data/train_comb/utt2spk $nnet_dir/xvectors_train_comb/transform.mat || exit 1;

# Train the PLDA model.
$train_cmd $nnet_dir/xvectors_train_comb/log/plda.log \
ivector-compute-plda ark:data/train_comb/spk2utt \
"ark:ivector-subtract-global-mean scp:$nnet_dir/xvectors_train_comb/xvector.scp ark:- | transform-vec $nnet_dir/xvectors_train_comb/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
$nnet_dir/xvectors_train_comb/plda || exit 1;
fi

if [ $stage -le 11 ]; then
# Compute PLDA scores for CN-Celeb eval core trials
$train_cmd $nnet_dir/scores/log/cnceleb_eval_scoring.log \
ivector-plda-scoring --normalize-length=true \
--num-utts=ark:$nnet_dir/xvectors_eval_enroll/num_utts.ark \
"ivector-copy-plda --smoothing=0.0 $nnet_dir/xvectors_train_comb/plda - |" \
"ark:ivector-mean ark:data/eval_enroll/spk2utt scp:$nnet_dir/xvectors_eval_enroll/xvector.scp ark:- | ivector-subtract-global-mean $nnet_dir/xvectors_train_comb/mean.vec ark:- ark:- | transform-vec $nnet_dir/xvectors_train_comb/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
"ark:ivector-subtract-global-mean $nnet_dir/xvectors_train_comb/mean.vec scp:$nnet_dir/xvectors_eval_test/xvector.scp ark:- | transform-vec $nnet_dir/xvectors_train_comb/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
"cat '$eval_trials_core' | cut -d\ --fields=1,2 |" $nnet_dir/scores/cnceleb_eval_scores || exit 1;

# CN-Celeb Eval Core:
# EER: 14.70%
# minDCF(p-target=0.01): 0.6814
# minDCF(p-target=0.001): 0.7979
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite unexpected that a well designed x-vector system would produce worse results than a traditional i-vector system.

I suggest changing this PR so that it only includes the v1 recipe for now. Postpone including the v2 recipe until you know what's wrong with the x-vector system or until you have a better understanding of how to train these kinds of systems. I haven't had time to look at everything that could be improved in the recipe, but I notice that the PR doesn't even include data augmentation in the x-vector training recipe, which we know is an essential step for achieving good performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to augmentation, you might wish to include some other free, wideband resource, like VoxCeleb in the DNN training data. My guess is that this will substantially improve the x-vector system performance, even though it introduces a language mismatch. You'll still want to use the in-domain data for training the PLDA model, though.

Alternatively, if you feel the recipe should consist of nothing but the CN-Celeb dev and eval datasets, then I would again consider removing the v2 recipe, but retain the v1 recipe. If you do choose to use only this data, then it could be the case that the v2 recipe is simply not appropriate for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I have removed the v2 recipe.

Actually, we have tried our best to improve the performance on the x-vector system if only choosing the CN-Celeb dev for x-vector training. Data augmentation similar as egs/VoxCeleb does not work in CN-Celeb training in our experiment. While in-domain PLDA training with the x-vector trained with VoxCeleb is effective
Related experimental results can be seen https://arxiv.org/abs/1911.01799.

We hope the recipe only consist of the CN-Celeb dev and eval datasets, and accept your suggestion to remove the v2 recipe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You could always decide to add an x-vector recipe later... and then I'd suggest doing whatever produces the best performance, which as you said, appears to be training the DNN on VoxCeleb (or perhaps VoxCeleb + CN-Celeb?) and the backend on CN-Celeb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will try it later to make the best performance on x-vector.

echo -e "\nCN-Celeb Eval Core:";
eer=$(paste $eval_trials_core $nnet_dir/scores/cnceleb_eval_scores | awk '{print $6, $3}' | compute-eer - 2>/dev/null)
mindcf1=`sid/compute_min_dcf.py --p-target 0.01 $nnet_dir/scores/cnceleb_eval_scores $eval_trials_core 2>/dev/null`
mindcf2=`sid/compute_min_dcf.py --p-target 0.001 $nnet_dir/scores/cnceleb_eval_scores $eval_trials_core 2>/dev/null`
echo "EER: $eer%"
echo "minDCF(p-target=0.01): $mindcf1"
echo "minDCF(p-target=0.001): $mindcf2"
fi
1 change: 1 addition & 0 deletions egs/cnceleb/v2/sid
1 change: 1 addition & 0 deletions egs/cnceleb/v2/steps
1 change: 1 addition & 0 deletions egs/cnceleb/v2/utils