[egs] Opensat20 large vocabulary (#4367)

kaldi-asr · Dec 26, 2020 · 958f0b0 · 958f0b0
1 parent 8709640
commit 958f0b0
Show file tree

Hide file tree

Showing 76 changed files with 5,510 additions and 2 deletions.
diff --git a/egs/librispeech/s5/local/lm/install_festival.sh b/egs/librispeech/s5/local/lm/install_festival.sh
@@ -35,8 +35,11 @@ fi
 
 if [ "$stage" -le 2 ]; then
   echo "Untarring the downloaded files..."
-  for f in `ls ./*.tar.*`; do
-    tar xf $f;
+  for f in `ls ./*.tar.gz`; do
+    tar -xzf $f;
+  done
+  for f in `ls ./*.tar.bz2`; do
+    tar -xf $f;
   done
 fi
 

diff --git a/egs/opensat20/README.md b/egs/opensat20/README.md
@@ -0,0 +1,49 @@
+###  OpenSAT 2020 recipe (https://sat.nist.gov/opensat20#tab_overview)
+
+This is a Kaldi based setup for the SAFE-T data (https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-safe-t-corpus.pdf).
+The aim of this setup is to provide example scripts to use out-of-domain data in low
+resource condition. It provides three recipes (run.sh, run_shared.sh, run_finetune.sh).
+The description of these three setups is as follows:
+
+`Target data acoustic model(run.sh)`: We trained an ASR system using the 40 hours of SAFE-T
+training data, and evaluated it on the OpenSAT20 Dev audio. We used CNN-TDNN-F
+architecture, which contains 6 CNN layers followed by 9 TDNN-F layers each with 1024
+neurons, and bottle-neck factorization to 128 dimensions with stride 3. Speed perturbation
+as augmentation is used to increase data size by a factor of 3, in addition online spectral
+augmentation is used to make each mini-batch unique and increase robustness of the model.
+With this setup, we can quickly train an ASR system and to get a baseline WER. The
+resulting WER is available in RESULTS doc.
+
+`Fine tuning (run_finetune.sh)`: To increase the amount of training data for the acoustic
+model training, we used AMI and ICSI speech data with speed perturbation as data augmentation.
+We used CNN-TDNN-F architecture, which contains 6 CNN layers followed by 9 TDNN-F layers
+each with 1536 neurons, and bottle-neck factorization to 160 dimensions with stride 3. Speed
+perturbation as augmentation and online spectral augmentation is used with this setup as well.
+After training the acoustic model with AMI and ICSI datasets, fine tuning is performed with
+a lower learning rate with the SAFE-T dataset. The resulting WER is available in RESULTS doc.
+
+`Shared(run_shared.sh)`: Adding AMI and ICSI data to the SAFE-T data and training an acoustic
+model with these three datasets. Similar to the Fine tuning setup, same CNN-TDNN-F
+architecture is used with speed perturbation and spectral augmentation. We are adding a script
+for other augmentations in this setup. The resulting WER is available in RESULTS doc.
+
+#### Data
+We use the SAFE-T data released with the following paper:
+```
+@inproceedings{delgado2020safe,
+  title={The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications},
+  author={Delgado, Dana and Walker, Kevin and Strassel, Stephanie and Jones, Karen Sp{\"a}rck and Caruso, Christopher and Graff, David},
+  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
+  pages={6450--6457},
+  year={2020}
+}
+```
+
+SAFE-T (speech analysis for emergency response technology) has 131 hrs (labelled and unlabelled)
+of single-channel 48 kHz training data. Most of the speakers are native English speakers. The 
+participants are playing the game of Flashpoint fire rescue. The recordings do not have overlap,
+little reverberation but have significant noise. The noise is artificial and the SNR varies with
+time. The noise level varies from 0-14db or 70-85 dB. The noises are car ambulances, rain,
+or similar sounds. There are a total of 87 speakers.
+
+```
diff --git a/egs/opensat20/s5/RESULTS b/egs/opensat20/s5/RESULTS
@@ -0,0 +1,16 @@
+CNN-TDNNF setup result for safet development set with only safet data
+# %WER 13.36 [ 2606 / 19507, 217 ins, 1110 del, 1279 sub ] exp/chain_a/cnn_tdnn_1a_spec/decode_safe_t_dev1/wer_8_0.0
+
+CNN-TDNNF setup result for safet development set with shared data (AMI+ICSI+safet)
+(run1)
+# %WER 12.61 [ 2460 / 19507, 245 ins, 1119 del, 1096 sub ] exp/chain_all/cnn_tdnn_all/decode_safe_t_dev1/wer_9_0.5
+# %WER 11.70 [ 2283 / 19507, 228 ins, 964 del, 1091 sub ] exp/chain_finetune/cnn_tdnn_finetune_shared_ep2/decode_safe_t_dev1_finetune_tl/wer_8_0.5
+
+(run2)
+# %WER 12.85 [ 2507 / 19507, 254 ins, 1107 del, 1146 sub ] exp/chain_all/cnn_tdnn_all/decode_safe_t_dev1/wer_8_1.0
+# %WER 11.86 [ 2313 / 19507, 298 ins, 895 del, 1120 sub ] exp/chain_finetune/cnn_tdnn_finetune_shared_ep2/decode_safe_t_dev1_finetune_tl/wer_8_0.0
+
+CNN-TDNNF setup result for safet development set with only out-of-domain data (AMI+ICSI) and fine-tuning with safet data
+# %WER 38.17 [ 7445 / 19507, 512 ins, 5278 del, 1655 sub ] exp/chain_train_icsiami/cnn_tdnn_train_icsiami/decode_safe_t_dev1_train_tl/wer_7_0.0
+# %WER 12.20 [ 2379 / 19507, 248 ins, 1000 del, 1131 sub ] exp/chain_finetune/cnn_tdnn_finetune_ep2/decode_safe_t_dev1_finetune_tl/wer_9_0.0
+# %WER 11.83 [ 2308 / 19507, 225 ins, 977 del, 1106 sub ] exp/chain_finetune/cnn_tdnn_finetune_ep3/decode_safe_t_dev1_finetune_tl/wer_8_0.5
diff --git a/egs/opensat20/s5/cmd.sh b/egs/opensat20/s5/cmd.sh
@@ -0,0 +1,20 @@
+# you can change cmd.sh depending on what type of queue you are using.
+# If you have no queueing system and want to run on a local machine, you
+# can change all instances 'queue.pl' to run.pl (but be careful and run
+# commands one by one: most recipes will exhaust the memory on your
+# machine).  queue.pl works with GridEngine (qsub).  slurm.pl works
+# with slurm.  Different queues are configured differently, with different
+# queue names and different ways of specifying things like memory;
+# to account for these differences you can create and edit the file
+# conf/queue.conf to match your queue's configuration.  Search for
+# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
+# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.
+
+export train_cmd="queue.pl --mem 1G"
+export decode_cmd="queue.pl --mem 2G"
+if [[ "$(hostname -f)" == "*.fit.vutbr.cz" ]]; then
+  queue_conf=$HOME/queue_conf/default.conf # see example /homes/kazi/iveselyk/queue_conf/default.conf,
+  export train_cmd="queue.pl --config $queue_conf --mem 2G --matylda 0.2"
+  export decode_cmd="queue.pl --config $queue_conf --mem 3G --matylda 0.1"
+  export cuda_cmd="queue.pl --config $queue_conf --gpu 1 --mem 10G --tmp 40G"
+fi
diff --git a/egs/opensat20/s5/conf/local.conf b/egs/opensat20/s5/conf/local.conf
@@ -0,0 +1,11 @@
+use_pitch=false
+use_ivector=true
+# lda-mllt transform used to train global-ivector
+lda_mllt_lang=safet
+# lang_list is space-separated language list used for multilingual training
+lang_list=(safet icsiami)
+# lang2weight is comma-separated list of weights, one per language, used to
+# scale example's output w.r.t its input language during training.
+lang2weight="0.7,0.3"
+# The language list used for decoding.
+decode_lang_list=(safet)
diff --git a/egs/opensat20/s5/conf/mfcc.conf b/egs/opensat20/s5/conf/mfcc.conf
@@ -0,0 +1,2 @@
+--use-energy=false   # only non-default option.
+--sample-frequency=16000 # 
diff --git a/egs/opensat20/s5/conf/mfcc_hires.conf b/egs/opensat20/s5/conf/mfcc_hires.conf
@@ -0,0 +1,10 @@
+# config for high-resolution MFCC features, intended for neural network training.
+# Note: we keep all cepstra, so it has the same info as filterbank features,
+# but MFCC is more easily compressible (because less correlated) which is why
+# we prefer this method.
+--use-energy=false   # use average of log energy, not energy.
+--sample-frequency=16000 
+--num-mel-bins=40
+--num-ceps=40
+--low-freq=40
+--high-freq=-400
diff --git a/egs/opensat20/s5/conf/online_cmvn.conf b/egs/opensat20/s5/conf/online_cmvn.conf
@@ -0,0 +1 @@
+# configuration file for apply-cmvn-online, used in the script ../local/online/run_online_decoding_nnet2.sh
diff --git a/egs/opensat20/s5/conf/vad.conf b/egs/opensat20/s5/conf/vad.conf
@@ -0,0 +1,2 @@
+--vad-energy-threshold=30
+--vad-energy-mean-scale=0.5
diff --git a/egs/opensat20/s5/local.conf b/egs/opensat20/s5/local.conf
@@ -0,0 +1,11 @@
+use_pitch=false
+use_ivector=true
+# lda-mllt transform used to train global-ivector
+lda_mllt_lang=safet
+# lang_list is space-separated language list used for multilingual training
+lang_list=(safet icsiami)
+# lang2weight is comma-separated list of weights, one per language, used to
+# scale example's output w.r.t its input language during training.
+lang2weight="0.7,0.3"
+# The language list used for decoding.
+decode_lang_list=(safet)
diff --git a/egs/opensat20/s5/local/AMI/ami_ihm_data_prep.sh b/egs/opensat20/s5/local/AMI/ami_ihm_data_prep.sh
@@ -0,0 +1,109 @@
+#!/usr/bin/env bash
+
+# Copyright 2014  University of Edinburgh (Author: Pawel Swietojanski)
+#           2016  Johns Hopkins University (Author: Daniel Povey)
+# AMI Corpus training data preparation
+# Apache 2.0
+
+# Note: this is called by ../run.sh.
+
+# To be run from one directory above this script.
+
+. ./path.sh
+
+#check existing directories
+if [ $# -ne 1 ]; then
+  echo "Usage: $0 /path/to/AMI"
+  echo "e.g. $0 /foo/bar/AMI"
+  exit 1;
+fi
+
+AMI_DIR=$1
+
+SEGS=data/local/AMI_annotations/train.txt
+dir=data/local/AMI_ihm/train
+odir=data/AMI/train_orig
+mkdir -p $dir
+
+# Audio data directory check
+if [ ! -d $AMI_DIR ]; then
+  echo "Error: $AMI_DIR directory does not exists."
+  exit 1;
+fi
+
+# And transcripts check
+if [ ! -f $SEGS ]; then
+  echo "Error: File $SEGS no found (run ami_text_prep.sh)."
+  exit 1;
+fi
+
+
+# find headset wav audio files only
+find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
+n=`cat $dir/wav.flist | wc -l`
+echo "In total, $n headset files were found."
+[ $n -ne 687 ] && \
+  echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
+
+# (1a) Transcriptions preparation
+# here we start with normalised transcriptions, the utt ids follow the convention
+# AMI_MEETING_CHAN_SPK_STIME_ETIME
+# AMI_ES2011a_H00_FEE041_0003415_0003484
+# we use uniq as some (rare) entries are doubled in transcripts
+
+awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
+ printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
+ for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
+
+# (1b) Make segment files from transcript
+
+awk '{
+       segment=$1;
+       split(segment,S,"[_]");
+       audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
+       print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
+}' < $dir/text > $dir/segments
+
+# (1c) Make wav.scp file.
+
+sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
+ perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
+  paste - $dir/wav.flist > $dir/wav1.scp
+
+#Keep only  train part of waves
+awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp >  $dir/wav2.scp
+
+#replace path with an appropriate sox command that select single channel only
+awk '{print $1" sox -c 1 -t wavpcm -e signed-integer "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
+
+# (1d) reco2file_and_channel
+cat $dir/wav.scp \
+ | perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
+              print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
+
+
+# In this data-prep phase we adapt to the session and speaker [later on we may
+# split into shorter pieces]., We use the 0th, 1st and 3rd underscore-separated
+# fields of the utterance-id as the speaker-id,
+# e.g. 'AMI_EN2001a_IHM_FEO065_0090130_0090775' becomes 'AMI_EN2001a_FEO065'.
+awk '{print $1}' $dir/segments | \
+  perl -ane 'chop; @A = split("_", $_); $spkid = join("_", @A[0,1,3]); print "$_ $spkid\n";'  \
+  >$dir/utt2spk || exit 1;
+
+
+awk '{print $1}' $dir/segments | \
+  perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
+          print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
+
+utils/utt2spk_to_spk2utt.pl <$dir/utt2spk >$dir/spk2utt || exit 1;
+
+# Copy stuff into its final location
+mkdir -p $odir
+for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
+  cp $dir/$f $odir/$f || exit 1;
+done
+
+utils/fix_data_dir.sh $odir
+utils/validate_data_dir.sh --no-feats $odir || exit 1;
+
+echo AMI IHM data preparation succeeded.
diff --git a/egs/opensat20/s5/local/AMI/ami_ihm_scoring_data_prep.sh b/egs/opensat20/s5/local/AMI/ami_ihm_scoring_data_prep.sh
@@ -0,0 +1,120 @@
+#!/usr/bin/env bash
+
+
+# Copyright 2014  University of Edinburgh (Author: Pawel Swietojanski)
+#           2016  Johns Hopkins University (Author: Daniel Povey)
+# AMI Corpus training data preparation
+# Apache 2.0
+
+# Note: this is called by ../run.sh.
+
+. ./path.sh
+
+#check existing directories
+if [ $# != 2 ]; then
+  echo "Usage: $0 /path/to/AMI (dev|eval)"
+  exit 1;
+fi
+
+AMI_DIR=$1
+SET=$2
+SEGS=data/local/AMI_annotations/$SET.txt
+
+dir=data/local/AMI_ihm/$SET
+odir=data/AMI/${SET}_orig
+mkdir -p $dir
+
+# Audio data directory check
+if [ ! -d $AMI_DIR ]; then
+  echo "Error: run.sh requires a directory argument"
+  exit 1;
+fi
+
+# And transcripts check
+if [ ! -f $SEGS ]; then
+  echo "Error: File $SEGS no found (run ami_text_prep.sh)."
+  exit 1;
+fi
+
+# find headset wav audio files only, here we again get all
+# the files in the corpora and filter only specific sessions
+# while building segments
+
+find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
+n=`cat $dir/wav.flist | wc -l`
+echo "In total, $n headset files were found."
+[ $n -ne 687 ] && \
+  echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"
+
+# (1a) Transcriptions preparation
+# here we start with normalised transcriptions, the utt ids follow the convention
+# AMI_MEETING_CHAN_SPK_STIME_ETIME
+# AMI_ES2011a_H00_FEE041_0003415_0003484
+
+awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
+ printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
+ for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text
+
+# (1c) Make segment files from transcript
+#segments file format is: utt-id side-id start-time end-time, e.g.:
+
+awk '{
+       segment=$1;
+       split(segment,S,"[_]");
+       audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
+       print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
+}' < $dir/text > $dir/segments
+
+#prepare wav.scp
+sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
+ perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
+  paste - $dir/wav.flist > $dir/wav1.scp
+
+#Keep only  train part of waves
+awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp >  $dir/wav2.scp
+
+#replace path with an appropriate sox command that select single channel only
+awk '{print $1" sox -c 1 -t wavpcm -e signed-integer "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp
+
+# (1d) reco2file_and_channel
+cat $dir/wav.scp \
+ | perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
+              print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;
+
+awk '{print $1}' $dir/segments | \
+  perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "segments: bad label $_";
+          print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;
+
+sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;
+
+#check and correct the case when segment timings for given speaker overlap themself
+#(important for simulatenous asclite scoring to proceed).
+#There is actually only one such case for devset and automatic segmentetions
+join $dir/utt2spk $dir/segments | \
+   perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
+           if ($pu eq $_[1] && $pt > $_[3]) {
+             print "s/^$_[0] $_[2] $_[3] $_[4]\$/$_[0] $_[2] $pt $_[4]/;\n"
+           }
+           $pu=$_[1]; $pt=$_[4];
+         }' > $dir/segments_to_fix
+
+if [ -s $dir/segments_to_fix ]; then
+  echo "$0. Applying following fixes to segments"
+  cat $dir/segments_to_fix
+  perl -i -pf $dir/segments_to_fix $dir/segments
+fi
+
+# Copy stuff into its final locations
+mkdir -p $odir
+for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
+  cp $dir/$f $odir/$f || exit 1;
+done
+
+#Produce STMs for sclite scoring
+local/AMI/convert2stm.pl $dir > $odir/stm
+cp local/english.glm $odir/glm
+
+utils/validate_data_dir.sh --no-feats $odir || exit 1;
+
+echo AMI $SET set data preparation succeeded.
+
diff --git a/egs/opensat20/s5/local/AMI/ami_split_README.txt b/egs/opensat20/s5/local/AMI/ami_split_README.txt
@@ -0,0 +1 @@
+../../../../ami/s5/local/split_REAMDE.txt
diff --git a/egs/opensat20/s5/local/AMI/ami_split_dev.orig b/egs/opensat20/s5/local/AMI/ami_split_dev.orig
@@ -0,0 +1 @@
+../../../../ami/s5/local/split_dev.orig
diff --git a/egs/opensat20/s5/local/AMI/ami_split_eval.orig b/egs/opensat20/s5/local/AMI/ami_split_eval.orig
@@ -0,0 +1 @@
+../../../../ami/s5/local/split_eval.orig
diff --git a/egs/opensat20/s5/local/AMI/ami_split_segments.pl b/egs/opensat20/s5/local/AMI/ami_split_segments.pl
@@ -0,0 +1 @@
+../../../../ami/s5b/local/ami_split_segments.pl
diff --git a/egs/opensat20/s5/local/AMI/ami_split_train.orig b/egs/opensat20/s5/local/AMI/ami_split_train.orig
@@ -0,0 +1 @@
+../../../../ami/s5/local/split_train.orig