Skip to content

Commit

Permalink
[egs] Opensat20 large vocabulary (#4367)
Browse files Browse the repository at this point in the history
  • Loading branch information
huangruizhe committed Dec 26, 2020
1 parent 8709640 commit 958f0b0
Show file tree
Hide file tree
Showing 76 changed files with 5,510 additions and 2 deletions.
7 changes: 5 additions & 2 deletions egs/librispeech/s5/local/lm/install_festival.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,11 @@ fi

if [ "$stage" -le 2 ]; then
echo "Untarring the downloaded files..."
for f in `ls ./*.tar.*`; do
tar xf $f;
for f in `ls ./*.tar.gz`; do
tar -xzf $f;
done
for f in `ls ./*.tar.bz2`; do
tar -xf $f;
done
fi

Expand Down
49 changes: 49 additions & 0 deletions egs/opensat20/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
### OpenSAT 2020 recipe (https://sat.nist.gov/opensat20#tab_overview)

This is a Kaldi based setup for the SAFE-T data (https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2020-safe-t-corpus.pdf).
The aim of this setup is to provide example scripts to use out-of-domain data in low
resource condition. It provides three recipes (run.sh, run_shared.sh, run_finetune.sh).
The description of these three setups is as follows:

`Target data acoustic model(run.sh)`: We trained an ASR system using the 40 hours of SAFE-T
training data, and evaluated it on the OpenSAT20 Dev audio. We used CNN-TDNN-F
architecture, which contains 6 CNN layers followed by 9 TDNN-F layers each with 1024
neurons, and bottle-neck factorization to 128 dimensions with stride 3. Speed perturbation
as augmentation is used to increase data size by a factor of 3, in addition online spectral
augmentation is used to make each mini-batch unique and increase robustness of the model.
With this setup, we can quickly train an ASR system and to get a baseline WER. The
resulting WER is available in RESULTS doc.

`Fine tuning (run_finetune.sh)`: To increase the amount of training data for the acoustic
model training, we used AMI and ICSI speech data with speed perturbation as data augmentation.
We used CNN-TDNN-F architecture, which contains 6 CNN layers followed by 9 TDNN-F layers
each with 1536 neurons, and bottle-neck factorization to 160 dimensions with stride 3. Speed
perturbation as augmentation and online spectral augmentation is used with this setup as well.
After training the acoustic model with AMI and ICSI datasets, fine tuning is performed with
a lower learning rate with the SAFE-T dataset. The resulting WER is available in RESULTS doc.

`Shared(run_shared.sh)`: Adding AMI and ICSI data to the SAFE-T data and training an acoustic
model with these three datasets. Similar to the Fine tuning setup, same CNN-TDNN-F
architecture is used with speed perturbation and spectral augmentation. We are adding a script
for other augmentations in this setup. The resulting WER is available in RESULTS doc.

#### Data
We use the SAFE-T data released with the following paper:
```
@inproceedings{delgado2020safe,
title={The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications},
author={Delgado, Dana and Walker, Kevin and Strassel, Stephanie and Jones, Karen Sp{\"a}rck and Caruso, Christopher and Graff, David},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={6450--6457},
year={2020}
}
```

SAFE-T (speech analysis for emergency response technology) has 131 hrs (labelled and unlabelled)
of single-channel 48 kHz training data. Most of the speakers are native English speakers. The
participants are playing the game of Flashpoint fire rescue. The recordings do not have overlap,
little reverberation but have significant noise. The noise is artificial and the SNR varies with
time. The noise level varies from 0-14db or 70-85 dB. The noises are car ambulances, rain,
or similar sounds. There are a total of 87 speakers.

```
16 changes: 16 additions & 0 deletions egs/opensat20/s5/RESULTS
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
CNN-TDNNF setup result for safet development set with only safet data
# %WER 13.36 [ 2606 / 19507, 217 ins, 1110 del, 1279 sub ] exp/chain_a/cnn_tdnn_1a_spec/decode_safe_t_dev1/wer_8_0.0

CNN-TDNNF setup result for safet development set with shared data (AMI+ICSI+safet)
(run1)
# %WER 12.61 [ 2460 / 19507, 245 ins, 1119 del, 1096 sub ] exp/chain_all/cnn_tdnn_all/decode_safe_t_dev1/wer_9_0.5
# %WER 11.70 [ 2283 / 19507, 228 ins, 964 del, 1091 sub ] exp/chain_finetune/cnn_tdnn_finetune_shared_ep2/decode_safe_t_dev1_finetune_tl/wer_8_0.5

(run2)
# %WER 12.85 [ 2507 / 19507, 254 ins, 1107 del, 1146 sub ] exp/chain_all/cnn_tdnn_all/decode_safe_t_dev1/wer_8_1.0
# %WER 11.86 [ 2313 / 19507, 298 ins, 895 del, 1120 sub ] exp/chain_finetune/cnn_tdnn_finetune_shared_ep2/decode_safe_t_dev1_finetune_tl/wer_8_0.0

CNN-TDNNF setup result for safet development set with only out-of-domain data (AMI+ICSI) and fine-tuning with safet data
# %WER 38.17 [ 7445 / 19507, 512 ins, 5278 del, 1655 sub ] exp/chain_train_icsiami/cnn_tdnn_train_icsiami/decode_safe_t_dev1_train_tl/wer_7_0.0
# %WER 12.20 [ 2379 / 19507, 248 ins, 1000 del, 1131 sub ] exp/chain_finetune/cnn_tdnn_finetune_ep2/decode_safe_t_dev1_finetune_tl/wer_9_0.0
# %WER 11.83 [ 2308 / 19507, 225 ins, 977 del, 1106 sub ] exp/chain_finetune/cnn_tdnn_finetune_ep3/decode_safe_t_dev1_finetune_tl/wer_8_0.5
20 changes: 20 additions & 0 deletions egs/opensat20/s5/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine). queue.pl works with GridEngine (qsub). slurm.pl works
# with slurm. Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration. Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="queue.pl --mem 1G"
export decode_cmd="queue.pl --mem 2G"
if [[ "$(hostname -f)" == "*.fit.vutbr.cz" ]]; then
queue_conf=$HOME/queue_conf/default.conf # see example /homes/kazi/iveselyk/queue_conf/default.conf,
export train_cmd="queue.pl --config $queue_conf --mem 2G --matylda 0.2"
export decode_cmd="queue.pl --config $queue_conf --mem 3G --matylda 0.1"
export cuda_cmd="queue.pl --config $queue_conf --gpu 1 --mem 10G --tmp 40G"
fi
11 changes: 11 additions & 0 deletions egs/opensat20/s5/conf/local.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
use_pitch=false
use_ivector=true
# lda-mllt transform used to train global-ivector
lda_mllt_lang=safet
# lang_list is space-separated language list used for multilingual training
lang_list=(safet icsiami)
# lang2weight is comma-separated list of weights, one per language, used to
# scale example's output w.r.t its input language during training.
lang2weight="0.7,0.3"
# The language list used for decoding.
decode_lang_list=(safet)
2 changes: 2 additions & 0 deletions egs/opensat20/s5/conf/mfcc.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--use-energy=false # only non-default option.
--sample-frequency=16000 #
10 changes: 10 additions & 0 deletions egs/opensat20/s5/conf/mfcc_hires.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# config for high-resolution MFCC features, intended for neural network training.
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--use-energy=false # use average of log energy, not energy.
--sample-frequency=16000
--num-mel-bins=40
--num-ceps=40
--low-freq=40
--high-freq=-400
1 change: 1 addition & 0 deletions egs/opensat20/s5/conf/online_cmvn.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# configuration file for apply-cmvn-online, used in the script ../local/online/run_online_decoding_nnet2.sh
2 changes: 2 additions & 0 deletions egs/opensat20/s5/conf/vad.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--vad-energy-threshold=30
--vad-energy-mean-scale=0.5
11 changes: 11 additions & 0 deletions egs/opensat20/s5/local.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
use_pitch=false
use_ivector=true
# lda-mllt transform used to train global-ivector
lda_mllt_lang=safet
# lang_list is space-separated language list used for multilingual training
lang_list=(safet icsiami)
# lang2weight is comma-separated list of weights, one per language, used to
# scale example's output w.r.t its input language during training.
lang2weight="0.7,0.3"
# The language list used for decoding.
decode_lang_list=(safet)
109 changes: 109 additions & 0 deletions egs/opensat20/s5/local/AMI/ami_ihm_data_prep.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env bash

# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)
# 2016 Johns Hopkins University (Author: Daniel Povey)
# AMI Corpus training data preparation
# Apache 2.0

# Note: this is called by ../run.sh.

# To be run from one directory above this script.

. ./path.sh

#check existing directories
if [ $# -ne 1 ]; then
echo "Usage: $0 /path/to/AMI"
echo "e.g. $0 /foo/bar/AMI"
exit 1;
fi

AMI_DIR=$1

SEGS=data/local/AMI_annotations/train.txt
dir=data/local/AMI_ihm/train
odir=data/AMI/train_orig
mkdir -p $dir

# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: $AMI_DIR directory does not exists."
exit 1;
fi

# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi


# find headset wav audio files only
find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n headset files were found."
[ $n -ne 687 ] && \
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"

# (1a) Transcriptions preparation
# here we start with normalised transcriptions, the utt ids follow the convention
# AMI_MEETING_CHAN_SPK_STIME_ETIME
# AMI_ES2011a_H00_FEE041_0003415_0003484
# we use uniq as some (rare) entries are doubled in transcripts

awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text

# (1b) Make segment files from transcript

awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
}' < $dir/text > $dir/segments

# (1c) Make wav.scp file.

sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp

#Keep only train part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp

#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -e signed-integer "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp

# (1d) reco2file_and_channel
cat $dir/wav.scp \
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;


# In this data-prep phase we adapt to the session and speaker [later on we may
# split into shorter pieces]., We use the 0th, 1st and 3rd underscore-separated
# fields of the utterance-id as the speaker-id,
# e.g. 'AMI_EN2001a_IHM_FEO065_0090130_0090775' becomes 'AMI_EN2001a_FEO065'.
awk '{print $1}' $dir/segments | \
perl -ane 'chop; @A = split("_", $_); $spkid = join("_", @A[0,1,3]); print "$_ $spkid\n";' \
>$dir/utt2spk || exit 1;


awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "bad label $_";
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;

utils/utt2spk_to_spk2utt.pl <$dir/utt2spk >$dir/spk2utt || exit 1;

# Copy stuff into its final location
mkdir -p $odir
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f $odir/$f || exit 1;
done

utils/fix_data_dir.sh $odir
utils/validate_data_dir.sh --no-feats $odir || exit 1;

echo AMI IHM data preparation succeeded.
120 changes: 120 additions & 0 deletions egs/opensat20/s5/local/AMI/ami_ihm_scoring_data_prep.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#!/usr/bin/env bash


# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)
# 2016 Johns Hopkins University (Author: Daniel Povey)
# AMI Corpus training data preparation
# Apache 2.0

# Note: this is called by ../run.sh.

. ./path.sh

#check existing directories
if [ $# != 2 ]; then
echo "Usage: $0 /path/to/AMI (dev|eval)"
exit 1;
fi

AMI_DIR=$1
SET=$2
SEGS=data/local/AMI_annotations/$SET.txt

dir=data/local/AMI_ihm/$SET
odir=data/AMI/${SET}_orig
mkdir -p $dir

# Audio data directory check
if [ ! -d $AMI_DIR ]; then
echo "Error: run.sh requires a directory argument"
exit 1;
fi

# And transcripts check
if [ ! -f $SEGS ]; then
echo "Error: File $SEGS no found (run ami_text_prep.sh)."
exit 1;
fi

# find headset wav audio files only, here we again get all
# the files in the corpora and filter only specific sessions
# while building segments

find $AMI_DIR -iname '*.Headset-*.wav' | sort > $dir/wav.flist
n=`cat $dir/wav.flist | wc -l`
echo "In total, $n headset files were found."
[ $n -ne 687 ] && \
echo "Warning: expected 687 (168 mtgs x 4 mics + 3 mtgs x 5 mics) data files, found $n"

# (1a) Transcriptions preparation
# here we start with normalised transcriptions, the utt ids follow the convention
# AMI_MEETING_CHAN_SPK_STIME_ETIME
# AMI_ES2011a_H00_FEE041_0003415_0003484

awk '{meeting=$1; channel=$2; speaker=$3; stime=$4; etime=$5;
printf("AMI_%s_%s_%s_%07.0f_%07.0f", meeting, channel, speaker, int(100*stime+0.5), int(100*etime+0.5));
for(i=6;i<=NF;i++) printf(" %s", $i); printf "\n"}' $SEGS | sort | uniq > $dir/text

# (1c) Make segment files from transcript
#segments file format is: utt-id side-id start-time end-time, e.g.:

awk '{
segment=$1;
split(segment,S,"[_]");
audioname=S[1]"_"S[2]"_"S[3]; startf=S[5]; endf=S[6];
print segment " " audioname " " startf*10/1000 " " endf*10/1000 " "
}' < $dir/text > $dir/segments

#prepare wav.scp
sed -e 's?.*/??' -e 's?.wav??' $dir/wav.flist | \
perl -ne 'split; $_ =~ m/(.*)\..*\-([0-9])/; print "AMI_$1_H0$2\n"' | \
paste - $dir/wav.flist > $dir/wav1.scp

#Keep only train part of waves
awk '{print $2}' $dir/segments | sort -u | join - $dir/wav1.scp > $dir/wav2.scp

#replace path with an appropriate sox command that select single channel only
awk '{print $1" sox -c 1 -t wavpcm -e signed-integer "$2" -t wavpcm - |"}' $dir/wav2.scp > $dir/wav.scp

# (1d) reco2file_and_channel
cat $dir/wav.scp \
| perl -ane '$_ =~ m:^(\S+)(H0[0-4])\s+.*\/([IETB].*)\.wav.*$: || die "bad label $_";
print "$1$2 $3 A\n"; ' > $dir/reco2file_and_channel || exit 1;

awk '{print $1}' $dir/segments | \
perl -ane '$_ =~ m:^(\S+)([FM][A-Z]{0,2}[0-9]{3}[A-Z]*)(\S+)$: || die "segments: bad label $_";
print "$1$2$3 $1$2\n";' > $dir/utt2spk || exit 1;

sort -k 2 $dir/utt2spk | utils/utt2spk_to_spk2utt.pl > $dir/spk2utt || exit 1;

#check and correct the case when segment timings for given speaker overlap themself
#(important for simulatenous asclite scoring to proceed).
#There is actually only one such case for devset and automatic segmentetions
join $dir/utt2spk $dir/segments | \
perl -ne '{BEGIN{$pu=""; $pt=0.0;} split;
if ($pu eq $_[1] && $pt > $_[3]) {
print "s/^$_[0] $_[2] $_[3] $_[4]\$/$_[0] $_[2] $pt $_[4]/;\n"
}
$pu=$_[1]; $pt=$_[4];
}' > $dir/segments_to_fix

if [ -s $dir/segments_to_fix ]; then
echo "$0. Applying following fixes to segments"
cat $dir/segments_to_fix
perl -i -pf $dir/segments_to_fix $dir/segments
fi

# Copy stuff into its final locations
mkdir -p $odir
for f in spk2utt utt2spk wav.scp text segments reco2file_and_channel; do
cp $dir/$f $odir/$f || exit 1;
done

#Produce STMs for sclite scoring
local/AMI/convert2stm.pl $dir > $odir/stm
cp local/english.glm $odir/glm

utils/validate_data_dir.sh --no-feats $odir || exit 1;

echo AMI $SET set data preparation succeeded.

1 change: 1 addition & 0 deletions egs/opensat20/s5/local/AMI/ami_split_README.txt
1 change: 1 addition & 0 deletions egs/opensat20/s5/local/AMI/ami_split_dev.orig
1 change: 1 addition & 0 deletions egs/opensat20/s5/local/AMI/ami_split_eval.orig
1 change: 1 addition & 0 deletions egs/opensat20/s5/local/AMI/ami_split_segments.pl
1 change: 1 addition & 0 deletions egs/opensat20/s5/local/AMI/ami_split_train.orig
Loading

0 comments on commit 958f0b0

Please sign in to comment.