add aishell recipe #1742

naxingyu · 2017-07-06T03:52:08Z

Add recipe for AIShell corpus, which is recently added to http://www.openslr.org/33/

danpovey · 2017-07-06T04:08:25Z

@keli78, can you please try this on our grid and see if it runs? See if you can use the data in /export/a05/xna/openslr_resources_33 to avoid re-downloading it from OpenSLR.

…

On Wed, Jul 5, 2017 at 11:52 PM, Xingyu Na ***@***.***> wrote: Add recipe for AIShell corpus, which is recently added to http://www.openslr.org/33/ ------------------------------ You can view, comment on, or merge this pull request online at: #1742 Commit Summary - add aishell recipe File Changes - *A* egs/aishell/README.txt <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-0> (9) - *A* egs/aishell/s5/RESULTS <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-1> (8) - *A* egs/aishell/s5/cmd.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-2> (15) - *A* egs/aishell/s5/conf/decode.config <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-3> (5) - *A* egs/aishell/s5/conf/mfcc.conf <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-4> (2) - *A* egs/aishell/s5/conf/mfcc_hires.conf <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-5> (10) - *A* egs/aishell/s5/conf/online_cmvn.conf <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-6> (1) - *A* egs/aishell/s5/conf/pitch.conf <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-7> (1) - *A* egs/aishell/s5/local/aishell_data_prep.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-8> (57) - *A* egs/aishell/s5/local/aishell_format_data.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-9> (60) - *A* egs/aishell/s5/local/aishell_prepare_dict.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-10> (48) - *A* egs/aishell/s5/local/aishell_train_lms.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-11> (88) - *A* egs/aishell/s5/local/chain/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-12> (186) - *A* egs/aishell/s5/local/download_and_untar.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-13> (105) - *A* egs/aishell/s5/local/nnet3/run_ivector_common.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-14> (145) - *A* egs/aishell/s5/local/nnet3/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-15> (104) - *A* egs/aishell/s5/local/score.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-16> (8) - *A* egs/aishell/s5/local/wer_hyp_filter <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-17> (19) - *A* egs/aishell/s5/local/wer_output_filter <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-18> (25) - *A* egs/aishell/s5/local/wer_ref_filter <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-19> (19) - *A* egs/aishell/s5/path.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-20> (6) - *A* egs/aishell/s5/run.sh <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-21> (144) - *A* egs/aishell/s5/steps <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-22> (1) - *A* egs/aishell/s5/utils <https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-23> (1) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/1742.patch - https://github.com/kaldi-asr/kaldi/pull/1742.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1742>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuyw87VITE4tl07ANyuYjp0PSQcZ2ks5sLFntgaJpZM4OPIUL> .

keli78 · 2017-07-06T05:04:20Z

Sure, will check it soon.

keli78 · 2017-07-06T05:58:47Z

I can avoid downloading the data while I got two errors to run the run.sh by far:

The first line in path.sh is not correct as I got an error "The standard file ../../tools/config/common_path.sh is not present -> Exit!"
With it fixed, I got another one as below:
local/aishell_train_lms.sh: line 47: get_word_map.pl: command not found
I guess the problem is because the way it checks whether kaldi_lm is installed is not correct.
After I fixed this issue, it runs ok by far.

naxingyu · 2017-07-06T06:52:30Z

Thanks. I've committed the fix. 2017-07-06 13:58 GMT+08:00 Ke Li <notifications@github.com>:

…

I can avoid downloading the data while I got two errors to run the run.sh by far: 1. The first line in path.sh is not correct as I got an error "The standard file ../../tools/config/common_path.sh is not present -> Exit!" 2. With it fixed, I got another one as below: local/aishell_train_lms.sh: line 47: get_word_map.pl: command not found I guess the problem is because the way it checks whether kaldi_lm is installed is not correct. After I fixed this issue, it runs ok by far. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1742 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKpxBqdiSm_4fAo17MF9yfAEoGdvvieks5sLHeagaJpZM4OPIUL> .

-- Xingyu Na

jtrmal · 2017-07-06T08:31:24Z

egs/aishell/s5/local/aishell_data_prep.sh

+n=`cat $train_dir/wav.flist $dev_dir/wav.flist $test_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+


I think this is not particularly effective, running the same find three times over the same directory -- how about running it once, getting single list of files, comparing the number of lines and only after that using grep on this files to get the partial filelists?

jtrmal · 2017-07-06T08:33:52Z

egs/aishell/s5/local/aishell_format_data.sh

+fsttablecompose data/lang/L_disambig.fst data/lang_test/G.fst | \
+   fstisstochastic || echo LG is not stochastic
+
+echo "$0: AISHELL data formatting succeeded"


there is a script format_lm.sh in the utils dir -- perhaps you could call that one?

jtrmal · 2017-07-06T08:35:10Z

egs/aishell/s5/local/aishell_train_lms.sh

+cat $dir/word_map | awk '{print $1}' | cat - <(echo "<s>"; echo "</s>" ) > $sdir/wordlist
+
+
+ngram-count -text $sdir/train -order 3 -limit-vocab -vocab $sdir/wordlist -unk \


are you testing srilm is installed?

This is just for showing an example of using srilm. Normal running of this script will exit before the srilm part.

jtrmal · 2017-07-06T08:36:34Z

egs/aishell/s5/local/nnet3/run_ivector_common.sh

+#!/bin/bash
+
+# This script is modified based on swbd/s5c/local/nnet3/run_ivector_common.sh
+


I think @danpovey was saying the minilibrispeech is a more recent script which should be preferable.

jtrmal · 2017-07-06T08:39:24Z

egs/aishell/s5/local/nnet3/run_ivector_common.sh

+  # Train a system just for its LDA+MLLT transform.  We use --num-iters 13
+  # because after we get the transform (12th iter is the last), any further
+  # training is pointless.
+  steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \


can you try replacing this with the pca transform training -- the parameters are the same (if you decide not to go for the minils ivector-common script)

danpovey · 2017-07-06T18:39:33Z

Yenda noticed some good things. Try to take the mini-librispeech example-- the ivector script is simpler. And you should be using PCA not LDA for the ivector script, the mini-librispeech example will do this.

…

On Thu, Jul 6, 2017 at 4:39 AM, jtrmal ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/aishell/s5/local/nnet3/run_ivector_common.sh <#1742 (comment)>: > + --cmd "$train_cmd" data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1; + steps/compute_cmvn_stats.sh data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1; + + # make MFCC data dir without pitch to extract iVector + utils/data/limit_feature_dim.sh 0:39 data/${datadir}_hires \ + data/${datadir}_hires_nopitch || exit 1; + steps/compute_cmvn_stats.sh data/${datadir}_hires_nopitch \ + exp/make_hires/$datadir $mfccdir || exit 1; + done +fi + +if [ $stage -le 2 ]; then + # Train a system just for its LDA+MLLT transform. We use --num-iters 13 + # because after we get the transform (12th iter is the last), any further + # training is pointless. + steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ can you try replacing this with the pca transform training -- the parameters are the same (if you decide not to go for the minils ivector-common script) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1742 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwRu3AFcQqqCkXKSlVgfxGRXHi_5ks5sLJ0-gaJpZM4OPIUL> .

naxingyu · 2017-07-12T02:47:16Z

Thanks. I’ve made changes accordingly. Commit history is cleaned. X.

…

在 2017年7月7日，02:39，Daniel Povey ***@***.***> 写道： Yenda noticed some good things. Try to take the mini-librispeech example-- the ivector script is simpler. And you should be using PCA not LDA for the ivector script, the mini-librispeech example will do this. On Thu, Jul 6, 2017 at 4:39 AM, jtrmal ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In egs/aishell/s5/local/nnet3/run_ivector_common.sh > <#1742 (comment)>: > > > + --cmd "$train_cmd" data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1; > + steps/compute_cmvn_stats.sh data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1; > + > + # make MFCC data dir without pitch to extract iVector > + utils/data/limit_feature_dim.sh 0:39 data/${datadir}_hires \ > + data/${datadir}_hires_nopitch || exit 1; > + steps/compute_cmvn_stats.sh data/${datadir}_hires_nopitch \ > + exp/make_hires/$datadir $mfccdir || exit 1; > + done > +fi > + > +if [ $stage -le 2 ]; then > + # Train a system just for its LDA+MLLT transform. We use --num-iters 13 > + # because after we get the transform (12th iter is the last), any further > + # training is pointless. > + steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ > > can you try replacing this with the pca transform training -- the > parameters are the same (if you decide not to go for the minils > ivector-common script) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1742 (review)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADJVuwRu3AFcQqqCkXKSlVgfxGRXHi_5ks5sLJ0-gaJpZM4OPIUL> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1742 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKpxAqy2wKRgLfezC5n6-YM6NJ1Quh1ks5sLSnrgaJpZM4OPIUL>.

naxingyu · 2017-07-13T02:44:40Z

@danpovey @jtrmal Thanks for the suggestions. I've changed the i-Vector script and re-run the related experiments, and results are updated. Please check out.

danpovey · 2017-07-13T18:51:58Z

@keli78, can you please check if this still runs?

keli78 · 2017-07-13T19:19:17Z

Sure.

keli78 · 2017-07-14T01:32:11Z

It's in train mono-phone stage and no problem occurs by far.

danpovey · 2017-07-16T18:56:27Z

egs/aishell/s5/local/chain/run_tdnn.sh

+    --egs.dir "$common_egs_dir" \
+    --egs.stage $get_egs_stage \
+    --egs.opts "--frames-overlap-per-eg 0" \
+    --egs.chunk-width $frames_per_eg \


I just noticed some very small issues with the nnet3+chain recipe.
Can you please rename this from 7h to 1a, and put it in the tuning/ directory, and create a soft link?
Also can you make frames_per_eg a comma-separated list, like 150,110,90 ?
Also I think you'll get better results if you change renorm to batchnorm.

naxingyu · 2017-07-17T14:16:48Z

@danpovey results updated.

* 'master' of https://github.com/kaldi-asr/kaldi: (36 commits) [scripts] Fix convert_nnet2_to_nnet3.py (kaldi-asr#1774) [egs] Add missing make_corpus_subset.sh in babel_multilang example (kaldi-asr#1766) [egs] Graphemic lexicon updates / fixes in babel/s5d recipe and hub4_spanish recipe (kaldi-asr#1740) [egs] update hkust results (kaldi-asr#1772) [egs] Update AMI chain experiments RE dropout, decay-time and proportional-shrink (kaldi-asr#1732) [egs] Fixes to the aishell (Mandarin) recipe (kaldi-asr#1770) [egs] Add recipe for aishell data (free Mandarin corpus, 170 hours total) (kaldi-asr#1742) [src] Change to arpa-reading code to accept blank lines with whitespace (kaldi-asr#1752) [scripts] For nnet3 training, add option to disable the model-combination (kaldi-asr#1757) [scripts] minor bugfix to nnet1 alignment script when creating lattices (kaldi-asr#1764) [src] Add support for row/column ranges when reading GeneralMatrix (kaldi-asr#1761) [src] Change name of option --norm-mean->--norm-means for consistency, thanks: 415198468@qq.com [egs] swbd/s5c, added 5 layer (b)lstm recipes (kaldi-asr#1759) [scripts] Fix bug in segment_long_utterances.sh (kaldi-asr#1758) [src] Fix indexing error in nnet1::Convolutional2DComponent (kaldi-asr#1755) [src] Fix usage message of program (thanks:jubang0219@gmail.com) [egs] some small updates to scripts (installing beamformit; segmentation example) [egs] Small fix to ami/s5b/local/chain/compare_wer_general.sh (kaldi-asr#1751) [build] Add configuration check for incompatible g++ compilers when CUDA is enabled. (kaldi-asr#1749) [egs] Update Librispeech nnet3 TDNN recipe (old one did not run) (kaldi-asr#1727) ...

…tal) (kaldi-asr#1742)

naxingyu force-pushed the aishell branch 2 times, most recently from 30771fc to e4418cc Compare July 6, 2017 06:51

jtrmal reviewed Jul 6, 2017

View reviewed changes

naxingyu force-pushed the aishell branch from e4418cc to c9a266a Compare July 12, 2017 02:43

add aishell recipe

a0ea2f0

naxingyu force-pushed the aishell branch from c9a266a to a0ea2f0 Compare July 13, 2017 02:42

danpovey reviewed Jul 16, 2017

View reviewed changes

naxingyu added 2 commits July 16, 2017 22:00

update aishell

819cbea

update aishell chain result

b13dd00

danpovey merged commit aedc2fe into kaldi-asr:master Jul 17, 2017

naxingyu deleted the aishell branch September 16, 2017 15:26

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs] Add recipe for aishell data (free Mandarin corpus, 170 hours to…

9926fdc

…tal) (kaldi-asr#1742)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add aishell recipe #1742

add aishell recipe #1742

naxingyu commented Jul 6, 2017

danpovey commented Jul 6, 2017 via email

keli78 commented Jul 6, 2017

keli78 commented Jul 6, 2017

naxingyu commented Jul 6, 2017 via email

jtrmal Jul 6, 2017

jtrmal Jul 6, 2017

jtrmal Jul 6, 2017

naxingyu Jul 12, 2017

jtrmal Jul 6, 2017

jtrmal Jul 6, 2017

danpovey commented Jul 6, 2017 via email

naxingyu commented Jul 12, 2017 via email

naxingyu commented Jul 13, 2017

danpovey commented Jul 13, 2017

keli78 commented Jul 13, 2017

keli78 commented Jul 14, 2017

danpovey Jul 16, 2017

naxingyu commented Jul 17, 2017

		cat $dir/word_map \| awk '{print $1}' \| cat - <(echo "<s>"; echo "</s>" ) > $sdir/wordlist


		ngram-count -text $sdir/train -order 3 -limit-vocab -vocab $sdir/wordlist -unk \

		#!/bin/bash

		# This script is modified based on swbd/s5c/local/nnet3/run_ivector_common.sh

add aishell recipe #1742

add aishell recipe #1742

Conversation

naxingyu commented Jul 6, 2017

danpovey commented Jul 6, 2017 via email

keli78 commented Jul 6, 2017

keli78 commented Jul 6, 2017

naxingyu commented Jul 6, 2017 via email

jtrmal Jul 6, 2017

Choose a reason for hiding this comment

jtrmal Jul 6, 2017

Choose a reason for hiding this comment

jtrmal Jul 6, 2017

Choose a reason for hiding this comment

naxingyu Jul 12, 2017

Choose a reason for hiding this comment

jtrmal Jul 6, 2017

Choose a reason for hiding this comment

jtrmal Jul 6, 2017

Choose a reason for hiding this comment

danpovey commented Jul 6, 2017 via email

naxingyu commented Jul 12, 2017 via email

naxingyu commented Jul 13, 2017

danpovey commented Jul 13, 2017

keli78 commented Jul 13, 2017

keli78 commented Jul 14, 2017

danpovey Jul 16, 2017

Choose a reason for hiding this comment

naxingyu commented Jul 17, 2017