-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add aishell recipe #1742
add aishell recipe #1742
Conversation
@keli78, can you please try this on our grid and see if it runs?
See if you can use the data in /export/a05/xna/openslr_resources_33 to
avoid re-downloading it from OpenSLR.
…On Wed, Jul 5, 2017 at 11:52 PM, Xingyu Na ***@***.***> wrote:
Add recipe for AIShell corpus, which is recently added to
http://www.openslr.org/33/
------------------------------
You can view, comment on, or merge this pull request online at:
#1742
Commit Summary
- add aishell recipe
File Changes
- *A* egs/aishell/README.txt
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-0> (9)
- *A* egs/aishell/s5/RESULTS
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-1> (8)
- *A* egs/aishell/s5/cmd.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-2> (15)
- *A* egs/aishell/s5/conf/decode.config
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-3> (5)
- *A* egs/aishell/s5/conf/mfcc.conf
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-4> (2)
- *A* egs/aishell/s5/conf/mfcc_hires.conf
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-5> (10)
- *A* egs/aishell/s5/conf/online_cmvn.conf
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-6> (1)
- *A* egs/aishell/s5/conf/pitch.conf
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-7> (1)
- *A* egs/aishell/s5/local/aishell_data_prep.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-8> (57)
- *A* egs/aishell/s5/local/aishell_format_data.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-9> (60)
- *A* egs/aishell/s5/local/aishell_prepare_dict.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-10> (48)
- *A* egs/aishell/s5/local/aishell_train_lms.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-11> (88)
- *A* egs/aishell/s5/local/chain/run_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-12> (186)
- *A* egs/aishell/s5/local/download_and_untar.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-13> (105)
- *A* egs/aishell/s5/local/nnet3/run_ivector_common.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-14> (145)
- *A* egs/aishell/s5/local/nnet3/run_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-15> (104)
- *A* egs/aishell/s5/local/score.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-16> (8)
- *A* egs/aishell/s5/local/wer_hyp_filter
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-17> (19)
- *A* egs/aishell/s5/local/wer_output_filter
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-18> (25)
- *A* egs/aishell/s5/local/wer_ref_filter
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-19> (19)
- *A* egs/aishell/s5/path.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-20> (6)
- *A* egs/aishell/s5/run.sh
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-21> (144)
- *A* egs/aishell/s5/steps
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-22> (1)
- *A* egs/aishell/s5/utils
<https://github.com/kaldi-asr/kaldi/pull/1742/files#diff-23> (1)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/1742.patch
- https://github.com/kaldi-asr/kaldi/pull/1742.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1742>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuyw87VITE4tl07ANyuYjp0PSQcZ2ks5sLFntgaJpZM4OPIUL>
.
|
Sure, will check it soon. |
I can avoid downloading the data while I got two errors to run the run.sh by far:
|
30771fc
to
e4418cc
Compare
Thanks. I've committed the fix.
2017-07-06 13:58 GMT+08:00 Ke Li <notifications@github.com>:
… I can avoid downloading the data while I got two errors to run the run.sh
by far:
1. The first line in path.sh is not correct as I got an error "The
standard file ../../tools/config/common_path.sh is not present ->
Exit!"
2. With it fixed, I got another one as below:
local/aishell_train_lms.sh: line 47: get_word_map.pl: command not found
I guess the problem is because the way it checks whether kaldi_lm is
installed is not correct.
After I fixed this issue, it runs ok by far.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1742 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADKpxBqdiSm_4fAo17MF9yfAEoGdvvieks5sLHeagaJpZM4OPIUL>
.
--
Xingyu Na
|
n=`cat $train_dir/wav.flist $dev_dir/wav.flist $test_dir/wav.flist | wc -l` | ||
[ $n -ne 141925 ] && \ | ||
echo Warning: expected 141925 data data files, found $n | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not particularly effective, running the same find three times over the same directory -- how about running it once, getting single list of files, comparing the number of lines and only after that using grep on this files to get the partial filelists?
fsttablecompose data/lang/L_disambig.fst data/lang_test/G.fst | \ | ||
fstisstochastic || echo LG is not stochastic | ||
|
||
echo "$0: AISHELL data formatting succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a script format_lm.sh in the utils dir -- perhaps you could call that one?
cat $dir/word_map | awk '{print $1}' | cat - <(echo "<s>"; echo "</s>" ) > $sdir/wordlist | ||
|
||
|
||
ngram-count -text $sdir/train -order 3 -limit-vocab -vocab $sdir/wordlist -unk \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you testing srilm is installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for showing an example of using srilm. Normal running of this script will exit before the srilm part.
#!/bin/bash | ||
|
||
# This script is modified based on swbd/s5c/local/nnet3/run_ivector_common.sh | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @danpovey was saying the minilibrispeech is a more recent script which should be preferable.
# Train a system just for its LDA+MLLT transform. We use --num-iters 13 | ||
# because after we get the transform (12th iter is the last), any further | ||
# training is pointless. | ||
steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try replacing this with the pca transform training -- the parameters are the same (if you decide not to go for the minils ivector-common script)
Yenda noticed some good things.
Try to take the mini-librispeech example-- the ivector script is simpler.
And you should be using PCA not LDA for the ivector script, the
mini-librispeech example will do this.
…On Thu, Jul 6, 2017 at 4:39 AM, jtrmal ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/aishell/s5/local/nnet3/run_ivector_common.sh
<#1742 (comment)>:
> + --cmd "$train_cmd" data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
+ steps/compute_cmvn_stats.sh data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
+
+ # make MFCC data dir without pitch to extract iVector
+ utils/data/limit_feature_dim.sh 0:39 data/${datadir}_hires \
+ data/${datadir}_hires_nopitch || exit 1;
+ steps/compute_cmvn_stats.sh data/${datadir}_hires_nopitch \
+ exp/make_hires/$datadir $mfccdir || exit 1;
+ done
+fi
+
+if [ $stage -le 2 ]; then
+ # Train a system just for its LDA+MLLT transform. We use --num-iters 13
+ # because after we get the transform (12th iter is the last), any further
+ # training is pointless.
+ steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \
can you try replacing this with the pca transform training -- the
parameters are the same (if you decide not to go for the minils
ivector-common script)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1742 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuwRu3AFcQqqCkXKSlVgfxGRXHi_5ks5sLJ0-gaJpZM4OPIUL>
.
|
Thanks. I’ve made changes accordingly. Commit history is cleaned.
X.
… 在 2017年7月7日,02:39,Daniel Povey ***@***.***> 写道:
Yenda noticed some good things.
Try to take the mini-librispeech example-- the ivector script is simpler.
And you should be using PCA not LDA for the ivector script, the
mini-librispeech example will do this.
On Thu, Jul 6, 2017 at 4:39 AM, jtrmal ***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In egs/aishell/s5/local/nnet3/run_ivector_common.sh
> <#1742 (comment)>:
>
> > + --cmd "$train_cmd" data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
> + steps/compute_cmvn_stats.sh data/${datadir}_hires exp/make_hires/$datadir $mfccdir || exit 1;
> +
> + # make MFCC data dir without pitch to extract iVector
> + utils/data/limit_feature_dim.sh 0:39 data/${datadir}_hires \
> + data/${datadir}_hires_nopitch || exit 1;
> + steps/compute_cmvn_stats.sh data/${datadir}_hires_nopitch \
> + exp/make_hires/$datadir $mfccdir || exit 1;
> + done
> +fi
> +
> +if [ $stage -le 2 ]; then
> + # Train a system just for its LDA+MLLT transform. We use --num-iters 13
> + # because after we get the transform (12th iter is the last), any further
> + # training is pointless.
> + steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \
>
> can you try replacing this with the pca transform training -- the
> parameters are the same (if you decide not to go for the minils
> ivector-common script)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1742 (review)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuwRu3AFcQqqCkXKSlVgfxGRXHi_5ks5sLJ0-gaJpZM4OPIUL>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1742 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKpxAqy2wKRgLfezC5n6-YM6NJ1Quh1ks5sLSnrgaJpZM4OPIUL>.
|
@keli78, can you please check if this still runs? |
Sure. |
It's in train mono-phone stage and no problem occurs by far. |
--egs.dir "$common_egs_dir" \ | ||
--egs.stage $get_egs_stage \ | ||
--egs.opts "--frames-overlap-per-eg 0" \ | ||
--egs.chunk-width $frames_per_eg \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed some very small issues with the nnet3+chain recipe.
Can you please rename this from 7h to 1a, and put it in the tuning/ directory, and create a soft link?
Also can you make frames_per_eg a comma-separated list, like 150,110,90 ?
Also I think you'll get better results if you change renorm to batchnorm.
@danpovey results updated. |
* 'master' of https://github.com/kaldi-asr/kaldi: (36 commits) [scripts] Fix convert_nnet2_to_nnet3.py (kaldi-asr#1774) [egs] Add missing make_corpus_subset.sh in babel_multilang example (kaldi-asr#1766) [egs] Graphemic lexicon updates / fixes in babel/s5d recipe and hub4_spanish recipe (kaldi-asr#1740) [egs] update hkust results (kaldi-asr#1772) [egs] Update AMI chain experiments RE dropout, decay-time and proportional-shrink (kaldi-asr#1732) [egs] Fixes to the aishell (Mandarin) recipe (kaldi-asr#1770) [egs] Add recipe for aishell data (free Mandarin corpus, 170 hours total) (kaldi-asr#1742) [src] Change to arpa-reading code to accept blank lines with whitespace (kaldi-asr#1752) [scripts] For nnet3 training, add option to disable the model-combination (kaldi-asr#1757) [scripts] minor bugfix to nnet1 alignment script when creating lattices (kaldi-asr#1764) [src] Add support for row/column ranges when reading GeneralMatrix (kaldi-asr#1761) [src] Change name of option --norm-mean->--norm-means for consistency, thanks: 415198468@qq.com [egs] swbd/s5c, added 5 layer (b)lstm recipes (kaldi-asr#1759) [scripts] Fix bug in segment_long_utterances.sh (kaldi-asr#1758) [src] Fix indexing error in nnet1::Convolutional2DComponent (kaldi-asr#1755) [src] Fix usage message of program (thanks:jubang0219@gmail.com) [egs] some small updates to scripts (installing beamformit; segmentation example) [egs] Small fix to ami/s5b/local/chain/compare_wer_general.sh (kaldi-asr#1751) [build] Add configuration check for incompatible g++ compilers when CUDA is enabled. (kaldi-asr#1749) [egs] Update Librispeech nnet3 TDNN recipe (old one did not run) (kaldi-asr#1727) ...
Add recipe for AIShell corpus, which is recently added to http://www.openslr.org/33/