-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: lexicon learning #976
Conversation
# decode | ||
if [ $stage -le 3 ]; then | ||
cp -rT data/lang_learned data/lang_learned_rescore | ||
cp data/lang_nosp/G.fst data/lang_learned/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the words.txt are exactly the same, copying the graphs isn't right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least check that they are the same with cmp and die if not.
7fbabf9
to
723f9a8
Compare
Can you please show us where on the grid this has been run? |
here: /export/a11/xzhang/tedlium/s5_r2/exp/tri3_lex_0.4_work (I added the affix "_0.4" for tuning purposes.) Sure. @jtrmal, I'd like to explain the whole thing to you tmr face to face if you want. |
OK, I'll have a look. On Wed, Oct 5, 2016 at 11:20 PM, Xiaohui Zhang notifications@github.com
|
Samuel, do these scripts include the yesterday's changes/fixes? I remember there was a script that was crashing. |
@jtrmal Nope. That's Dan's debug_lexicon script which I used to generate alternative pronunciations and that problem only happens for Pashto. I'll suggest changes after I figure out the problem. You can start with my scripts which are more relevant to lexicon learning: local/run_learn_lex.sh, steps/dict/learn_lexicon.sh, steps/dict/select_prons_bayesian.py |
import imp | ||
import math | ||
import numpy | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samuel, I see a bunch of modules you are not actually using here and in the other python scripts.
Please remove those. Also, as a general principle, if you importing something outside the default python libs, you should be little bit more helpful with what is wrong.
i.e.
try:
import numpy
except ImportError:
print "Error: you don't seem to have numpy installed, OR the install is not functional"
raise
btw, please install flake8 and run your script through that program
Please avoid using numpy for stuff like this that's not math intensive... On Fri, Oct 7, 2016 at 1:06 PM, jtrmal notifications@github.com wrote:
|
I didn't forget about this. Wanna run the scripts. The thing is the main script is huge, so when I talked with Dan, we were leaning towards splitting it into smaller chunks (or isolating some chunks of the code into separate scripts), but I want to get the feel for it. BTW, @xiaohui-zhang , you still can address the other things I suggested |
Thanks @jtrmal. Yep I'm addressing issues you found. On Tue, Oct 11, 2016 at 11:22 AM, jtrmal notifications@github.com wrote:
Xiao-hui |
6ee9947
to
e9e1103
Compare
addressed issues @jtrmal mentioned earlier |
|
||
. cmd.sh | ||
. path.sh | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be written as
. ./cmd.sh
. ./path.sh
remove_tags=true | ||
only_words=true | ||
icu_transform="Any-Lower" | ||
cmd=run.pl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to remove the whole icu transform mess for the kaldi-wide scripts and perhaps just rely on "tr" or make it user's job?
@danpovey any opinion? I used that to correctly normalize casing of the lexicon (bc for example in Turkish, i is not lowercased I) and was handling utf-8 encoding well. But not sure if/how this is importat for kaldi in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just talked with Dan and let's go with ignoring the normalization issues and let's make it user's job.. The same with the apply_g2p.sh
That will simplify the g2p scripts considerably
|
||
[ -f ./path.sh ] && . ./path.sh; # source the path. | ||
. parse_options.sh || exit 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to write
. utils/parse_options.sh
@jtrmal and @xiaohui-zhang, can you please remind me what are the next TODOs to get this in shape? |
I want to run the scripts on graphemic babel just to see how the scripts On Mon, Oct 17, 2016 at 3:45 PM, Daniel Povey notifications@github.com
|
it'll be great if someone can run it on tedlium first since English is On Mon, Oct 17, 2016 at 4:02 PM, jtrmal notifications@github.com wrote:
Xiao-hui |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaohui-zhang , I was asking you to remove the lexicon case-normalization for g2p training, as after discussion with @danpovey it seemed it should be user's responsibility. I don't seem to be able to find that comment now, but can you please address that?
I.e. please remove all the normalization, just assume the lexicon/wordlist is already given the the g2p scripts normalized. That will simplify the scripts greatly.
Samuel, any progress on implementing the changes? |
|
||
|
||
. utils/parse_options.sh # accept options | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either use
set -e -o pipefail
or make sure all calls ends with
|| exit 1
|
||
# Begin configuration section. | ||
iters=5 | ||
stage=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stage should be set to 0
I played with the scripts and I think the steps/learn_lexicon.sh should really be re-desgined. |
sorry I've been struggling with the g2p toolkit thing recently. am addressing these issues now. |
I agree with @jtrmal that we should definitely take train_g2p out of the recipe. However, I doubt whether we should take apply_g2p out of the recipe. The reason is that, when running phone-decoding (debug_lexicon), we prefer a lexicon which covers most words in acoustic training data, otherwise too many oovs will make phone-decoding very slow. More importantly, in many cases we want to include pronunciation candidates from g2p in the pronunciation selection procedure. But I think we can make apply_g2p optional here, assuming the user-provided reference lexicon already covers most words in acoustic training data, and the user doesn't care about any pron from g2p during pronunciation selection. So I think I can add an option like "use_g2p_for_oov" and change the script accordingly. @danpovey what do you think? |
Usage messages always need to make it easy to tell whether something is an input or an output parameter (e.g. does the script write to, or read, the file?) |
Sure. I noted output file as "Output file containing XXX". On Wed, Nov 9, 2016 at 5:49 PM, Daniel Povey notifications@github.com
Xiao-hui |
Thanks for @danpovey 's detailed suggestions. I've addressed them in my last commit. Please review when you have time :-) |
steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \ | ||
$data $dir/lang_expanded_target_vocab $src_mdl $alidir || exit 1; | ||
|
||
# Train another SAT system on the given data and put it in ${src_mdl}_${affix}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would give uninformative errors if num_leaves and num_gauss were not set. Either set defaults, or check that the user has set them if retrain_src_mdl == true.
mkdir -p $dir/dict_expanded_target_vocab | ||
cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \ | ||
$dir/dict_expanded_target_vocab 2>/dev/null | ||
rm $dir/dict_expanded_target_vocab/lexiconp.txt 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove lexicon.txt too, prepare_lang.sh automatically creates it from lexiconp.txt. Do this wherever you remove lexiconp.txt.
echo " <target-vocab> the vocabulary we want the final learned lexicon to cover." | ||
echo " <data> acoustic training data we use to get alternative" | ||
echo " pronunciations and collet acoustic evidence." | ||
echo " <src-mdl> the acoustic model based on which we re-train an SAT model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"acoustic model" or src-mdl implies a file, but I think that is not what you mean. Please rename to srcdir and say that it's a directory containing GMM-based models.
echo " <ref-dict> the dir which contains the reference lexicon (most probably hand-derived)" | ||
echo " we want to expand/improve, and nonsilence_phones.txt,.etc which we need " | ||
echo " for building new dict dirs." | ||
echo " <target-vocab> the vocabulary we want the final learned lexicon to cover." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention the format, e.g. one word per line. [or is it a symbol table format, like words.txt?]
if [ $stage -le 1 ]; then | ||
steps/dict/learn_lexicon.sh --g2p-pron-candidates "$data/lexicon_oov_g2p.txt" \ | ||
--min-prob $min_prob --variants-prob-mass $variants_prob_mass \ | ||
--variants-prob-mass2 $variants_prob_mass2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable name does not exist any more (and this means you have not tested the script after your latest changes).
$data $dir/lang_combined_iter1 $mdl $dir/lats || exit 1; | ||
|
||
# Get arc level information from the lattice. | ||
$cmd JOB=1:$nj $dir/lats/log/get_arc_info.JOB.log \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names lats_pass1 and lats_pass2 would be better than lats and lats2.
if [ $stage -le 3 ]; then | ||
echo "$0: Generate pronunciation candidates from phone-level decoding on acoustic training data.." | ||
mdl=$src_mdl && $retrain_src_mdl && mdl=${src_mdl}_${affix} | ||
steps/cleanup/debug_lexicon.sh --nj $nj --cmd "$decode_cmd" $data $dir/lang_expanded_train \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that I think you should investigate at some point (not necessarily right now), is whether you can improve this recipe by using a higher-order phone language model in steps/cleanup/debug_lexicon.sh. Currently it uses an un-smoothed bigram (adding unigrams causes a blowup), but you could train a triphone LM using make_phone_lm.py using the same inputs (alignments); that also does not smooth below bigram. You'd have to make sure the disambiguation symbol was taken care of... make_phone_lm.py uses a disambiguation symbol. You could probably use #0
, just make sure it's in phones.txt and L_disambig has the self-loop to consume it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I've added to my TODO list
else: | ||
raise Exception("Unknown value {0} for --{1}".format(values, self.dest)) | ||
|
||
def GetArgs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving the python to a pep8-compliant style (e.g. lower_case function names), and validation with flake8, but you could do that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks
words.add(entry[0]) | ||
print("{0} {1}".format(entry[0], entry[1]), | ||
file = args.out_lexicon_handle) | ||
print ("Before pruning, the total num. pronunciations is: {}".format(len(lexicon))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to print things like this to stderr, as a general rule. That way you can redirect the output to /dev/stdout and things will work.
break | ||
if not all_ref_prons_accepted or flags[0] == '1' or flags[2] == '1': | ||
words_to_edit.append((word, counts[word])) | ||
if count > thr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use threshold, not thr-- google style guide is against abbreviated names.
Guys, I won't have probably time for this until half of the december (until On Wed, Aug 17, 2016 at 3:26 PM, Xiaohui Zhang notifications@github.com
|
$data/train_vocab.txt | sort > $data/oov_train.txt || exit 1; | ||
steps/dict/apply_g2p.sh --var-counts 4 $data/oov_train.txt \ | ||
$g2p_mdl_dir exp/g2p/oov_lex_train || exit 1; | ||
cat exp/g2p/oov_lex_train/lexicon.lex | awk '{print $1,$3}' | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the awk command back to cut -f1,3 cause the columns of lexicon.lex are separated by tabs, and the phones are separated by spaces:
hello \tab 0.8 \tab h a l o
So you think this is ready to merge? On Wed, Nov 16, 2016 at 5:14 PM, Xiaohui Zhang notifications@github.com
|
yep. as long as you don't have other suggestions |
@xiaohui-zhang Do you have results to commit too ? |
I'm asking for some big changes before we commit this, but Samuel, you On Thu, Nov 17, 2016 at 1:57 AM, vince62s notifications@github.com wrote:
|
Since it looks like the alternative version we're looking into may be becoming a bigger and more long-term project, I'm considering committing this as a stopgap measure. Samuel, are there any small fixes that you need to apply to this PR? |
@danpovey good to merge now |
Hello @vince62s , sorry for the late reply. We've been working hard on improving WERs and it's hard to get consistent improvements, partly cause the pronunciations we improved don't always appear in the dev/test sets. Also my baseline was out-of-date. So I didn't bother putting numbers in RESULTS. I'll let you know when we have consistent WER improvements. Thanks! |
@xiaohui-zhang I didn't try the scripts nor read it carefully because I am not sure to fully understand what it does in the case of tedlium. I would appreciate some clarification. In the case of Tedlium, unless I am wrong, training data do not have OOV, since they included all words from training data in the dictionary. so, if this is the case, am I right saying that this script step will do nothing ? from another general standpoint, if a user does the job on his side (to use G2P to generate prons for oov words in the training data) which I guess is what the tedlium team did when built their dict, will the script do some kind of redundant thing / detect some lexicon errors that were initiated by G2P in the first place ? thanks for your help. |
Hi @vince62s , basically there is OOV cause the vocab size of data/local/dict/lexicon.txt was truncated to 200k. This recipe is not showing how to deal with OOVs with lexicon learning. What this recipe does is: it debugs the given reference lexicon (data/local/dict/lexicon.txt), so that the pronunciations agree more with the acoustics. The output lexicon will be put in data/local/dict_learned_nosp (Note that the vocab is the same as the reference lexicon, cause "target-vocab" in learn_lexicon.sh was set as ref-vocab (vocab of the reference lexicon. The recipe also outputs a lexicon-edit file, which records the changes to the reference lexicon). |
the main script is egs/wsj/s5/steps/dict/learn_lexicon.sh