Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: lexicon learning #976

Merged
merged 9 commits into from
Dec 1, 2016
Merged

WIP: lexicon learning #976

merged 9 commits into from
Dec 1, 2016

Conversation

xiaohui-zhang
Copy link
Contributor

the main script is egs/wsj/s5/steps/dict/learn_lexicon.sh

# decode
if [ $stage -le 3 ]; then
cp -rT data/lang_learned data/lang_learned_rescore
cp data/lang_nosp/G.fst data/lang_learned/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the words.txt are exactly the same, copying the graphs isn't right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least check that they are the same with cmp and die if not.

@danpovey
Copy link
Contributor

danpovey commented Oct 6, 2016

Can you please show us where on the grid this has been run?
@jtrmal, if you have time it would be good if you can look over this.

@xiaohui-zhang
Copy link
Contributor Author

here: /export/a11/xzhang/tedlium/s5_r2/exp/tri3_lex_0.4_work (I added the affix "_0.4" for tuning purposes.) Sure. @jtrmal, I'd like to explain the whole thing to you tmr face to face if you want.

@jtrmal
Copy link
Contributor

jtrmal commented Oct 6, 2016

OK, I'll have a look.
y.

On Wed, Oct 5, 2016 at 11:20 PM, Xiaohui Zhang notifications@github.com
wrote:

here: /export/a11/xzhang/tedlium/s5_r2/exp/tri3_lex_0.4_work (I added the
affix "_0.4" for tuning purposes.) Sure. @jtrmal
https://github.com/jtrmal, I'd like to explain the whole thing to you
tmr face to face if you want.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AKisX2_MJJMfcvxuDIw-4zz8TbXYtLhjks5qxGkWgaJpZM4JmzJX
.

@jtrmal
Copy link
Contributor

jtrmal commented Oct 7, 2016

Samuel, do these scripts include the yesterday's changes/fixes? I remember there was a script that was crashing.

@xiaohui-zhang
Copy link
Contributor Author

@jtrmal Nope. That's Dan's debug_lexicon script which I used to generate alternative pronunciations and that problem only happens for Pashto. I'll suggest changes after I figure out the problem. You can start with my scripts which are more relevant to lexicon learning: local/run_learn_lex.sh, steps/dict/learn_lexicon.sh, steps/dict/select_prons_bayesian.py

import imp
import math
import numpy

Copy link
Contributor

@jtrmal jtrmal Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samuel, I see a bunch of modules you are not actually using here and in the other python scripts.
Please remove those. Also, as a general principle, if you importing something outside the default python libs, you should be little bit more helpful with what is wrong.

i.e.

try: 
  import numpy
except ImportError:
  print "Error: you don't seem to have numpy installed, OR the install is not functional"
  raise

btw, please install flake8 and run your script through that program

@danpovey
Copy link
Contributor

danpovey commented Oct 7, 2016

Please avoid using numpy for stuff like this that's not math intensive...
it's not always installed.

On Fri, Oct 7, 2016 at 1:06 PM, jtrmal notifications@github.com wrote:

@jtrmal commented on this pull request.

In egs/wsj/s5/steps/dict/select_prons_bayesian.py
#976 (review):

+#!/usr/bin/env python
+
+# Copyright 2016 Xiaohui Zhang
+# Apache 2.0.
+
+from future import print_function
+from collections import defaultdict
+import os
+import argparse
+import sys
+import warnings
+import copy
+import imp
+import math
+import numpy
+

Samuel, I see a bunch of modules you are not actually using here and in
the other python scripts.
Please remove those. Also, as a general principle, if you importing
something outside the default python libs, you should be little bit more
helpful with what is wrong.

i.e.
try:
import numpy
except ImportError:
print "Error: you don't seem to have numpy installed, OR the install is
not functional"
raise

btw, please install flake8 and run your script through that program


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#976 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu3oM1y-elQ_t-FY_Rk4HmHkFNU44ks5qxnwvgaJpZM4JmzJX
.

@jtrmal
Copy link
Contributor

jtrmal commented Oct 11, 2016

I didn't forget about this. Wanna run the scripts. The thing is the main script is huge, so when I talked with Dan, we were leaning towards splitting it into smaller chunks (or isolating some chunks of the code into separate scripts), but I want to get the feel for it.

BTW, @xiaohui-zhang , you still can address the other things I suggested

@xiaohui-zhang
Copy link
Contributor Author

Thanks @jtrmal. Yep I'm addressing issues you found.

On Tue, Oct 11, 2016 at 11:22 AM, jtrmal notifications@github.com wrote:

I didn't forget about this. Wanna run the scripts. The thing is the main
script is huge, so when I talked with Dan, we were leaning towards
splitting it into smaller chunks (or isolating some chunks of the code into
separate scripts), but I want to get the feel for it.

BTW, @xiaohui-zhang https://github.com/xiaohui-zhang , you still can
address the other things I suggested


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ANiEEQksj5zRV0U3E2NUT6IDR654Qrsbks5qy6mjgaJpZM4JmzJX
.

Xiao-hui

@xiaohui-zhang
Copy link
Contributor Author

addressed issues @jtrmal mentioned earlier


. cmd.sh
. path.sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be written as

. ./cmd.sh
. ./path.sh

remove_tags=true
only_words=true
icu_transform="Any-Lower"
cmd=run.pl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to remove the whole icu transform mess for the kaldi-wide scripts and perhaps just rely on "tr" or make it user's job?
@danpovey any opinion? I used that to correctly normalize casing of the lexicon (bc for example in Turkish, i is not lowercased I) and was handling utf-8 encoding well. But not sure if/how this is importat for kaldi in general

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just talked with Dan and let's go with ignoring the normalization issues and let's make it user's job.. The same with the apply_g2p.sh
That will simplify the g2p scripts considerably


[ -f ./path.sh ] && . ./path.sh; # source the path.
. parse_options.sh || exit 1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to write

. utils/parse_options.sh

@danpovey
Copy link
Contributor

@jtrmal and @xiaohui-zhang, can you please remind me what are the next TODOs to get this in shape?

@jtrmal
Copy link
Contributor

jtrmal commented Oct 17, 2016

I want to run the scripts on graphemic babel just to see how the scripts
behave...
y.

On Mon, Oct 17, 2016 at 3:45 PM, Daniel Povey notifications@github.com
wrote:

@jtrmal https://github.com/jtrmal and @xiaohui-zhang
https://github.com/xiaohui-zhang, can you please remind me what are the
next TODOs to get this in shape?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AKisX2-W1N2UNCqgKzVwfJ-xOZ0tc0Tzks5q09BXgaJpZM4JmzJX
.

@xiaohui-zhang
Copy link
Contributor Author

it'll be great if someone can run it on tedlium first since English is
understandable and it's easier to figure out potential problems in my
recipe...

On Mon, Oct 17, 2016 at 4:02 PM, jtrmal notifications@github.com wrote:

I want to run the scripts on graphemic babel just to see how the scripts
behave...
y.

On Mon, Oct 17, 2016 at 3:45 PM, Daniel Povey notifications@github.com
wrote:

@jtrmal https://github.com/jtrmal and @xiaohui-zhang
https://github.com/xiaohui-zhang, can you please remind me what are
the
next TODOs to get this in shape?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-
auth/AKisX2-W1N2UNCqgKzVwfJ-xOZ0tc0Tzks5q09BXgaJpZM4JmzJX>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ANiEEZDsBsv7WEKKapEuU7BLLnVQxcOsks5q09RwgaJpZM4JmzJX
.

Xiao-hui

Copy link
Contributor

@jtrmal jtrmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaohui-zhang , I was asking you to remove the lexicon case-normalization for g2p training, as after discussion with @danpovey it seemed it should be user's responsibility. I don't seem to be able to find that comment now, but can you please address that?

I.e. please remove all the normalization, just assume the lexicon/wordlist is already given the the g2p scripts normalized. That will simplify the scripts greatly.

@jtrmal
Copy link
Contributor

jtrmal commented Oct 20, 2016

Samuel, any progress on implementing the changes?



. utils/parse_options.sh # accept options

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use

set -e -o pipefail

or make sure all calls ends with

|| exit 1


# Begin configuration section.
iters=5
stage=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stage should be set to 0

@jtrmal
Copy link
Contributor

jtrmal commented Oct 20, 2016

I played with the scripts and I think the steps/learn_lexicon.sh should really be re-desgined.
Mainly, the train_g2p and apply_g2p should be taken out.
The reason is that in many recipes, you will have g2p already trained (especially given the fact we will shift the responsibility for lexicon normalization for g2p to the user) or you actually do not need the g2p at all -- For graphemic systems, you can generate pronuncation right away (albeit badly) and in some cases you might decide you don't want to mess with g2p at all and want use only the prons found in the lattice.

@xiaohui-zhang
Copy link
Contributor Author

sorry I've been struggling with the g2p toolkit thing recently. am addressing these issues now.

@xiaohui-zhang
Copy link
Contributor Author

I agree with @jtrmal that we should definitely take train_g2p out of the recipe. However, I doubt whether we should take apply_g2p out of the recipe. The reason is that, when running phone-decoding (debug_lexicon), we prefer a lexicon which covers most words in acoustic training data, otherwise too many oovs will make phone-decoding very slow. More importantly, in many cases we want to include pronunciation candidates from g2p in the pronunciation selection procedure. But I think we can make apply_g2p optional here, assuming the user-provided reference lexicon already covers most words in acoustic training data, and the user doesn't care about any pron from g2p during pronunciation selection. So I think I can add an option like "use_g2p_for_oov" and change the script accordingly. @danpovey what do you think?

@danpovey
Copy link
Contributor

danpovey commented Nov 9, 2016

Usage messages always need to make it easy to tell whether something is an input or an output parameter (e.g. does the script write to, or read, the file?)

@xiaohui-zhang
Copy link
Contributor Author

Sure. I noted output file as "Output file containing XXX".

On Wed, Nov 9, 2016 at 5:49 PM, Daniel Povey notifications@github.com
wrote:

Usage messages always need to make it easy to tell whether something is an
input or an output parameter (e.g. does the script write to, or read, the
file?)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ANiEEQ5B8OGg3FVkPclFZozhlDrU9-LFks5q8k4LgaJpZM4JmzJX
.

Xiao-hui

@xiaohui-zhang
Copy link
Contributor Author

Thanks for @danpovey 's detailed suggestions. I've addressed them in my last commit. Please review when you have time :-)

steps/align_fmllr.sh --nj $nj --cmd "$train_cmd" \
$data $dir/lang_expanded_target_vocab $src_mdl $alidir || exit 1;

# Train another SAT system on the given data and put it in ${src_mdl}_${affix},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would give uninformative errors if num_leaves and num_gauss were not set. Either set defaults, or check that the user has set them if retrain_src_mdl == true.

mkdir -p $dir/dict_expanded_target_vocab
cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \
$dir/dict_expanded_target_vocab 2>/dev/null
rm $dir/dict_expanded_target_vocab/lexiconp.txt 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove lexicon.txt too, prepare_lang.sh automatically creates it from lexiconp.txt. Do this wherever you remove lexiconp.txt.

echo " <target-vocab> the vocabulary we want the final learned lexicon to cover."
echo " <data> acoustic training data we use to get alternative"
echo " pronunciations and collet acoustic evidence."
echo " <src-mdl> the acoustic model based on which we re-train an SAT model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"acoustic model" or src-mdl implies a file, but I think that is not what you mean. Please rename to srcdir and say that it's a directory containing GMM-based models.

echo " <ref-dict> the dir which contains the reference lexicon (most probably hand-derived)"
echo " we want to expand/improve, and nonsilence_phones.txt,.etc which we need "
echo " for building new dict dirs."
echo " <target-vocab> the vocabulary we want the final learned lexicon to cover."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention the format, e.g. one word per line. [or is it a symbol table format, like words.txt?]

if [ $stage -le 1 ]; then
steps/dict/learn_lexicon.sh --g2p-pron-candidates "$data/lexicon_oov_g2p.txt" \
--min-prob $min_prob --variants-prob-mass $variants_prob_mass \
--variants-prob-mass2 $variants_prob_mass2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable name does not exist any more (and this means you have not tested the script after your latest changes).

$data $dir/lang_combined_iter1 $mdl $dir/lats || exit 1;

# Get arc level information from the lattice.
$cmd JOB=1:$nj $dir/lats/log/get_arc_info.JOB.log \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the names lats_pass1 and lats_pass2 would be better than lats and lats2.

if [ $stage -le 3 ]; then
echo "$0: Generate pronunciation candidates from phone-level decoding on acoustic training data.."
mdl=$src_mdl && $retrain_src_mdl && mdl=${src_mdl}_${affix}
steps/cleanup/debug_lexicon.sh --nj $nj --cmd "$decode_cmd" $data $dir/lang_expanded_train \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that I think you should investigate at some point (not necessarily right now), is whether you can improve this recipe by using a higher-order phone language model in steps/cleanup/debug_lexicon.sh. Currently it uses an un-smoothed bigram (adding unigrams causes a blowup), but you could train a triphone LM using make_phone_lm.py using the same inputs (alignments); that also does not smooth below bigram. You'd have to make sure the disambiguation symbol was taken care of... make_phone_lm.py uses a disambiguation symbol. You could probably use #0, just make sure it's in phones.txt and L_disambig has the self-loop to consume it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I've added to my TODO list

else:
raise Exception("Unknown value {0} for --{1}".format(values, self.dest))

def GetArgs():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving the python to a pep8-compliant style (e.g. lower_case function names), and validation with flake8, but you could do that later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks

words.add(entry[0])
print("{0} {1}".format(entry[0], entry[1]),
file = args.out_lexicon_handle)
print ("Before pruning, the total num. pronunciations is: {}".format(len(lexicon)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to print things like this to stderr, as a general rule. That way you can redirect the output to /dev/stdout and things will work.

break
if not all_ref_prons_accepted or flags[0] == '1' or flags[2] == '1':
words_to_edit.append((word, counts[word]))
if count > thr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use threshold, not thr-- google style guide is against abbreviated names.

@jtrmal
Copy link
Contributor

jtrmal commented Nov 16, 2016

Guys, I won't have probably time for this until half of the december (until
LORELEI dry0run finishes)
I tried the scripts on babel data (Pastho) a couple of weeks back and it
actually crashed. Unfortunately I don't have the old system anymore and I
don't know what was the error.
I was wondering if @xiaohui-zhang could try to run the scripts and see if
he gets the error and if yes, then what is the reason for that. It might be
just that I was using older scripts and the bug was fixed since that. I'm
not saying he has to make actually improving something -- just to see if
there is some robustness issue
Thanks,
y.

On Wed, Aug 17, 2016 at 3:26 PM, Xiaohui Zhang notifications@github.com
wrote:

the main script is egs/wsj/s5/steps/dict/learn_lexicon.sh

You can view, comment on, or merge this pull request online at:

#976
Commit Summary

  • WIP: lexicon learning

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#976, or mute the thread
https://github.com/notifications/unsubscribe-auth/AKisX8vu6v1K9KKoAZYkyx68erk64rQ7ks5qg2B0gaJpZM4JmzJX
.

$data/train_vocab.txt | sort > $data/oov_train.txt || exit 1;
steps/dict/apply_g2p.sh --var-counts 4 $data/oov_train.txt \
$g2p_mdl_dir exp/g2p/oov_lex_train || exit 1;
cat exp/g2p/oov_lex_train/lexicon.lex | awk '{print $1,$3}' | \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the awk command back to cut -f1,3 cause the columns of lexicon.lex are separated by tabs, and the phones are separated by spaces:
hello \tab 0.8 \tab h a l o

@xiaohui-zhang
Copy link
Contributor Author

@jtrmal Thanks. I'll make sure this pipeline works. After you finish the LORELEI dryrun the priority of getting your help might be experiments related to the new g2p toolkit.

@danpovey just addressed all your comments (and tested the whole thing)

@danpovey
Copy link
Contributor

So you think this is ready to merge?

On Wed, Nov 16, 2016 at 5:14 PM, Xiaohui Zhang notifications@github.com
wrote:

@jtrmal https://github.com/jtrmal Thanks. I'll make sure this pipeline
works. After you finish the LORELEI dryrun the priority of getting your
help might be experiments related to the new g2p toolkit.

@danpovey https://github.com/danpovey just addressed all your comments
(and tested the whole thing)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu1zCf1hpLtELyfkXy6Q2BBZHMAPIks5q-4BFgaJpZM4JmzJX
.

@xiaohui-zhang
Copy link
Contributor Author

yep. as long as you don't have other suggestions

@vince62s
Copy link
Contributor

@xiaohui-zhang Do you have results to commit too ?

@danpovey
Copy link
Contributor

I'm asking for some big changes before we commit this, but Samuel, you
might as well put the results on this PR.
Basically, after looking at the lexicon-edits it was making, I think the
algorithm should be changed to use a different objective function that's
more related to likelihood...

On Thu, Nov 17, 2016 at 1:57 AM, vince62s notifications@github.com wrote:

@xiaohui-zhang https://github.com/xiaohui-zhang Do you have results to
commit too ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#976 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu2EDaGJPFHUUPFauOeXMbIv7m9YBks5q-_rqgaJpZM4JmzJX
.

@danpovey
Copy link
Contributor

danpovey commented Dec 1, 2016

Since it looks like the alternative version we're looking into may be becoming a bigger and more long-term project, I'm considering committing this as a stopgap measure. Samuel, are there any small fixes that you need to apply to this PR?

@xiaohui-zhang
Copy link
Contributor Author

@danpovey good to merge now

@xiaohui-zhang
Copy link
Contributor Author

Hello @vince62s , sorry for the late reply. We've been working hard on improving WERs and it's hard to get consistent improvements, partly cause the pronunciations we improved don't always appear in the dev/test sets. Also my baseline was out-of-date. So I didn't bother putting numbers in RESULTS. I'll let you know when we have consistent WER improvements. Thanks!

@danpovey danpovey merged commit 03e6b92 into kaldi-asr:master Dec 1, 2016
@vince62s
Copy link
Contributor

vince62s commented Dec 2, 2016

@xiaohui-zhang I didn't try the scripts nor read it carefully because I am not sure to fully understand what it does in the case of tedlium. I would appreciate some clarification.

In the case of Tedlium, unless I am wrong, training data do not have OOV, since they included all words from training data in the dictionary. so, if this is the case, am I right saying that this script step will do nothing ?

from another general standpoint, if a user does the job on his side (to use G2P to generate prons for oov words in the training data) which I guess is what the tedlium team did when built their dict, will the script do some kind of redundant thing / detect some lexicon errors that were initiated by G2P in the first place ?

thanks for your help.

@xiaohui-zhang
Copy link
Contributor Author

Hi @vince62s , basically there is OOV cause the vocab size of data/local/dict/lexicon.txt was truncated to 200k. This recipe is not showing how to deal with OOVs with lexicon learning. What this recipe does is: it debugs the given reference lexicon (data/local/dict/lexicon.txt), so that the pronunciations agree more with the acoustics. The output lexicon will be put in data/local/dict_learned_nosp (Note that the vocab is the same as the reference lexicon, cause "target-vocab" in learn_lexicon.sh was set as ref-vocab (vocab of the reference lexicon. The recipe also outputs a lexicon-edit file, which records the changes to the reference lexicon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants