Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[src,egs,scripts] Merging RNNLM-related changes which were in wrong branch #2092

Merged
merged 123 commits into from
Dec 21, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
b51a4fc
Adding some sampling code
danpovey Jun 17, 2017
aeffb8b
[src] Adding some RNNLM-related sampling utilities.
danpovey Jun 17, 2017
0bf3bc6
[build] Modify Makefiles w.r.t. RNNLM stuff
danpovey Jun 17, 2017
87bc825
[src] Fix sampler-test.cc (fixing test failure)
danpovey Jun 17, 2017
d3d1351
[build] link cusparse lib and handle it with CuDeivce (#1699)
kangshiyin Jun 20, 2017
9e1f742
[src] Add arpa-reading code for RNNLM (#1701)
keli78 Jun 26, 2017
de4faf9
[scripts] RNNLM data-preparation (#1707) (#1717)
wantee Jul 3, 2017
2210e63
make </s> case-sensitive in rnnlm (#1738)
wantee Jul 5, 2017
59521f4
[rnnlm,scripts] add a --unigram-scale option to rnnlm/choose_features…
keli78 Jul 5, 2017
3c07383
[src] Some drafts of RNNLM-related code.
danpovey Jul 3, 2017
4fe8118
Fix small formatting issues.
danpovey Jul 3, 2017
63c50a7
[src] Adding and refactoring RNNLM related code
danpovey Jul 7, 2017
dc3449c
[scripts] add some documentations for rnnlm scripts (#1743)
wantee Jul 7, 2017
570d97f
add test code for ArpaSampling; fix errors in arpa-sampling.cc and rn…
Jul 10, 2017
614dd45
Merge pull request #1753 from keli78/arpa-testing
danpovey Jul 10, 2017
06eee20
[src] remove unused declarations
danpovey Jul 15, 2017
96daad1
[egs] add RNNLM data preparation script for PTB (#1771)
keli78 Jul 20, 2017
f4b9d93
[src] some partial work towards RNNLM training.
danpovey Jul 23, 2017
e6920de
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Jul 23, 2017
f2f43ba
[src] Fixing various compilation errors, finishing more RNNLM trainin…
danpovey Jul 23, 2017
ee461f7
Merge branch 'rnnlm' of https://github.com/kaldi-asr/kaldi into rnnlm
danpovey Jul 23, 2017
fdf87a0
[src] add default args for arpa sampling test (#1768)
keli78 Jul 23, 2017
9b94c4d
[src] Adding more declarations of needed sparse-matrix functions
danpovey Jul 24, 2017
5be79bb
[src] Add new CuMatrix::AddToRows() overload; test; minor fixes (#1775)
hhadian Jul 25, 2017
b66e23a
[src] Some fixes re CuArray + Add CuMatrixBase::AddToElements + test …
hhadian Jul 26, 2017
0b7b691
[src] change to table-reading code to make Value() non-const.
danpovey Jul 25, 2017
ab68664
[src] Further progress on RNNLM
danpovey Jul 26, 2017
e2434c2
[src] Further progress on RNNLM code
danpovey Jul 27, 2017
15553de
[src] Add rnnlm::ReadSparseWordFeatures (#1778)
hhadian Jul 27, 2017
4d171cb
[src] Add CuVectorBase::CopyElements() and VecMatVec() + tests (#1780)
hhadian Jul 29, 2017
4b1bbef
[src] Further progress on RNNLM code
danpovey Jul 29, 2017
75f3972
[src] Minor fix to test code (#1781)
hhadian Jul 30, 2017
e89d031
[src] Change CuSparseMatrix to use CSR storage format; implement more…
kangshiyin Jul 30, 2017
a65fa45
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Aug 1, 2017
5bf3b3e
[src] Implement CuMatrixBase::AddSmat() (#1782)
kangshiyin Aug 1, 2017
f1b0298
[src] Further progress on rnnlm
danpovey Aug 1, 2017
72a8dc2
[src] Implementing some missing functions for RNNLM traiing
danpovey Aug 1, 2017
6d9af27
[src] Add CuMatrixBase::AddMatSmat() and unit test (#1789)
kangshiyin Aug 2, 2017
e30ea06
[src] Further progress on RNNLM code
danpovey Aug 3, 2017
bd31845
[src] clarify documentation
danpovey Aug 3, 2017
c2dd613
[src] Add CuMatrix::AddSmatMat() and unit test (#1791)
kangshiyin Aug 3, 2017
6f91119
[src] Add RnnlmEmbeddingTrainer::PrintStats() (#1792)
hhadian Aug 3, 2017
62c9273
[src] Various fixes and more progress for RNNLM
danpovey Aug 5, 2017
79a7701
[src] Fix compilation errors in test code
danpovey Aug 5, 2017
c03503b
[src] Add RnnlmExample::Read,Write + some functions in rnnlm-test-uti…
hhadian Aug 5, 2017
0cbec10
[src] Add more testing code for RNNLM
danpovey Aug 5, 2017
386eb7f
[src] Use AddSmat in GeneralMatrix (#1798)
kangshiyin Aug 6, 2017
e3fbfa0
[src] RNNLM-related script changes; code fixes
danpovey Aug 7, 2017
cd821f9
[src] fix to compile error
danpovey Aug 7, 2017
d93d0c1
[src] CUDA kernel for ApplyExpSpecial (#1801)
kangshiyin Aug 7, 2017
d661a1e
[src] Add a simple implementation for EstimateAndWriteLanguageModel (…
hhadian Aug 7, 2017
cd77241
[src,scripts,egs] Further RNNLM progress
danpovey Aug 8, 2017
ec578d3
[src] Various code fixes
danpovey Aug 8, 2017
8099228
[src] Various fixes to problems encountered while debugging RNNLM code
danpovey Aug 10, 2017
560b3db
[src] fix options-related bug in rnnlm-train.cc
danpovey Aug 10, 2017
7e67d70
[src] Various RNNLM-related fixes; add mutex for memory management code.
danpovey Aug 11, 2017
dd5125d
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Aug 11, 2017
57b6de6
[scripts] Add 'constant' feature for RNNLM word representation, to ha…
danpovey Aug 11, 2017
079955f
[scripts] updates to RNNLM feature validation
danpovey Aug 12, 2017
27851a2
[src] various RNNLM-related fixes and optimizations.
danpovey Aug 12, 2017
d56c28b
[src] simple Interpolated kneser ney LM for testing purpose (#1812)
wantee Aug 12, 2017
1a4ca00
[scripts] add scales for rnnlm features (#1816)
wantee Aug 14, 2017
42c41d1
[scripts] add show_word_features.py (#1815)
wantee Aug 14, 2017
4a0c54b
[scripts,egs] remove the setting of PYTHONIOENCODING in prepare_rnnlm…
wantee Aug 14, 2017
377b786
[src] Bug-fix and test-code changes in cudamatrix
danpovey Aug 14, 2017
99113da
[src] Bug-fix and improvements to stability for RNNLM code
danpovey Aug 14, 2017
b65c12c
[scripts,egs] various bug-fixes.
danpovey Aug 14, 2017
daa12ec
[src] fix to compile error in test code
danpovey Aug 14, 2017
8a1144f
[src,build] fix Makefile; make some sampling code faster.
danpovey Aug 14, 2017
abc1c85
[src,scripts,egs] rnnlm: initialize_matrix, translate python to perl …
sas91 Aug 16, 2017
3c618c5
[src] Add inbuilt tool to estimate LM optimized for RNNLM importance …
danpovey Aug 18, 2017
0ce198d
[src,scripts,egs,build] Enable RNNLM lattice rescoring with Tensorflo…
hainan-xv Aug 11, 2017
cc0b0c4
[scripts] Documentation fix in xconfig scripts
danpovey Aug 12, 2017
8541a21
[scripts] Fix to script usage message (thanks: @yzmyyff)
danpovey Aug 14, 2017
63750ba
[build] fix compilation problem of tfrnnlm and tfrnnlmbin (#1822)
hainan-xv Aug 15, 2017
c927fc7
[scripts,src] Check that symbol '#0' is not in the vocab of the ARPA …
xiaohui-zhang Aug 15, 2017
5f9e4d9
[src] Inconsequential bug-fixes to problems found when compiling with…
danpovey Aug 15, 2017
3bd7fea
[src] Bug-fixes to backoff model for sampling
danpovey Aug 19, 2017
f6f16ab
[src] enable multi-threading for sampling for RNNLM training
danpovey Aug 20, 2017
4286e2c
[src,egs] Enable bypass of ARPA format for RNNLM sampling-language-mo…
danpovey Aug 20, 2017
bc387e6
[src] Optimizations to sparse-matrix functions: AddMatSmat, AddSmat …
sas91 Aug 23, 2017
9fc22b1
[src,scripts] adding more scripts and binaries
danpovey Aug 24, 2017
5fd3c46
[src,egs] various fixes
danpovey Aug 24, 2017
f0476c6
[scripts] modify prepare_split_data.py to include dev data
danpovey Aug 24, 2017
4437008
[src,scripts,egs] Finishing scripts and fixing bugs in RNNLM setup
danpovey Aug 26, 2017
0e1105f
[egs] Update RNNLM results
danpovey Aug 26, 2017
a4cbaa0
[scripts,egs] Changing how unigram feature is printed and how max-fea…
danpovey Aug 27, 2017
5a0c848
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Aug 28, 2017
ce5f369
[scripts,egs] Add example for RNNLM training for WSJ (no results yet)…
danpovey Aug 29, 2017
8c67fb9
[src,scripts] Solve speed issue with RNNLM sampling.
danpovey Aug 29, 2017
e095047
[scripts] Fixes to RNNLM training script
danpovey Aug 29, 2017
1f7872c
[egs] Add objf results for WSJ training
danpovey Aug 29, 2017
2689112
[src] Changes to Classify{R,W}filename to allow some spaces. thanks:…
danpovey Aug 31, 2017
143d256
[src] Add function to get max memory of nnet3 computation; cosmetic s…
danpovey Sep 10, 2017
041fd87
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Sep 12, 2017
1b0a32f
[scripts] add special_symbol_opts (allows RNNLM setup to use differen…
wantee Sep 17, 2017
1841b3c
[scripts] Fix get_embedding_dim.py RE left-context and right-context …
danpovey Sep 22, 2017
d8284dc
[src,scripts,egs] Fast lattice rescoring based on pruned composition …
danpovey Sep 23, 2017
60db284
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Sep 23, 2017
817272e
[src] Use correct error messages for errors from CuSparse (#1908)
selaselah Sep 25, 2017
0ffc8b5
[egs] Add example script for RNNLM training on Swbd (#1907)
keli78 Sep 26, 2017
0c78131
[src] Make copy constructor of NnetComputer explicit
danpovey Sep 27, 2017
d0f36ca
[src] fix bug in pruned composition (thanks: @hainan-xv)
danpovey Sep 28, 2017
27aa514
[scripts] Make sure all rnnlm scripts use encoding=utf-8 with open
danpovey Sep 29, 2017
402f531
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Oct 14, 2017
693840b
[src] Add scales and constant values to Descriptors in nnet3 (#1884)
danpovey Oct 17, 2017
e87e303
[scripts] Fix a learning rate decay bug in rnnlm setup (#1944)
keli78 Oct 17, 2017
b51874c
[src] Removed unnecessary kLinearInParameters and kLinearInInput flag…
mmaciej2 Oct 23, 2017
764483b
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Oct 23, 2017
629b885
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Oct 25, 2017
b858223
[src,scripts,egs] Add l2 regularization for RNNLMs; fixes RE test-mod…
keli78 Nov 6, 2017
6127c32
[egs] Update RNNLM recipes with L2 regularization (#2005)
keli78 Nov 9, 2017
3348b3e
[src] Workaround for compiler issue (thanks: @francoishernandez)
danpovey Nov 20, 2017
652050a
[src,scripts,egs] nnet3-rnnlm lattice rescoring draft (#1906)
hainan-xv Nov 23, 2017
8537641
Merge remote-tracking branch 'upstream/master' into rnnlm
danpovey Nov 23, 2017
131cdd4
[build] Update version file, first commit of kaldi 5.3.
danpovey Nov 23, 2017
edec255
[build] Update src/doc/get_version_info.sh (for building documentation)
danpovey Nov 23, 2017
347d181
[doc] Update version-related documentation.
danpovey Nov 23, 2017
dc7bed5
[src] Some fixes to testing code
danpovey Nov 23, 2017
4ac4051
[egs] Minor change to comment
danpovey Nov 23, 2017
f2d2305
[src] Fix minor bug in rnnlm-compute-state.cc RE dimension checking (…
hainan-xv Nov 29, 2017
4a24b4b
[src] Add (faster) pruned composition for RNNLM rescoring (#2059)
hainan-xv Dec 13, 2017
a4aa18e
merge rnnlm with latest master
hainan-xv Dec 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[src,build] fix Makefile; make some sampling code faster.
  • Loading branch information
danpovey committed Aug 14, 2017
commit 8a1144f17514691290622b8d3b6329ca87d3a980
71 changes: 71 additions & 0 deletions egs/ptb/s5/local/rnnlm/train_rnnlm_sampling.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/usr/bin/bash


# this will eventually be totally refactored and moved into steps/.

dir=exp/rnnlm_data_prep
vocab=data/vocab/words.txt
embedding_dim=600

ns=$(rnnlm/get_num_splits.sh 200000 data/text $dir/data_weights.txt)

# work out the number of splits.
ns=$(rnnlm/get_num_splits.sh 200000 data/text $dir/data_weights.txt)
vocab_size=$(tail -n 1 $vocab |awk '{print $NF + 1}')

# split the data into pieces that individual jobs will train on.
# rnnlm/split_data.sh data/text $ns


rnnlm/prepare_split_data.py --vocab-file=$vocab --data-weights-file=$dir/data_weights.txt \
--num-splits=$ns data/text $dir/text

. ./path.sh

# cat >$dir/config <<EOF
# input-node name=input dim=$embedding_dim
# component name=affine1 type=NaturalGradientAffineComponent input-dim=$embedding_dim output-dim=$embedding_dim
# component-node input=input name=affine1 component=affine1
# output-node input=affine1 name=output
# EOF

mkdir -p $dir/configs
cat >$dir/configs/network.xconfig <<EOF
input dim=$embedding_dim name=input
relu-renorm-layer name=tdnn1 dim=512 input=Append(0, IfDefined(-1))
relu-renorm-layer name=tdnn2 dim=512 input=Append(0, IfDefined(-2))
relu-renorm-layer name=tdnn3 dim=512 input=Append(0, IfDefined(-2))
output-layer name=output include-log-softmax=false dim=$embedding_dim
EOF

steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/


# note: this is way too slow, we need to speed it up somehow.
# I'm not sure if I want to have a dependency on numpy just for this though.
# maybe we can rewrite in perl.
rnnlm/initialize_matrix.py --num-rows=$vocab_size --num-cols=$embedding_dim \
--first-column=1.0 > $dir/embedding.0.mat

nnet3-init $dir/configs/final.config - | nnet3-copy --learning-rate=0.0001 - $dir/0.rnnlm


rnnlm-train --use-gpu=no --read-rnnlm=$dir/0.rnnlm --write-rnnlm=$dir/1.rnnlm --read-embedding=$dir/embedding.0.mat \
--write-embedding=/$dir/embedding.1.mat "ark:rnnlm-get-egs --vocab-size=$vocab_size $dir/text/1.txt ark,t:- |"

# or with GPU:
rnnlm-train --rnnlm.max-param-change=0.5 --embedding.max-param-change=0.5 \
--use-gpu=yes --read-rnnlm=$dir/0.rnnlm --write-rnnlm=$dir/1.rnnlm --read-embedding=$dir/embedding.0.mat \
--write-embedding=$dir/embedding.1.mat 'ark:for n in 1 2 3 4 5 6; do cat exp/rnnlm_data_prep/text/*.txt; done | rnnlm-get-egs --vocab-size=10003 - ark,t:- |'


# just a note on the unigram entropy of PTB training set:
# awk '{for (n=1;n<=NF;n++) { count[$n]++; } count["</s>"]++; } END{ tot_count=0; tot_entropy=0.0; for(k in count) tot_count += count[k]; for (k in count) { p = count[k]*1.0/tot_count; tot_entropy += p*log(p); } print "entropy is " -tot_entropy; }' <data/text/ptb.txt
# 6.52933

# .. and entropy of bigrams:
# awk '{hist="<s>"; for (n=1;n<=NF;n++) { count[hist,$n]++; hist=$n; } count[hist,"</s>"]++; } END{ tot_count=0; tot_entropy=0.0; for(k in count) tot_count += count[k]; for (k in count) { p = count[k]*1.0/tot_count; tot_entropy += p*log(p); } print "entropy is " -tot_entropy; }' <data/text/ptb.txt
# 10.7482
# in information theory, H(X) = H(Y) = 6.52, H(X,Y) = 10.7482, so H(Y | X) = 10.7482 - 6.52 = ***4.2282***, which
# is the entropy of the next symbol given the preceding symbol. this gives a limit on the expected training
# objective given just a single word of context.
6 changes: 3 additions & 3 deletions src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,9 @@ $(EXT_SUBDIRS) : mklibdir ext_depend
# this is necessary for correct parallel compilation
#1)The tools depend on all the libraries

bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin: \
bin fstbin gmmbin fgmmbin sgmm2bin featbin nnetbin nnet2bin nnet3bin chainbin latbin ivectorbin lmbin kwsbin online2bin rnnlmbin: \
base matrix util feat tree gmm transform sgmm2 fstext hmm \
lm decoder lat cudamatrix nnet nnet2 nnet3 ivector chain kws online2
lm decoder lat cudamatrix nnet nnet2 nnet3 ivector chain kws online2 rnnlm

#2)The libraries have inter-dependencies
base: base/.depend.mk
Expand All @@ -172,7 +172,7 @@ cudamatrix: base util matrix
nnet: base util hmm tree matrix cudamatrix
nnet2: base util matrix lat gmm hmm tree transform cudamatrix
nnet3: base util matrix lat gmm hmm tree transform cudamatrix chain fstext
rnnlm: base util matrix cudamatrix nnet3
rnnlm: base util matrix cudamatrix nnet3 lm hmm
chain: lat hmm tree fstext matrix cudamatrix util base
ivector: base util matrix transform tree gmm
#3)Dependencies for optional parts of Kaldi
Expand Down
41 changes: 37 additions & 4 deletions src/rnnlm/sampler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,27 @@ void Sampler::SampleWords(
SampleFromIntervals(intervals, sample);
}



// This hacked version of std::priority_queue allows us to extract all elements
// of the priority queue to a supplied vector, in an efficient way. It relies
// on the fact that std::priority<queue> stores the underlying container as a
// protected member 'c'. The only way to do this using the supplied interface
// of std::priority_queue is to repeatedly pop() the element from the queue, but
// that is too slow, and it actually had an impact on the speed of the
// application.
template <typename T>
class hacked_priority_queue: public std::priority_queue<T> {
public:
void append_all_elements(std::vector<T> *output) const {
output->insert(output->end(), this->c.begin(), this->c.end());
}
// we have to redeclare the constructor.
template <typename InputIter> hacked_priority_queue(
InputIter begin, const InputIter end): std::priority_queue<T>(begin, end) { }
};


// static
void Sampler::NormalizeIntervals(int32 num_words_to_sample,
double total_p,
Expand All @@ -324,7 +345,7 @@ void Sampler::NormalizeIntervals(int32 num_words_to_sample,
// current_alpha = (num_words_to_sample - num_ones) / total_remaining_p.
// As we update 'num_ones' and 'total_remaining_p', we will continue
// to update current_alpha, and it will keep getting larger.
std::priority_queue<Interval> queue(intervals->begin(), intervals->end());
hacked_priority_queue<Interval> queue(intervals->begin(), intervals->end());

// clear 'intervals'; we'll use the space to store the intervals that will
// have a prob of exactly 1.0, and eventually we'll add the rest.
Expand Down Expand Up @@ -376,15 +397,27 @@ void Sampler::NormalizeIntervals(int32 num_words_to_sample,
}
}
}
// it's not that efficient to use the top() function of the queue to remove
// elements, but there doesn't seem to be an efficient way to get
// all the elements at once without nasty hacks. Hopefully this won't dominate.
#if 0
// The following code is a bit slow but has the advantage of not assuming
// anything about the internals of class std::priority_queue.
while (!queue.empty()) {
Interval top = queue.top();
top.prob *= current_alpha;
queue.pop();
intervals->push_back(top);
}
#else
{ // This code is faster but relies on the fact that priority_queue
// has a protected member 'c' which is the underlying container.
size_t cur_size = intervals->size();
queue.append_all_elements(intervals);
// the next loop scales the 'prob' members of the elements we just
// added to 'intervals', by current_alpha.
std::vector<Interval>::iterator iter = intervals->begin() + cur_size,
end = intervals->end();
for (; iter != end; ++iter) iter->prob *= current_alpha;
}
#endif

if (GetVerboseLevel() >= 2) {
double tot_prob = 0.0;
Expand Down