Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

akshaysubr · 2019-12-10T22:57:11Z

Multi-threading using C++11 threads to optimize the numerator graph's forward-backward algorithm to remove serial CPU bottleneck while training e2e chain models.

Added a command line option --multithreaded-numerator to toggle between single and multi-threaded execution. Tested and compared new implementation to the old implementation to ensure correctness. Overall speedup of 5-10x for just the forward-backward algorithm and 1.3-1.5x for a full training loop.

This PR also includes adding the ability to turn on mixed precision compute using tensor cores and added some NVTX markers in the code that are activated by a compile time flag -DUSE_NVTX.

…the timeline

…d added more NVTX markers

…n-multithreaded-numerator

danpovey · 2019-12-11T02:53:43Z

src/chain/chain-generic-numerator.cc

+    workers[thread].join();
+    // Reduce thread values to a single value
+    partial_loglike += partial_loglike_mt[thread];
+    ok = ok && ok_mt[thread];


Does this actually create threads here, or uses some kind of thread pool? Seems like creating threads might be kind of heavyweight.

Yes, threads are created within ForwardBackward(...) and the number of threads created is set to be the available hardware concurrency and the total number of sequences are split across these threads. There is an overhead to creating threads this way, but since the GenericNumeratorComputation object is created once per iteration and ForwardBackward is called only once during its lifetime, this design seemed reasonable.

I'm a bit worried about setting these things automatically -- hardware concurency is one thing and user limits placed on resources (e.g. in SGE/Slurm) are other things. This reminds me the OMP mentality that causes issues in many similar environments.
So I think it would be a nice thing to have but I don't think it should be default.

So, make this also opt-in? Or set a better default value?

src/chain/chain-training.cc

danpovey · 2019-12-11T02:55:32Z

src/chainbin/nnet3-chain-train.cc

@@ -53,6 +52,9 @@ int main(int argc, char *argv[]) {
                "yes|no|optional|wait, only has effect if compiled with CUDA");

    opts.Register(&po);
+#if HAVE_CUDA==1
+    CuDevice::RegisterDeviceOptions(&po);


If your changes require configuration to be effective, I think I would rather you set the defaults to be fast.

Okay, the multithreaded flag is set to False currently, will change that to True but do you also want the tensor core flag default to be True since this has an impact on numerical roundoff?

see my comment above

… chain-training.cc

danpovey · 2019-12-12T02:32:20Z

Hm, it might need some debugging, if we set the tensor-core thing to true by default.
I'm thinking the training might be affected or might be more likely to fail various
assertions.
Tricky decision.
Does it cause a crash on older hardware if you set that flag to true?
@luitjens or @hugovbraun do you have any opinions on this?

hugovbraun · 2019-12-12T03:25:39Z

It won't crash on older hardware (the tensor core flag will be ignored by cublas). However since it can possibly damage accuracy (the math is done in half precision), I think an opt-in would be better.

luitjens · 2019-12-12T03:56:20Z

I agree with Hugo we should opt into tensor cores

jtrmal · 2019-12-12T17:04:19Z

Yes, that would be my preference (make it user configurable number of threads). Y

…

On Thu, Dec 12, 2019 at 17:59 Akshay Subramaniam ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/chain/chain-generic-numerator.cc <#3766 (comment)>: > + if (GetVerboseLevel() >= 1) + ok_mt[thread] = ok_mt[thread] && CheckValues(seq, probs, alpha[thread], beta[thread], derivs); + } + return; + }; + + std::vector<std::thread> workers(nthreads); + for (int thread = 0; thread < nthreads; ++thread) + // Launch all threads + workers[thread] = std::thread(thread_lambda, thread, num_sequences, num_sequences_per_thread); + for (int thread = 0; thread < nthreads; ++thread) { + // Join threads back in + workers[thread].join(); + // Reduce thread values to a single value + partial_loglike += partial_loglike_mt[thread]; + ok = ok && ok_mt[thread]; So, make this also opt-in? Or set a better default value? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3766?email_source=notifications&email_token=ACUKYX43Q5MWPMOIDEFP3SLQYJUYFA5CNFSM4JZGA2F2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPAHVMY#discussion_r357260183>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXZYYAXAOZWUDIGA5Q3QYJUYFANCNFSM4JZGA2FQ> .

akshaysubr · 2019-12-12T19:14:19Z

Changed the command line argument from a multithreading flag to a thread count. Setting thread count to 0 uses full hardware concurrency and the default is to use 1 thread.

danpovey · 2019-12-31T06:29:35Z

I'm sorry, I seem to have neglected this PR. (Reminder: please ping me if I drop the ball).
I am thinking that you could make the thread-count default to the minimum of 4, and the full hardware concurrency? If that's possible? The reason I think it's OK to default to >1 is that the numerator computation doesn't dominate anyway.

akshaysubr · 2020-01-06T22:04:26Z

@danpovey I think that is reasonable. I've made the change.

danpovey · 2020-01-07T01:59:25Z

src/chain/chain-denominator.cc

@@ -231,6 +234,7 @@ void DenominatorComputation::Beta(int32 t) {
 }

 BaseFloat DenominatorComputation::Forward() {
+  NVTX_RANGE(__func__);


is this debug code?

No, this is instrumentation for the profiler. The NVTX_RANGE macro is set when compiled with -DUSE_NVTX otherwise it's empty.

danpovey · 2020-01-07T02:33:30Z

Great, thanks! Merging.

* [src] fix fstmakecontextfst duplicated disambig symbols (#3811) * [src] Fix wrong error message format in make_lexicon_fst.py * [egs] Update librispeech TDNN-F recipe (#3813) * [src] Enable multiple threads for chain-generic-numerator to remove CPU bottleneck (#3766) Co-authored-by: Tommy Tao <o_otaotao@hotmail.com> Co-authored-by: o-alexandre-felipe <o.alexandre.felipe@gmail.com> Co-authored-by: Akshay Subramaniam <6964110+akshaysubr@users.noreply.github.com>

…PU bottleneck (kaldi-asr#3766)

akshaysubr added 6 commits November 7, 2019 11:17

Added NVTX markers to nnet3-chain-train to track down CPU portion of …

3ba4c2c

…the timeline

Adding tensor core support to nnet3-chain-train

79ad328

Added multithreaded version of the numerator graph in chain models an…

170bef0

…d added more NVTX markers

Added a few more markers to track natural gradients

198eb2f

Merge branch 'master' of https://github.com/kaldi-asr/kaldi into chai…

f6f7fc5

…n-multithreaded-numerator

Fixed a few issues and cleaned up chain-generic-numerator

94d6512

danpovey reviewed Dec 11, 2019

View reviewed changes

src/chain/chain-training.cc Outdated Show resolved Hide resolved

danpovey reviewed Dec 11, 2019

View reviewed changes

Changed multithreded default to true and removed rogue NVTX marker in…

7879808

… chain-training.cc

Made number of threads command line configurable and set default to 1

ae31d97

Changed default threads to min(4,hardware_concurrency)

2e446c3

danpovey reviewed Jan 7, 2020

View reviewed changes

danpovey merged commit 8e2bbd2 into kaldi-asr:master Jan 7, 2020

naxingyu mentioned this pull request Jan 7, 2020

Sync pybind11 with master #3820

Merged

Bar-BY pushed a commit to Bar-BY/kaldi that referenced this pull request Jan 21, 2020

[src] Enable multiple threads for chain-generic-numerator to remove C…

6684126

…PU bottleneck (kaldi-asr#3766)

galv pushed a commit to galv/kaldi that referenced this pull request Dec 10, 2022

[src] Enable multiple threads for chain-generic-numerator to remove C…

dceb6f8

…PU bottleneck (kaldi-asr#3766)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

akshaysubr commented Dec 10, 2019

danpovey Dec 11, 2019

akshaysubr Dec 11, 2019

jtrmal Dec 12, 2019

akshaysubr Dec 12, 2019

danpovey Dec 11, 2019

akshaysubr Dec 11, 2019

jtrmal Dec 12, 2019

danpovey commented Dec 12, 2019

hugovbraun commented Dec 12, 2019

luitjens commented Dec 12, 2019

jtrmal commented Dec 12, 2019 via email

akshaysubr commented Dec 12, 2019

danpovey commented Dec 31, 2019

akshaysubr commented Jan 6, 2020

danpovey Jan 7, 2020

akshaysubr Jan 7, 2020

danpovey commented Jan 7, 2020

Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

Conversation

akshaysubr commented Dec 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Dec 12, 2019

hugovbraun commented Dec 12, 2019

luitjens commented Dec 12, 2019

jtrmal commented Dec 12, 2019 via email

akshaysubr commented Dec 12, 2019

danpovey commented Dec 31, 2019

akshaysubr commented Jan 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Jan 7, 2020