-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766
Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766
Conversation
…d added more NVTX markers
…n-multithreaded-numerator
workers[thread].join(); | ||
// Reduce thread values to a single value | ||
partial_loglike += partial_loglike_mt[thread]; | ||
ok = ok && ok_mt[thread]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually create threads here, or uses some kind of thread pool? Seems like creating threads might be kind of heavyweight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, threads are created within ForwardBackward(...)
and the number of threads created is set to be the available hardware concurrency and the total number of sequences are split across these threads. There is an overhead to creating threads this way, but since the GenericNumeratorComputation
object is created once per iteration and ForwardBackward
is called only once during its lifetime, this design seemed reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about setting these things automatically -- hardware concurency is one thing and user limits placed on resources (e.g. in SGE/Slurm) are other things. This reminds me the OMP mentality that causes issues in many similar environments.
So I think it would be a nice thing to have but I don't think it should be default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, make this also opt-in? Or set a better default value?
@@ -53,6 +52,9 @@ int main(int argc, char *argv[]) { | |||
"yes|no|optional|wait, only has effect if compiled with CUDA"); | |||
|
|||
opts.Register(&po); | |||
#if HAVE_CUDA==1 | |||
CuDevice::RegisterDeviceOptions(&po); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your changes require configuration to be effective, I think I would rather you set the defaults to be fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, the multithreaded flag is set to False
currently, will change that to True
but do you also want the tensor core flag default to be True
since this has an impact on numerical roundoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above
… chain-training.cc
Hm, it might need some debugging, if we set the tensor-core thing to true by default. |
It won't crash on older hardware (the tensor core flag will be ignored by cublas). However since it can possibly damage accuracy (the math is done in half precision), I think an opt-in would be better. |
I agree with Hugo we should opt into tensor cores |
Yes, that would be my preference (make it user configurable number of
threads).
Y
…On Thu, Dec 12, 2019 at 17:59 Akshay Subramaniam ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/chain/chain-generic-numerator.cc
<#3766 (comment)>:
> + if (GetVerboseLevel() >= 1)
+ ok_mt[thread] = ok_mt[thread] && CheckValues(seq, probs, alpha[thread], beta[thread], derivs);
+ }
+ return;
+ };
+
+ std::vector<std::thread> workers(nthreads);
+ for (int thread = 0; thread < nthreads; ++thread)
+ // Launch all threads
+ workers[thread] = std::thread(thread_lambda, thread, num_sequences, num_sequences_per_thread);
+ for (int thread = 0; thread < nthreads; ++thread) {
+ // Join threads back in
+ workers[thread].join();
+ // Reduce thread values to a single value
+ partial_loglike += partial_loglike_mt[thread];
+ ok = ok && ok_mt[thread];
So, make this also opt-in? Or set a better default value?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3766?email_source=notifications&email_token=ACUKYX43Q5MWPMOIDEFP3SLQYJUYFA5CNFSM4JZGA2F2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPAHVMY#discussion_r357260183>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXZYYAXAOZWUDIGA5Q3QYJUYFANCNFSM4JZGA2FQ>
.
|
Changed the command line argument from a multithreading flag to a thread count. Setting thread count to 0 uses full hardware concurrency and the default is to use 1 thread. |
I'm sorry, I seem to have neglected this PR. (Reminder: please ping me if I drop the ball). |
@danpovey I think that is reasonable. I've made the change. |
@@ -231,6 +234,7 @@ void DenominatorComputation::Beta(int32 t) { | |||
} | |||
|
|||
BaseFloat DenominatorComputation::Forward() { | |||
NVTX_RANGE(__func__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this debug code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is instrumentation for the profiler. The NVTX_RANGE
macro is set when compiled with -DUSE_NVTX
otherwise it's empty.
Great, thanks! Merging. |
* [src] fix fstmakecontextfst duplicated disambig symbols (#3811) * [src] Fix wrong error message format in make_lexicon_fst.py * [egs] Update librispeech TDNN-F recipe (#3813) * [src] Enable multiple threads for chain-generic-numerator to remove CPU bottleneck (#3766) Co-authored-by: Tommy Tao <o_otaotao@hotmail.com> Co-authored-by: o-alexandre-felipe <o.alexandre.felipe@gmail.com> Co-authored-by: Akshay Subramaniam <6964110+akshaysubr@users.noreply.github.com>
Multi-threading using C++11 threads to optimize the numerator graph's forward-backward algorithm to remove serial CPU bottleneck while training e2e chain models.
Added a command line option
--multithreaded-numerator
to toggle between single and multi-threaded execution. Tested and compared new implementation to the old implementation to ensure correctness. Overall speedup of 5-10x for just the forward-backward algorithm and 1.3-1.5x for a full training loop.This PR also includes adding the ability to turn on mixed precision compute using tensor cores and added some NVTX markers in the code that are activated by a compile time flag
-DUSE_NVTX
.