Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multithreaded version of chain-generic-numerator to remove CPU bottleneck #3766

Merged
merged 9 commits into from
Jan 7, 2020

Conversation

akshaysubr
Copy link
Contributor

Multi-threading using C++11 threads to optimize the numerator graph's forward-backward algorithm to remove serial CPU bottleneck while training e2e chain models.

Added a command line option --multithreaded-numerator to toggle between single and multi-threaded execution. Tested and compared new implementation to the old implementation to ensure correctness. Overall speedup of 5-10x for just the forward-backward algorithm and 1.3-1.5x for a full training loop.

This PR also includes adding the ability to turn on mixed precision compute using tensor cores and added some NVTX markers in the code that are activated by a compile time flag -DUSE_NVTX.

workers[thread].join();
// Reduce thread values to a single value
partial_loglike += partial_loglike_mt[thread];
ok = ok && ok_mt[thread];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually create threads here, or uses some kind of thread pool? Seems like creating threads might be kind of heavyweight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, threads are created within ForwardBackward(...) and the number of threads created is set to be the available hardware concurrency and the total number of sequences are split across these threads. There is an overhead to creating threads this way, but since the GenericNumeratorComputation object is created once per iteration and ForwardBackward is called only once during its lifetime, this design seemed reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about setting these things automatically -- hardware concurency is one thing and user limits placed on resources (e.g. in SGE/Slurm) are other things. This reminds me the OMP mentality that causes issues in many similar environments.
So I think it would be a nice thing to have but I don't think it should be default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, make this also opt-in? Or set a better default value?

@@ -53,6 +52,9 @@ int main(int argc, char *argv[]) {
"yes|no|optional|wait, only has effect if compiled with CUDA");

opts.Register(&po);
#if HAVE_CUDA==1
CuDevice::RegisterDeviceOptions(&po);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your changes require configuration to be effective, I think I would rather you set the defaults to be fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the multithreaded flag is set to False currently, will change that to True but do you also want the tensor core flag default to be True since this has an impact on numerical roundoff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

@danpovey
Copy link
Contributor

Hm, it might need some debugging, if we set the tensor-core thing to true by default.
I'm thinking the training might be affected or might be more likely to fail various
assertions.
Tricky decision.
Does it cause a crash on older hardware if you set that flag to true?
@luitjens or @hugovbraun do you have any opinions on this?

@hugovbraun
Copy link
Contributor

It won't crash on older hardware (the tensor core flag will be ignored by cublas). However since it can possibly damage accuracy (the math is done in half precision), I think an opt-in would be better.

@luitjens
Copy link
Contributor

I agree with Hugo we should opt into tensor cores

@jtrmal
Copy link
Contributor

jtrmal commented Dec 12, 2019 via email

@akshaysubr
Copy link
Contributor Author

Changed the command line argument from a multithreading flag to a thread count. Setting thread count to 0 uses full hardware concurrency and the default is to use 1 thread.

@danpovey
Copy link
Contributor

I'm sorry, I seem to have neglected this PR. (Reminder: please ping me if I drop the ball).
I am thinking that you could make the thread-count default to the minimum of 4, and the full hardware concurrency? If that's possible? The reason I think it's OK to default to >1 is that the numerator computation doesn't dominate anyway.

@akshaysubr
Copy link
Contributor Author

@danpovey I think that is reasonable. I've made the change.

@@ -231,6 +234,7 @@ void DenominatorComputation::Beta(int32 t) {
}

BaseFloat DenominatorComputation::Forward() {
NVTX_RANGE(__func__);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this debug code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is instrumentation for the profiler. The NVTX_RANGE macro is set when compiled with -DUSE_NVTX otherwise it's empty.

@danpovey
Copy link
Contributor

danpovey commented Jan 7, 2020

Great, thanks! Merging.

@danpovey danpovey merged commit 8e2bbd2 into kaldi-asr:master Jan 7, 2020
danpovey pushed a commit that referenced this pull request Jan 7, 2020
* [src] fix fstmakecontextfst duplicated disambig symbols (#3811)

* [src] Fix wrong error message format in make_lexicon_fst.py

* [egs] Update librispeech TDNN-F recipe (#3813)

* [src] Enable multiple threads for chain-generic-numerator to remove CPU bottleneck (#3766)

Co-authored-by: Tommy Tao <o_otaotao@hotmail.com>
Co-authored-by: o-alexandre-felipe <o.alexandre.felipe@gmail.com>
Co-authored-by: Akshay Subramaniam <6964110+akshaysubr@users.noreply.github.com>
Bar-BY pushed a commit to Bar-BY/kaldi that referenced this pull request Jan 21, 2020
galv pushed a commit to galv/kaldi that referenced this pull request Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants