Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LT-LM recipe for librispeech #4590

Merged
merged 10 commits into from
Jan 21, 2022
Merged

LT-LM recipe for librispeech #4590

merged 10 commits into from
Jan 21, 2022

Conversation

medbar
Copy link
Contributor

@medbar medbar commented Jul 13, 2021

Code for LT-LM training on the LibriSpeech corpus.
Paper LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring
The model is trained using the fairseq framework.

@kkm000 kkm000 self-requested a review July 24, 2021 00:41
@kkm000 kkm000 self-assigned this Jul 24, 2021
@kkm000
Copy link
Contributor

kkm000 commented Jul 24, 2021

@medbar, thanks for this new recipe, really interesting approach, спасибо, чрезвычайно любопытно, to the point I want to run and play with it myself—I'm interested in one-shot rescoring. Please do not be overwhelmed by the sheer amount of review comments: the code is large indeed, and there will be a lot of changes requested. We're generally following the Google code style with a couple of exceptions. I understand that comments like "add space here" and "remove space here" might be annoying, and C++ reviews are especially persnickety, but it what helps keeping the code base high, so please do not be off-put by them. It's the best part of keeping our codebase accessible to future collaborators.

If you are comfortable with clang-format, you may just use it for the C++ code part with --style=Google. Alternatively, you may ask cpplint.py for help, linked to from the Google style doc, but it won't reformat code for you, only point at some violations. For Python, we never imposed any readability rules, except for sane line length, around 100, otherwise GitHub diffs turn into a mess. For Bash, we normally use the base indent of 2. I'm talking about code formatting, because it's where most of change requests originate.

Oh, and feel free to ignore CodeFactor non-passing. You can skim through suggestions, but they more often make no real readability sense than do. Sometimes it finds important pessimizations to get rid of, but they drown in it's noise.

Do you think that the only C++ binary that you're adding is beneficial to have for general use? Maybe just put it into latbin/, if it's not extremely specialized? I do not think we have recipes with their own binaries, generally all things C++ go under src. I would not worry about too much code bloat, and we're already bloated to 15GB of binaries in a static build, so it's a drop in the sea anyway.

About the torch training part: does it use Hydra to orchestrate? Does it benefit from NCCL, i.e. is it faster on a multi-GPU nodes, or many single-GPU nodes would work as well, in your experience? What GPU would be the best? If fairview use lower-precision tensor cores in training, T4 would be the king; in NCCL scenario, V100 can be connected into a large collective; and if it scales well to multiple GPU nodes with a fast fabric network, P100 are likely the best bang for the buck.

In the paper (very start of section 5) a 50GB text dataset is mentioned, but without any citation or references. I only skimmed the code, so I might have easily missed where it's coming from. Can you give a pointer please?

I'll get to review in the next couple days.

@medbar
Copy link
Contributor Author

medbar commented Jul 26, 2021

@kkm000, thank you for your answer. I styled c++ code by clang-format and fixed most of the issues.

Do you think that the only C++ binary that you're adding is beneficial to have for general use? Maybe just put it into latbin/, if it's not extremely specialized? I do not think we have recipes with their own binaries, generally all things C++ go under src. I would not worry about too much code bloat, and we're already bloated to 15GB of binaries in a static build, so it's a drop in the sea anyway.

latgen-faster-mapped-fake-am can be used only with FAM (Fake Acoustic Model) generated by fairseq_ltlm/recipes/genlats/generate_fake_am_model.py. FAM is an .ark file with a single matrix of shape (#pdf_id, #pdf_id). latgen-faster-mapped-fake-am takes alignment as an input and selects logits from FAM for each frame. So right now it is a quite special tool. Maybe it would be better to use some special Kaldi-object to store FAM. If you know the appropriate one, could you please specify it?

About the torch training part: does it use Hydra to orchestrate? Does it benefit from NCCL, i.e. is it faster on a multi-GPU nodes, or many single-GPU nodes would work as well, in your experience? What GPU would be the best? If fairview use lower-precision tensor cores in training, T4 would be the king; in NCCL scenario, V100 can be connected into a large collective; and if it scales well to multiple GPU nodes with a fast fabric network, P100 are likely the best bang for the buck.

In the paper, single-GPU training was used. But it took about two weeks so I decided to use several nodes setup in new experiments. This recipe trains lt-lm on slurm cluster using Distributed Data Parallel with NCCL as a backend. With 6 2080ti ( two nodes, 3 gpu per node) model had been trained for 4 days. The main bottleneck of this setup is memory usage. I had to use --update_freq fairseq option to increase batch_size artificially. So I think that GPUs with a large amount of graphical memory are better suited for this task.

In the paper (very start of section 5) a 50GB text dataset is mentioned, but without any citation or references. I only skimmed the code, so I might have easily missed where it's coming from. Can you give a pointer please?

We use extra 5GB (not 50GB) text dataset which is included in the LibriSpeech corpus. In egs/librispeech/s5 it is data/local/lm/librispeech-lm-norm.train.txt. All paths required for training are described in fairseq_ltlm/recipes/config.sh

Looking forward to your review.

@kkm000
Copy link
Contributor

kkm000 commented Aug 4, 2021

@medbar, thanks for your detailed comments. I think that you are doing it all correctly w.r.t. Kaldi data structures, but I'll give a second brain pass when I have a breather. I'm really sorry about sitting on your recipe review for this long. As it usually happens, unexpected (ha!) emergencies at work.

@stale
Copy link

stale bot commented Oct 4, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Oct 4, 2021
Copy link
Contributor

@kkm000 kkm000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a missing file header, otherwise LGTM!

I'm so sorry about dragging this on. My bad. Let's fix and commit that ASAP.

@@ -0,0 +1,197 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a standard header. Borrow one from any file in bin/ and put your name or institution or both. The file ./COPYING has instructions, if you aren't sure. LGTM otherwise.

@GNroy
Copy link

GNroy commented Jan 21, 2022

Hello to all!
Any chance this PR will end up being merged?

@stale stale bot removed the stale Stale bot on the loose label Jan 21, 2022
@kkm000 kkm000 merged commit 4609ea1 into kaldi-asr:master Jan 21, 2022
@kkm000
Copy link
Contributor

kkm000 commented Jan 21, 2022

Sorry for dragging this thing on an on. Lost it on my radar. Merged. Thank you @medbar, for both your contribution and patience!

@GNroy

Any chance this PR will end up being merged?

My estimate is 100%, give or take a few. :^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants