LT-LM recipe for librispeech #4590

medbar · 2021-07-13T15:51:07Z

Code for LT-LM training on the LibriSpeech corpus.
Paper LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring
The model is trained using the fairseq framework.

kkm000 · 2021-07-24T22:31:29Z

@medbar, thanks for this new recipe, really interesting approach, спасибо, чрезвычайно любопытно, to the point I want to run and play with it myself—I'm interested in one-shot rescoring. Please do not be overwhelmed by the sheer amount of review comments: the code is large indeed, and there will be a lot of changes requested. We're generally following the Google code style with a couple of exceptions. I understand that comments like "add space here" and "remove space here" might be annoying, and C++ reviews are especially persnickety, but it what helps keeping the code base high, so please do not be off-put by them. It's the best part of keeping our codebase accessible to future collaborators.

If you are comfortable with clang-format, you may just use it for the C++ code part with --style=Google. Alternatively, you may ask cpplint.py for help, linked to from the Google style doc, but it won't reformat code for you, only point at some violations. For Python, we never imposed any readability rules, except for sane line length, around 100, otherwise GitHub diffs turn into a mess. For Bash, we normally use the base indent of 2. I'm talking about code formatting, because it's where most of change requests originate.

Oh, and feel free to ignore CodeFactor non-passing. You can skim through suggestions, but they more often make no real readability sense than do. Sometimes it finds important pessimizations to get rid of, but they drown in it's noise.

Do you think that the only C++ binary that you're adding is beneficial to have for general use? Maybe just put it into latbin/, if it's not extremely specialized? I do not think we have recipes with their own binaries, generally all things C++ go under src. I would not worry about too much code bloat, and we're already bloated to 15GB of binaries in a static build, so it's a drop in the sea anyway.

About the torch training part: does it use Hydra to orchestrate? Does it benefit from NCCL, i.e. is it faster on a multi-GPU nodes, or many single-GPU nodes would work as well, in your experience? What GPU would be the best? If fairview use lower-precision tensor cores in training, T4 would be the king; in NCCL scenario, V100 can be connected into a large collective; and if it scales well to multiple GPU nodes with a fast fabric network, P100 are likely the best bang for the buck.

In the paper (very start of section 5) a 50GB text dataset is mentioned, but without any citation or references. I only skimmed the code, so I might have easily missed where it's coming from. Can you give a pointer please?

I'll get to review in the next couple days.

medbar · 2021-07-26T17:37:47Z

@kkm000, thank you for your answer. I styled c++ code by clang-format and fixed most of the issues.

Do you think that the only C++ binary that you're adding is beneficial to have for general use? Maybe just put it into latbin/, if it's not extremely specialized? I do not think we have recipes with their own binaries, generally all things C++ go under src. I would not worry about too much code bloat, and we're already bloated to 15GB of binaries in a static build, so it's a drop in the sea anyway.

latgen-faster-mapped-fake-am can be used only with FAM (Fake Acoustic Model) generated by fairseq_ltlm/recipes/genlats/generate_fake_am_model.py. FAM is an .ark file with a single matrix of shape (#pdf_id, #pdf_id). latgen-faster-mapped-fake-am takes alignment as an input and selects logits from FAM for each frame. So right now it is a quite special tool. Maybe it would be better to use some special Kaldi-object to store FAM. If you know the appropriate one, could you please specify it?

About the torch training part: does it use Hydra to orchestrate? Does it benefit from NCCL, i.e. is it faster on a multi-GPU nodes, or many single-GPU nodes would work as well, in your experience? What GPU would be the best? If fairview use lower-precision tensor cores in training, T4 would be the king; in NCCL scenario, V100 can be connected into a large collective; and if it scales well to multiple GPU nodes with a fast fabric network, P100 are likely the best bang for the buck.

In the paper, single-GPU training was used. But it took about two weeks so I decided to use several nodes setup in new experiments. This recipe trains lt-lm on slurm cluster using Distributed Data Parallel with NCCL as a backend. With 6 2080ti ( two nodes, 3 gpu per node) model had been trained for 4 days. The main bottleneck of this setup is memory usage. I had to use --update_freq fairseq option to increase batch_size artificially. So I think that GPUs with a large amount of graphical memory are better suited for this task.

In the paper (very start of section 5) a 50GB text dataset is mentioned, but without any citation or references. I only skimmed the code, so I might have easily missed where it's coming from. Can you give a pointer please?

We use extra 5GB (not 50GB) text dataset which is included in the LibriSpeech corpus. In egs/librispeech/s5 it is data/local/lm/librispeech-lm-norm.train.txt. All paths required for training are described in fairseq_ltlm/recipes/config.sh

Looking forward to your review.

kkm000 · 2021-08-04T08:55:55Z

@medbar, thanks for your detailed comments. I think that you are doing it all correctly w.r.t. Kaldi data structures, but I'll give a second brain pass when I have a breather. I'm really sorry about sitting on your recipe review for this long. As it usually happens, unexpected (ha!) emergencies at work.

stale · 2021-10-04T04:42:49Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kkm000

Please add a missing file header, otherwise LGTM!

I'm so sorry about dragging this on. My bad. Let's fix and commit that ASAP.

kkm000 · 2021-10-12T03:21:49Z

egs/librispeech/s5/fairseq_ltlm/kaldi_utils/latgen-faster-mapped-fake-am.cc

@@ -0,0 +1,197 @@
+


Please add a standard header. Borrow one from any file in bin/ and put your name or institution or both. The file ./COPYING has instructions, if you aren't sure. LGTM otherwise.

GNroy · 2022-01-21T15:57:24Z

Hello to all!
Any chance this PR will end up being merged?

kkm000 · 2022-01-21T17:47:14Z

Sorry for dragging this thing on an on. Lost it on my radar. Merged. Thank you @medbar, for both your contribution and patience!

@GNroy

Any chance this PR will end up being merged?

My estimate is 100%, give or take a few. :^)

Anton Mitrofanov added 7 commits June 22, 2021 15:53

moved from old ltlm repo

9c42ce6

ltlm work well

6184ed3

2.55 epoch 103 ( 34 epoch per loop)

360528e

added copyright

6c27696

like paper model

ad178af

readme

7d36e1b

fixed some CodeFactor issues

afcc7ee

kkm000 self-requested a review July 24, 2021 00:41

kkm000 self-assigned this Jul 24, 2021

Anton Mitrofanov added 2 commits July 26, 2021 18:08

clang-format

56f4afd

fixed other styling issues

e995fcb

stale bot added the stale Stale bot on the loose label Oct 4, 2021

kkm000 requested changes Oct 12, 2021

View reviewed changes

Update latgen-faster-mapped-fake-am.cc

7b6af7b

stale bot removed the stale Stale bot on the loose label Jan 21, 2022

kkm000 merged commit 4609ea1 into kaldi-asr:master Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LT-LM recipe for librispeech #4590

LT-LM recipe for librispeech #4590

medbar commented Jul 13, 2021

kkm000 commented Jul 24, 2021 •

edited

Loading

medbar commented Jul 26, 2021

kkm000 commented Aug 4, 2021

stale bot commented Oct 4, 2021

kkm000 left a comment

kkm000 Oct 12, 2021

GNroy commented Jan 21, 2022

kkm000 commented Jan 21, 2022

LT-LM recipe for librispeech #4590

LT-LM recipe for librispeech #4590

Conversation

medbar commented Jul 13, 2021

kkm000 commented Jul 24, 2021 • edited Loading

medbar commented Jul 26, 2021

kkm000 commented Aug 4, 2021

stale bot commented Oct 4, 2021

kkm000 left a comment

Choose a reason for hiding this comment

kkm000 Oct 12, 2021

Choose a reason for hiding this comment

GNroy commented Jan 21, 2022

kkm000 commented Jan 21, 2022

kkm000 commented Jul 24, 2021 •

edited

Loading