add model scaling section #15119

lvwerra · 2022-01-12T09:24:57Z

What does this PR do?

This PR adds a practical scaling guide to the documentation. When starting to work with large transformer models one of the most common issues is running out of GPU memory. There are several easy to implement strategies to counteract this. This guide shows how to use gradient accumulation, gradient checkpointing, mixed precision training, and choice of optimizer to decrease the memory footprint. It shows how to do this with both the Trainer and accelerate in a hands-on fashion.

What does this PR not do?

This PR does not replace the in-depth explanations at "Model Parallelism" and "Performance and Scalability: How To Fit a Bigger Model and Train It Faster". Rather the proposed guide should be an easy entry point for new users whereas these other sections are for more advanced users.

TensorFlow?

If this guide resonates well it would make sense to extend it with a TensorFlow/Keras section (similar to the accelerate section). Maybe TensorFlow boy (aka @Rocketknight1) has some ideas?

Rocketknight1 · 2022-01-12T13:15:41Z

We could definitely make a TF/Keras section! I believe Keras doesn't support gradient accumulation or gradient checkpointing out of the box, though, but mixed precision, Adafactor, etc. are all very doable.

sgugger

Thanks a lot for working on this, this is a great addition!

The way you show the outputs is inconsistent with the rest of the doc. It should either be with the doctest syntax:

>>> instruction
result

or in two separate code blocks.

docs/source/_toctree.yml

docs/source/scaling.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

lvwerra · 2022-01-12T16:18:36Z

Hi @sgugger I integrated your feedback, let me know if this is what you had in mind.

stas00

Thanks for working on it, @lvwerra. We need more of similar content.

That's said I think a better solution would be to rewrite the existing documents and not add a new one that has more than 50% overlap with the existing doc, which will make it more confusing to the user and more difficult to maintain for us, as many things will now be duplicated if this PR is merged as is.

Indeed the original performance doc has been grown into scalability+performance and parallelism is no longer parallelism but actually it's the scalability doc, that's specifically there to deal with OOM.

I think we need 2 docs:

scalability
performance

and move/combine this PR's doc with the scalability sections of the the current performance doc - removing the huge overlap that this PR proposes.

and for the performance doc leave only software/hardware bits that talk about speeding things up, and minimally talking about memory usage.

Bottom line, IMHO it'd be good to brainstorm how to do that in the most harmonious way before creating a new document.

stas00 · 2022-01-12T17:14:09Z

docs/source/scaling.mdx

+```
+
+That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on your machine make sure to stop all processes that are using GPU memory.
+


at this point I typically do torch.ones((1, 1)).to(device_id) because cuda kernels take up 1-2GB of each card.

https://github.com/stas00/ipyexperiments/blob/39c9b454e89e53b74c74dcb579c12ecf3d2161b9/ipyexperiments/utils/mem.py#L21-L28

Because not all of card's memory is available to the user, but once it's loaded the user knows how much memory is really available.

stas00 · 2022-01-12T17:18:25Z

docs/source/scaling.mdx

+Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. 
+
+In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.
+


Please see this discussion #14819

BNB needs either a stable embedding or be configured to run embed in fp32.

otherwise the proposed recipe is unstable according to Tim.

sgugger · 2022-01-12T18:52:11Z

In the case of documentation, I don't think repetition is a bad thing since users rarely look at every page in a row. Making sure the docs are all linked to each other for further reading is important. The two doc pages reach different target audience IMO and present the techniques in different ways, so I think having both of them is good.

stas00 · 2022-01-12T19:19:31Z

Repetition is a bad thing in this situation, IMHO:

as it'd make it more difficult to point users to the right doc. For example @LysandreJik has been pointing users to the performance doc when they have difficulties with getting their model to run memory-wise. Now you will have 2 competing docs, which one should he point users to? It shouldn't take more than a few secs for a maintainer to find where to point the user to and not needing to reread each doc and contemplate is it A or B?
how do you maintain 2 overlapping documents?
some users do read docs and now they will need to read 2 docs with confusing overlaps

But the big question is this: why make things more complicated and messy when the whole performance/scalability documentation is due for a revamp anyway. It's far from great at the moment. I propose to discuss how we best serve both users and ourselves by improving the performance/scalability doc structure/layout. I'm not proposing to drop anything from the current PR but to integrate rather than creating a diverging doc.

and I'd be happy to work together with @lvwerra to make it happen.

stas00 · 2022-01-12T21:17:02Z

update: @lvwerra and I will discuss a revamp/integration on Fri.

lvwerra · 2022-01-31T10:48:22Z

Hi there, I updated the doc and this is ready for another review:

use @stas00's suggestion to initialize the measurement by loading a small tensor to the GPU
added a note about the embedding issue with BnB optimizer

I disusssed with @stas00 that we could start the revamp by merging this PR and then move things around as we start adding the other documents in #15213.

stas00

Looks great, @lvwerra

I just wonder whether for now we should append this to the end of performance.mdx? Since we don't intend to have a scaling.mdx in the revamped docs - at least not in the initial layout.

otherwise if people start linking to scaling.mdx and then we wipe it out - we will need to deal with additional redirects or end up with broken links.

or at the very least putting a big fat disclaimer that this doc is temporary and will be removed soon. do not link to it.

lvwerra · 2022-02-07T09:51:55Z

Good point, I merged scaling.mdx and performance.mdx. The document starts with the new practical guide and I added the existing material to a section called Further discussions. There is still a bit of redundancy between the two parts but I think we can refactor it when we flesh out the other parts.

@sgugger are you happy with the current state to be merged?

sgugger

LGTM, thanks for iterating on this!

docs/source/_toctree.yml

docs/source/performance.mdx

sgugger · 2022-02-07T13:34:41Z

docs/source/performance.mdx

+GPU memory occupied: 8681 MB.
+```
+
+We can see that the memory footprint was dramatically reduced at the cost of being only slightly slower than the vanilla run. Of course, this would change as you increase the number of accumulation steps. In general you would want to max out the GPU usage as much as possible. So in our case, the batch_size of 4 was already pretty close to the GPU's limit. If we wanted to train with a batch size of 64 we should not use `per_device_train_batch_size=1` and `gradient_accumulation_steps=64` but instead `per_device_train_batch_size=4` and `gradient_accumulation_steps=16` which has the same effective batch size while making better use of the available GPU resources.


Very nice remark!

docs/source/performance.mdx

sgugger · 2022-02-07T13:38:04Z

docs/source/performance.mdx

+Instead of keeping the rolling average for each element in the weight matrices Adafactor only stores aggregated information (row- and column-wise sums of the rolling averages) which reduces the footprint considerably. One downside of Adafactor is that in some instances convergence can be slower than Adam's so some experimentation is advised here. We can use Adafactor simply by setting `adafactor=True`:
+
+
+```py
+training_args = TrainingArguments(per_device_train_batch_size=4, adafactor=True, **default_args)


I think this way is deprecated and that one has to pass optim-"adafactor" nowadays. @stas00 can confirm.

Indeed - fixed.

docs/source/performance.mdx

lvwerra · 2022-02-08T16:39:18Z

Thanks @sgugger, I integrated your feedback! @stas00 do you want to have a last look before merging?

stas00 · 2022-02-08T16:49:30Z

Thanks, @lvwerra

Please revert all the header level changes in the existing doc - the menu on the right only handles ## and ### and your proposed change will practically eliminate most of the menu entries as you added an additional level
We discussed adding this PR's content at the end of the existing document as at the moment it's the extra and not the main document as it was laid out. 95% of this PR's material will be spread out over the new documents, and the current performance doc will remain here as a reference that the more specific docs will point to for indepth information. (e.g. mixed precision section we aren't going to repeat the full discussion about why and how in each sub-document and will point to the performance reference doc instead). We may rename the main reference doc and make performance.mdx the entry point instead.

Does it make sense? If not, and we need to sync on the big picture - let's discuss it then first.

lvwerra · 2022-02-08T17:00:28Z

Hi @stas00

Sure, I'll revert the headers, I was not aware of that limitation.

Regarding the order of the content: I know that we will change that document radically in later PRs but I thought since this is gonna be public and maybe here for a few weeks it would be nicer to have the practical guide at the beginning and the theoretical part at the end. If you feel strongly about this I can change it back.

stas00 · 2022-02-08T17:03:17Z

Hi @stas00

Sure, I'll revert the headers, I was not aware of that limitation.

In fact I asked when the new design was added to support more than 3-levels but nothing came out of that. i.e. the menu is missing already a bunch of entries where there is a deeper nesting like the deepspeed doc. :(

Regarding the order of the content: I know that we will change that document radically in later PRs but I thought since this is gonna be public and maybe here for a few weeks it would be nicer to have the practical guide at the beginning and the theoretical part at the end. If you feel strongly about this I can change it back.

Sure, that works. I think what I will do instead in the new PR is to move this original document out to a new document. I'm just concerned with broken links for any pre-existing links to this url.

LysandreJik · 2022-02-09T14:29:47Z

Thanks a lot for working on this important part of the docs!

* add model scaling section * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * integrate reviewer feedback * initialize GPU properly * add note about BnB optimizer * move doc from `scaling.mdx` to `performance.mdx` * integrate reviewer feedback * revert section levels Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add model scaling section

adda689

lvwerra requested review from stas00, LysandreJik and sgugger January 12, 2022 09:24

sgugger approved these changes Jan 12, 2022

View reviewed changes

lvwerra and others added 2 commits January 12, 2022 15:35

Apply suggestions from code review

e40ad71

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

integrate reviewer feedback

dc9a299

stas00 reviewed Jan 12, 2022

View reviewed changes

leandro added 2 commits January 31, 2022 11:39

initialize GPU properly

8d052b9

add note about BnB optimizer

8e61830

stas00 approved these changes Jan 31, 2022

View reviewed changes

move doc from scaling.mdx to performance.mdx

4e78163

sgugger approved these changes Feb 7, 2022

View reviewed changes

integrate reviewer feedback

5a7334e

revert section levels

17d48d4

lvwerra merged commit d923f76 into master Feb 9, 2022

lvwerra deleted the scaling-guide branch February 9, 2022 14:27

lvwerra mentioned this pull request Feb 17, 2022

[WIP] [doc] performance/scalability revamp #15213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add model scaling section #15119

add model scaling section #15119

lvwerra commented Jan 12, 2022

Rocketknight1 commented Jan 12, 2022

sgugger left a comment

lvwerra commented Jan 12, 2022

stas00 left a comment •

edited

Loading

stas00 Jan 12, 2022

stas00 Jan 12, 2022 •

edited

Loading

sgugger commented Jan 12, 2022

stas00 commented Jan 12, 2022 •

edited

Loading

stas00 commented Jan 12, 2022

lvwerra commented Jan 31, 2022

stas00 left a comment •

edited

Loading

lvwerra commented Feb 7, 2022

sgugger left a comment

sgugger Feb 7, 2022

sgugger Feb 7, 2022

lvwerra Feb 8, 2022

lvwerra commented Feb 8, 2022

stas00 commented Feb 8, 2022 •

edited

Loading

lvwerra commented Feb 8, 2022

stas00 commented Feb 8, 2022 •

edited

Loading

LysandreJik commented Feb 9, 2022

		```

		That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on your machine make sure to stop all processes that are using GPU memory.

		Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory.

		In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.

add model scaling section #15119

add model scaling section #15119

Conversation

lvwerra commented Jan 12, 2022

What does this PR do?

What does this PR not do?

TensorFlow?

Rocketknight1 commented Jan 12, 2022

sgugger left a comment

Choose a reason for hiding this comment

lvwerra commented Jan 12, 2022

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

stas00 Jan 12, 2022

Choose a reason for hiding this comment

stas00 Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

sgugger commented Jan 12, 2022

stas00 commented Jan 12, 2022 • edited Loading

stas00 commented Jan 12, 2022

lvwerra commented Jan 31, 2022

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

lvwerra commented Feb 7, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Feb 7, 2022

Choose a reason for hiding this comment

sgugger Feb 7, 2022

Choose a reason for hiding this comment

lvwerra Feb 8, 2022

Choose a reason for hiding this comment

lvwerra commented Feb 8, 2022

stas00 commented Feb 8, 2022 • edited Loading

lvwerra commented Feb 8, 2022

stas00 commented Feb 8, 2022 • edited Loading

LysandreJik commented Feb 9, 2022

stas00 left a comment •

edited

Loading

stas00 Jan 12, 2022 •

edited

Loading

stas00 commented Jan 12, 2022 •

edited

Loading

stas00 left a comment •

edited

Loading

stas00 commented Feb 8, 2022 •

edited

Loading

stas00 commented Feb 8, 2022 •

edited

Loading