Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add model scaling section #15119

Merged
merged 8 commits into from
Feb 9, 2022
Merged

add model scaling section #15119

merged 8 commits into from
Feb 9, 2022

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Jan 12, 2022

What does this PR do?

This PR adds a practical scaling guide to the documentation. When starting to work with large transformer models one of the most common issues is running out of GPU memory. There are several easy to implement strategies to counteract this. This guide shows how to use gradient accumulation, gradient checkpointing, mixed precision training, and choice of optimizer to decrease the memory footprint. It shows how to do this with both the Trainer and accelerate in a hands-on fashion.

What does this PR not do?

This PR does not replace the in-depth explanations at "Model Parallelism" and "Performance and Scalability: How To Fit a Bigger Model and Train It Faster". Rather the proposed guide should be an easy entry point for new users whereas these other sections are for more advanced users.

TensorFlow?

If this guide resonates well it would make sense to extend it with a TensorFlow/Keras section (similar to the accelerate section). Maybe TensorFlow boy (aka @Rocketknight1) has some ideas?

@Rocketknight1
Copy link
Member

We could definitely make a TF/Keras section! I believe Keras doesn't support gradient accumulation or gradient checkpointing out of the box, though, but mixed precision, Adafactor, etc. are all very doable.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, this is a great addition!

The way you show the outputs is inconsistent with the rest of the doc. It should either be with the doctest syntax:

>>> instruction
result 

or in two separate code blocks.

docs/source/_toctree.yml Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
docs/source/scaling.mdx Outdated Show resolved Hide resolved
lvwerra and others added 2 commits January 12, 2022 15:35
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@lvwerra
Copy link
Member Author

lvwerra commented Jan 12, 2022

Hi @sgugger I integrated your feedback, let me know if this is what you had in mind.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it, @lvwerra. We need more of similar content.

That's said I think a better solution would be to rewrite the existing documents and not add a new one that has more than 50% overlap with the existing doc, which will make it more confusing to the user and more difficult to maintain for us, as many things will now be duplicated if this PR is merged as is.

Indeed the original performance doc has been grown into scalability+performance and parallelism is no longer parallelism but actually it's the scalability doc, that's specifically there to deal with OOM.

I think we need 2 docs:

  1. scalability
  2. performance

and move/combine this PR's doc with the scalability sections of the the current performance doc - removing the huge overlap that this PR proposes.

and for the performance doc leave only software/hardware bits that talk about speeding things up, and minimally talking about memory usage.

Bottom line, IMHO it'd be good to brainstorm how to do that in the most harmonious way before creating a new document.

```

That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on your machine make sure to stop all processes that are using GPU memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point I typically do torch.ones((1, 1)).to(device_id) because cuda kernels take up 1-2GB of each card.

https://github.com/stas00/ipyexperiments/blob/39c9b454e89e53b74c74dcb579c12ecf3d2161b9/ipyexperiments/utils/mem.py#L21-L28

Because not all of card's memory is available to the user, but once it's loaded the user knows how much memory is really available.

Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory.

In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.

Copy link
Contributor

@stas00 stas00 Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this discussion #14819

BNB needs either a stable embedding or be configured to run embed in fp32.

otherwise the proposed recipe is unstable according to Tim.

@sgugger
Copy link
Collaborator

sgugger commented Jan 12, 2022

In the case of documentation, I don't think repetition is a bad thing since users rarely look at every page in a row. Making sure the docs are all linked to each other for further reading is important. The two doc pages reach different target audience IMO and present the techniques in different ways, so I think having both of them is good.

@stas00
Copy link
Contributor

stas00 commented Jan 12, 2022

Repetition is a bad thing in this situation, IMHO:

  1. as it'd make it more difficult to point users to the right doc. For example @LysandreJik has been pointing users to the performance doc when they have difficulties with getting their model to run memory-wise. Now you will have 2 competing docs, which one should he point users to? It shouldn't take more than a few secs for a maintainer to find where to point the user to and not needing to reread each doc and contemplate is it A or B?
  2. how do you maintain 2 overlapping documents?
  3. some users do read docs and now they will need to read 2 docs with confusing overlaps

But the big question is this: why make things more complicated and messy when the whole performance/scalability documentation is due for a revamp anyway. It's far from great at the moment. I propose to discuss how we best serve both users and ourselves by improving the performance/scalability doc structure/layout. I'm not proposing to drop anything from the current PR but to integrate rather than creating a diverging doc.

and I'd be happy to work together with @lvwerra to make it happen.

@stas00
Copy link
Contributor

stas00 commented Jan 12, 2022

update: @lvwerra and I will discuss a revamp/integration on Fri.

@lvwerra
Copy link
Member Author

lvwerra commented Jan 31, 2022

Hi there, I updated the doc and this is ready for another review:

  • use @stas00's suggestion to initialize the measurement by loading a small tensor to the GPU
  • added a note about the embedding issue with BnB optimizer

I disusssed with @stas00 that we could start the revamp by merging this PR and then move things around as we start adding the other documents in #15213.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @lvwerra

I just wonder whether for now we should append this to the end of performance.mdx? Since we don't intend to have a scaling.mdx in the revamped docs - at least not in the initial layout.

otherwise if people start linking to scaling.mdx and then we wipe it out - we will need to deal with additional redirects or end up with broken links.

or at the very least putting a big fat disclaimer that this doc is temporary and will be removed soon. do not link to it.

@lvwerra
Copy link
Member Author

lvwerra commented Feb 7, 2022

Good point, I merged scaling.mdx and performance.mdx. The document starts with the new practical guide and I added the existing material to a section called Further discussions. There is still a bit of redundancy between the two parts but I think we can refactor it when we flesh out the other parts.

@sgugger are you happy with the current state to be merged?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this!

docs/source/_toctree.yml Outdated Show resolved Hide resolved
docs/source/performance.mdx Show resolved Hide resolved
GPU memory occupied: 8681 MB.
```

We can see that the memory footprint was dramatically reduced at the cost of being only slightly slower than the vanilla run. Of course, this would change as you increase the number of accumulation steps. In general you would want to max out the GPU usage as much as possible. So in our case, the batch_size of 4 was already pretty close to the GPU's limit. If we wanted to train with a batch size of 64 we should not use `per_device_train_batch_size=1` and `gradient_accumulation_steps=64` but instead `per_device_train_batch_size=4` and `gradient_accumulation_steps=16` which has the same effective batch size while making better use of the available GPU resources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice remark!

docs/source/performance.mdx Outdated Show resolved Hide resolved
Comment on lines 266 to 270
Instead of keeping the rolling average for each element in the weight matrices Adafactor only stores aggregated information (row- and column-wise sums of the rolling averages) which reduces the footprint considerably. One downside of Adafactor is that in some instances convergence can be slower than Adam's so some experimentation is advised here. We can use Adafactor simply by setting `adafactor=True`:


```py
training_args = TrainingArguments(per_device_train_batch_size=4, adafactor=True, **default_args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way is deprecated and that one has to pass optim-"adafactor" nowadays. @stas00 can confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - fixed.

docs/source/performance.mdx Outdated Show resolved Hide resolved
@lvwerra
Copy link
Member Author

lvwerra commented Feb 8, 2022

Thanks @sgugger, I integrated your feedback! @stas00 do you want to have a last look before merging?

@stas00
Copy link
Contributor

stas00 commented Feb 8, 2022

Thanks, @lvwerra

  1. Please revert all the header level changes in the existing doc - the menu on the right only handles ## and ### and your proposed change will practically eliminate most of the menu entries as you added an additional level

  2. We discussed adding this PR's content at the end of the existing document as at the moment it's the extra and not the main document as it was laid out. 95% of this PR's material will be spread out over the new documents, and the current performance doc will remain here as a reference that the more specific docs will point to for indepth information. (e.g. mixed precision section we aren't going to repeat the full discussion about why and how in each sub-document and will point to the performance reference doc instead). We may rename the main reference doc and make performance.mdx the entry point instead.

Does it make sense? If not, and we need to sync on the big picture - let's discuss it then first.

@lvwerra
Copy link
Member Author

lvwerra commented Feb 8, 2022

Hi @stas00

Sure, I'll revert the headers, I was not aware of that limitation.

Regarding the order of the content: I know that we will change that document radically in later PRs but I thought since this is gonna be public and maybe here for a few weeks it would be nicer to have the practical guide at the beginning and the theoretical part at the end. If you feel strongly about this I can change it back.

@stas00
Copy link
Contributor

stas00 commented Feb 8, 2022

Hi @stas00

Sure, I'll revert the headers, I was not aware of that limitation.

In fact I asked when the new design was added to support more than 3-levels but nothing came out of that. i.e. the menu is missing already a bunch of entries where there is a deeper nesting like the deepspeed doc. :(

Regarding the order of the content: I know that we will change that document radically in later PRs but I thought since this is gonna be public and maybe here for a few weeks it would be nicer to have the practical guide at the beginning and the theoretical part at the end. If you feel strongly about this I can change it back.

Sure, that works. I think what I will do instead in the new PR is to move this original document out to a new document. I'm just concerned with broken links for any pre-existing links to this url.

@lvwerra lvwerra merged commit d923f76 into master Feb 9, 2022
@lvwerra lvwerra deleted the scaling-guide branch February 9, 2022 14:27
@LysandreJik
Copy link
Member

Thanks a lot for working on this important part of the docs!

stevhliu pushed a commit to stevhliu/transformers that referenced this pull request Feb 18, 2022
* add model scaling section

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* integrate reviewer feedback

* initialize GPU properly

* add note about BnB optimizer

* move doc from `scaling.mdx` to `performance.mdx`

* integrate reviewer feedback

* revert section levels

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ManuelFay pushed a commit to ManuelFay/transformers that referenced this pull request Mar 31, 2022
* add model scaling section

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* integrate reviewer feedback

* initialize GPU properly

* add note about BnB optimizer

* move doc from `scaling.mdx` to `performance.mdx`

* integrate reviewer feedback

* revert section levels

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants