-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add model scaling section #15119
add model scaling section #15119
Conversation
We could definitely make a TF/Keras section! I believe Keras doesn't support gradient accumulation or gradient checkpointing out of the box, though, but mixed precision, Adafactor, etc. are all very doable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this, this is a great addition!
The way you show the outputs is inconsistent with the rest of the doc. It should either be with the doctest syntax:
>>> instruction
result
or in two separate code blocks.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Hi @sgugger I integrated your feedback, let me know if this is what you had in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it, @lvwerra. We need more of similar content.
That's said I think a better solution would be to rewrite the existing documents and not add a new one that has more than 50% overlap with the existing doc, which will make it more confusing to the user and more difficult to maintain for us, as many things will now be duplicated if this PR is merged as is.
Indeed the original performance doc has been grown into scalability+performance and parallelism is no longer parallelism but actually it's the scalability doc, that's specifically there to deal with OOM.
I think we need 2 docs:
- scalability
- performance
and move/combine this PR's doc with the scalability sections of the the current performance doc - removing the huge overlap that this PR proposes.
and for the performance doc leave only software/hardware bits that talk about speeding things up, and minimally talking about memory usage.
Bottom line, IMHO it'd be good to brainstorm how to do that in the most harmonious way before creating a new document.
docs/source/scaling.mdx
Outdated
``` | ||
|
||
That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on your machine make sure to stop all processes that are using GPU memory. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point I typically do torch.ones((1, 1)).to(device_id)
because cuda kernels take up 1-2GB of each card.
Because not all of card's memory is available to the user, but once it's loaded the user knows how much memory is really available.
docs/source/scaling.mdx
Outdated
Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. | ||
|
||
In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see this discussion #14819
BNB needs either a stable embedding or be configured to run embed in fp32.
otherwise the proposed recipe is unstable according to Tim.
In the case of documentation, I don't think repetition is a bad thing since users rarely look at every page in a row. Making sure the docs are all linked to each other for further reading is important. The two doc pages reach different target audience IMO and present the techniques in different ways, so I think having both of them is good. |
Repetition is a bad thing in this situation, IMHO:
But the big question is this: why make things more complicated and messy when the whole performance/scalability documentation is due for a revamp anyway. It's far from great at the moment. I propose to discuss how we best serve both users and ourselves by improving the performance/scalability doc structure/layout. I'm not proposing to drop anything from the current PR but to integrate rather than creating a diverging doc. and I'd be happy to work together with @lvwerra to make it happen. |
update: @lvwerra and I will discuss a revamp/integration on Fri. |
Hi there, I updated the doc and this is ready for another review:
I disusssed with @stas00 that we could start the revamp by merging this PR and then move things around as we start adding the other documents in #15213. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @lvwerra
I just wonder whether for now we should append this to the end of performance.mdx
? Since we don't intend to have a scaling.mdx
in the revamped docs - at least not in the initial layout.
otherwise if people start linking to scaling.mdx
and then we wipe it out - we will need to deal with additional redirects or end up with broken links.
or at the very least putting a big fat disclaimer that this doc is temporary and will be removed soon. do not link to it.
Good point, I merged @sgugger are you happy with the current state to be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating on this!
GPU memory occupied: 8681 MB. | ||
``` | ||
|
||
We can see that the memory footprint was dramatically reduced at the cost of being only slightly slower than the vanilla run. Of course, this would change as you increase the number of accumulation steps. In general you would want to max out the GPU usage as much as possible. So in our case, the batch_size of 4 was already pretty close to the GPU's limit. If we wanted to train with a batch size of 64 we should not use `per_device_train_batch_size=1` and `gradient_accumulation_steps=64` but instead `per_device_train_batch_size=4` and `gradient_accumulation_steps=16` which has the same effective batch size while making better use of the available GPU resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice remark!
docs/source/performance.mdx
Outdated
Instead of keeping the rolling average for each element in the weight matrices Adafactor only stores aggregated information (row- and column-wise sums of the rolling averages) which reduces the footprint considerably. One downside of Adafactor is that in some instances convergence can be slower than Adam's so some experimentation is advised here. We can use Adafactor simply by setting `adafactor=True`: | ||
|
||
|
||
```py | ||
training_args = TrainingArguments(per_device_train_batch_size=4, adafactor=True, **default_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this way is deprecated and that one has to pass optim-"adafactor" nowadays. @stas00 can confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - fixed.
Thanks, @lvwerra
Does it make sense? If not, and we need to sync on the big picture - let's discuss it then first. |
Hi @stas00 Sure, I'll revert the headers, I was not aware of that limitation. Regarding the order of the content: I know that we will change that document radically in later PRs but I thought since this is gonna be public and maybe here for a few weeks it would be nicer to have the practical guide at the beginning and the theoretical part at the end. If you feel strongly about this I can change it back. |
In fact I asked when the new design was added to support more than 3-levels but nothing came out of that. i.e. the menu is missing already a bunch of entries where there is a deeper nesting like the deepspeed doc. :(
Sure, that works. I think what I will do instead in the new PR is to move this original document out to a new document. I'm just concerned with broken links for any pre-existing links to this url. |
Thanks a lot for working on this important part of the docs! |
* add model scaling section * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * integrate reviewer feedback * initialize GPU properly * add note about BnB optimizer * move doc from `scaling.mdx` to `performance.mdx` * integrate reviewer feedback * revert section levels Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add model scaling section * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * integrate reviewer feedback * initialize GPU properly * add note about BnB optimizer * move doc from `scaling.mdx` to `performance.mdx` * integrate reviewer feedback * revert section levels Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This PR adds a practical scaling guide to the documentation. When starting to work with large transformer models one of the most common issues is running out of GPU memory. There are several easy to implement strategies to counteract this. This guide shows how to use gradient accumulation, gradient checkpointing, mixed precision training, and choice of optimizer to decrease the memory footprint. It shows how to do this with both the
Trainer
andaccelerate
in a hands-on fashion.What does this PR not do?
This PR does not replace the in-depth explanations at "Model Parallelism" and "Performance and Scalability: How To Fit a Bigger Model and Train It Faster". Rather the proposed guide should be an easy entry point for new users whereas these other sections are for more advanced users.
TensorFlow?
If this guide resonates well it would make sense to extend it with a TensorFlow/Keras section (similar to the
accelerate
section). Maybe TensorFlow boy (aka @Rocketknight1) has some ideas?