-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments
.github/workflows/self-push.yml
Outdated
- name: Install dependencies | ||
run: | | ||
apt -y update && apt install -y libaio-dev | ||
pip install --upgrade pip | ||
rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds | ||
pip install .[testing,deepspeed,fairscale] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need this step -> installed in the image huggingface/transformers-pytorch-deepspeed-latest-gpu
, except fairscale
(which we don't test at this moment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
- name: Re-compile DeepSpeed | ||
working-directory: /workspace | ||
run: | | ||
pip install deepspeed # installs the deps correctly | ||
rm -rf DeepSpeed | ||
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build | ||
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduled CI should use stable DeepSpeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \ | ||
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduled CI should use stable DeepSpeed
RUN python3 -m pip uninstall -y torch torchvision torchaudio | ||
RUN python3 -m pip install --no-cache-dir -U torch torchvision torchaudio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't the second command that includes -U
take care of uninstall automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You caught me ! Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some peculiar circumstances where an explicit uninstall is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think you are also missing the cuda information there - it will install cuda-10.2 version by default
you most likely want:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
it's time to move to cuda-11.x
and of course -U
and why not using cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all the options please see: https://pytorch.org/get-started/locally/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had some doubt regarding the cuda version. Thank you for pointing this out.
I will need to change some other Dockerfiles too, for example transformers-all-latest-gpu
. On it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding --no-cache-dir
, I just copied it from other places. People on StackOverflow say this can reduce docker image.
@LysandreJik might have other insights here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't mind redownloading the same packages all the time then it works. Perhaps downloading the larger docker image is more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no-cache-dir is to have a leaner image. It's already quite heavy with frameworks such as PyTorch which end up weighing 1GB+, skipping the cache ensures the download is as fast as it can be.
The documentation is not available anymore as the PR was closed or merged. |
Regarding
Here we only upgrade
I am not sure if we need to upgrade all 3 modules at the same time. PyTorch installation instructions always install these 3 at the same time, so I guess yes ?? |
Yes, usually it's the easiest to always handle all 3 packages as a single package to avoid incompatibility conflicts down the road. I first tried to "optimize" and only install the other 2 when it was needed, but later I switched to always installing the 3 together. the other 2 are tiny compared to the main package. |
OK, I will try to get it done. A bit not easy as here we have using an argument |
RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113 | ||
RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow ugly here, but just to install CUDA 11.x.
(In another PR regarding nightly CI, this part needs some rework)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good to me. A bit complex but it seems to get the work done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, can't wait to see it run!
.github/workflows/self-push.yml
Outdated
- name: Install dependencies | ||
run: | | ||
apt -y update && apt install -y libaio-dev | ||
pip install --upgrade pip | ||
rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds | ||
pip install .[testing,deepspeed,fairscale] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
- name: Re-compile DeepSpeed | ||
working-directory: /workspace | ||
run: | | ||
pip install deepspeed # installs the deps correctly | ||
rm -rf DeepSpeed | ||
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build | ||
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113 | ||
RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good to me. A bit complex but it seems to get the work done.
6473fe3
to
e3aeb8e
Compare
[Just FYI] an update: I am running the tests before merge. So far only a subset of tests is run. I got some issues PyTorch pipelines (single-gpu)] Torch CUDA test (multi GPUs) I will try to re-run them later, also will wait the scheduled CIs during this weekend in order to compare. |
Feel free to also spawn dummy machines to help you out if it helps getting it merged quicker |
412ee29
to
2b6352c
Compare
I am going to merge now.
For the versions, let's discuss in #17586 I observed with the latest stable PyTorch/DeepSpeed, First Run
Second Run
With previous setting (PyTorch 1.9 + DeepSpeed Recompiled)
|
The very first deepspeed test using deepspeed JIT install will have the overhead of building deepspeed, which takes about 1min - depending on the hardware. This doesn't happen if deepspeed was prebuilt before tests were run. Is it possible that this test happens to be the first one to run? |
(This is for the stable release of Here is the job run page The tests are ran in the following order
Dumb question: Is stable release of |
You can see that it was indeed building deepspeed during that test's run, see: https://github.com/huggingface/transformers/runs/6761224960?check_suite_focus=true#step:6:334 so need to either have a longer timeout or always prebuild deepspeed.
Hope the following makes the whole situation loud and clear: What is being built:
How is it being built:
|
Thank you @stas00 , thankfully I get better understanding of the terminology now! I will pre-build |
…ace#17417) * update versions Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…ace#17417) * update versions Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
(As there are a few PRs waiting for review, feel free to postpone the review for this PR if things get confusing at this moment)
Currently:
This PR fix it by using latest stable PyTorch + DeepSpeed, for both push/scheduled CIs
(Let me run a dummy test before merge 🙏 )