Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

Merged
merged 10 commits into from
Jun 7, 2022

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented May 25, 2022

What does this PR do?

(As there are a few PRs waiting for review, feel free to postpone the review for this PR if things get confusing at this moment)

Currently:

  • scheduled CI uses latest stable PyTorch (OK) + nightly DeepSpeed (Not OK)
  • push CI uses PyTorch 1.9 (Not OK) + latest stable DeepSpeed (OK)

This PR fix it by using latest stable PyTorch + DeepSpeed, for both push/scheduled CIs

(Let me run a dummy test before merge 🙏 )

Copy link
Collaborator Author

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

Comment on lines 266 to 264
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds
pip install .[testing,deepspeed,fairscale]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need this step -> installed in the image huggingface/transformers-pytorch-deepspeed-latest-gpu, except fairscale (which we don't test at this moment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines -271 to -316
- name: Re-compile DeepSpeed
working-directory: /workspace
run: |
pip install deepspeed # installs the deps correctly
rm -rf DeepSpeed
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduled CI should use stable DeepSpeed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Comment on lines -14 to -15
RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduled CI should use stable DeepSpeed

Comment on lines 15 to 16
RUN python3 -m pip uninstall -y torch torchvision torchaudio
RUN python3 -m pip install --no-cache-dir -U torch torchvision torchaudio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't the second command that includes -U take care of uninstall automatically?

Copy link
Collaborator Author

@ydshieh ydshieh May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You caught me ! Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some peculiar circumstances where an explicit uninstall is needed.

Copy link
Contributor

@stas00 stas00 May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think you are also missing the cuda information there - it will install cuda-10.2 version by default

you most likely want:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

it's time to move to cuda-11.x

and of course -U and why not using cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all the options please see: https://pytorch.org/get-started/locally/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had some doubt regarding the cuda version. Thank you for pointing this out.
I will need to change some other Dockerfiles too, for example transformers-all-latest-gpu. On it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding --no-cache-dir, I just copied it from other places. People on StackOverflow say this can reduce docker image.
@LysandreJik might have other insights here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't mind redownloading the same packages all the time then it works. Perhaps downloading the larger docker image is more expensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no-cache-dir is to have a leaner image. It's already quite heavy with frameworks such as PyTorch which end up weighing 1GB+, skipping the cache ensures the download is as fast as it can be.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 25, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator Author

ydshieh commented May 25, 2022

@stas00

Regarding transformers-pytorch-gpu:

RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION

Here we only upgrade torch, but not torchvision and torcuaudio. Those two are installed in a previous step

RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing]

I am not sure if we need to upgrade all 3 modules at the same time. PyTorch installation instructions always install these 3 at the same time, so I guess yes ??

@stas00
Copy link
Contributor

stas00 commented May 25, 2022

Yes, usually it's the easiest to always handle all 3 packages as a single package to avoid incompatibility conflicts down the road.

I first tried to "optimize" and only install the other 2 when it was needed, but later I switched to always installing the 3 together. the other 2 are tiny compared to the main package.

@ydshieh
Copy link
Collaborator Author

ydshieh commented May 25, 2022

Yes, usually it's the easiest to always handle all 3 packages as a single package to avoid incompatibility conflicts down the road.

I first tried to "optimize" and only install the other 2 when it was needed, but later I switched to always installing the 3 together. the other 2 are tiny compared to the main package.

OK, I will try to get it done. A bit not easy as here we have using an argument $PYTORCH so we can specify torch version.
(although we only use the default value to get the latest stable version for now.)

Comment on lines +20 to +21
RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow ugly here, but just to install CUDA 11.x.

(In another PR regarding nightly CI, this part needs some rework)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good to me. A bit complex but it seems to get the work done.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can't wait to see it run!

Comment on lines 266 to 264
- name: Install dependencies
run: |
apt -y update && apt install -y libaio-dev
pip install --upgrade pip
rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds
pip install .[testing,deepspeed,fairscale]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines -271 to -316
- name: Re-compile DeepSpeed
working-directory: /workspace
run: |
pip install deepspeed # installs the deps correctly
rm -rf DeepSpeed
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Comment on lines +20 to +21
RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good to me. A bit complex but it seems to get the work done.

@ydshieh ydshieh force-pushed the fix_pytorch_version_in_ci branch 3 times, most recently from 6473fe3 to e3aeb8e Compare June 3, 2022 15:03
@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 3, 2022

[Just FYI] an update:

I am running the tests before merge. So far only a subset of tests is run. I got some issues

PyTorch pipelines (single-gpu)]
https://github.com/huggingface/transformers/runs/6728867682?check_suite_focus=true

Torch CUDA test (multi GPUs)
https://github.com/huggingface/transformers/runs/6728868779?check_suite_focus=true

I will try to re-run them later, also will wait the scheduled CIs during this weekend in order to compare.

@LysandreJik
Copy link
Member

Feel free to also spawn dummy machines to help you out if it helps getting it merged quicker

@ydshieh ydshieh force-pushed the fix_pytorch_version_in_ci branch from 412ee29 to 2b6352c Compare June 6, 2022 18:45
@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 7, 2022

I am going to merge now.


@stas00

  • intel_extension_for_pytorch is added
    • the version still uses the approach so far $(python3 -c "from torch import version; print(version.__version__.split('+')[0])")
    • I have to remove pip uninstall -y that was in your original patch. The whole line wasn't working.

For the versions, let's discuss in #17586


@stas00

I observed with the latest stable PyTorch/DeepSpeed, test_can_resume_training_normal_zero2_fp16 takes quite long to run the first time: (On push CI, it will timeout)

First Run

63.46s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
10.97s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

Second Run

18.17s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
11.01s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

With previous setting (PyTorch 1.9 + DeepSpeed Recompiled)

6.16s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
2.83s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

@ydshieh ydshieh merged commit 9aa230a into main Jun 7, 2022
@ydshieh ydshieh deleted the fix_pytorch_version_in_ci branch June 7, 2022 09:53
@stas00
Copy link
Contributor

stas00 commented Jun 7, 2022

I observed with the latest stable PyTorch/DeepSpeed, test_can_resume_training_normal_zero2_fp16 takes quite long to run the first time: (On push CI, it will timeout)

The very first deepspeed test using deepspeed JIT install will have the overhead of building deepspeed, which takes about 1min - depending on the hardware.

This doesn't happen if deepspeed was prebuilt before tests were run.

Is it possible that this test happens to be the first one to run?

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 7, 2022

@stas00

(This is for the stable release of DeepSpeed + PyTorch)

Here is the job run page
https://github.com/huggingface/transformers/runs/6761224960?check_suite_focus=true

The tests are ran in the following order

# The following 3 are OK
tests/deepspeed/test_deepspeed.py::CoreIntegrationDeepSpeed::test_init_zero3_fp16 
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_errors_zero2_fp16 
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_errors_zero3_fp16 

# This one timed out
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16

# This one is OK
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16 

tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_config_object 
...

Dumb question: Is stable release of DeepSpeed == pre-built ?

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2022

You can see that it was indeed building deepspeed during that test's run, see: https://github.com/huggingface/transformers/runs/6761224960?check_suite_focus=true#step:6:334

so need to either have a longer timeout or always prebuild deepspeed.

Dumb question: Is stable release of DeepSpeed == pre-built ?

Hope the following makes the whole situation loud and clear:

What is being built:

  • stable release install: pip install deepspeed==0.6.5
  • bleed/master install: pip install git+https://github.com/microsoft/DeepSpeed (or git clone ...; pip install -e .)

How is it being built:

  • pip install deepspeed JIT build - this will build deepspeed the first time it's used. pip just installs the source files here.
  • DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" - this is the prebuiding - so that the first time it's used it's already ready to use

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 7, 2022

Thank you @stas00 , thankfully I get better understanding of the terminology now!

I will pre-build DeepSpeed so it will be indeed ready to be speedy!

@ydshieh ydshieh mentioned this pull request Jun 8, 2022
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…ace#17417)

* update versions

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022
…ace#17417)

* update versions

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants