Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

ydshieh · 2022-05-25T14:33:30Z

What does this PR do?

(As there are a few PRs waiting for review, feel free to postpone the review for this PR if things get confusing at this moment)

Currently:

scheduled CI uses latest stable PyTorch (OK) + nightly DeepSpeed (Not OK)
push CI uses PyTorch 1.9 (Not OK) + latest stable DeepSpeed (OK)

This PR fix it by using latest stable PyTorch + DeepSpeed, for both push/scheduled CIs

(Let me run a dummy test before merge 🙏 )

ydshieh

Some comments

ydshieh · 2022-05-25T14:35:15Z

.github/workflows/self-push.yml

-      - name: Install dependencies
-        run: |
-          apt -y update && apt install -y libaio-dev
-          pip install --upgrade pip
-          rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds
-          pip install .[testing,deepspeed,fairscale]
-


No need this step -> installed in the image huggingface/transformers-pytorch-deepspeed-latest-gpu, except fairscale (which we don't test at this moment)

ydshieh · 2022-05-25T14:35:42Z

.github/workflows/self-scheduled.yml

-      - name: Re-compile DeepSpeed
-        working-directory: /workspace
-        run: |
-          pip install deepspeed # installs the deps correctly
-          rm -rf DeepSpeed
-          git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
-          DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
-


scheduled CI should use stable DeepSpeed

Sounds good

ydshieh · 2022-05-25T14:35:52Z

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

-RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
-    DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1


scheduled CI should use stable DeepSpeed

stas00 · 2022-05-25T14:37:19Z

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

+RUN python3 -m pip uninstall -y torch torchvision torchaudio
+RUN python3 -m pip install --no-cache-dir -U torch torchvision torchaudio


won't the second command that includes -U take care of uninstall automatically?

You caught me ! Fixed

there are some peculiar circumstances where an explicit uninstall is needed.

but I think you are also missing the cuda information there - it will install cuda-10.2 version by default

you most likely want:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

it's time to move to cuda-11.x

and of course -U and why not using cache?

for all the options please see: https://pytorch.org/get-started/locally/

Yeah, I had some doubt regarding the cuda version. Thank you for pointing this out.
I will need to change some other Dockerfiles too, for example transformers-all-latest-gpu. On it.

Regarding --no-cache-dir, I just copied it from other places. People on StackOverflow say this can reduce docker image.
@LysandreJik might have other insights here?

if you don't mind redownloading the same packages all the time then it works. Perhaps downloading the larger docker image is more expensive.

no-cache-dir is to have a leaner image. It's already quite heavy with frameworks such as PyTorch which end up weighing 1GB+, skipping the cache ensures the download is as fast as it can be.

HuggingFaceDocBuilderDev · 2022-05-25T14:53:16Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2022-05-25T15:06:34Z

@stas00

Regarding transformers-pytorch-gpu:

transformers/docker/transformers-pytorch-gpu/Dockerfile

Line 17 in 7e46ec7

    
           RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION

Here we only upgrade torch, but not torchvision and torcuaudio. Those two are installed in a previous step

transformers/docker/transformers-pytorch-gpu/Dockerfile

Line 12 in 7e46ec7

RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing]

I am not sure if we need to upgrade all 3 modules at the same time. PyTorch installation instructions always install these 3 at the same time, so I guess yes ??

stas00 · 2022-05-25T15:15:16Z

Yes, usually it's the easiest to always handle all 3 packages as a single package to avoid incompatibility conflicts down the road.

I first tried to "optimize" and only install the other 2 when it was needed, but later I switched to always installing the 3 together. the other 2 are tiny compared to the main package.

ydshieh · 2022-05-25T15:20:04Z

Yes, usually it's the easiest to always handle all 3 packages as a single package to avoid incompatibility conflicts down the road.

I first tried to "optimize" and only install the other 2 when it was needed, but later I switched to always installing the 3 together. the other 2 are tiny compared to the main package.

OK, I will try to get it done. A bit not easy as here we have using an argument $PYTORCH so we can specify torch version.
(although we only use the default value to get the latest stable version for now.)

ydshieh · 2022-05-25T15:28:53Z

docker/transformers-pytorch-gpu/Dockerfile

+RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
+RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113


Somehow ugly here, but just to install CUDA 11.x.

(In another PR regarding nightly CI, this part needs some rework)

Ok, looks good to me. A bit complex but it seems to get the work done.

LysandreJik

LGTM, can't wait to see it run!

LysandreJik · 2022-05-31T08:11:34Z

.github/workflows/self-push.yml

-      - name: Install dependencies
-        run: |
-          apt -y update && apt install -y libaio-dev
-          pip install --upgrade pip
-          rm -rf ~/.cache/torch_extensions/ # shared between conflicting builds
-          pip install .[testing,deepspeed,fairscale]
-


LysandreJik · 2022-05-31T08:11:43Z

.github/workflows/self-scheduled.yml

-      - name: Re-compile DeepSpeed
-        working-directory: /workspace
-        run: |
-          pip install deepspeed # installs the deps correctly
-          rm -rf DeepSpeed
-          git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
-          DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
-


Sounds good

LysandreJik · 2022-05-31T08:12:24Z

docker/transformers-pytorch-gpu/Dockerfile

+RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
+RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113


Ok, looks good to me. A bit complex but it seems to get the work done.

ydshieh · 2022-06-03T17:38:14Z

[Just FYI] an update:

I am running the tests before merge. So far only a subset of tests is run. I got some issues

PyTorch pipelines (single-gpu)]
https://github.com/huggingface/transformers/runs/6728867682?check_suite_focus=true

Torch CUDA test (multi GPUs)
https://github.com/huggingface/transformers/runs/6728868779?check_suite_focus=true

I will try to re-run them later, also will wait the scheduled CIs during this weekend in order to compare.

LysandreJik · 2022-06-06T08:05:12Z

Feel free to also spawn dummy machines to help you out if it helps getting it merged quicker

ydshieh · 2022-06-07T08:03:07Z

I am going to merge now.

@stas00

intel_extension_for_pytorch is added
- the version still uses the approach so far $(python3 -c "from torch import version; print(version.__version__.split('+')[0])")
- I have to remove pip uninstall -y that was in your original patch. The whole line wasn't working.

For the versions, let's discuss in #17586

@stas00

I observed with the latest stable PyTorch/DeepSpeed, test_can_resume_training_normal_zero2_fp16 takes quite long to run the first time: (On push CI, it will timeout)

First Run

63.46s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
10.97s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

Second Run

18.17s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
11.01s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

With previous setting (PyTorch 1.9 + DeepSpeed Recompiled)

6.16s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16
2.83s call     tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16

stas00 · 2022-06-07T15:11:29Z

I observed with the latest stable PyTorch/DeepSpeed, test_can_resume_training_normal_zero2_fp16 takes quite long to run the first time: (On push CI, it will timeout)

The very first deepspeed test using deepspeed JIT install will have the overhead of building deepspeed, which takes about 1min - depending on the hardware.

This doesn't happen if deepspeed was prebuilt before tests were run.

Is it possible that this test happens to be the first one to run?

ydshieh · 2022-06-07T16:35:43Z

@stas00

(This is for the stable release of DeepSpeed + PyTorch)

Here is the job run page
https://github.com/huggingface/transformers/runs/6761224960?check_suite_focus=true

The tests are ran in the following order

# The following 3 are OK
tests/deepspeed/test_deepspeed.py::CoreIntegrationDeepSpeed::test_init_zero3_fp16 
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_errors_zero2_fp16 
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_errors_zero3_fp16 

# This one timed out
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero2_fp16

# This one is OK
tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_can_resume_training_normal_zero3_fp16 

tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_config_object 
...

Dumb question: Is stable release of DeepSpeed == pre-built ?

stas00 · 2022-06-07T17:15:15Z

You can see that it was indeed building deepspeed during that test's run, see: https://github.com/huggingface/transformers/runs/6761224960?check_suite_focus=true#step:6:334

so need to either have a longer timeout or always prebuild deepspeed.

Dumb question: Is stable release of DeepSpeed == pre-built ?

Hope the following makes the whole situation loud and clear:

What is being built:

stable release install: pip install deepspeed==0.6.5
bleed/master install: pip install git+https://github.com/microsoft/DeepSpeed (or git clone ...; pip install -e .)

How is it being built:

pip install deepspeed JIT build - this will build deepspeed the first time it's used. pip just installs the source files here.
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" - this is the prebuiding - so that the first time it's used it's already ready to use

ydshieh · 2022-06-07T17:21:45Z

Thank you @stas00 , thankfully I get better understanding of the terminology now!

I will pre-build DeepSpeed so it will be indeed ready to be speedy!

…ace#17417) * update versions Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

ydshieh requested review from stas00 and LysandreJik May 25, 2022 14:33

ydshieh commented May 25, 2022

View reviewed changes

stas00 reviewed May 25, 2022

View reviewed changes

ydshieh commented May 25, 2022

View reviewed changes

LysandreJik approved these changes May 31, 2022

View reviewed changes

ydshieh force-pushed the fix_pytorch_version_in_ci branch 3 times, most recently from 6473fe3 to e3aeb8e Compare June 3, 2022 15:03

ydshieh added 7 commits June 6, 2022 20:39

update

cd2778b

update

8b5c938

update

41734d7

update

c58c937

fix

9d26b37

fix torch-scatter version

81bde16

fix path

2b6352c

ydshieh force-pushed the fix_pytorch_version_in_ci branch from 412ee29 to 2b6352c Compare June 6, 2022 18:45

ydshieh added 3 commits June 7, 2022 10:40

patch

7efd092

fix

bce06f7

fix

04b1df0

ydshieh merged commit 9aa230a into main Jun 7, 2022

ydshieh deleted the fix_pytorch_version_in_ci branch June 7, 2022 09:53

ydshieh mentioned this pull request Jun 8, 2022

Pre-build DeepSpeed #17607

Merged

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI (huggingf…

c9579f0

…ace#17417) * update versions Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI (huggingf…

705eb57

…ace#17417) * update versions Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

ydshieh commented May 25, 2022 •

edited

Loading

ydshieh left a comment

ydshieh May 25, 2022

LysandreJik May 31, 2022

ydshieh May 25, 2022

LysandreJik May 31, 2022

ydshieh May 25, 2022

stas00 May 25, 2022

ydshieh May 25, 2022 •

edited

Loading

stas00 May 25, 2022

stas00 May 25, 2022 •

edited

Loading

stas00 May 25, 2022

ydshieh May 25, 2022

ydshieh May 25, 2022

stas00 May 25, 2022

LysandreJik May 31, 2022

HuggingFaceDocBuilderDev commented May 25, 2022 •

edited

Loading

ydshieh commented May 25, 2022 •

edited

Loading

stas00 commented May 25, 2022

ydshieh commented May 25, 2022

ydshieh May 25, 2022

LysandreJik May 31, 2022

LysandreJik left a comment

LysandreJik May 31, 2022

LysandreJik May 31, 2022

LysandreJik May 31, 2022

ydshieh commented Jun 3, 2022

LysandreJik commented Jun 6, 2022

ydshieh commented Jun 7, 2022 •

edited

Loading

stas00 commented Jun 7, 2022

ydshieh commented Jun 7, 2022 •

edited

Loading

stas00 commented Jun 7, 2022 •

edited

Loading

ydshieh commented Jun 7, 2022 •

edited

Loading

		RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
		DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

		RUN python3 -m pip uninstall -y torch torchvision torchaudio
		RUN python3 -m pip install --no-cache-dir -U torch torchvision torchaudio

		RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' \|\| VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113
		RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' \|\| VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/cu113

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI #17417

Conversation

ydshieh commented May 25, 2022 • edited Loading

What does this PR do?

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 May 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 25, 2022 • edited Loading

ydshieh commented May 25, 2022 • edited Loading

stas00 commented May 25, 2022

ydshieh commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Jun 3, 2022

LysandreJik commented Jun 6, 2022

ydshieh commented Jun 7, 2022 • edited Loading

stas00 commented Jun 7, 2022

ydshieh commented Jun 7, 2022 • edited Loading

stas00 commented Jun 7, 2022 • edited Loading

What is being built:

How is it being built:

ydshieh commented Jun 7, 2022 • edited Loading

ydshieh commented May 25, 2022 •

edited

Loading

ydshieh May 25, 2022 •

edited

Loading

stas00 May 25, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 25, 2022 •

edited

Loading

ydshieh commented May 25, 2022 •

edited

Loading

ydshieh commented Jun 7, 2022 •

edited

Loading

ydshieh commented Jun 7, 2022 •

edited

Loading

stas00 commented Jun 7, 2022 •

edited

Loading

ydshieh commented Jun 7, 2022 •

edited

Loading