Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deepspeed Inference] HF Integration #14426

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 17, 2021

This PR is working on an integration of Deepspeed Inference which implements Tensor Parallelism. This is different from Deepspeed ZeRO inference.

This is a very early draft.

To try:

cd transformers
export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0  \
deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 \
--evaluation_strategy=steps --do_eval --label_smoothing 0.1 --learning_rate \
3e-5 --logging_first_step --logging_steps 500 --max_source_length 128 \
--max_target_length 128 --overwrite_output_dir --per_device_eval_batch_size $BS \
--predict_with_generate --sortish_sampler --source_lang en --target_lang ro \
--dataset_name wmt16 --dataset_config ro-en --source_prefix \
'translate English to Romanian: ' --val_max_target_length 128 --warmup_steps \
50 --max_eval_samples 50 --deepspeed_inference --skip_memory_metrics 0

and it currently hangs with --num_gpus > 1. One gpu finishes processing and the other is stuck in preparing inputs. So need to figure out the synchronization of the gpus.

@huggingface huggingface deleted a comment from github-actions bot Dec 17, 2021
@stas00 stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Dec 17, 2021
src/transformers/deepspeed.py Outdated Show resolved Hide resolved
@@ -33,6 +33,54 @@
logger = logging.get_logger(__name__)


inference_custom_map = dict(
electra=dict(ElectraLayer=("output.dense")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 This will only parallelize the output.dense layer and the other parts will be duplicated on all GPUs, resulting in memory inefficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To parallelize all parts, all layer information must be input. This will be similar to the policy of the existing DeepSpeed Inference, and it will not be very different from the policy I used in Parallelformers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RezaYazdaniAminabadi Am I right? Or any other your opinions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the PR says this is very early. So basically all I did is converting an example that Reza gave me to have it integrated into HF Trainer. So treating it as a black box for now and waiting for Reza to complete the project before trying to understand how it works.

But I trust Reza will be happy to answer your question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hyunwoongko, this only shows that which linear layers would require an all_reduce. So, this is not going to use the same policy as when injecting the kernels. You can find more detail on how the other layers are partitioned on the replace_module function in DeepSpeed. But, basically this policy here is just showing which part need to be partitioned horizontally, whereas the rest are partitioned vertically. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanatory notes, @RezaYazdaniAminabadi - I have added them to the file, so this is covered.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Inference WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants