Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs #1074

Merged
merged 5 commits into from
Sep 26, 2023

Conversation

WoosukKwon
Copy link
Collaborator

Fixes #1070

The TORCH_CUDA_ARCH_LIST env variable is a standard place to specify the target GPUs one wants to build a PyTorch project for. This PR enables using the variable in our setup.py. This will be especially useful for those who build vLLM images for specific GPUs.

@WoosukKwon
Copy link
Collaborator Author

@zhuohan123 This PR is also ready for review.

@WoosukKwon WoosukKwon mentioned this pull request Sep 18, 2023
5 tasks
@v1nc3nt27
Copy link

v1nc3nt27 commented Sep 19, 2023

When I build this branch on an A10, it just stops after the initial request (shortened):

INFO 09-19 16:02:13 async_llm_engine.py:328] Received request cmpl-833d8e3bd8b84aa9a7a4675abd62a19b: prompt: ... 29914, 25580, 29962].

If I use branch merge_quant, it works fine.

All I do to test this in this and merge_quant branch is:

git clone https://github.com/vllm-project/vllm.git
git checkout fix-setup
cd vllm
pip install -e .
pip install fschat[webui,model_worker]
python3 -m vllm.entrypoints.openai.api_server --model abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq --quantization awq --max-num-batched-tokens 4096 --host 0.0.0.0 --port 8080

Is there a way to increase logging and find out what is happening? In the case where it doesn't work, GPU memory stays at around 18GB, in the other case it goes up to 23GB, so I don't think much is happening. There is no error coming back from the server to the client sending the request, it just stays open.

UPDATE:

Launching the default api server via python -m vllm.entrypoints.api_server --model abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq --quantization awq --max-num-batched-tokens 4096 --host 0.0.0.0 --port 8080 and using the example curl in the docs also returns no result but the logs show that prompt token ids are None:

INFO 09-19 16:18:56 async_llm_engine.py:372] Received request 49d601e1afbe4075987bea7877cb06ef: prompt: 'San Francisco is a', sampling params: SamplingParams(n=4, best_of=4, presence_penalty=0.0, frequency_penalty=0.0, temperature=0, top_p=1.0, top_k=-1, use_beam_search=True, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=False, max_tokens=7, logprobs=None), prompt token ids: None.

@WoosukKwon
Copy link
Collaborator Author

@v1nc3nt27 What is the merge_quant branch? Our repo doesn't have a branch with the name.

@v1nc3nt27
Copy link

@WoosukKwon sorry, it was this branch #1032

# based on the NVCC CUDA version.
compute_capabilities = set(SUPPORTED_ARCHS)
if nvcc_cuda_version < Version("11.1"):
compute_capabilities.remove("8.6")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discard does not raise an error if the element is not present in the set. Similar for the remove below.

Suggested change
compute_capabilities.remove("8.6")
compute_capabilities.discard("8.6")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we might need to remove *+PTX

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L80 and below is executed when compute_capabilities is empty. In this case, we add all SUPPORTED_ARCHS and remove some of them based on the user's CUDA version. So, I think using remove is more appropriate than discard, and we don't need to remove *PTX because SUPPORTED_ARCHS does not have it.

Comment on lines +99 to +103
# CUDA 11.8 is required to generate the code targeting compute capability 8.9.
# However, GPUs with compute capability 8.9 can also run the code generated by
# the previous versions of CUDA 11 and targeting compute capability 8.0.
# Therefore, if CUDA 11.8 is not available, we target compute capability 8.0
# instead of 8.9.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, should we print a warning for this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Added.

Copy link
Collaborator Author

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 Addressed your comments. PTAL.

# based on the NVCC CUDA version.
compute_capabilities = set(SUPPORTED_ARCHS)
if nvcc_cuda_version < Version("11.1"):
compute_capabilities.remove("8.6")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L80 and below is executed when compute_capabilities is empty. In this case, we add all SUPPORTED_ARCHS and remove some of them based on the user's CUDA version. So, I think using remove is more appropriate than discard, and we don't need to remove *PTX because SUPPORTED_ARCHS does not have it.

Comment on lines +99 to +103
# CUDA 11.8 is required to generate the code targeting compute capability 8.9.
# However, GPUs with compute capability 8.9 can also run the code generated by
# the previous versions of CUDA 11 and targeting compute capability 8.0.
# Therefore, if CUDA 11.8 is not available, we target compute capability 8.0
# instead of 8.9.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Added.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution!

@WoosukKwon WoosukKwon merged commit a425bd9 into main Sep 26, 2023
2 checks passed
@WoosukKwon WoosukKwon deleted the fix-setup branch September 26, 2023 17:21
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants