[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs #1074

WoosukKwon · 2023-09-18T01:38:54Z

The TORCH_CUDA_ARCH_LIST env variable is a standard place to specify the target GPUs one wants to build a PyTorch project for. This PR enables using the variable in our setup.py. This will be especially useful for those who build vLLM images for specific GPUs.

WoosukKwon · 2023-09-18T19:10:10Z

@zhuohan123 This PR is also ready for review.

v1nc3nt27 · 2023-09-19T16:11:47Z

When I build this branch on an A10, it just stops after the initial request (shortened):

INFO 09-19 16:02:13 async_llm_engine.py:328] Received request cmpl-833d8e3bd8b84aa9a7a4675abd62a19b: prompt: ... 29914, 25580, 29962].

If I use branch merge_quant, it works fine.

All I do to test this in this and merge_quant branch is:

git clone https://github.com/vllm-project/vllm.git
git checkout fix-setup
cd vllm
pip install -e .
pip install fschat[webui,model_worker]
python3 -m vllm.entrypoints.openai.api_server --model abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq --quantization awq --max-num-batched-tokens 4096 --host 0.0.0.0 --port 8080

Is there a way to increase logging and find out what is happening? In the case where it doesn't work, GPU memory stays at around 18GB, in the other case it goes up to 23GB, so I don't think much is happening. There is no error coming back from the server to the client sending the request, it just stays open.

UPDATE:

Launching the default api server via python -m vllm.entrypoints.api_server --model abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq --quantization awq --max-num-batched-tokens 4096 --host 0.0.0.0 --port 8080 and using the example curl in the docs also returns no result but the logs show that prompt token ids are None:

INFO 09-19 16:18:56 async_llm_engine.py:372] Received request 49d601e1afbe4075987bea7877cb06ef: prompt: 'San Francisco is a', sampling params: SamplingParams(n=4, best_of=4, presence_penalty=0.0, frequency_penalty=0.0, temperature=0, top_p=1.0, top_k=-1, use_beam_search=True, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=False, max_tokens=7, logprobs=None), prompt token ids: None.

WoosukKwon · 2023-09-20T18:55:49Z

@v1nc3nt27 What is the merge_quant branch? Our repo doesn't have a branch with the name.

v1nc3nt27 · 2023-09-21T07:27:48Z

@WoosukKwon sorry, it was this branch #1032

zhuohan123 · 2023-09-21T08:03:53Z

setup.py

+    # based on the NVCC CUDA version.
+    compute_capabilities = set(SUPPORTED_ARCHS)
+    if nvcc_cuda_version < Version("11.1"):
+        compute_capabilities.remove("8.6")


discard does not raise an error if the element is not present in the set. Similar for the remove below.

Suggested change

compute_capabilities.remove("8.6")

compute_capabilities.discard("8.6")

Also we might need to remove *+PTX

L80 and below is executed when compute_capabilities is empty. In this case, we add all SUPPORTED_ARCHS and remove some of them based on the user's CUDA version. So, I think using remove is more appropriate than discard, and we don't need to remove *PTX because SUPPORTED_ARCHS does not have it.

zhuohan123 · 2023-09-21T08:15:22Z

setup.py

+        # CUDA 11.8 is required to generate the code targeting compute capability 8.9.
+        # However, GPUs with compute capability 8.9 can also run the code generated by
+        # the previous versions of CUDA 11 and targeting compute capability 8.0.
+        # Therefore, if CUDA 11.8 is not available, we target compute capability 8.0
+        # instead of 8.9.


Actually, should we print a warning for this comment?

Sounds good! Added.

WoosukKwon

@zhuohan123 Addressed your comments. PTAL.

WoosukKwon · 2023-09-26T16:48:25Z

setup.py

+    # based on the NVCC CUDA version.
+    compute_capabilities = set(SUPPORTED_ARCHS)
+    if nvcc_cuda_version < Version("11.1"):
+        compute_capabilities.remove("8.6")


L80 and below is executed when compute_capabilities is empty. In this case, we add all SUPPORTED_ARCHS and remove some of them based on the user's CUDA version. So, I think using remove is more appropriate than discard, and we don't need to remove *PTX because SUPPORTED_ARCHS does not have it.

WoosukKwon · 2023-09-26T16:52:35Z

setup.py

+        # CUDA 11.8 is required to generate the code targeting compute capability 8.9.
+        # However, GPUs with compute capability 8.9 can also run the code generated by
+        # the previous versions of CUDA 11 and targeting compute capability 8.0.
+        # Therefore, if CUDA 11.8 is not available, we target compute capability 8.0
+        # instead of 8.9.


Sounds good! Added.

zhuohan123

LGTM! Thanks for the contribution!

…-project#1074)

WoosukKwon added 3 commits September 18, 2023 00:05

Support TORCH_CUDA_ARCH_LIST

ba4d194

Minor

fecd976

Reimplement

15c7c41

WoosukKwon requested a review from zhuohan123 September 18, 2023 01:38

Minor

40b28a6

WoosukKwon mentioned this pull request Sep 18, 2023

[v0.2.0] Release Tracker #1089

Closed

5 tasks

zhuohan123 requested changes Sep 21, 2023

View reviewed changes

Add warning

0989f80

WoosukKwon commented Sep 26, 2023

View reviewed changes

zhuohan123 approved these changes Sep 26, 2023

View reviewed changes

WoosukKwon merged commit a425bd9 into main Sep 26, 2023
2 checks passed

WoosukKwon deleted the fix-setup branch September 26, 2023 17:21

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs (vllm…

a56fe93

…-project#1074)

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs (vllm…

580b222

…-project#1074)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs #1074

[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs #1074

WoosukKwon commented Sep 18, 2023

WoosukKwon commented Sep 18, 2023

v1nc3nt27 commented Sep 19, 2023 •

edited

Loading

WoosukKwon commented Sep 20, 2023

v1nc3nt27 commented Sep 21, 2023

zhuohan123 Sep 21, 2023

zhuohan123 Sep 21, 2023

WoosukKwon Sep 26, 2023

zhuohan123 Sep 21, 2023

WoosukKwon Sep 26, 2023

WoosukKwon left a comment

WoosukKwon Sep 26, 2023

WoosukKwon Sep 26, 2023

zhuohan123 left a comment

	compute_capabilities.remove("8.6")
	compute_capabilities.discard("8.6")

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs #1074

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs #1074

Conversation

WoosukKwon commented Sep 18, 2023

WoosukKwon commented Sep 18, 2023

v1nc3nt27 commented Sep 19, 2023 • edited Loading

WoosukKwon commented Sep 20, 2023

v1nc3nt27 commented Sep 21, 2023

zhuohan123 Sep 21, 2023

Choose a reason for hiding this comment

zhuohan123 Sep 21, 2023

Choose a reason for hiding this comment

WoosukKwon Sep 26, 2023

Choose a reason for hiding this comment

zhuohan123 Sep 21, 2023

Choose a reason for hiding this comment

WoosukKwon Sep 26, 2023

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Sep 26, 2023

Choose a reason for hiding this comment

WoosukKwon Sep 26, 2023

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs #1074

[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs #1074

v1nc3nt27 commented Sep 19, 2023 •

edited

Loading