[Misc] Support quantization of MllamaForCausalLM #8822

mgoin · 2024-09-25T21:03:03Z

Tested using https://huggingface.co/mgoin/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic with only the language_model modules quantized. The multi_modal_projector and vision_model structures were ignored.

Here is the output from examples/offline_inference_vision_language.py the with the model replaced with the checkpoint above:

python examples/offline_inference_vision_language.py -m mllama
/home/mgoin/code/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 09-25 20:57:40 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 09-25 20:57:40 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-25 20:57:40 llm_engine.py:226] Initializing an LLM engine (vdev) with config: model='/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic', speculative_config=None, tokenizer='/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 09-25 20:57:40 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 09-25 20:57:40 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-25 20:57:42 model_runner.py:1014] Starting to load model /home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic...
INFO 09-25 20:57:42 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.44it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.62it/s]

INFO 09-25 20:57:44 model_runner.py:1025] Loading model weights took 11.7887 GB
INFO 09-25 20:57:44 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
WARNING 09-25 20:57:44 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
INFO 09-25 20:57:51 gpu_executor.py:122] # GPU blocks: 12546, # CPU blocks: 1638
WARNING 09-25 20:57:53 preprocess.py:86] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.42it/s, est. speed input: 24.20 toks/s, output: 154.88 toks/s]
 The image shows a cherry blossom tree in full bloom, with the Tokyo Tower in the background. The cherry blossoms are pink and white, and they are in full bloom, covering the entire frame of the image. The branches of the tree are bare, and the flowers are scattered throughout the branches. In the background,
 The image shows a tall white tower with a round top, surrounded by pink cherry blossom trees. The tower is in the center of the image and is framed by branches of pink cherry blossoms. The sky is blue and clear. The overall atmosphere suggests a springtime scene, with the cherry blossoms in full bloom and
 The image shows a white tower, possibly a skyscraper or a tower, with pink cherry blossoms in the foreground. The tower is tall and slender, with a rounded top and a series of windows running along its length. The cherry blossoms are in full bloom, with delicate pink petals and green leaves. The background
 The image shows a white tower framed by pink cherry blossoms. The tower is tall and slender, with a rounded top and a series of windows running up its length. It is set against a bright blue sky, suggesting that it is either early morning or late afternoon. The cherry blossoms are in full bloom, with
[rank0]:[W925 20:57:55.317346400 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

github-actions · 2024-09-25T21:03:17Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

ywang96

Thanks for this patch!

Support quantization of MllamaForCausalLM

ea90720

ywang96 approved these changes Sep 25, 2024

View reviewed changes

comaniac enabled auto-merge (squash) September 25, 2024 21:14

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 25, 2024

simon-mo disabled auto-merge September 25, 2024 21:46

simon-mo merged commit 7193774 into main Sep 25, 2024
31 of 39 checks passed

ywang96 mentioned this pull request Sep 25, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

88 tasks

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 27, 2024

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

6e56fc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Support quantization of MllamaForCausalLM #8822

[Misc] Support quantization of MllamaForCausalLM #8822

mgoin commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

ywang96 left a comment

[Misc] Support quantization of MllamaForCausalLM #8822

[Misc] Support quantization of MllamaForCausalLM #8822

Conversation

mgoin commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

ywang96 left a comment

Choose a reason for hiding this comment