Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Support quantization of MllamaForCausalLM #8822

Merged
merged 1 commit into from
Sep 25, 2024
Merged

Conversation

mgoin
Copy link
Sponsor Collaborator

@mgoin mgoin commented Sep 25, 2024

Tested using https://huggingface.co/mgoin/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic with only the language_model modules quantized. The multi_modal_projector and vision_model structures were ignored.

Here is the output from examples/offline_inference_vision_language.py the with the model replaced with the checkpoint above:

python examples/offline_inference_vision_language.py -m mllama
/home/mgoin/code/vllm/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 09-25 20:57:40 arg_utils.py:940] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
WARNING 09-25 20:57:40 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-25 20:57:40 llm_engine.py:226] Initializing an LLM engine (vdev) with config: model='/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic', speculative_config=None, tokenizer='/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 09-25 20:57:40 enc_dec_model_runner.py:140] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 09-25 20:57:40 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-25 20:57:42 model_runner.py:1014] Starting to load model /home/mgoin/code/llm-compressor/examples/quantization_w8a8_fp8/Llama-3.2-11B-Vision-Instruct-FP8-Dynamic...
INFO 09-25 20:57:42 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.44it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.62it/s]

INFO 09-25 20:57:44 model_runner.py:1025] Loading model weights took 11.7887 GB
INFO 09-25 20:57:44 enc_dec_model_runner.py:297] Starting profile run for multi-modal models.
WARNING 09-25 20:57:44 registry.py:238] Expected at least 8192 dummy encoder tokens for profiling, but found 6404 tokens instead.
INFO 09-25 20:57:51 gpu_executor.py:122] # GPU blocks: 12546, # CPU blocks: 1638
WARNING 09-25 20:57:53 preprocess.py:86] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.42it/s, est. speed input: 24.20 toks/s, output: 154.88 toks/s]
 The image shows a cherry blossom tree in full bloom, with the Tokyo Tower in the background. The cherry blossoms are pink and white, and they are in full bloom, covering the entire frame of the image. The branches of the tree are bare, and the flowers are scattered throughout the branches. In the background,
 The image shows a tall white tower with a round top, surrounded by pink cherry blossom trees. The tower is in the center of the image and is framed by branches of pink cherry blossoms. The sky is blue and clear. The overall atmosphere suggests a springtime scene, with the cherry blossoms in full bloom and
 The image shows a white tower, possibly a skyscraper or a tower, with pink cherry blossoms in the foreground. The tower is tall and slender, with a rounded top and a series of windows running along its length. The cherry blossoms are in full bloom, with delicate pink petals and green leaves. The background
 The image shows a white tower framed by pink cherry blossoms. The tower is tall and slender, with a rounded top and a series of windows running up its length. It is set against a bright blue sky, suggesting that it is either early morning or late afternoon. The cherry blossoms are in full bloom, with
[rank0]:[W925 20:57:55.317346400 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this patch!

@comaniac comaniac enabled auto-merge (squash) September 25, 2024 21:14
@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 25, 2024
@simon-mo simon-mo merged commit 7193774 into main Sep 25, 2024
31 of 39 checks passed
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants