-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][VLM] Add Qwen2-VL model support #7905
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Add Qwen2-VL support in chat_utils.py.
…ties in a single batch.
Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly. In the meantime, can you fix the CI failures? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fyabc Thank you for contributing to vLLM! I took a brief took and left a first round of review. Please take a look.
As @DarkLight1337 mentioned, we might want to wait for #7559 to be merged first because as we're going to have a model that supports a mix of modalities, we want to be careful with API changes.
vllm/worker/model_runner.py
Outdated
# special processing for mrope position deltas. | ||
if self.runner.model_is_mrope: | ||
image_grid_thw = mm_kwargs.get("image_grid_thw", None) | ||
video_grid_thw = mm_kwargs.get("video_grid_thw", None) | ||
assert image_grid_thw is not None or video_grid_thw is not None, \ | ||
"mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'." | ||
|
||
hf_config = self.runner.model_config.hf_config | ||
|
||
from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding | ||
|
||
inter_data.mrope_input_positions = [None] * inter_data.n_seqs | ||
for seq_idx in range(inter_data.n_seqs): | ||
seq_data = seq_group_metadata.seq_data[ | ||
inter_data.seq_ids[seq_idx]] | ||
token_ids = seq_data.get_token_ids() | ||
|
||
mrope_input_positions, mrope_position_delta = MRotaryEmbedding.get_input_positions( | ||
token_ids, | ||
image_grid_thw=image_grid_thw, | ||
video_grid_thw=video_grid_thw, | ||
image_token_id=hf_config.image_token_id, | ||
video_token_id=hf_config.video_token_id, | ||
vision_start_token_id=hf_config.vision_start_token_id, | ||
vision_end_token_id=hf_config.vision_end_token_id, | ||
spatial_merge_size=hf_config.vision_config. | ||
spatial_merge_size, | ||
context_len=inter_data.context_lens[seq_idx], | ||
) | ||
|
||
seq_data.mrope_position_delta = mrope_position_delta | ||
inter_data.mrope_input_positions[ | ||
seq_idx] = mrope_input_positions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with us doing this at the model runner level and I'm honestly sure if there's a better place to apply mrope. What's your thought on this? @WoosukKwon
Can you merge from |
Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again. |
# Conflicts: # vllm/worker/model_runner.py
@fyabc Hi, can this patch support mutiple images in one prompt like follows:
|
Hi @DragonFive , you can pass multiple images into a single prompt like this: messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
] See "Multi image inference" section of our README for more details. |
Sorry, I can't repeat the stuck results, I have only encountered time consuming and unstable openai service. |
|
@PancakeAwesome Hi, I found If you use
If you skip So if your input image |
@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs? |
I had this problem when I tried this in pixtral. I tried this on other machines. But in the end, isn't it an important way to use VLLM api_server.py to receive multiple requests and get a response using Openai? I mean, where can I get a response with a different request without using OpenAi? Do you have any advice? Note : Maybe I need to try requests method . |
If it's happening to other VLMs as well, can you open a separate issue so we can better investigate it? |
When will vllm openai server be ready to support qwenvl2 video inference? |
See #7558 |
Unrecognized keys in |
@AlexanderChen1989 please make sure you have installed this particular version of transformers |
I solved the problem with the normal classic multiple request method, when I send a request like this, the model does not seem to have any problems and can receive a response. But Openai is currently running into a bottleneck and creating problems after sending continuous requests. These are what I am currently observing. |
Same error with above transformer (4.45.0.dev0) and pip install vllm -U (vllm-0.6.1.post2): 1745 scaling_factor = 1 AssertionError: |
You should install vLLM first before the dev version of transformers (make sure you are installing from the commit hash btw). If you install vLLM afterwards, it may overwrite the existing version of transformers that you have installed. |
Not working even with fresh install both, in your order: !pip uninstall transformers -y
|
Can you run |
Qwen2VLConfig { It is the same I think ! |
This config is outdated. Compare it to the official one here. Notice the difference in |
I got it. Apply the new config manually and copy over preprocessor.jon and it is workng now. Thanks, |
Ahhhh I could have run vLLM with Qwen2-VL-7B as: CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --limit-mm-per-prompt image=30 --host 0.0.0.0 --port 9999 --served-model-name EraX-VL-V1 --model ./EraX-VL-7B I used the code from Qwen2 git (https://github.com/QwenLM/Qwen2-VL)
ERROR:
Any hint please. Thanks, |
try this: "type": "image_url", |
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you! Unrecognized keys in
|
See my comment above: #7905 (comment) |
|
This version of transformer will raise the following error:
|
T
The current version of vLLM requires |
@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug. |
This PR adding support for Qwen2-VL model.
FIX #8139
FIX #8281
Requirements
This PR requirestransformers
with this PR merged and this bugfix PR merged (You can install it viapip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
).NOTE: Current latesttransformers
version have a bug, so you should install a develop version as above now.transformers>=4.45
, please install vLLM from source.Optional Requirements
qwen-vl-utils
to preprocess multimodal content correctly (qwen-vl-utils
is not a part of this PR).Example Usage
Notes
Here are some important notes about this PR:
Qwen2-VL uses rotary embedding with multimodal sections (
mrope
) (seevllm/model_executor/layers/rotary_embedding.py
for more details). This rotary embedding requires the inputpositions
to be a tensor of shape(3, seq_len)
(instead of(seq_len,)
in common case)._mrope_position_delta
(with typeOptional[int]
) attribute intovllm.sequence.SequenceData
(this attribute is used to computemrope_input_positions
in each decoding step). (If reviewers have a better solution, please comment in this PR)model_runner.py
to compute themrope_input_positions
when the model usesmrope
. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).Qwen2-VL uses
flash-attn==2.6.1
(instead ofvllm-flash-attn==2.6.1
) to compute vision attention (see the commented line 36 invllm/model_executor/models/qwen2_vl.py
). Currentvllm-flash-attn
version will outputNaN
logits value, and I am still debugging this bug.xformers
backend as a fallback implementation ofQwen2VisionAttention
, so there is no need to addflash-attn
into project requirements file.Qwen2-VL supports both image and video inputs. To support this feature, we add a
video
multimodal plugin (seevllm/multimodal/video.py
for more details).OpenAI-compatible server
vllm.entrypoints.openai.api_server
uses a model-independent multimodal data fetcher (e.g.vllm.multimodal.utils.async_get_and_parse_image
), so vision smart resizing logic inqwen-vl-utils
cannot be applied now. I think its good to create another PR to fix it later.Multiple modalities support details
Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:
So I remove the key same check in
vllm.multimodal.base.MultiModalInputs.batch()
method, since different samples may returns different modality keys.