[Model][VLM] Add Qwen2-VL model support #7905

fyabc · 2024-08-27T09:54:13Z

This PR adding support for Qwen2-VL model.

Requirements

This PR requires transformers with this PR merged and this bugfix PR merged (You can install it via pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830).
NOTE: Current latest transformers version have a bug, so you should install a develop version as above now.
For transformers>=4.45, please install vLLM from source.

Optional Requirements

When constructing LLM inputs, we recommend using our helper package qwen-vl-utils to preprocess multimodal content correctly (qwen-vl-utils is not a part of this PR).

Example Usage

from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
IMAGE_PATH = '/path/to/image.jpg'
VIDEO_PATH = '/path/to/video.mp4'

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={'image': 10, 'video': 10},
)

sampling_params = SamplingParams(
    temperature=0.1, top_p=0.001, repetition_penalty=1.05, max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': [
        {
            'type': 'image',
            'image': IMAGE_PATH,

            # min_pixels & max_pixels are optional
            'max_pixels': 12845056,
        },

        # You can also pass one or more videos:
        # {
        #     'type': 'video',
        #     'video': VIDEO_PATH,
        # }

        {
            'type': 'text',
            'text': 'What does this diagram illustrate?',
        },
    ]},
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data['image'] = image_inputs
if video_inputs is not None:
    mm_data['video'] = video_inputs

llm_inputs = {
    'prompt': prompt,
    'multi_modal_data': mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

Notes

Here are some important notes about this PR:

Qwen2-VL uses rotary embedding with multimodal sections (mrope) (see vllm/model_executor/layers/rotary_embedding.py for more details). This rotary embedding requires the input positions to be a tensor of shape (3, seq_len) (instead of (seq_len,) in common case).
1. To support this feature, we add a new _mrope_position_delta (with type Optional[int]) attribute into vllm.sequence.SequenceData (this attribute is used to compute mrope_input_positions in each decoding step). (If reviewers have a better solution, please comment in this PR)
2. We also change model_runner.py to compute the mrope_input_positions when the model uses mrope. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).
Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.
1. UPDATE 2024.09.06: Add xformers backend as a fallback implementation of Qwen2VisionAttention, so there is no need to add flash-attn into project requirements file.
Qwen2-VL supports both image and video inputs. To support this feature, we add a video multimodal plugin (see vllm/multimodal/video.py for more details).
OpenAI-compatible server
1. Currently, vllm.entrypoints.openai.api_server uses a model-independent multimodal data fetcher (e.g. vllm.multimodal.utils.async_get_and_parse_image), so vision smart resizing logic in qwen-vl-utils cannot be applied now. I think its good to create another PR to fix it later.

Multiple modalities support details

Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:

# 1. A batch with two samples, sample 1 contains images, sample 2 contains videos
llm.generate([
    {
        "prompt": "XXX",
        "multi_modal_data": {
            "image": ...
        }
    },
    {
        "prompt": "XXX",
        "multi_modal_data": {
            "video": ...
        }
    }
])

# 2. A single sample with both images and videos
llm.generate([
    {
        "prompt": "XXX",
        "multi_modal_data": {
            "image": ...,
            "video": ...
        }
    }
])

So I remove the key same check in vllm.multimodal.base.MultiModalInputs.batch() method, since different samples may returns different modality keys.

github-actions · 2024-08-27T09:54:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Add Qwen2-VL support in chat_utils.py.

…ties in a single batch.

DarkLight1337 · 2024-08-29T03:19:19Z

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

fyabc · 2024-08-29T04:01:21Z

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

ywang96

@fyabc Thank you for contributing to vLLM! I took a brief took and left a first round of review. Please take a look.

As @DarkLight1337 mentioned, we might want to wait for #7559 to be merged first because as we're going to have a model that supports a mix of modalities, we want to be careful with API changes.

vllm/model_executor/models/qwen2_vl.py

ywang96 · 2024-08-29T04:02:05Z

vllm/worker/model_runner.py

+        # special processing for mrope position deltas.
+        if self.runner.model_is_mrope:
+            image_grid_thw = mm_kwargs.get("image_grid_thw", None)
+            video_grid_thw = mm_kwargs.get("video_grid_thw", None)
+            assert image_grid_thw is not None or video_grid_thw is not None, \
+                "mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'."
+
+            hf_config = self.runner.model_config.hf_config
+
+            from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding
+
+            inter_data.mrope_input_positions = [None] * inter_data.n_seqs
+            for seq_idx in range(inter_data.n_seqs):
+                seq_data = seq_group_metadata.seq_data[
+                    inter_data.seq_ids[seq_idx]]
+                token_ids = seq_data.get_token_ids()
+
+                mrope_input_positions, mrope_position_delta = MRotaryEmbedding.get_input_positions(
+                    token_ids,
+                    image_grid_thw=image_grid_thw,
+                    video_grid_thw=video_grid_thw,
+                    image_token_id=hf_config.image_token_id,
+                    video_token_id=hf_config.video_token_id,
+                    vision_start_token_id=hf_config.vision_start_token_id,
+                    vision_end_token_id=hf_config.vision_end_token_id,
+                    spatial_merge_size=hf_config.vision_config.
+                    spatial_merge_size,
+                    context_len=inter_data.context_lens[seq_idx],
+                )
+
+                seq_data.mrope_position_delta = mrope_position_delta
+                inter_data.mrope_input_positions[
+                    seq_idx] = mrope_input_positions


I'm okay with us doing this at the model runner level and I'm honestly sure if there's a better place to apply mrope. What's your thought on this? @WoosukKwon

DarkLight1337 · 2024-08-29T04:03:09Z

Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

Can you merge from main first? It fixes some of the mypy errors which might apply here.

fyabc · 2024-08-29T08:08:49Z

Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again.
I also add some notes about multiple modalities in the PR overview.

…mplementation.

# Conflicts: # vllm/worker/model_runner.py

DragonFive · 2024-08-30T07:49:35Z

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

fyabc · 2024-08-30T11:28:34Z

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

Hi @DragonFive , you can pass multiple images into a single prompt like this:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

See "Multi image inference" section of our README for more details.

PancakeAwesome · 2024-09-13T07:32:39Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?你能详细解释一下为什么使用或不使用某个特定的指令集架构（ISA）可能会显著影响vLLM内部的处理时间吗？

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?你能提供一个没有和有离线推理的代码片段吗？同时打印传递给 () 的图像对象（并报告调用时间）？

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.你觉得服务器端无论何时都无法收到响应，并且请求被卡住的情况如何？我刚才分享了几个例子，但如果你有其他请求，我也可以尝试。

Sorry, I can't repeat the stuck results, I have only encountered time consuming and unstable openai service.

PancakeAwesome · 2024-09-13T07:43:36Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?你能详细解释一下为什么使用或不使用某个特定的指令集架构（ISA）可能会显著影响vLLM内部的处理时间吗？

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?你能提供一个没有和有离线推理的代码片段吗？同时打印传递给 () 的图像对象（并报告调用时间）？


from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = 'Qwen__Qwen2-VL-7B-Instruct'

llm = LLM(
    model=MODEL_PATH,
    max_model_len=4096,
    limit_mm_per_prompt={'image': 1, 'video': 1},
    gpu_memory_utilization=0.9,
    enforce_eager=True,
    dtype='bfloat16',
    trust_remote_code=True,
)

IMAGE_PATH = "test6.jpg"
VIDEO_PATH = '1.mp4'
question = "hi"
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.8,
    max_tokens=100,
    frequency_penalty=1.5,
    top_k=50,
    best_of=3,
    repetition_penalty=1.,
    stop_token_ids=[],
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': [
        {
            'type': 'image',
            'image': IMAGE_PATH,

            # min_pixels & max_pixels are optional
            # 'max_pixels': 12845056,
            'max_pixels': 1500000,
        },

        # You can also pass one or more videos:
        # {
        #     'type': 'video',
        #     'video': VIDEO_PATH,
        # },

        {
            'type': 'text',
            'text': question,
        },
    ]},
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

# image_inputs, video_inputs = process_vision_info(messages) # with qwenvl utils
image_inputs = Image.open(IMAGE_PATH) # withoutqwenvl utils
video_inputs = None 

mm_data = {}
if image_inputs is not None:
    mm_data['image'] = image_inputs
if video_inputs is not None:
    mm_data['video'] = video_inputs

llm_inputs = {
    'prompt': prompt,
    'multi_modal_data': mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
# print(outputs)
print(generated_text)

fyabc · 2024-09-13T08:09:58Z

@PancakeAwesome Hi, I found 'max_pixels': 1500000 in the message of your code snippet.

If you use qwen-vl-utils, the code flow should be:

1) raw PIL image
2) => qwen-vl-utils (resize by maxpixels=1500000)
3) => llm.generate()
4) => input mapper (resize by minpixels=3036, maxpixels=12845056, values are read from model preprocessor_config.json)
5) => model.forward()

If you skip qwen-vl-utils (openai api always skip it now), step 2 will not be applied.

So if your input image test6.jpg is larger than 1500000 pixels and qwen-vl-utils, vllm will accept a larger input, and run slower.
You can remove the 'max_pixels': 1500000 in your code and try again.

fyabc · 2024-09-13T08:12:09Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

syngokhan · 2024-09-13T13:55:48Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

I had this problem when I tried this in pixtral. I tried this on other machines. But in the end, isn't it an important way to use VLLM api_server.py to receive multiple requests and get a response using Openai? I mean, where can I get a response with a different request without using OpenAi? Do you have any advice?

Note : Maybe I need to try requests method .

DarkLight1337 · 2024-09-14T09:28:06Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

I had this problem when I tried this in pixtral. I tried this on other machines. But in the end, isn't it an important way to use VLLM api_server.py to receive multiple requests and get a response using Openai? I mean, where can I get a response with a different request without using OpenAi? Do you have any advice?

Note : Maybe I need to try requests method .

If it's happening to other VLMs as well, can you open a separate issue so we can better investigate it?

PancakeAwesome · 2024-09-14T10:36:30Z

When will vllm openai server be ready to support qwenvl2 video inference?
@fyabc @DarkLight1337

DarkLight1337 · 2024-09-14T10:37:45Z

When will vllm openai server be ready to support qwenvl2 video inference? @fyabc @DarkLight1337

See #7558

AlexanderChen1989 · 2024-09-16T06:32:14Z

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
Traceback (most recent call last):
File "/workspace/lite/test1.py", line 10, in
llm = LLM(
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config
model_config = self.create_model_config()
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config
return ModelConfig(
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init
self.max_model_len = _get_and_verify_max_len(
File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len
assert "factor" in rope_scaling
AssertionError

ywang96 · 2024-09-16T06:40:22Z

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

syngokhan · 2024-09-16T12:27:29Z

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

@fyabc could you elaborate on why using qwen-vl-utils or not could significantly affect the processing time inside vLLM?

@PancakeAwesome Can you provide you offline inference code snippet with and without qwen-vl-utils, and print the image objects passed to LLM (and also report time of llm.generate() call)?

What do you think about the server side not receiving a response regardless of the time and the request getting stuck? I shared a few examples of these above, but if you have any additional requests, I can try them too.

@syngokhan Sorry, I have no idea about why openai api server is stuck... May it be a bug in current openai api server implementation for vision models? Have you tried other VLMs?

I had this problem when I tried this in pixtral. I tried this on other machines. But in the end, isn't it an important way to use VLLM api_server.py to receive multiple requests and get a response using Openai? I mean, where can I get a response with a different request without using OpenAi? Do you have any advice?
Note : Maybe I need to try requests method .

If it's happening to other VLMs as well, can you open a separate issue so we can better investigate it?

I solved the problem with the normal classic multiple request method, when I send a request like this, the model does not seem to have any problems and can receive a response. But Openai is currently running into a bottleneck and creating problems after sending continuous requests. These are what I am currently observing.

thusinh1969 · 2024-09-17T04:57:43Z

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

Same error with above transformer (4.45.0.dev0) and pip install vllm -U (vllm-0.6.1.post2):

1745 scaling_factor = 1
1746 else:
-> 1747 assert "factor" in rope_scaling
1748 scaling_factor = rope_scaling["factor"]
1749 if rope_type == "yarn":

AssertionError:

DarkLight1337 · 2024-09-17T04:59:54Z

You should install vLLM first before the dev version of transformers (make sure you are installing from the commit hash btw). If you install vLLM afterwards, it may overwrite the existing version of transformers that you have installed.

thusinh1969 · 2024-09-17T05:10:45Z

You should install vLLM first before the dev version of transformers (make sure you are installing from the commit hash btw). If you install vLLM afterwards, it may overwrite the existing version of transformers that you have installed.

Not working even with fresh install both, in your order:

!pip uninstall transformers -y
!pip uninstall vllm -y # always install vllm first
!pip install vllm -U
!pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

   1745     scaling_factor = 1
   1746 else:
-> 1747     assert "factor" in rope_scaling
   1748     scaling_factor = rope_scaling["factor"]
   1749 if rope_type == "yarn":

AssertionError:

DarkLight1337 · 2024-09-17T05:15:36Z

You should install vLLM first before the dev version of transformers (make sure you are installing from the commit hash btw). If you install vLLM afterwards, it may overwrite the existing version of transformers that you have installed.

Not working even with fresh install both, in your order:

!pip uninstall transformers -y
!pip uninstall vllm -y # always install vllm first
!pip install vllm -U
!pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
   1745     scaling_factor = 1
   1746 else:
-> 1747     assert "factor" in rope_scaling
   1748     scaling_factor = rope_scaling["factor"]
   1749 if rope_type == "yarn":

AssertionError: 

Can you run pip list and see which version is actually installed for both libraries? Separately, can you run AutoConfig.from_pretrained(...) with the HF repo you are using to ensure that the config is up to date?

thusinh1969 · 2024-09-17T05:23:06Z

You should install vLLM first before the dev version of transformers (make sure you are installing from the commit hash btw). If you install vLLM afterwards, it may overwrite the existing version of transformers that you have installed.

Not working even with fresh install both, in your order:
!pip uninstall transformers -y
!pip uninstall vllm -y # always install vllm first
!pip install vllm -U
!pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
   1745     scaling_factor = 1
   1746 else:
-> 1747     assert "factor" in rope_scaling
   1748     scaling_factor = rope_scaling["factor"]
   1749 if rope_type == "yarn":

AssertionError: 
Can you run pip list and see which version is actually installed for both libraries? Separately, can you run AutoConfig.from_pretrained(...) with the HF repo you are using to ensure that the config is up to date?

Qwen2VLConfig {
"_name_or_path": "./docker/EraX-VL-7B/EraX-VL-7B",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.45.0.dev0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

It is the same I think !
Steve

DarkLight1337 · 2024-09-17T05:32:13Z

Qwen2VLConfig {
"_name_or_path": "./docker/EraX-VL-7B/EraX-VL-7B",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.45.0.dev0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

This config is outdated. Compare it to the official one here. Notice the difference in rope_scaling.

thusinh1969 · 2024-09-17T05:55:57Z

I got it. Apply the new config manually and copy over preprocessor.jon and it is workng now.

Thanks,
Steve

thusinh1969 · 2024-09-17T06:27:23Z

Ahhhh I could have run vLLM with Qwen2-VL-7B as:

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --limit-mm-per-prompt image=30 --host 0.0.0.0 --port 9999 --served-model-name EraX-VL-V1 --model ./EraX-VL-7B

I used the code from Qwen2 git (https://github.com/QwenLM/Qwen2-VL)

import cv2
import matplotlib.pyplot as plt
from PIL import Image

import uuid, base64

# Prepare base64 image
test_image1 = './samples/bill-1.png'

with open(test_image1, "rb") as f:
    encoded_image = base64.b64encode(f.read())

encoded_image_text = encoded_image.decode('utf-8')
base64_qwen = f"data:image;base64,{encoded_image_text}"

# Run
from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:9999/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is the content of the image?"

chat_response = client.chat.completions.create(
    model="EraX-VL-V1",
    temperature=0.2,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": base64_qwen,
                },
                {
                    "type": "text", 
                    "text": prompt
                },
            ],
        },
    ],
)

ERROR:

   1038         err.response.read()
   1040     log.debug("Re-raising status error")
-> 1041     raise self._make_status_error_from_response(err.response) from None
   1043 return self._process_response(
   1044     cast_to=cast_to,
   1045     options=options,
   (...)
   1049     retries_taken=options.get_max_retries(self.max_retries) - retries,
   1050 )

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: image', 'type': 'BadRequestError', 'param': None, 'code': 400}

Any hint please.

Thanks,
Steve

5Elza5 · 2024-09-17T07:59:01Z

Ahhhh I could have run vLLM with Qwen2-VL-7B as:

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --limit-mm-per-prompt image=30 --host 0.0.0.0 --port 9999 --served-model-name EraX-VL-V1 --model ./EraX-VL-7B

I used the code from Qwen2 git (https://github.com/QwenLM/Qwen2-VL)

import cv2
import matplotlib.pyplot as plt
from PIL import Image

import uuid, base64

# Prepare base64 image
test_image1 = './samples/bill-1.png'

with open(test_image1, "rb") as f:
    encoded_image = base64.b64encode(f.read())

encoded_image_text = encoded_image.decode('utf-8')
base64_qwen = f"data:image;base64,{encoded_image_text}"

# Run
from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:9999/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is the content of the image?"

chat_response = client.chat.completions.create(
    model="EraX-VL-V1",
    temperature=0.2,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": base64_qwen,
                },
                {
                    "type": "text", 
                    "text": prompt
                },
            ],
        },
    ],
)

ERROR:

   1038         err.response.read()
   1040     log.debug("Re-raising status error")
-> 1041     raise self._make_status_error_from_response(err.response) from None
   1043 return self._process_response(
   1044     cast_to=cast_to,
   1045     options=options,
   (...)
   1049     retries_taken=options.get_max_retries(self.max_retries) - retries,
   1050 )

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: image', 'type': 'BadRequestError', 'param': None, 'code': 400}

Any hint please.

Thanks, Steve

try this:

"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
from here:
https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

yuanjietu · 2024-09-20T17:25:06Z

Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you!

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}

AssertionError Traceback (most recent call last)
/tmp/ipykernel_32014/2600037648.py in
9 del config.rope_scaling['mrope_section']
10
---> 11 llm = LLM(
12 model=MODEL_PATH,
13 limit_mm_per_prompt={'image': 10, 'video': 10},

~/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py in init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs)
176 **kwargs,
177 )
--> 178 self.llm_engine = LLMEngine.from_engine_args(
179 engine_args, usage_context=UsageContext.LLM_CLASS)
180 self.request_counter = Counter()

~/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
545 """Creates an LLM engine from the engine arguments."""
546 # Create the engine configs.
--> 547 engine_config = engine_args.create_engine_config()
548 executor_class = cls._get_executor_cls(engine_config)
549 # Create the LLM engine.

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_engine_config(self)
842
843 device_config = DeviceConfig(device=self.device)
--> 844 model_config = self.create_model_config()
845
846 if model_config.is_multimodal_model:

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_model_config(self)
780
781 def create_model_config(self) -> ModelConfig:
--> 782 return ModelConfig(
783 model=self.model,
784 tokenizer=self.tokenizer,

~/.local/lib/python3.10/site-packages/vllm/config.py in init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format)
225 self.disable_sliding_window = True
226
--> 227 self.max_model_len = _get_and_verify_max_len(
228 hf_config=self.hf_text_config,
229 max_model_len=max_model_len,

~/.local/lib/python3.10/site-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len)
1745 scaling_factor = 1
1746 else:
-> 1747 assert "factor" in rope_scaling
1748 scaling_factor = rope_scaling["factor"]
1749 if rope_type == "yarn":

AssertionError:

DarkLight1337 · 2024-09-20T17:32:13Z

See my comment above: #7905 (comment)

exceedzhang · 2024-09-22T10:04:03Z

Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you!

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}

AssertionError Traceback (most recent call last) /tmp/ipykernel_32014/2600037648.py in 9 del config.rope_scaling['mrope_section'] 10 ---> 11 llm = LLM( 12 model=MODEL_PATH, 13 limit_mm_per_prompt={'image': 10, 'video': 10},

~/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py in init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs) 176 **kwargs, 177 ) --> 178 self.llm_engine = LLMEngine.from_engine_args( 179 engine_args, usage_context=UsageContext.LLM_CLASS) 180 self.request_counter = Counter()

~/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers) 545 """Creates an LLM engine from the engine arguments.""" 546 # Create the engine configs. --> 547 engine_config = engine_args.create_engine_config() 548 executor_class = cls._get_executor_cls(engine_config) 549 # Create the LLM engine.

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_engine_config(self) 842 843 device_config = DeviceConfig(device=self.device) --> 844 model_config = self.create_model_config() 845 846 if model_config.is_multimodal_model:

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_model_config(self) 780 781 def create_model_config(self) -> ModelConfig: --> 782 return ModelConfig( 783 model=self.model, 784 tokenizer=self.tokenizer,

~/.local/lib/python3.10/site-packages/vllm/config.py in init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format) 225 self.disable_sliding_window = True 226 --> 227 self.max_model_len = _get_and_verify_max_len( 228 hf_config=self.hf_text_config, 229 max_model_len=max_model_len,

~/.local/lib/python3.10/site-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len) 1745 scaling_factor = 1 1746 else: -> 1747 assert "factor" in rope_scaling 1748 scaling_factor = rope_scaling["factor"] 1749 if rope_type == "yarn":

AssertionError:

huggingface/transformers#33401

YuanLiuuuuuu · 2024-09-26T15:39:35Z

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

This version of transformer will raise the following error:

ModuleNotFoundError: No module named 'transformers.models.mllama'

DarkLight1337 · 2024-09-26T15:45:13Z

T

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

This version of transformer will raise the following error:
ModuleNotFoundError: No module named 'transformers.models.mllama'

The current version of vLLM requires transformers>=4.45. Qwen2-VL has only just been made compatible with transformers>=4.45 in vLLM, so you'll have to install vLLM from source.

chenzhengda · 2024-09-27T03:05:37Z

@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug.

fyabc added 5 commits August 23, 2024 16:59

Add support to Qwen2-VL.

0a648b2

Merge branch 'refs/heads/main' into add_qwen2_vl_new

320df57

Reformat

7f96df8

Merge branch 'refs/heads/main' into add_qwen2_vl_new

fbf2b8b

Update transformers link.

bcaff4f

fyabc added 4 commits August 27, 2024 18:55

Bugfix of mrope_input_positions in model_runner.py.

f2185bf

Rename pixel_values_video to pixel_values_videos in qwen2_vl.py.

60448cb

Add Qwen2-VL support in chat_utils.py.

Fix the bug of MultiModalInputs.batch() when passing different modali…

71a77b1

…ties in a single batch.

Fix the bug when running OpenAI-compatible API server.

60c4cbd

DarkLight1337 mentioned this pull request Aug 27, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

88 tasks

ywang96 reviewed Aug 29, 2024

View reviewed changes

fyabc added 4 commits August 29, 2024 12:08

Merge branch 'refs/heads/main' into add_qwen2_vl_new

e29ff54

Refactor qwen2_vl.py based on review comments.

ddb7138

reformat

14fe12a

reformat

89def23

fyabc added 5 commits August 29, 2024 16:20

Fix the bug of model_is_mrope in model_runner.py.

e721e60

fix type hints in qwen2_vl.py

d66d167

Update mm input processors according to new MultiModalInput.batch() i…

acd85ed

…mplementation.

Merge branch 'refs/heads/main' into add_qwen2_vl_new

8d762c6

# Conflicts: # vllm/worker/model_runner.py

Fix SamplerOutput.

87ba5ed

chensiye-csy mentioned this pull request Aug 30, 2024

有支持Vllm并且在上面运行量化模型的计划吗？ QwenLM/Qwen2-VL#20

Closed

Fix bug of quantization.

cda300a

Bugfix of type hints in qwen2_vl.py.

da03a3f

DarkLight1337 mentioned this pull request Sep 14, 2024

[Usage]: Model Qwen2VLForConditionalGeneration does not support LoRA, but LoRA is enabled. #8484

Closed

1 task

fyabc mentioned this pull request Sep 18, 2024

vLLM -0.61: 'Unknown part type: image' when run QWen2-VL-7B with vLLM QwenLM/Qwen2-VL#213

Closed

Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024

[Model][VLM] Add Qwen2-VL model support (vllm-project#7905)

ccffc9f

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

niaoyu mentioned this pull request Sep 23, 2024

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 QwenLM/Qwen2-VL#231

Closed

[Model][VLM] Add Qwen2-VL model support #7905

[Model][VLM] Add Qwen2-VL model support #7905

Conversation

fyabc commented Aug 27, 2024 • edited by DarkLight1337 Loading

Requirements

Optional Requirements

Example Usage

Notes

github-actions bot commented Aug 27, 2024

DarkLight1337 commented Aug 29, 2024 • edited Loading

fyabc commented Aug 29, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 Aug 29, 2024

Choose a reason for hiding this comment

DarkLight1337 commented Aug 29, 2024

fyabc commented Aug 29, 2024

DragonFive commented Aug 30, 2024

fyabc commented Aug 30, 2024

PancakeAwesome commented Sep 13, 2024

PancakeAwesome commented Sep 13, 2024

fyabc commented Sep 13, 2024

fyabc commented Sep 13, 2024

syngokhan commented Sep 13, 2024 • edited Loading

DarkLight1337 commented Sep 14, 2024

PancakeAwesome commented Sep 14, 2024 • edited Loading

DarkLight1337 commented Sep 14, 2024

AlexanderChen1989 commented Sep 16, 2024

ywang96 commented Sep 16, 2024

syngokhan commented Sep 16, 2024

thusinh1969 commented Sep 17, 2024 • edited Loading

DarkLight1337 commented Sep 17, 2024

thusinh1969 commented Sep 17, 2024

DarkLight1337 commented Sep 17, 2024 • edited Loading

thusinh1969 commented Sep 17, 2024

DarkLight1337 commented Sep 17, 2024 • edited Loading

thusinh1969 commented Sep 17, 2024

thusinh1969 commented Sep 17, 2024 • edited Loading

ERROR:

5Elza5 commented Sep 17, 2024 • edited Loading

ERROR:

yuanjietu commented Sep 20, 2024

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}

DarkLight1337 commented Sep 20, 2024

exceedzhang commented Sep 22, 2024

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}

YuanLiuuuuuu commented Sep 26, 2024 • edited Loading

DarkLight1337 commented Sep 26, 2024 • edited Loading

chenzhengda commented Sep 27, 2024

fyabc commented Aug 27, 2024 •

edited by DarkLight1337

Loading

DarkLight1337 commented Aug 29, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading

syngokhan commented Sep 13, 2024 •

edited

Loading

PancakeAwesome commented Sep 14, 2024 •

edited

Loading

thusinh1969 commented Sep 17, 2024 •

edited

Loading

DarkLight1337 commented Sep 17, 2024 •

edited

Loading

DarkLight1337 commented Sep 17, 2024 •

edited

Loading

thusinh1969 commented Sep 17, 2024 •

edited

Loading

5Elza5 commented Sep 17, 2024 •

edited

Loading

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}

YuanLiuuuuuu commented Sep 26, 2024 •

edited

Loading

DarkLight1337 commented Sep 26, 2024 •

edited

Loading