[Core] Support serving encoder/decoder models #7258

DarkLight1337 · 2024-08-07T09:11:04Z

This PR cleans up the parsing logic introduced in #4942 to pass mypy, and updates the async engine to follow the same code structure, enabling encoder/decoder models to be served using the OpenAI-compatible server. I have added a basic test accordingly.

Other related changes:

Moved parsing-related code from vllm.inputs.data to a new module vllm.inputs.parse. To declutter import statements, these functions now have to be imported from vllm.inputs.parse explicitly rather than from vllm.inputs.
Most (but not all) of the warnings spammed by test_bart.py due to mismatched text have been fixed by padding the vLLM output string with BOS/EOS tokens.
Added is_list_of helper to vllm.utils (ported from #7126). This avoids the need for type casts in parse_and_batch_prompt.
Replaced all existing uses of TypeGuard with the newly-introduced TypeIs construct. The minimum versions of mypy and typing_extensions have been bumped accordingly.

cc @afeldman-nm

github-actions · 2024-08-07T09:11:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337

Some explanations

DarkLight1337 · 2024-08-07T09:33:48Z

vllm/entrypoints/openai/logits_processors.py

+    logit_bias: Dict[int, float],
+    token_ids: List[int],
+    logits: torch.Tensor,
+) -> torch.Tensor:


This error appeared after updating mypy so I had to fix it in this PR.

DarkLight1337 · 2024-08-07T09:35:47Z

vllm/utils.py

-) -> List[Tuple[PromptInputs, PromptInputs]]:
-    return [(enc_dec_prompt['encoder_prompt'],
-             enc_dec_prompt['decoder_prompt'])
-            for enc_dec_prompt in enc_dec_prompts]


Since vllm.inputs could import utils but these functions require vllm.inputs, I have moved them to vllm.inputs.parse to avoid circular imports.

DarkLight1337 · 2024-08-07T09:36:40Z

vllm/inputs/data.py

-    raise ValueError("prompt must be a string, array of strings, "
-                     "array of tokens, or array of token arrays")
-
-


These have been moved to vllm.inputs.parse.

DarkLight1337 · 2024-08-07T09:37:01Z

vllm/inputs/data.py

-
-    raise ValueError(f"Invalid prompt {prompt}")
-
-


This is no longer required as I have moved the parsing logic inside the engine class itself.

afeldman-nm

Hi @DarkLight1337 these improvements are great. I just had a few questions/suggestions. Overall LGTM.

afeldman-nm · 2024-08-07T14:44:03Z

vllm/engine/llm_engine.py

+        decoder_prompt, decoder_prompt_ids, decoder_mm_data = decoder_comps
+
+        if encoder_mm_data is not None or decoder_mm_data is not None:
+            raise ValueError("Multi-modal data is not supported for "


Nit: currently the encoder/decoder infrastructure doesn't support multi-modal models (fixing this is a near-term goal.)

Perhaps "Multi-modal inputs are not supported for "
"encoder-decoder models"

?

Technically the existing multi-modal models use a vision encoder, so I wanted to make it clear that "encoder" here refers to language encoder.

The point I am about to make it not critical, don't block merging on addressing this - but just for context, the limitation is really that encoder/decoder models with cross-attention do not currently support multi-modal.

So for example, a model like Llava has an encoder but not cross-attention, and obviously vLLM supports Llava (and did even before encoder/decoder model support was introduced.)

However, a vision model setup like the one shown in the diagram (taken from a blog post) requires cross attention between ViT and RoBERTa in order to decode a description of the image; even though this is a vision model, it would not currently be supported by vLLM because it requires both multi-modal support and encoder/decoder cross-attention, which is a currently-unsupported combination (for example, EncoderDecoderModelRunner has an assert which checks that multimodality is not enabled.)

Of course this inability to have multimodal+encoder/decoder will need to be addressed by the time we add the Whisper model to vLLM, which is a near-term goal

Thanks for providing more context, I see what you mean by encoder-decoder in multi-modal context now. Yeah, we will need to address this soon.

I'll update the error to say "multi-modal" encoder-decoder models are not supported yet.

Just saw this comment - yes like @afeldman-nm said, this is something we should start thinking about since Llama 3.1 multi-modal has exactly the same inference pattern like what Andrew described here.

@ywang96 FYI I just released an RFC giving an overview of next-steps for vLLM's encoder/decoder support, and referenced this discussion in the section on multimodality support: #7366 (comment)

Feel free to leave any feedback that you have

vllm/inputs/data.py

vllm/engine/llm_engine.py

afeldman-nm · 2024-08-07T16:02:08Z

tests/entrypoints/openai/test_encoder_decoder.py

+    completion = await client.completions.create(
+        model=model_name,
+        prompt=[0, 0, 0, 0, 0],
+        max_tokens=5,
+        temperature=0.0,
+    )


@DarkLight1337

It is great that we support this now. Does it make sense maybe to have two different test prompts, one being a singleton encoder prompt, the other being an explicit encoder/decoder prompt? This would help confirm that the prompt processing pipeline works as it should for the async scenario (not that I can think of a reason it wouldn't.)

An example prompt could be

{ 'encoder_prompt': 'The rain in spain', 'decoder_prompt': [0,0,0,0,0] }

I don't think this input format is specified in the OpenAI API, so users shouldn't run into problems unless they use the async engine directly. Since this PR is pretty big already, let's leave this for future work.

Okay then LGTM.

robertgshaw2-neuralmagic · 2024-08-07T16:13:05Z

vllm/engine/async_llm_engine.py

@@ -334,17 +438,19 @@ async def add_request_async(
        trace_headers: Optional[Mapping[str, str]] = None,
        prompt_adapter_request: Optional[PromptAdapterRequest] = None,
    ) -> None:
+        """Async version of :meth:`add_request`."""


This should not change in this PR

Sorry, I don't quite get what you mean by this. Could you elaborate?

I don't think this function needs changes in this PR - just a nit

I see - this is just to keep the order of arguments consistent with the new ordering of parameters in process_model_inputs_async (which has been updated alongside process_model_inputs).

robertgshaw2-neuralmagic · 2024-08-07T16:31:24Z

I think this looks good.

Separately @DarkLight1337 (for follow up) perhaps we should create a new class called AsyncLLMEngineEncoderDecoder rather than branching inside AsyncLLMEngine - WDYT

DarkLight1337 · 2024-08-07T16:35:53Z

I think this looks good.

Separately @DarkLight1337 (for follow up) perhaps we should create a new class called AsyncLLMEngineEncoderDecoder rather than branching inside AsyncLLMEngine - WDYT

I think that it would be best to keep the same class structure as the base LLMEngine. If the goal is to slim down the engine classes, perhaps we could factor out both the sync and async versions of the parsing code.

alexm-neuralmagic · 2024-08-07T17:45:57Z

vllm/engine/async_llm_engine.py

+                    request_id=request_id,
+                )
+
+                encoder_comps, decoder_comps = await asyncio.gather(


does the order matter here?

Yes, that's just how asyncio.gather works.

DarkLight1337 added 3 commits August 7, 2024 07:53

Introduce is_list_of

33c9e25

Avoid circular imports

e6dd6f5

Refactor prompt parsing and extend this to async engine

f938c86

DarkLight1337 requested a review from robertgshaw2-neuralmagic August 7, 2024 09:11

DarkLight1337 mentioned this pull request Aug 7, 2024

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

Merged

DarkLight1337 requested a review from rkooo567 August 7, 2024 09:14

DarkLight1337 added 3 commits August 7, 2024 09:16

Remove unnecessary comments

6332d1e

Enable full async

07b4d21

grammar

e29864c

DarkLight1337 commented Aug 7, 2024

View reviewed changes

DarkLight1337 added 5 commits August 7, 2024 09:40

Add description

c9dfb40

Fix wrong type annotations

1233192

Merge branch 'upstream' into inputs-parser

f332275

Remove redundant docs

dcdebee

Be more strict

65db3f1

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 7, 2024

DarkLight1337 added 3 commits August 7, 2024 10:03

Fix docs

9ffeb22

Fix 2

c9e0b08

Disallow multi-modal data for enc/dec models

14bca1f

DarkLight1337 mentioned this pull request Aug 7, 2024

[VLM][Core] Support profiling with multiple multi-modal inputs per prompt #7126

Merged

5 tasks

DarkLight1337 requested a review from njhill August 7, 2024 10:20

DarkLight1337 added 8 commits August 7, 2024 10:22

Improve type narrowing behavior using TypeIs

8fc7099

Avoid sequential await

3a8a072

Fix type annotations based on test files

ef5327c

Properly handle inputs["decoder_prompt"]=None

8a835cc

Clean

e0024c2

Clean

76af172

Fix incorrect decoder inputs in singleton case

5c16f2e

Clean

e239ba9

Fix mismatch between hf and vllm output text

3afdbc5

DarkLight1337 changed the title ~~[Core] Refactor encoder/decoder prompt parsing logic~~ [Core] Support encoder/decoder prompt in AsyncLLMEngine Aug 7, 2024

DarkLight1337 added 4 commits August 7, 2024 13:10

Factor out duplicate code

c61b01f

Factor out more duplicate code

f8ed373

Remove default values to avoid accidentally miss those arguments

a4df70a

Add test for serving encoder/decoder model with OpenAI server

5240bb3

DarkLight1337 changed the title ~~[Core] Support encoder/decoder prompt in AsyncLLMEngine~~ [Core] Support serving encoder/decoder models Aug 7, 2024

afeldman-nm approved these changes Aug 7, 2024

View reviewed changes

Use two type variables

d321c82

afeldman-nm reviewed Aug 7, 2024

View reviewed changes

vllm/engine/llm_engine.py Show resolved Hide resolved

afeldman-nm reviewed Aug 7, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Aug 7, 2024

View reviewed changes

Merge branch 'upstream' into inputs-parser

931d1f6

Merge branch 'upstream' into inputs-parser

a06c67f

Merge branch 'upstream' into inputs-parser

9f64a05

alexm-neuralmagic reviewed Aug 7, 2024

View reviewed changes

Update error message

e4c5c21

This was referenced Aug 8, 2024

[mypy] Enable following imports for entrypoints #7248

Merged

[TEMP] Diff view for "Enable following imports for entrypoints" DarkLight1337/vllm-rocm#2

Closed

Merge branch 'upstream' into inputs-parser

68fbf5a

robertgshaw2-neuralmagic approved these changes Aug 9, 2024

View reviewed changes

DarkLight1337 merged commit 7eb4a51 into vllm-project:main Aug 9, 2024
68 checks passed

DarkLight1337 deleted the inputs-parser branch August 9, 2024 02:42

DarkLight1337 mentioned this pull request Aug 9, 2024

[Core] Factor out input preprocessing to a separate class #7329

Merged

afeldman-nm mentioned this pull request Aug 9, 2024

[RFC]: Encoder/decoder models & feature compatibility #7366

Open

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[Core] Support serving encoder/decoder models (vllm-project#7258)

720371a

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Core] Support serving encoder/decoder models (vllm-project#7258)

87014a1

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[Core] Support serving encoder/decoder models (vllm-project#7258)

ec64423

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support serving encoder/decoder models #7258

[Core] Support serving encoder/decoder models #7258

DarkLight1337 commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024

DarkLight1337 left a comment

DarkLight1337 Aug 7, 2024

DarkLight1337 Aug 7, 2024

DarkLight1337 Aug 7, 2024

DarkLight1337 Aug 7, 2024 •

edited

Loading

afeldman-nm left a comment

afeldman-nm Aug 7, 2024

DarkLight1337 Aug 7, 2024

afeldman-nm Aug 7, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024 •

edited

Loading

DarkLight1337 Aug 8, 2024 •

edited

Loading

DarkLight1337 Aug 8, 2024

ywang96 Aug 9, 2024

afeldman-nm Aug 9, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024 •

edited

Loading

DarkLight1337 Aug 7, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024

robertgshaw2-neuralmagic Aug 7, 2024

DarkLight1337 Aug 7, 2024

robertgshaw2-neuralmagic Aug 8, 2024

DarkLight1337 Aug 8, 2024

robertgshaw2-neuralmagic commented Aug 7, 2024

DarkLight1337 commented Aug 7, 2024

alexm-neuralmagic Aug 7, 2024

DarkLight1337 Aug 8, 2024

		raise ValueError("prompt must be a string, array of strings, "
		"array of tokens, or array of token arrays")

[Core] Support serving encoder/decoder models #7258

[Core] Support serving encoder/decoder models #7258

Conversation

DarkLight1337 commented Aug 7, 2024 • edited Loading

github-actions bot commented Aug 7, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

afeldman-nm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afeldman-nm Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

afeldman-nm Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afeldman-nm Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

afeldman-nm Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Aug 7, 2024

DarkLight1337 commented Aug 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 commented Aug 7, 2024 •

edited

Loading

DarkLight1337 Aug 7, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024 •

edited

Loading

DarkLight1337 Aug 8, 2024 •

edited

Loading

afeldman-nm Aug 9, 2024 •

edited

Loading

afeldman-nm Aug 7, 2024 •

edited

Loading

DarkLight1337 Aug 7, 2024 •

edited

Loading