[core] Multi Step Scheduling #7000

SolitaryThinker · 2024-07-31T18:04:33Z

Adds initial multi step scheduling support to vLLM.
RFC: #6854

Current Status:

8/16: Initial support for chunked prefill thanks to @varun-sundar-rabindranath

8/14: Ready for another round of reviews! ~~please review #7452~~
8/8: multi-node working
8/6: PP+TP working; PP+ray fixed; ~~a few single GPU perf regressions (easy fix)~~
8/2 PP works with MP; Ready for initial pass on design
8/1 - PP is very close to working. We do get the desired interleaving of steps between microbatches which is great!
7/31 - Current branch is in very rough shape after getting the RFC design working. Will clean up after adding TP/PP support as there may be some refactors needed. However single GPU is ready for initial testing

Cmd:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --swap-space 16 --disable-log-requests --use-v2-block-manager --tensor-parallel-size 1 --worker-use-ray --pipeline-parallel-size 1 --gpu-memory-utilization 0.90 --num-scheduler-steps 8

Benchmark (8/16)
See: #7528
CP_1: Force Single Step: We force single step when there are prefill requests in a batch. This may work well for offline batching, but not good for online serving because new requests keep coming.

CP_2: Ignore Prefill (WIP): We ignore prefill requests since the second step, meaning that prefill requests do nothing in (k-1) steps. This may work better for online serving.

Single GPU	Baseline (Req/s)	Baseline+CP (Req/s)	MS-8 (Req/s)	MS-8+CP_1 (Req/s)
A10G 8B Llama	6.21	-	6.63	-
H100 8B Llama	25.96	27.82	44.44	31.4
H100 30B Llama	10.38	11.01	14.27	12.31

PP=2	Baseline (Req/s)	MS-4 (Req/s)	MS-8 (Req/s)	MS-12 (Req/s)	MS-16 (Req/s)
~~A10G 8B Llama (microbatch=128)~~	8.98	-	9.99	-	-
~~H100 8B Llama~~	23	-	31	-	- `
~~H100 70B Llama~~	3.09	3.13	3.13	-	-

TP=2	Baseline (Req/s)	MS-4 (Req/s)	MS-8 (Req/s)	MS-12 (Req/s)	MS-16 (Req/s)
~~A10G 8B Llama~~	6.11	-	7.02	-	-

TP=2, PP=2	Baseline (Req/s)	MS-4 (Req/s)	MS-8 (Req/s)	MS-12 (Req/s)	MS-16 (Req/s)
~~A10G 8B Llama (microbatch=128)~~	5.99	-	7.15	-	-

TODO:
Milestone 1: POC

Milstone 2: Mergeable

Clean up data structures
use num_scheduler_steps
Add tests
Tests passing
Clean up model_runner.py, perhaps multi_step_model_runner.py?
Not a blocker, but [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace #6971 will improve perf and current is included in this PR.

Follow up work: Tracking Issue #7528

add logprob support in _pythonize_sampler_output
support chunked-prefill
support guided decoding
- add flag to enforce synchronous pythonization (for logit processors and guided decoding)
support spec-decode
support prefix caching
remove num_steps argument https://github.com/vllm-project/vllm/pull/7000/files#r1718684239

github-actions · 2024-07-31T18:04:45Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

rkooo567 · 2024-08-01T02:39:26Z

QQ: do you plan to split PRs to smaller pieces?

SolitaryThinker · 2024-08-01T17:25:46Z

@rkooo567 If there are splits that makes sense I will definitely do that. Currently working on a small part here #6971

benchmarks/backend_request_func.py

vllm/worker/multi_step_model_runner.py

vllm/worker/multi_step_worker.py

comaniac

The first batch of comments.

vllm/config.py

vllm/engine/arg_utils.py

vllm/engine/async_llm_engine.py

vllm/model_executor/layers/sampler.py

vllm/sequence.py

vllm/worker/multi_step_model_runner.py

vllm/worker/model_runner_base.py

SolitaryThinker · 2024-08-09T18:37:08Z

@zhuohan123 @rkooo567 @Yard1 @comaniac @alexm-neuralmagic rebased and ready for review

vllm/worker/multi_step_model_runner.py

SolitaryThinker · 2024-08-10T06:16:21Z

Working on a smaller PR that contains parts of this.

vllm/worker/worker_base.py

vllm/worker/worker.py

zhuohan123

First round of questions. Will add more tmrw.

vllm/config.py

vllm/sequence.py

vllm/worker/worker.py

vllm/worker/multi_step_worker.py

zhuohan123

Second batch of reviews

vllm/worker/multi_step_model_runner.py

zhuohan123 · 2024-08-12T04:49:55Z

vllm/worker/multi_step_model_runner.py

+        Execute the model for a single step and update multi-step
+        metadata
+        """
+        assert num_steps == 1, "MultiStepModelRunner only supports num_steps=1"


Does this assert mean the MultiStepModelRunner can only be run with one step? Can you elaborate on this?

MultiStepModelRunner only takes a single step internally before returning to AsyncLLMEngine. As the multi-step is done implicitly using stateful model inputs and SequenceGroup states.

thanks for the explaination!

This is a bit confusing tho. IIRC, this was introduced by me for multi-step draft model runner? We should remove this argument and use stateful model inputs as the unify representation. Also cc @alexm-neuralmagic

yeah let's remove this argument

Let's do this in a follow up PR as it will involve spec decode as well. Will add to TODO tracker

vllm/worker/multi_step_model_runner.py

vllm/worker/worker_base.py

vllm/engine/async_llm_engine.py

comaniac

LGTM for me. Leave to @zhuohan123

zhuohan123

Thanks for the hard work! In general LGTM. Please see my comments.

zhuohan123 · 2024-08-17T07:04:36Z

vllm/engine/arg_utils.py

+            if not self.use_v2_block_manager:
+                raise ValueError("BlockSpaceManagerV2 is required for "
+                                 "multi-step (--num-scheduler-steps > 1)")


Can we auto-correct to v2 block manager and print a warning here?

zhuohan123 · 2024-08-17T07:06:12Z

vllm/worker/multi_step_model_runner.py

+        Execute the model for a single step and update multi-step
+        metadata
+        """
+        assert num_steps == 1, "MultiStepModelRunner only supports num_steps=1"


yeah let's remove this argument

vllm/engine/async_llm_engine.py

zhuohan123 · 2024-08-17T07:21:34Z

vllm/sequence.py

@@ -997,7 +996,7 @@ class SamplerOutput:

    # On-device tensor containing the sampled token ids.
    sampled_token_ids: Optional[torch.Tensor] = None
-    sampled_token_ids_numpy: Optional[numpy.ndarray] = None
+    sampled_token_ids_cpu: Optional[torch.Tensor] = None


Add a comment to explain why we need this variable?

zhuohan123 · 2024-08-17T07:39:48Z

vllm/worker/multi_step_model_runner.py

+class MutableModelInputForGPUWithMultiStepMetadata(BroadcastableModelInput):
+    # actual frozen model input dataclass passed to _base_model_runner
+    frozen_model_input: Optional[ModelInputForGPUWithSamplingMetadata] = None
+
+    # list of model outputs for each step, may not be all pythonized
+    outputs: List[ModelOutput] = field(default_factory=list)


If outputs is a part of this data structure, calling this class MutableModelInput seems confusing?

Also the current class name is probably a bit too long. Maybe something like ModelRequests? Feel free to use any other name that makes more sense here.

renamed to StatefulModelInputs. The outputs is really a cache and needed for the next step. So renamed to cached_outputs

zhuohan123 · 2024-08-17T07:45:44Z

vllm/worker/multi_step_model_runner.py

+        # Update GPU tensors
+        ops.advance_step(
+            num_seqs=num_seqs,
+            num_queries=num_queries,
+            block_size=self.block_size,
+            input_tokens=frozen_model_input.input_tokens,
+            sampled_token_ids=model_input.outputs[-1].sampled_token_ids,
+            input_positions=frozen_model_input.input_positions,
+            seq_lens=attn_metadata.seq_lens_tensor,
+            slot_mapping=attn_metadata.slot_mapping,
+            block_tables=attn_metadata.block_tables)


Not a review comment, just a question: Is this op attention-backend specific?

cc @WoosukKwon

yes, eventually we will move this into attention backends API, but it may involve some refactoring to do cleanly. See some initial work done here: #7571

vllm/worker/multi_step_model_runner.py

zhuohan123 · 2024-08-17T07:57:59Z

vllm/worker/multi_step_worker.py

+            model_input.last_sampled_token_ids = (
+                execute_model_req.last_sampled_token_ids.cuda())
+            model_input.add_sampler_output(
+                SamplerOutput(outputs=[], sampled_token_ids=None),
+                model_input.last_sampled_token_ids)


If we are not the last pipeline stage, why would we need to know last_sampled_token_ids? We are not running the sampler if we are not the last pipeline stage right?

Yes, however non-last PP stages need to use last_sampled_token_ids to perform in-place advance_step on GPU. And this is where we append to model_inputs so that every rank sees a consistent sampled_token_ids for the last step

zhuohan123 · 2024-08-17T08:04:54Z

Also before merge, can you please verify the throughput (tokens/sec) gain in the following settings to make sure the PR is good performance-wise:

ShareGPT + Llama 8B + 1x H100/A100
ShareGPT + Llama 70B + 8x H100/A100

Also, can you add what are the dataset you are using in your original benchmark? Thanks!

afeldman-nm · 2024-08-18T17:45:06Z

tests/multi_step/test_correctness.py

+                                       server_cli_args: List[str]):
+
+    outputs = None
+    with RemoteOpenAIServer(model_name, server_cli_args) as server:


@SolitaryThinker no need to block on this feedback - but if you have time - I would propose adding an example/offline_inference_multi_step.py example which instantiates an engine instance with multi-step enabled. Similar in structure to example/offline_inference.py.

An example of why this is useful - as part of the logprobs workstream, I am trying to step through the multi-step model runner with the python debugger & examine the output logprobs. I am using your multi_step/test_correctness.py in order to set up a server with multi-step enabled.

However, multi_step/test_correctness.py is an end-to-end client/server test & it is not straightforward (although technically doable) to step through the server code with the debugger because the server is in another process.

I will get around this by writing a short script which sets up an engine instance with multi-step enabled.

However, for someone else who is approaching this code for the first time, it could be helpful to have an example file (or unit test) which just sets up an engine instance with multi-step enabled and invokes inference using LLM.generate(). This could be a good way to facilitate quick debugging & also gives insight into how the server works.

Here is the offline_inference_multi_step.py script I wrote for myself to facilitate debugging, if you would like to use it.

''' Example of setting up LLM with multi-step enabled. In actuality, async engine would be a more sensible choice for a real use-case. However this example is useful for demonstration & debugging of multi-step code. ''' from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM(model="JackFram/llama-160m", swap_space=16, tensor_parallel_size=1, gpu_memory_utilization=0.9, num_scheduler_steps=8, use_v2_block_manager=True, ) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

vllm/worker/multi_step_model_runner.py

Yard1 · 2024-08-19T17:39:54Z

vllm/worker/multi_step_model_runner.py

+            output[0].sampled_token_ids = None
+            output[0].sampled_token_probs = None
+            output[0].logprobs = None


I do wonder if there's a more generic way of doing this. If this data structure gets modified somewhere else it will not be reflected here. Maybe a loop where we check the device if the object is a tensor?

These are optionals and only set if include_gpu_probs_tensor is set in the sampler.

remove some redundant test cases set v2 blockmananger and fix rebase Update vllm/engine/async_llm_engine.py Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Update vllm/engine/async_llm_engine.py Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Update vllm/worker/multi_step_model_runner.py Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> add comment typo rename to StatefulModelInput renamed outputs to cached_outputs Update vllm/worker/multi_step_model_runner.py Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

SolitaryThinker · 2024-08-19T18:05:20Z

Also before merge, can you please verify the throughput (tokens/sec) gain in the following settings to make sure the PR is good performance-wise:

ShareGPT + Llama 8B + 1x H100/A100
ShareGPT + Llama 70B + 8x H100/A100
Also, can you add what are the dataset you are using in your original benchmark? Thanks!

@zhuohan123
I'm using sharegpt for all the numbers. Benchmarked using the benchmark_serving.py script.
See below for single GPU numbers.

zhuohan123

LGTM! Thanks for the hard work! Please make sure to keep track of the TODOs we discussed in this PR.

WoosukKwon · 2024-08-20T09:07:37Z

[rank0]: File "/data/woosuk/workspace/vllm/vllm/engine/output_processor/multi_step.py", line 88, in process_outputs
[rank0]: assert valid_samples

@SolitaryThinker Huge thanks for the PR! QQ: I got the above error when running benchmark scripts with num_scheduler_steps > 1. Is this expected?

Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

jiqing-feng · 2024-08-21T08:06:19Z

Hi @WoosukKwon . I see spec decode also has a class name MultiStepWorker, is there any relation with MultiStepWorker from vllm/worker/multi_step_worker.py in this PR?

Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

comaniac reviewed Aug 2, 2024

View reviewed changes

benchmarks/backend_request_func.py Outdated Show resolved Hide resolved

alexm-neuralmagic reviewed Aug 5, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

vllm/worker/multi_step_worker.py Outdated Show resolved Hide resolved

comaniac reviewed Aug 5, 2024

View reviewed changes

jon-chuang reviewed Aug 5, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

jon-chuang reviewed Aug 5, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Show resolved Hide resolved

SolitaryThinker force-pushed the multi-step branch from e1f1f19 to d820deb Compare August 8, 2024 00:35

SolitaryThinker requested a review from comaniac August 8, 2024 03:23

jon-chuang mentioned this pull request Aug 8, 2024

[RFC]: Multi-Step Scheduling #6854

Open

SolitaryThinker force-pushed the multi-step branch from 40d5e5f to 5789d64 Compare August 9, 2024 05:19

SolitaryThinker marked this pull request as ready for review August 9, 2024 05:36

SolitaryThinker requested review from alexm-neuralmagic, richardliaw and jon-chuang August 9, 2024 05:36

jon-chuang reviewed Aug 9, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

Yard1 reviewed Aug 9, 2024

View reviewed changes

vllm/worker/model_runner_base.py Outdated Show resolved Hide resolved

SolitaryThinker force-pushed the multi-step branch from 7771e1c to c1b0e0a Compare August 9, 2024 18:35

SolitaryThinker changed the title ~~[WIP] [core] Multi Step Scheduling~~ [core] Multi Step Scheduling Aug 9, 2024

jon-chuang reviewed Aug 9, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

SolitaryThinker force-pushed the multi-step branch from 119b4de to bda8e68 Compare August 9, 2024 22:02

SolitaryThinker mentioned this pull request Aug 10, 2024

[core] [2/N] refactor worker_base input preparation for multi-step #7387

Merged

SolitaryThinker commented Aug 10, 2024

View reviewed changes

vllm/worker/worker_base.py Outdated Show resolved Hide resolved

SolitaryThinker commented Aug 10, 2024

View reviewed changes

vllm/worker/worker.py Outdated Show resolved Hide resolved

zhuohan123 reviewed Aug 11, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

vllm/sequence.py Outdated Show resolved Hide resolved

vllm/sequence.py Outdated Show resolved Hide resolved

vllm/worker/worker.py Outdated Show resolved Hide resolved

vllm/worker/multi_step_worker.py Outdated Show resolved Hide resolved

zhuohan123 reviewed Aug 12, 2024

View reviewed changes

SolitaryThinker force-pushed the multi-step branch from 32072f7 to 67c1907 Compare August 12, 2024 07:41

varun-sundar-rabindranath mentioned this pull request Aug 16, 2024

Varun/multi step chunked prefill #7563

Closed

SolitaryThinker force-pushed the multi-step branch from 3351973 to 32b03b7 Compare August 16, 2024 21:56

SolitaryThinker requested a review from comaniac August 16, 2024 22:21

comaniac approved these changes Aug 16, 2024

View reviewed changes

SolitaryThinker force-pushed the multi-step branch from b799db0 to 8ed980b Compare August 17, 2024 06:20

zhuohan123 reviewed Aug 17, 2024

View reviewed changes

afeldman-nm reviewed Aug 18, 2024

View reviewed changes

SolitaryThinker force-pushed the multi-step branch 2 times, most recently from 79e9b54 to 2a07b6c Compare August 19, 2024 02:56

SolitaryThinker requested a review from zhuohan123 August 19, 2024 04:45

SolitaryThinker force-pushed the multi-step branch from 2a07b6c to 9f4cf17 Compare August 19, 2024 04:50

afeldman-nm reviewed Aug 19, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved

afeldman-nm mentioned this pull request Aug 19, 2024

[RFC]: Encoder/decoder models & feature compatibility #7366

Open

Yard1 reviewed Aug 19, 2024

View reviewed changes

SolitaryThinker and others added 2 commits August 19, 2024 10:55

format

5fac4a1

SolitaryThinker force-pushed the multi-step branch from a14fbab to 5fac4a1 Compare August 19, 2024 17:55

zhuohan123 approved these changes Aug 19, 2024

View reviewed changes

Yard1 approved these changes Aug 19, 2024

View reviewed changes

Yard1 merged commit 47b65a5 into vllm-project:main Aug 19, 2024
65 checks passed

SolitaryThinker mentioned this pull request Aug 19, 2024

[Core] Logprobs support in Multi-step #7652

Merged

SolitaryThinker deleted the multi-step branch August 19, 2024 22:16

zifeitong pushed a commit to zifeitong/vllm that referenced this pull request Aug 20, 2024

[core] Multi Step Scheduling (vllm-project#7000)

d44acb9

Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[core] Multi Step Scheduling (vllm-project#7000)

d3cf5a4

Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024

[core] Multi Step Scheduling (vllm-project#7000)

320bb2b

Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>

youkaichao mentioned this pull request Sep 4, 2024

[Performance]: The impact of CPU on vLLM performance is significant. #8147

Open

[core] Multi Step Scheduling #7000

[core] Multi Step Scheduling #7000

Conversation

SolitaryThinker commented Jul 31, 2024 • edited Loading

github-actions bot commented Jul 31, 2024

rkooo567 commented Aug 1, 2024

SolitaryThinker commented Aug 1, 2024

comaniac left a comment

Choose a reason for hiding this comment

SolitaryThinker commented Aug 9, 2024

SolitaryThinker commented Aug 10, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SolitaryThinker Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuohan123 commented Aug 17, 2024 • edited Loading

afeldman-nm Aug 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SolitaryThinker commented Aug 19, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

WoosukKwon commented Aug 20, 2024

jiqing-feng commented Aug 21, 2024

SolitaryThinker commented Jul 31, 2024 •

edited

Loading

SolitaryThinker Aug 12, 2024 •

edited

Loading

zhuohan123 commented Aug 17, 2024 •

edited

Loading

afeldman-nm Aug 18, 2024 •

edited

Loading