Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull from head #1

Merged
merged 124 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
230c4b3
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
89579a2
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
f942efb
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
8b9241b
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
e288df0
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-neuralmagic May 9, 2024
16bc0a0
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
f12b20d
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
190bc83
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
0ee535b
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
ff5abcd
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
a3c1245
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
cea6443
[Bugfix] Update grafana.json (#4711)
robertgshaw2-neuralmagic May 9, 2024
be0c518
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
ebce310
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
379da6d
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
c833101
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
208b71b
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
e965d46
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
51d4094
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
64b77df
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
dac6a3f
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
6a0f617
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
706588a
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
2e7796f
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
fcc2994
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-neuralmagic May 10, 2024
4e12131
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
e254497
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
6eaccb7
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
a709e87
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-neuralmagic May 13, 2024
a7be4d0
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
702bee4
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
350f9e1
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
e7c46b9
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
0fca3cd
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
8bc68e1
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
ce532ff
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
1356df5
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
33d3914
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
ac1fbf7
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
4bfa7e7
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
c579b75
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
ccb63a8
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
dc72402
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
676a999
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
29bc01b
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
8a7cc25
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
65bf2ac
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
e9cdd2b
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
a5675d3
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
361c461
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
fc0d9df
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
52f8107
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
30e7543
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
973617a
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
5c34257
Add marlin unit tests and marlin benchmark script (#4815)
alexm-neuralmagic May 16, 2024
99caa49
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
dbc0754
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
5e0391c
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
9216b9c
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
6979ade
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-neuralmagic May 16, 2024
f09edd8
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
b5853f9
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
e081880
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
10fa9ee
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
8435b20
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
2060e93
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
9a31a81
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
8e7fb5d
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
0150a10
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
2614812
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
33e0823
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
48d5985
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
c5711ef
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
86b45ae
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
c0724fc
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
alexeykondrat May 18, 2024
2e9a222
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
f68470e
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
27ce854
[Kernel] Add marlin_24 unit tests (#4901)
alexm-neuralmagic May 19, 2024
b57e6c5
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
6287537
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
da5a0b5
Remove marlin warning (#4918)
alexm-neuralmagic May 20, 2024
546a97e
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
943e72c
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
f0eecee
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
1937e29
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
c3af447
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
65ae8c2
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
d130b57
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
f12c3b5
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
e941f88
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
757b62c
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
14772ee
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
99eff67
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
9b9a10d
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
5f6d10c
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
c74c913
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
8674f98
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
a3a73ab
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
97b0300
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
eb6d3c2
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
a36de68
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
ee3eea0
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
6066253
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-neuralmagic May 23, 2024
2ba80be
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
5eda2ea
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
a124232
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
e3470f8
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
6a50f4c
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
9197709
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
robertgshaw2-neuralmagic May 24, 2024
e64fde4
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
8e192ff
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
325c119
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
d5a1697
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
f17a1a8
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
1102bef
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
fbdb7b3
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
890aa93
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
d4f3985
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
9ba4155
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-neuralmagic May 28, 2024
dd8de11
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
290f4ad
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
5ae5ed1
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
dfba529
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
616e600
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#4764)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
  • Loading branch information
leiwen83 and wenlei03 committed May 24, 2024
commit e64fde4b013cb8bb2321f59ba78aca50b02071cb
117 changes: 117 additions & 0 deletions tests/core/block/test_prefix_caching_block.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,123 @@ def test_get_common_computed_block_ids(num_blocks: int, block_size: int,

assert (len(res) == zero_point_blocks)

# Test case that assume those prompted block after first immutable would
# be freed into hashless allocator, while first immutable block get ref
# increased.
@staticmethod
@pytest.mark.parametrize("num_blocks", [3])
@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("seed", list(range(10)))
def test_alloc_promotion(num_blocks: int, block_size: int, seed: int):
random.seed(seed)

allocator = PrefixCachingBlockAllocator(num_blocks=num_blocks,
block_size=block_size)
token_ids = list(range(block_size))

block = allocator.allocate_immutable(prev_block=None,
token_ids=token_ids)

assert allocator._refcounter.get(block.block_id) == 1
m = allocator.allocate_mutable(prev_block=None)

block_id = m.block_id
for i in range(block_size):
m.append_token_ids([i])
# After block get promoted to immutable from mutable, if there is
# already same content hash block, then it shall be released into
# hashless_allocator
# And first immutable block's ref get increased by 1
assert m.block_id == block.block_id
assert block_id in allocator._hashless_allocator._free_block_indices
assert allocator._refcounter.get(block.block_id) == 2

# Test case when eviction and allocation are mixed,
# make sure they work as expected
@staticmethod
@pytest.mark.parametrize("num_blocks", [3])
@pytest.mark.parametrize("block_size", [16])
@pytest.mark.parametrize("seed", list(range(10)))
def test_eviction_alloc_mixed(num_blocks: int, block_size: int, seed: int):
random.seed(seed)

all_blocks_list = [i for i in range(num_blocks)]
zero_ref = {i: 0 for i in range(num_blocks)}
allocator = PrefixCachingBlockAllocator(num_blocks=num_blocks,
block_size=block_size)
token_ids = list(range(num_blocks * block_size))

# now we have num_blocks free blocks in hashless allocator
# with internal tracking list _blocks _cached_blocks and evictor
# empty and block's ref shall be 0
assert list(allocator._hashless_allocator._free_block_indices
) == all_blocks_list
assert len(allocator._blocks.keys()) == 0
assert len(allocator._cached_blocks.values()) == 0
assert len(allocator.evictor.free_table.keys()) == 0
assert allocator._refcounter._refcounts == zero_ref

# Allocate immutable chains with only one block residuled in
new_block = []
for i in range(num_blocks):
block = allocator.allocate_immutable(
prev_block=None,
token_ids=token_ids[block_size * i:block_size * (i + 1)])
new_block.append(block)

# Free all blocks, and now all blocks shall be in the evictor
# there shall be no tracking data left in _blocks
# all blocks shall be tracked in _cached_blocks
# all blocks' ref shall be zero
for block in new_block:
allocator.free(block)

assert len(allocator._blocks.keys()) == 0
assert len(allocator._hashless_allocator._free_block_indices) == 0
assert list(allocator._cached_blocks.values()) == all_blocks_list
assert list(allocator.evictor.free_table.keys()) == all_blocks_list
assert allocator._refcounter._refcounts == zero_ref

# Allocate a mutable block, and the first block shall be evicted
# and set its content hash into None, ref to 1
mutable = allocator.allocate_mutable(prev_block=None)

assert mutable.block_id == 0
assert mutable.content_hash is None
assert 0 in allocator._blocks
assert allocator._refcounter.get(0) == 1
assert 0 not in allocator._cached_blocks
assert 0 not in allocator.evictor

# Since this mutable block has no hash yet, it shall be released into
# hashless allocator
allocator.free(mutable)

assert len(allocator._blocks.keys()) == 0
assert allocator._refcounter._refcounts == zero_ref
assert 0 not in allocator._cached_blocks
assert 0 not in allocator.evictor
assert 0 in allocator._hashless_allocator._free_block_indices

# when allocate immutable with first block_size tokens, we
# shall get free block from hashless allocator, thus no block left
# in hashless
block = allocator.allocate_immutable(prev_block=None,
token_ids=token_ids[:block_size])

assert block.block_id == 0
assert len(allocator._hashless_allocator._free_block_indices) == 0
assert 0 in allocator._blocks
assert 0 in allocator._cached_blocks.values()
assert allocator._refcounter.get(0) == 1
assert 0 not in allocator.evictor

# allocate mutable block again, it shall be popped from evictor
mutable = allocator.allocate_mutable(prev_block=None)
assert len(allocator._hashless_allocator._free_block_indices) == 0
assert mutable.block_id not in allocator.evictor.free_table
assert allocator._refcounter.get(mutable.block_id) == 1

# Test case where two last accessed times are equal
@staticmethod
@pytest.mark.parametrize("num_blocks", [1024])
Expand Down
41 changes: 24 additions & 17 deletions vllm/core/block/prefix_caching_block.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,21 +160,17 @@ def allocate_mutable(self,
# If the evictor has blocks available for eviction, evict a block
# and return it.
if self.evictor.num_blocks > 0:
# here we get an evicted block, which is only added
# into evictor if its ref counter is 0
# and since its content would be changed, we need
# to remove it from _cached_blocks's tracking list
block_id, content_hash_to_evict = self.evictor.evict()

# Here we may have scenario that several blocks have
# the same content hash, but due to the latter coming block
# is coming from mutable to immutable path, their physical
# block is added into evictor.
# However in this case, we shall not pop the _cached_blocks,
# as the same content is still used by others, which means
# we need to check ref before decide to pop the list.

_block_id = self._cached_blocks[content_hash_to_evict]
refcount = self._refcounter.get(_block_id)
if refcount == 1:
self._cached_blocks.pop(content_hash_to_evict)
assert _block_id == block_id
assert self._refcounter.get(_block_id) == 0
assert _block_id == block_id

self._cached_blocks.pop(content_hash_to_evict)

self._refcounter.incr(block_id)

Expand All @@ -199,7 +195,11 @@ def allocate_mutable(self,

def _incr_refcount_cached_block(self, block: Block,
block_id: BlockId) -> None:
# since block is already computed, mark it
# now _incr_refcount_cached_block comes from two place
# allocate_immutable/promote_to_immutable_block where hit
# _cached_blocks hash key.
# In both cases, it means that already exists a already
# computed block which shared with block now
block.computed = True

refcount = self._refcounter.incr(block_id)
Expand Down Expand Up @@ -228,13 +228,19 @@ def _free_block_id_for_block(self, block_id: BlockId,
block: Block) -> None:
assert isinstance(block, PrefixCachingBlock)

if block.content_hash is None:
# if we comes from promote_to_immutable_block, it means that
# block.content_hash is never None.
# However we need to release the same content block, so that
# physical block could get reused.
if block.block_id != block_id or block.content_hash is None:
refcount = self._refcounter.get(block_id)
# We have fork case where block would get more than one ref,
# so we cannot free it from tracking if ref cnt large than 1
if refcount <= 1:
assert block.block_id is not None
assert block.block_id is not None
refcount = self._refcounter.get(block.block_id)
if refcount == 1:
del self._blocks[block.block_id]

return self._hashless_allocator.free(block)

refcount = self._refcounter.decr(block_id)
Expand Down Expand Up @@ -317,7 +323,8 @@ def promote_to_immutable_block(self, block: Block) -> BlockId:
if block.content_hash not in self._cached_blocks:
self._cached_blocks[block.content_hash] = block.block_id
else:
self._free_block_id_for_block(block.block_id, block)
self._free_block_id_for_block(
self._cached_blocks[block.content_hash], block)
self._incr_refcount_cached_block(
block, self._cached_blocks[block.content_hash])

Expand Down