Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull from head #1

Merged
merged 124 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
230c4b3
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
89579a2
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
f942efb
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
8b9241b
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
e288df0
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-neuralmagic May 9, 2024
16bc0a0
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
f12b20d
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
190bc83
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
0ee535b
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
ff5abcd
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
a3c1245
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
cea6443
[Bugfix] Update grafana.json (#4711)
robertgshaw2-neuralmagic May 9, 2024
be0c518
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
ebce310
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
379da6d
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
c833101
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
208b71b
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
e965d46
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
51d4094
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
64b77df
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
dac6a3f
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
6a0f617
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
706588a
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
2e7796f
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
fcc2994
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-neuralmagic May 10, 2024
4e12131
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
e254497
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
6eaccb7
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
a709e87
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-neuralmagic May 13, 2024
a7be4d0
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
702bee4
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
350f9e1
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
e7c46b9
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
0fca3cd
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
8bc68e1
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
ce532ff
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
1356df5
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
33d3914
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
ac1fbf7
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
4bfa7e7
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
c579b75
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
ccb63a8
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
dc72402
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
676a999
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
29bc01b
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
8a7cc25
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
65bf2ac
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
e9cdd2b
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
a5675d3
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
361c461
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
fc0d9df
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
52f8107
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
30e7543
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
973617a
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
5c34257
Add marlin unit tests and marlin benchmark script (#4815)
alexm-neuralmagic May 16, 2024
99caa49
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
dbc0754
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
5e0391c
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
9216b9c
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
6979ade
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-neuralmagic May 16, 2024
f09edd8
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
b5853f9
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
e081880
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
10fa9ee
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
8435b20
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
2060e93
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
9a31a81
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
8e7fb5d
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
0150a10
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
2614812
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
33e0823
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
48d5985
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
c5711ef
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
86b45ae
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
c0724fc
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
alexeykondrat May 18, 2024
2e9a222
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
f68470e
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
27ce854
[Kernel] Add marlin_24 unit tests (#4901)
alexm-neuralmagic May 19, 2024
b57e6c5
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
6287537
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
da5a0b5
Remove marlin warning (#4918)
alexm-neuralmagic May 20, 2024
546a97e
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
943e72c
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
f0eecee
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
1937e29
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
c3af447
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
65ae8c2
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
d130b57
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
f12c3b5
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
e941f88
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
757b62c
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
14772ee
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
99eff67
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
9b9a10d
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
5f6d10c
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
c74c913
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
8674f98
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
a3a73ab
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
97b0300
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
eb6d3c2
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
a36de68
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
ee3eea0
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
6066253
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-neuralmagic May 23, 2024
2ba80be
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
5eda2ea
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
a124232
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
e3470f8
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
6a50f4c
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
9197709
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
robertgshaw2-neuralmagic May 24, 2024
e64fde4
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
8e192ff
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
325c119
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
d5a1697
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
f17a1a8
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
1102bef
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
fbdb7b3
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
890aa93
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
d4f3985
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
9ba4155
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-neuralmagic May 28, 2024
dd8de11
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
290f4ad
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
5ae5ed1
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
dfba529
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
616e600
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
…ectness, Distributed, Engine, Llava Tests (vllm-project#4797)
  • Loading branch information
Alexei-V-Ivanov-AMD committed May 17, 2024
commit 26148120b3c05704409a425d017f0a51fca3b7cc
11 changes: 6 additions & 5 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This script build the ROCm docker image and runs test inside it.
# This script runs test inside the corresponding ROCm docker container.
set -ex

# Print ROCm version
Expand All @@ -19,15 +19,16 @@ done

echo "--- Building container"
sha=$(git rev-parse --short HEAD)
container_name=rocm_${sha}
image_name=rocm_${sha}
container_name=rocm_${sha}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)
docker build \
-t ${container_name} \
-t ${image_name} \
-f Dockerfile.rocm \
--progress plain \
.

remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${container_name} || true
docker rm -f ${container_name} || docker image rm -f ${image_name} || true
}
trap remove_docker_container EXIT

Expand All @@ -39,6 +40,6 @@ docker run \
--rm \
-e HF_TOKEN \
--name ${container_name} \
${container_name} \
${image_name} \
/bin/bash -c "${@}"

18 changes: 15 additions & 3 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,16 @@

steps:
- label: Regression Test
mirror_hardwares: [amd]
command: pytest -v -s test_regression.py
working_dir: "/vllm-workspace/tests" # optional

- label: AsyncEngine Test
#mirror_hardwares: [amd]
command: pytest -v -s async_engine

- label: Basic Correctness Test
mirror_hardwares: [amd]
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_basic_correctness.py
Expand All @@ -24,14 +27,15 @@ steps:
command: pytest -v -s core

- label: Distributed Comm Ops Test
#mirror_hardwares: [amd]
command: pytest -v -s distributed/test_comm_ops.py
working_dir: "/vllm-workspace/tests"
num_gpus: 2

- label: Distributed Tests
mirror_hardwares: [amd]
working_dir: "/vllm-workspace/tests"
num_gpus: 2
mirror_hardwares: [amd]
commands:
- pytest -v -s distributed/test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
Expand All @@ -45,16 +49,18 @@ steps:
- pytest -v -s spec_decode/e2e/test_integration_dist.py

- label: Distributed Tests (Multiple Groups)
#mirror_hardwares: [amd]
working_dir: "/vllm-workspace/tests"
num_gpus: 4
commands:
- pytest -v -s distributed/test_pynccl.py

- label: Engine Test
#mirror_hardwares: [amd]
mirror_hardwares: [amd]
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
#mirror_hardwares: [amd]
commands:
# these tests have to be separated, because each one will allocate all posible GPU memory
- pytest -v -s entrypoints --ignore=entrypoints/test_server_oot_registration.py
Expand All @@ -74,6 +80,7 @@ steps:
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors

- label: Kernels Test %N
#mirror_hardwares: [amd]
command: pytest -v -s kernels --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

Expand All @@ -84,7 +91,7 @@ steps:
- pytest -v -s models --ignore=models/test_llava.py

- label: Llava Test
#mirror_hardwares: [amd]
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models/test_llava.py
Expand All @@ -95,6 +102,7 @@ steps:
- pytest -v -s prefix_caching

- label: Samplers Test
#mirror_hardwares: [amd]
command: pytest -v -s samplers

- label: LogitsProcessor Test
Expand All @@ -110,16 +118,20 @@ steps:
command: pytest -v -s spec_decode

- label: LoRA Test %N
#mirror_hardwares: [amd]
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Tensorizer Test
#mirror_hardwares: [amd]
command: apt-get install curl libsodium23 && pytest -v -s tensorizer_loader

- label: Metrics Test
mirror_hardwares: [amd]
command: pytest -v -s metrics

- label: Quantization Test
#mirror_hardwares: [amd]
command: pytest -v -s quantization

- label: Benchmarks
Expand Down
3 changes: 1 addition & 2 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:

- label: ":docker: build image"
commands:
commands:
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
- "docker push {{ docker_image }}"
env:
Expand Down
6 changes: 5 additions & 1 deletion tests/engine/test_stop_reason.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ def test_stop_reason(vllm_model, example_prompts):
# test stop token
outputs = llm.generate(example_prompts,
sampling_params=SamplingParams(
ignore_eos=True,
seed=SEED,
max_tokens=MAX_TOKENS,
stop_token_ids=[stop_token_id]))
Expand All @@ -43,7 +44,10 @@ def test_stop_reason(vllm_model, example_prompts):
# test stop string
outputs = llm.generate(example_prompts,
sampling_params=SamplingParams(
seed=SEED, max_tokens=MAX_TOKENS, stop="."))
ignore_eos=True,
seed=SEED,
max_tokens=MAX_TOKENS,
stop="."))
for output in outputs:
output = output.outputs[0]
assert output.finish_reason == "stop"
Expand Down
10 changes: 1 addition & 9 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1060,7 +1060,7 @@ def get_image_input_enum_type(
"bfloat16": torch.bfloat16,
}

_ROCM_NOT_SUPPORTED_DTYPE = ["float", "float32"]
_ROCM_NOT_SUPPORTED_DTYPE: List[str] = [] #


def _get_and_verify_dtype(
Expand Down Expand Up @@ -1092,14 +1092,6 @@ def _get_and_verify_dtype(
else:
raise ValueError(f"Unknown dtype: {dtype}")

if is_hip() and torch_dtype == torch.float32:
rocm_supported_dtypes = [
k for k, v in _STR_DTYPE_TO_TORCH_DTYPE.items()
if (k not in _ROCM_NOT_SUPPORTED_DTYPE)
]
raise ValueError(f"dtype '{dtype}' is not supported in ROCm. "
f"Supported dtypes are {rocm_supported_dtypes}")

# Verify the dtype.
if torch_dtype != config_dtype:
if torch_dtype == torch.float32:
Expand Down