Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull from head #1

Merged
merged 124 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
230c4b3
[CI/Test] fix swap test for multi gpu (#4689)
youkaichao May 8, 2024
89579a2
[Misc] Use vllm-flash-attn instead of flash-attn (#4686)
WoosukKwon May 8, 2024
f942efb
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
comaniac May 8, 2024
8b9241b
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…
cadedaniel May 8, 2024
e288df0
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (…
alexm-neuralmagic May 9, 2024
16bc0a0
[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)
MahmoudAshraf97 May 9, 2024
f12b20d
[Frontend] Move async logic outside of constructor (#4674)
DarkLight1337 May 9, 2024
190bc83
[Misc] Remove unnecessary ModelRunner imports (#4703)
WoosukKwon May 9, 2024
0ee535b
[Misc] Set block size at initialization & Fix test_model_runner (#4705)
WoosukKwon May 9, 2024
ff5abcd
[ROCm] Add support for Punica kernels on AMD GPUs (#3140)
kliuae May 9, 2024
a3c1245
[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)
DarkLight1337 May 9, 2024
cea6443
[Bugfix] Update grafana.json (#4711)
robertgshaw2-neuralmagic May 9, 2024
be0c518
[Bugfix] Add logs for all model dtype casting (#4717)
mgoin May 9, 2024
ebce310
[Model] Snowflake arctic model implementation (#4652)
sfc-gh-hazhang May 9, 2024
379da6d
[Kernel] [FP8] Improve FP8 linear layer performance (#4691)
pcmoritz May 9, 2024
c833101
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
comaniac May 10, 2024
208b71b
[Core][Distributed] refactor pynccl (#4591)
youkaichao May 10, 2024
e965d46
[Misc] Keep only one implementation of the create_dummy_prompt functi…
AllenDou May 10, 2024
51d4094
chunked-prefill-doc-syntax (#4603)
simon-mo May 10, 2024
64b77df
[Core]fix type annotation for `swap_blocks` (#4726)
jikunshang May 10, 2024
dac6a3f
[Misc] Apply a couple g++ cleanups (#4719)
stevegrubb May 10, 2024
6a0f617
[Core] Fix circular reference which leaked llm instance in local dev …
rkooo567 May 10, 2024
706588a
[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)
AllenDou May 10, 2024
2e7796f
[Speculative decoding] CUDA graph support (#4295)
heeju-kim2 May 10, 2024
fcc2994
[CI] Nits for bad initialization of SeqGroup in testing (#4748)
robertgshaw2-neuralmagic May 10, 2024
4e12131
[Core][Test] fix function name typo in custom allreduce (#4750)
youkaichao May 10, 2024
e254497
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)
CatherineSue May 11, 2024
6eaccb7
[Model] Add support for IBM Granite Code models (#4636)
yikangshen May 12, 2024
a709e87
[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)
robertgshaw2-neuralmagic May 13, 2024
a7be4d0
[CORE] Improvement in ranks code (#4718)
SwapnilDreams100 May 13, 2024
702bee4
[Core][Distributed] refactor custom allreduce to support multiple tp …
youkaichao May 13, 2024
350f9e1
[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425)
DarkLight1337 May 13, 2024
e7c46b9
[Scheduler] Warning upon preemption and Swapping (#4647)
rkooo567 May 13, 2024
0fca3cd
[Misc] Enhance attention selector (#4751)
WoosukKwon May 13, 2024
8bc68e1
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, u…
sangstar May 13, 2024
ce532ff
[Speculative decoding] Improve n-gram efficiency (#4724)
comaniac May 13, 2024
1356df5
[Kernel] Use flash-attn for decoding (#3648)
skrider May 13, 2024
33d3914
[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793)
pcmoritz May 13, 2024
ac1fbf7
[Doc] Shorten README by removing supported model list (#4796)
zhuohan123 May 13, 2024
4bfa7e7
[Doc] Add API reference for offline inference (#4710)
DarkLight1337 May 14, 2024
c579b75
[Doc] Add meetups to the doc (#4798)
zhuohan123 May 14, 2024
ccb63a8
[Core][Hash][Automatic Prefix caching] Accelerating the hashing funct…
KuntaiDu May 14, 2024
dc72402
[Bugfix][Doc] Fix CI failure in docs (#4804)
DarkLight1337 May 14, 2024
676a999
[Core] Add MultiprocessingGPUExecutor (#4539)
njhill May 14, 2024
29bc01b
Add 4th meetup announcement to readme (#4817)
simon-mo May 14, 2024
8a7cc25
Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820)
rkooo567 May 15, 2024
65bf2ac
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill …
rkooo567 May 15, 2024
e9cdd2b
[CI/Build] Further decouple HuggingFace implementation from ours duri…
DarkLight1337 May 15, 2024
a5675d3
[Bugfix] Properly set distributed_executor_backend in ParallelConfig …
zifeitong May 15, 2024
361c461
[Doc] Highlight the fourth meetup in the README (#4842)
zhuohan123 May 15, 2024
fc0d9df
[Frontend] Re-enable custom roles in Chat Completions API (#4758)
DarkLight1337 May 15, 2024
52f8107
[Frontend] Support OpenAI batch file format (#4794)
wuisawesome May 15, 2024
30e7543
[Core] Implement sharded state loader (#4690)
aurickq May 16, 2024
973617a
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
comaniac May 16, 2024
5c34257
Add marlin unit tests and marlin benchmark script (#4815)
alexm-neuralmagic May 16, 2024
99caa49
[Kernel] add bfloat16 support for gptq marlin kernel (#4788)
jinzhen-lin May 16, 2024
dbc0754
[docs] Fix typo in examples filename openi -> openai (#4864)
wuisawesome May 16, 2024
5e0391c
[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851)
wuisawesome May 16, 2024
9216b9c
[Bugfix] Bypass authorization API token for preflight requests (#4862)
dulacp May 16, 2024
6979ade
Add GPTQ Marlin 2:4 sparse structured support (#4790)
alexm-neuralmagic May 16, 2024
f09edd8
Add JSON output support for benchmark_latency and benchmark_throughpu…
simon-mo May 16, 2024
b5853f9
[ROCm][AMD][Bugfix] adding a missing triton autotune config (#4845)
hongxiayang May 16, 2024
e081880
[Core][Distributed] remove graph mode function (#4818)
youkaichao May 16, 2024
10fa9ee
[Misc] remove old comments (#4866)
youkaichao May 16, 2024
8435b20
[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850)
Silencioo May 16, 2024
2060e93
[Kernel] Add w8a8 CUTLASS kernels (#4749)
tlrmchlsmth May 16, 2024
9a31a81
[Bugfix] Fix FP8 KV cache support (#4869)
WoosukKwon May 16, 2024
8e7fb5d
Support to serve vLLM on Kubernetes with LWS (#4829)
kerthcet May 16, 2024
0150a10
[Frontend] OpenAI API server: Do not add bos token by default when en…
bofenghuang May 17, 2024
2614812
[Build/CI] Extending the set of AMD tests with Regression, Basic Corr…
Alexei-V-Ivanov-AMD May 17, 2024
33e0823
[Bugfix] fix rope error when load models with different dtypes (#4835)
jinzhen-lin May 17, 2024
48d5985
Sync huggingface modifications of qwen Moe model (#4774)
eigen2017 May 17, 2024
c5711ef
[Doc] Update Ray Data distributed offline inference example (#4871)
Yard1 May 17, 2024
86b45ae
[Bugfix] Relax tiktoken to >= 0.6.0 (#4890)
mgoin May 17, 2024
c0724fc
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if…
alexeykondrat May 18, 2024
2e9a222
[Lora] Support long context lora (#4787)
rkooo567 May 18, 2024
f68470e
[Bugfix][Model] Add base class for vision-language models (#4809)
DarkLight1337 May 19, 2024
27ce854
[Kernel] Add marlin_24 unit tests (#4901)
alexm-neuralmagic May 19, 2024
b57e6c5
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
6287537
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
da5a0b5
Remove marlin warning (#4918)
alexm-neuralmagic May 20, 2024
546a97e
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
943e72c
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
f0eecee
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
1937e29
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
c3af447
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
65ae8c2
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
d130b57
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
f12c3b5
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
e941f88
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
757b62c
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
14772ee
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
99eff67
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
9b9a10d
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
5f6d10c
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
c74c913
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
8674f98
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
a3a73ab
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
97b0300
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
eb6d3c2
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
a36de68
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
ee3eea0
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
6066253
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-neuralmagic May 23, 2024
2ba80be
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
5eda2ea
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
a124232
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
e3470f8
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
6a50f4c
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
9197709
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
robertgshaw2-neuralmagic May 24, 2024
e64fde4
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
8e192ff
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
325c119
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
d5a1697
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
f17a1a8
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
1102bef
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
fbdb7b3
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
890aa93
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
d4f3985
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
9ba4155
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-neuralmagic May 28, 2024
dd8de11
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
290f4ad
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
5ae5ed1
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
dfba529
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
616e600
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[Doc] Add API reference for offline inference (vllm-project#4710)
  • Loading branch information
DarkLight1337 committed May 14, 2024
commit 4bfa7e7f75eb5b1a397c93aeea1dea1afa867b2a
8 changes: 7 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,13 @@ Documentation
getting_started/quickstart
getting_started/examples/examples_index

.. toctree::
:maxdepth: 1
:caption: Offline Inference

offline_inference/llm
offline_inference/sampling_params

.. toctree::
:maxdepth: 1
:caption: Serving
Expand Down Expand Up @@ -101,7 +108,6 @@ Documentation
:maxdepth: 2
:caption: Developer Documentation

dev/sampling_params
dev/engine/engine_index
dev/kernel/paged_attention
dev/dockerfile/dockerfile
Expand Down
6 changes: 6 additions & 0 deletions docs/source/offline_inference/llm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
LLM Class
==========

.. autoclass:: vllm.LLM
:members:
:show-inheritance:
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Sampling Params
===============
Sampling Parameters
===================

.. autoclass:: vllm.SamplingParams
:members:
4 changes: 2 additions & 2 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ completion = client.chat.completions.create(
```

### Extra Parameters for Chat API
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
The following [sampling parameters (click through to see documentation)](../offline_inference/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
Expand All @@ -65,7 +65,7 @@ The following extra parameters are supported:
```

### Extra Parameters for Completions API
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
The following [sampling parameters (click through to see documentation)](../offline_inference/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
Expand Down