Skip to content

Commit

Permalink
Write README and front page of doc (vllm-project#147)
Browse files Browse the repository at this point in the history
  • Loading branch information
WoosukKwon authored Jun 18, 2023
1 parent bf5f121 commit dcda03b
Show file tree
Hide file tree
Showing 9 changed files with 65 additions and 60 deletions.
90 changes: 39 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,54 @@
# vLLM
# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

## Build from source
| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |

```bash
pip install -r requirements.txt
pip install -e . # This may take several minutes.
```
vLLM is a fast and easy-to-use library for LLM inference and serving.

## Test simple server
## Latest News 🔥

```bash
# Single-GPU inference.
python examples/simple_server.py # --model <your_model>
- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().

# Multi-GPU inference (e.g., 2 GPUs).
ray start --head
python examples/simple_server.py -tp 2 # --model <your_model>
```
## Getting Started

The detailed arguments for `simple_server.py` can be found by:
```bash
python examples/simple_server.py --help
```
Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)

## FastAPI server
## Key Features

To start the server:
```bash
ray start --head
python -m vllm.entrypoints.fastapi_server # --model <your_model>
```
vLLM comes with many powerful features that include:

To test the server:
```bash
python test_cli_client.py
```
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

## Gradio web server
## Performance

Install the following additional dependencies:
```bash
pip install gradio
```
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
For details, check out our [blog post]().

Start the server:
```bash
python -m vllm.http_frontend.fastapi_frontend
# At another terminal
python -m vllm.http_frontend.gradio_webserver
```
<p align="center">
<img src="./assets/figures/perf_a10g_n1.png" width="45%">
<img src="./assets/figures/perf_a100_n1.png" width="45%">
<br>
<em> Serving throughput when each request asks for 1 output completion. </em>
</p>

## Load LLaMA weights
<p align="center">
<img src="./assets/figures/perf_a10g_n3.png" width="45%">
<img src="./assets/figures/perf_a100_n3.png" width="45%">
<br>
<em> Serving throughput when each request asks for 3 output completions. </em>
</p>

Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
## Contributing

1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
```bash
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
```
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash
python simple_server.py --model /output/path/llama-7b
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
```
We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
Binary file added assets/figures/perf_a100_n1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a100_n3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a10g_n1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/perf_a10g_n3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 8 additions & 5 deletions docs/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,20 @@
Installation
============

vLLM is a Python library that includes some C++ and CUDA code.
vLLM can run on systems that meet the following requirements:
vLLM is a Python library that also contains some C++ and CUDA code.
This additional code requires compilation on the user's machine.

Requirements
------------

* OS: Linux
* Python: 3.8 or higher
* CUDA: 11.0 -- 11.8
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)

.. note::
As of now, vLLM does not support CUDA 12.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.

.. tip::
If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
Expand Down Expand Up @@ -45,7 +48,7 @@ You can install vLLM using pip:
Build from source
-----------------

You can also build and install vLLM from source.
You can also build and install vLLM from source:

.. code-block:: console
Expand Down
16 changes: 15 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,21 @@
Welcome to vLLM!
================

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
**vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:

- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

For more information, please refer to our `blog post <>`_.


Documentation
-------------
Expand Down
4 changes: 2 additions & 2 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Supported Models
================

vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it.

Expand All @@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
* - :code:`LlamaForCausalLM`
- LLaMA, Vicuna, Alpaca, Koala
- LLaMA, Vicuna, Alpaca, Koala, Guanaco
* - :code:`OPTForCausalLM`
- OPT, OPT-IML

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ def get_requirements() -> List[str]:
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
packages=setuptools.find_packages(
exclude=("benchmarks", "csrc", "docs", "examples", "tests")),
exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
python_requires=">=3.8",
install_requires=get_requirements(),
ext_modules=ext_modules,
Expand Down

0 comments on commit dcda03b

Please sign in to comment.