GitHub - del-zhenwu/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLM

English | 简体中文

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
Persistent Batch Inference: Further optimization of model execution efficiency.

Performance

As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model.

Target Device: NVIDIA A100(80G)

Metrics: Throughput (token/s)

Test Data: The number of input tokens is 1, and the number of generated tokens is 2048

The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x

Quick Start

Installation

Below are quick steps for installation:

conda create -n lmdeploy python=3.10
conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

Deploy InternLM

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf

Inference by TurboMind

docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
    python3 -m lmdeploy.turbomind.chat internlm /workspace

When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.

Serving

Launch inference server by:

bash workspace/service_docker_up.sh

Then, you can communicate with the inference server by command line,

python3 lmdeploy.serve.client {server_ip_addresss}:33337 internlm

or webui,

python3 lmdeploy.app {server_ip_addresss}:33337 internlm

For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from here

Inference with PyTorch

Single GPU

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. First execute the quantization script, and the quantization parameters are stored in the workspace/triton_models/weights transformed by deploy.py.

python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \   # Whether to use symmetric or asymmetric quantization.
  --offload  False \  # Whether to offload some modules to CPU to save GPU memory.
  --num_tp 1 \   # The number of GPUs used for tensor parallelism

Then adjust workspace/triton_models/weights/config.ini

use_context_fmha changed to 0, means off
quant_policy is set to 4. This parameter defaults to 0, which means it is not enabled

Here is quantization test results.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

FasterTransformer

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github		.github
3rdparty		3rdparty
benchmark		benchmark
cmake		cmake
docker		docker
docs		docs
examples		examples
lmdeploy		lmdeploy
resources		resources
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh-CN.md		README_zh-CN.md
generate.sh		generate.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Performance

Quick Start

Installation

Deploy InternLM

Get InternLM model

Inference by TurboMind

Serving

Inference with PyTorch

Single GPU

Tensor Parallel with DeepSpeed

Quantization

Contributing

Acknowledgement

License

About

Releases

Packages

Languages

License

del-zhenwu/lmdeploy

Folders and files

Latest commit

History

Repository files navigation

Introduction

Performance

Quick Start

Installation

Deploy InternLM

Get InternLM model

Inference by TurboMind

Serving

Inference with PyTorch

Single GPU

Tensor Parallel with DeepSpeed

Quantization

Contributing

Acknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages