Skip to content

Commit

Permalink
update gifs
Browse files Browse the repository at this point in the history
  • Loading branch information
JustinLin610 committed Aug 16, 2023
1 parent 4957c33 commit 512f90a
Show file tree
Hide file tree
Showing 6 changed files with 99 additions and 39 deletions.
63 changes: 42 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ Qwen-7B is the 7B-parameter version of the large language model series, Qwen (ab

The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.


## News

* 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
Expand Down Expand Up @@ -250,11 +249,11 @@ Note: The GPU memory usage profiling in the above table is performed on single A

We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.

| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
| ------ | :---------------------------: | :---------------------------: |
| BF16 (no quantization) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |
| Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
| ---------------------- | :----------------------------------------: | :---------------------------------------: |
| BF16 (no quantization) | 30.06 | 27.55 |
| Int8 (bnb) | 7.94 | 7.86 |
| NF4 (bnb) | 21.43 | 20.37 |

In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.

Expand All @@ -265,30 +264,23 @@ We also profile the peak GPU memory usage for encoding 2048 tokens as context (a
When using flash attention, the memory usage is:

| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.11GB | 23.52GB |
| Int8 | 12.17GB | 17.60GB |
| NF4 | 9.52GB | 14.93GB |

When not using flash attention, the memory usage is:

| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
| --- | :---: | :---: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |
| ------------------ | :---------------------------------: | :-----------------------------------: |
| BF16 | 18.11GB | 24.40GB |
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |

The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).

## Demo

### CLI Demo

We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:

```
python cli_demo.py
```

### Web UI

Expand All @@ -304,16 +296,40 @@ Then run the command below and click on the generated link:
python web_demo.py
```

<p align="center">
<br>
<img src="assets/web_demo.gif" width="600" />
<br>
<p>

### CLI Demo

We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:

```
python cli_demo.py
```

<p align="center">
<br>
<img src="assets/cli_demo.gif" width="600" />
<br>
<p>

## API

We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:

```bash
pip install fastapi uvicorn openai pydantic sse_starlette
```

Then run the command to deploy your API:

```bash
python openai_api.py
```

You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.

Using the API is also simple. See the example below:
Expand Down Expand Up @@ -345,6 +361,11 @@ response = openai.ChatCompletion.create(
print(response.choices[0].message.content)
```

<p align="center">
<br>
<img src="assets/openai_api.gif" width="600" />
<br>
<p>

## Tool Usage

Expand Down
40 changes: 30 additions & 10 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,19 +280,10 @@ model = AutoModelForCausalLM.from_pretrained(
| Int8 | 12.18GB | 18.47GB |
| NF4 | 9.52GB | 15.81GB |


以上测速和显存占用情况,均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。

## Demo

### 交互式Demo

我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令:

```
python cli_demo.py
```

### Web UI

我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前,确保已经安装如下代码库:
Expand All @@ -307,16 +298,41 @@ pip install -r requirements_web_demo.txt
python web_demo.py
```

<p align="center">
<br>
<img src="assets/web_demo.gif" width="600" />
<br>
<p>


### 交互式Demo

我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令:

```
python cli_demo.py
```

<p align="center">
<br>
<img src="assets/cli_demo.gif" width="600" />
<br>
<p>

## API

我们提供了OpenAI API格式的本地API部署方法(感谢@hanpenggit)。在开始之前先安装必要的代码库:

```bash
pip install fastapi uvicorn openai pydantic sse_starlette
```

随后即可运行以下命令部署你的本地API:

```bash
python openai_api.py
```

你也可以修改参数,比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题,更新上述代码库往往可以解决大多数问题。

使用API同样非常简单,示例如下:
Expand Down Expand Up @@ -348,6 +364,11 @@ response = openai.ChatCompletion.create(
print(response.choices[0].message.content)
```

<p align="center">
<br>
<img src="assets/openai_api.gif" width="600" />
<br>
<p>

## 工具调用

Expand Down Expand Up @@ -405,7 +426,6 @@ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct

如遇到问题,敬请查阅[FAQ](FAQ_zh.md)以及issue区,如仍无法解决再提交issue。


## 使用协议

研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看[LICENSE](LICENSE)。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
Expand Down
35 changes: 27 additions & 8 deletions README_JA.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,14 +285,6 @@ Flash attentionを使用しない場合、メモリ使用量は次のように

## デモ

### CLI デモ

`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する:

```
python cli_demo.py
```

### ウェブ UI

ウェブUIデモを構築するためのコードを提供します(@wysaidに感謝)。始める前に、以下のパッケージがインストールされていることを確認してください:
Expand All @@ -307,7 +299,28 @@ pip install -r requirements_web_demo.txt
python web_demo.py
```

<p align="center">
<br>
<img src="assets/web_demo.gif" width="600" />
<br>
<p>

### CLI デモ

`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する:

```
python cli_demo.py
```

<p align="center">
<br>
<img src="assets/cli_demo.gif" width="600" />
<br>
<p>

## API

OpenAI APIをベースにローカルAPIをデプロイする方法を提供する(@hanpenggitに感謝)。始める前に、必要なパッケージをインストールしてください:

```bash
Expand Down Expand Up @@ -351,6 +364,12 @@ response = openai.ChatCompletion.create(
print(response.choices[0].message.content)
```

<p align="center">
<br>
<img src="assets/openai_api.gif" width="600" />
<br>
<p>

## ツールの使用

Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。
Expand Down
Binary file added assets/cli_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/openai_api.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/web_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 512f90a

Please sign in to comment.