Optimize the Falcon block for inference #500

mryab · 2023-09-03T20:52:19Z

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launches
If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

borzunov · 2023-09-03T21:14:07Z

Benchmarks: this PR gives +40% to inference speed

Model: Falcon-40B
GPU: A6000 Ada

main @ d40eb6c, 3 runs

Sep 04 11:52:03.072 [INFO] Inference throughput: 756.6 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:52:13.657 [INFO] Forward pass throughput: 61338.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:53:08.637 [INFO] Inference throughput: 776.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:53:19.217 [INFO] Forward pass throughput: 61292.5 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:54:06.825 [INFO] Inference throughput: 759.0 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:54:17.416 [INFO] Forward pass throughput: 61322.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

optim_falcon @ 52baffb, 3 runs

Sep 04 11:48:32.613 [INFO] Inference throughput: 1044.6 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)                                               
Sep 04 11:48:43.189 [INFO] Forward pass throughput: 62396.0 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)  

Sep 04 11:49:31.860 [INFO] Inference throughput: 1075.5 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:49:42.453 [INFO] Forward pass throughput: 61365.4 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:50:28.453 [INFO] Inference throughput: 1068.0 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:50:39.046 [INFO] Forward pass throughput: 61758.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

@mryab

This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in #500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

This is most likely due to bitsandbytes performing work not captured in the graph

@mryab

This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in bigscience-workshop#500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in bigscience-workshop#499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.

mryab force-pushed the optim_falcon branch from 74d0d45 to d184b46 Compare September 3, 2023 21:10

borzunov mentioned this pull request Sep 3, 2023

Add Falcon support #499

Merged

Base automatically changed from falcon-new to main September 3, 2023 21:45

mryab force-pushed the optim_falcon branch from 8060c55 to 9ef12f8 Compare September 3, 2023 21:51

mryab marked this pull request as ready for review September 3, 2023 23:11

mryab added 14 commits September 4, 2023 11:29

Optimize Falcon block for inference

ca4d091

Fix class names

1f006c5

Post-rebase changes

1fc22bd

Fix formatting, reduce diff

67764fe

Fix the test

111bf7e

Make cos_cached/sin_cached buffers

ce401f1

Fix buffer registration

2c1452d

Make the block compatible with other architectures

ae30427

Improve test and compatibility

841a0d5

Fix rotary embeddings

d56f57a

Fix rotary embeddings

cfaf6c1

WIP disable graphs

1f2ef79

Fix formatting

b941df5

Rollback CUDA graphs

177669e

mryab force-pushed the optim_falcon branch from 15cde13 to 177669e Compare September 4, 2023 08:29

mryab added 4 commits September 4, 2023 12:00

Run tests on CUDA and CPU,

91f6248

Enable CUDA graphs only on CUDA

ea6c037

Do not fuse split_heads with qkv

2c27c19

This is most likely due to bitsandbytes performing work not captured in the graph

Update test_optimized_layers

52baffb

mryab changed the title ~~[WIP] Optimize Falcon block for inference~~ Optimize the Falcon block for inference Sep 4, 2023

mryab requested a review from borzunov September 4, 2023 12:35

borzunov approved these changes Sep 4, 2023

View reviewed changes

mryab merged commit 1ebd88a into main Sep 4, 2023
11 checks passed

mryab deleted the optim_falcon branch September 4, 2023 12:38

mryab mentioned this pull request Sep 20, 2023

Optimize LLaMA for inference #513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the Falcon block for inference #500

Optimize the Falcon block for inference #500

mryab commented Sep 3, 2023 •

edited

Loading

borzunov commented Sep 3, 2023 •

edited

Loading

Optimize the Falcon block for inference #500

Optimize the Falcon block for inference #500

Conversation

mryab commented Sep 3, 2023 • edited Loading

borzunov commented Sep 3, 2023 • edited Loading

Benchmarks: this PR gives +40% to inference speed

mryab commented Sep 3, 2023 •

edited

Loading

borzunov commented Sep 3, 2023 •

edited

Loading