Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the Falcon block for inference #500

Merged
merged 18 commits into from
Sep 4, 2023
Merged

Optimize the Falcon block for inference #500

merged 18 commits into from
Sep 4, 2023

Conversation

mryab
Copy link
Member

@mryab mryab commented Sep 3, 2023

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

  1. Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launches
  2. If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

@borzunov
Copy link
Collaborator

borzunov commented Sep 3, 2023

Benchmarks: this PR gives +40% to inference speed

Model: Falcon-40B
GPU: A6000 Ada

main @ d40eb6c, 3 runs

Sep 04 11:52:03.072 [INFO] Inference throughput: 756.6 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:52:13.657 [INFO] Forward pass throughput: 61338.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:53:08.637 [INFO] Inference throughput: 776.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:53:19.217 [INFO] Forward pass throughput: 61292.5 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:54:06.825 [INFO] Inference throughput: 759.0 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:54:17.416 [INFO] Forward pass throughput: 61322.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

optim_falcon @ 52baffb, 3 runs

Sep 04 11:48:32.613 [INFO] Inference throughput: 1044.6 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)                                               
Sep 04 11:48:43.189 [INFO] Forward pass throughput: 62396.0 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)  

Sep 04 11:49:31.860 [INFO] Inference throughput: 1075.5 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:49:42.453 [INFO] Forward pass throughput: 61365.4 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

Sep 04 11:50:28.453 [INFO] Inference throughput: 1068.0 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 04 11:50:39.046 [INFO] Forward pass throughput: 61758.3 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

@borzunov borzunov mentioned this pull request Sep 3, 2023
Base automatically changed from falcon-new to main September 3, 2023 21:45
borzunov added a commit that referenced this pull request Sep 3, 2023
This PR adds:

- Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B.
- CI tests for Falcon-RW-1B.
- `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

- Backward pass support is broken for now, will be fixed in #500.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
@mryab mryab marked this pull request as ready for review September 3, 2023 23:11
@mryab mryab changed the title [WIP] Optimize Falcon block for inference Optimize the Falcon block for inference Sep 4, 2023
@mryab mryab requested a review from borzunov September 4, 2023 12:35
@mryab mryab merged commit 1ebd88a into main Sep 4, 2023
11 checks passed
@mryab mryab deleted the optim_falcon branch September 4, 2023 12:38
d-popov pushed a commit to d-popov/petals-ai that referenced this pull request Sep 6, 2023
This PR adds:

- Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B.
- CI tests for Falcon-RW-1B.
- `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

- Backward pass support is broken for now, will be fixed in bigscience-workshop#500.

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
d-popov pushed a commit to d-popov/petals-ai that referenced this pull request Sep 6, 2023
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically,

* Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche
* If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR

The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match.

Lastly, the pull request makes the backward pass work (as discussed in bigscience-workshop#499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants