Add Falcon support #499

borzunov · 2023-09-02T23:25:19Z

This PR adds:

Support for models based on transformers.FalconModel (the in-library format for Falcon). Tested on Falcon-40B.
CI tests for Falcon-RW-1B.
--throughput dry_run option to evaluate throughput and exit right away (implemented by @mryab).

Limitations:

Backward pass support is broken for now, will be fixed in Optimize the Falcon block for inference #500.

borzunov · 2023-09-03T14:14:04Z

Falcon-40B benchmarks

These are measured before 4537c77 that slows down inference by 1-2% (but necessary to make MQA models work properly with the rest of Petals).

H100 (80 GB):

Sep 03 14:07:43.798 [INFO] Inference throughput: 728.4 tokens/sec per block (1 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)
Sep 03 14:07:57.270 [INFO] Forward pass throughput: 93138.6 tokens/sec per block (1024 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)

A100 (80 GB):

Sep 03 13:22:40.739 [INFO] Inference throughput: 710.3 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)
Sep 03 13:22:50.803 [INFO] Forward pass throughput: 61680.6 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)

RTX A6000 Ada (48 GB):

Sep 03 15:14:46.634 [INFO] Inference throughput: 785.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 03 15:14:57.330 [INFO] Forward pass throughput: 62151.1 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

format

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in #499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.

@mryab

This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in bigscience-workshop#500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in bigscience-workshop#499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.

borzunov added 4 commits September 2, 2023 23:24

Draft Falcon support

15fed1e

Support --throughput dry_run

bd6aa23

Fix comment

cc234c5

Fix DistrtibutedFalconModel.word_embeddings_layernorm

ff83e7c

borzunov force-pushed the falcon-new branch 7 times, most recently from bcbd950 to 2137876 Compare September 3, 2023 10:46

borzunov added 2 commits September 3, 2023 13:02

Use NF4 for all models on CUDA

012ae0e

Use safetensors float32 model

8d78cf2

borzunov force-pushed the falcon-new branch from 469d214 to 8d78cf2 Compare September 3, 2023 13:04

Set default pad_token_id

97dd376

borzunov force-pushed the falcon-new branch 2 times, most recently from a8cb6dc to 72033d1 Compare September 3, 2023 14:34

Set default revision for tiiuae/* repos to models in the in-library

cac654a

format

borzunov force-pushed the falcon-new branch from 72033d1 to cac654a Compare September 3, 2023 14:56

Fix num_key_value_groups

cda4fe8

borzunov force-pushed the falcon-new branch 4 times, most recently from 872f80d to 2747255 Compare September 3, 2023 19:22

Expand/collapse KV caches when config.new_decoder_architecture is True

4537c77

borzunov force-pushed the falcon-new branch from 2747255 to 4537c77 Compare September 3, 2023 19:24

borzunov added 2 commits September 3, 2023 19:57

Fix cache reordering with seq_len = 0

f6553ad

Fix dim order

1d865d1

borzunov marked this pull request as ready for review September 3, 2023 21:41

borzunov merged commit dd4a323 into main Sep 3, 2023
11 checks passed

borzunov deleted the falcon-new branch September 3, 2023 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Falcon support #499

Add Falcon support #499

borzunov commented Sep 2, 2023 •

edited

Loading

borzunov commented Sep 3, 2023 •

edited

Loading

Add Falcon support #499

Add Falcon support #499

Conversation

borzunov commented Sep 2, 2023 • edited Loading

borzunov commented Sep 3, 2023 • edited Loading

Falcon-40B benchmarks

borzunov commented Sep 2, 2023 •

edited

Loading

borzunov commented Sep 3, 2023 •

edited

Loading