-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Falcon support #499
Merged
Merged
Add Falcon support #499
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
borzunov
force-pushed
the
falcon-new
branch
7 times, most recently
from
September 3, 2023 10:46
bcbd950
to
2137876
Compare
Falcon-40B benchmarksThese are measured before 4537c77 that slows down inference by 1-2% (but necessary to make MQA models work properly with the rest of Petals). H100 (80 GB):
A100 (80 GB):
RTX A6000 Ada (48 GB):
|
borzunov
force-pushed
the
falcon-new
branch
2 times, most recently
from
September 3, 2023 14:34
a8cb6dc
to
72033d1
Compare
borzunov
force-pushed
the
falcon-new
branch
4 times, most recently
from
September 3, 2023 19:22
872f80d
to
2747255
Compare
mryab
added a commit
that referenced
this pull request
Sep 4, 2023
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in #499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
d-popov
pushed a commit
to d-popov/petals-ai
that referenced
this pull request
Sep 6, 2023
This PR adds: - Support for models based on `transformers.FalconModel` (the in-library format for Falcon). Tested on Falcon-40B. - CI tests for Falcon-RW-1B. - `--throughput dry_run` option to evaluate throughput and exit right away (implemented by @mryab). Limitations: - Backward pass support is broken for now, will be fixed in bigscience-workshop#500. Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
d-popov
pushed a commit
to d-popov/petals-ai
that referenced
this pull request
Sep 6, 2023
This PR attempts to optimize the inference of Falcon models in the single-token setup by reducing the majority of Python overhead and making several assumptions about the setup. Specifically, * Layer normalization, QKV projection (with splitting) and rotary embeddings are executed through CUDA graphs, which reduces most overhead related to small kernel launche * If no sin/cos tensors are cached by the rotary embedding layer, we cache them for 8192 tokens (INFERENCE_MAX_LENGTH) during the first forward pass. In general, it should be beneficial to always run a max-length sequence before starting a block, but this is a question for another PR The PR also adds a small test to ensure that the results (without quantization) of the block before and after quantization indeed match. Lastly, the pull request makes the backward pass work (as discussed in bigscience-workshop#499) by making cached sin/cos for RotaryEmbedding into buffers and disabling the inference mode during their creation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds:
transformers.FalconModel
(the in-library format for Falcon). Tested on Falcon-40B.--throughput dry_run
option to evaluate throughput and exit right away (implemented by @mryab).Limitations: