Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New optimized kernels #365

Merged
merged 8 commits into from
Feb 24, 2024
Merged

New optimized kernels #365

merged 8 commits into from
Feb 24, 2024

Conversation

casper-hansen
Copy link
Owner

@casper-hansen casper-hansen commented Feb 24, 2024

New kernels from casper-hansen/AutoAWQ_kernels#12 / mit-han-lab/llm-awq#142 that scale better. They are as fast as ExLlamaV2 kernels at batch size 1 and 64% faster decoding at larger batch sizes. Additionally, it is more memory efficient, saving GB of memory when scaling up. The kernels are much faster at decoding than previous GEMM kernels and should be the new preferred format, although it requires requantization.

Note: The AutoAWQ_kernels PR slightly modified the kernels for Windows compatibility and to fix some small issues like converting float to half / half to float.

Benchmarks

GPU: NVIDIA GeForce RTX 4090
Model: Mistral 7B Instruct

New kernel

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 153.108 181.792 4.47 GB (18.90%)
1 64 64 4986.08 184.787 4.48 GB (18.96%)
1 128 128 6445.57 183.334 4.49 GB (19.00%)
1 256 256 7903.41 178.064 4.51 GB (19.08%)
1 512 512 8918.64 168.751 4.55 GB (19.24%)
1 1024 1024 7338.22 152.792 4.63 GB (19.56%)
1 2048 2048 5690.87 128.262 5.63 GB (23.80%)
1 4096 4096 3843.33 129.923 9.82 GB (41.54%)
8 32 32 990 1324 4.50 GB (19.02%)
8 64 64 9628 1323 4.54 GB (19.18%)
8 128 128 9656 1309 4.60 GB (19.45%)
8 256 256 9036 1265 4.72 GB (19.97%)
8 512 512 7992 1191 5.31 GB (22.45%)
8 1024 1024 6922 1070 7.92 GB (33.49%)
8 2048 2048 5520 892 16.90 GB (71.44%)
16 32 32 1997.92 2591.88 4.53 GB (19.14%)
16 64 64 9916.44 2583.3 4.60 GB (19.44%)
16 128 128 9446.14 2547.31 4.72 GB (19.96%)
16 256 256 8663.37 2426.64 5.22 GB (22.08%)
16 512 512 7967.01 2221.12 6.65 GB (28.13%)
16 1024 1024 6924.81 1897.71 11.86 GB (50.14%)
32 32 32 3353.53 4248.61 4.59 GB (19.40%)
32 64 64 9677.23 4154.19 4.72 GB (19.95%)
32 128 128 9109.03 4044.9 5.22 GB (22.06%)
32 256 256 8595.1 3770.58 6.49 GB (27.42%)
32 512 512 7937.57 3275.92 9.34 GB (39.50%)

ExLlamaV2

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 151.36 186.248 4.53 GB (19.15%)
1 64 64 1005.16 186.704 4.54 GB (19.21%)
1 128 128 3447 185.016 4.56 GB (19.27%)
1 256 256 6128.1 179.998 4.58 GB (19.38%)
1 512 512 8355.09 170.445 4.64 GB (19.60%)
1 1024 1024 8192.38 154.214 4.74 GB (20.05%)
1 2048 2048 6539.28 129.506 5.77 GB (24.41%)
1 4096 4096 4280.31 130.978 10.08 GB (42.61%)
8 32 32 1037 1154 4.57 GB (19.32%)
8 64 64 8909 1150 4.62 GB (19.54%)
8 128 128 11235 1141 4.71 GB (19.93%)
8 256 256 11426 1109 4.90 GB (20.71%)
8 512 512 10215 1050 5.56 GB (23.52%)
8 1024 1024 8716 953 8.39 GB (35.48%)
8 2048 2048 6627 808 17.80 GB (75.28%)
16 32 32 1960.73 2048.13 4.61 GB (19.51%)
16 64 64 11457.77 2023.67 4.71 GB (19.92%)
16 128 128 12099.81 1992.66 4.89 GB (20.69%)
16 256 256 11357.77 1917.56 5.47 GB (23.14%)
16 512 512 10438.64 1785.0 7.12 GB (30.12%)
16 1024 1024 8807.95 1565.22 12.77 GB (53.98%)
32 32 32 3370.3 2278.66 4.70 GB (19.89%)
32 64 64 12461.55 2259.29 4.89 GB (20.68%)
32 128 128 12169.56 2226.39 5.47 GB (23.13%)
32 256 256 11562.06 2150.82 6.96 GB (29.41%)
32 512 512 10565.37 1995.27 10.25 GB (43.34%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant