Marlin symmetric quantization and inference #320

IlyasMoutawwakil · 2024-01-25T10:47:39Z

with @casper-hansen 🤗
experimental, still needs cleanup.

IlyasMoutawwakil · 2024-01-25T11:13:51Z

Perplexity results

Symmetric AWQ Marlin model:

user@hf-dgx-01:/workspace/opt-bench$ python examples/eval.py --model_path vicuna-7b-v1.5-awq-marlin
Perplexity 7.138: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [00:54<00:00,  3.04it/s]
Perplexity: 7.138

Zero Point AWQ model:

user@hf-dgx-01:/workspace/opt-bench$ python examples/eval.py --model_path vicuna-7b-v1.5-awq
Perplexity 7.013: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [01:37<00:00,  1.71it/s]
Perplexity: 7.013

IlyasMoutawwakil · 2024-01-29T09:36:14Z

updated with new marlin

Perf Bench

Batch Size = 1

GEMM

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: GEMM
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |            277.835 |           93.0104 | 4.55 GB (5.75%)   |
|            1 |               64 |              64 |           1749.38  |           93.7882 | 4.57 GB (5.78%)   |
|            1 |              128 |             128 |           2058.66  |           93.2005 | 4.61 GB (5.82%)   |
|            1 |              256 |             256 |           2349.42  |           93.2399 | 4.67 GB (5.90%)   |
|            1 |              512 |             512 |           2546.7   |           92.5384 | 4.80 GB (6.06%)   |
|            1 |             1024 |            1024 |           5635.29  |           92.893  | 5.06 GB (6.40%)   |
|            1 |             2048 |            2048 |           5343.16  |           76.1673 | 6.15 GB (7.77%)   |
|            1 |             4096 |            4096 |           4555.36  |           77.43   | 11.15 GB (14.08%) |
|            1 |             8192 |            8192 |           3133.58  |           57.5579 | 28.68 GB (36.23%) |

Marlin

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-marlin
Version: Marlin
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |            277.605 |          102.415  | 4.52 GB (5.71%)   |
|            1 |               64 |              64 |           2707.15  |          103.143  | 4.55 GB (5.74%)   |
|            1 |              128 |             128 |           5128.98  |          102.101  | 4.58 GB (5.78%)   |
|            1 |              256 |             256 |           6342.32  |          104.245  | 4.64 GB (5.86%)   |
|            1 |              512 |             512 |           6664.59  |          103.435  | 4.77 GB (6.03%)   |
|            1 |             1024 |            1024 |           6254.59  |          104.344  | 5.04 GB (6.36%)   |
|            1 |             2048 |            2048 |           5427.11  |          103.412  | 6.13 GB (7.75%)   |
|            1 |             4096 |            4096 |           4450.35  |          105.25   | 11.12 GB (14.05%) |
|            1 |             8192 |            8192 |           2955.77  |           72.1539 | 28.65 GB (36.20%) |

ExllamaV2

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: ExllamaV2
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |             291.52 |          126.095  | 4.55 GB (5.75%)   |
|            1 |               64 |              64 |            1623.86 |          126.239  | 4.57 GB (5.78%)   |
|            1 |              128 |             128 |            4237.64 |          125.642  | 4.61 GB (5.82%)   |
|            1 |              256 |             256 |            6727.2  |          125.932  | 4.67 GB (5.90%)   |
|            1 |              512 |             512 |            8173.39 |          125.267  | 4.80 GB (6.06%)   |
|            1 |             1024 |            1024 |            8178.4  |          124.575  | 5.06 GB (6.40%)   |
|            1 |             2048 |            2048 |            6965.1  |          106.812  | 6.34 GB (8.01%)   |
|            1 |             4096 |            4096 |            5054.08 |          109.529  | 11.43 GB (14.44%) |
|            1 |             8192 |            8192 |            3230.48 |           73.2361 | 29.15 GB (36.82%) |

Batch Size = 8

GEMM

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: GEMM
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s    | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:-------------------|:------------------|
|            8 |               32 |              32 | 1346.3070769585704 | 708.0039668305446  | 4.66 GB (5.88%)   |
|            8 |               64 |              64 | 2919.1490129870836 | 711.0797660422141  | 4.79 GB (6.05%)   |
|            8 |              128 |             128 | 5629.3036596650445 | 707.4964050013705  | 5.04 GB (6.37%)   |
|            8 |              256 |             256 | 7791.613649124367  | 708.4374630521071  | 5.56 GB (7.02%)   |
|            8 |              512 |             512 | 8755.697456241565  | 716.3168883290994  | 6.63 GB (8.38%)   |
|            8 |             1024 |            1024 | 8376.709636556277  | 659.15788232983    | 10.84 GB (13.69%) |
|            8 |             2048 |            2048 | 6869.208920003594  | 521.6471612461911  | 23.00 GB (29.06%) |
|            8 |             4096 |            4096 | 4909.631004412986  | 335.84994344853817 | 62.33 GB (78.75%) |
|            8 |             8192 |            8192 | OOM                | OOM                | 74.05 GB (93.56%) |

Marlin

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-marlin
Version: Marlin
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s   | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:------------------|:------------------|
|            8 |               32 |              32 | 2187.0295157029404 | 819.9005986560782 | 4.63 GB (5.85%)   |
|            8 |               64 |              64 | 7313.37104870623   | 817.403946406821  | 4.76 GB (6.01%)   |
|            8 |              128 |             128 | 7714.85411899108   | 802.8912710566616 | 5.01 GB (6.33%)   |
|            8 |              256 |             256 | 8115.586666906011  | 811.3558371215785 | 5.53 GB (6.98%)   |
|            8 |              512 |             512 | 7940.376095680874  | 810.7481093096866 | 6.61 GB (8.35%)   |
|            8 |             1024 |            1024 | 7536.497977676355  | 812.9086900695303 | 10.81 GB (13.66%) |
|            8 |             2048 |            2048 | 6143.37585686755   | 697.8004408767624 | 22.97 GB (29.03%) |
|            8 |             4096 |            4096 | 4438.606568948759  | 401.2056340722673 | 62.31 GB (78.72%) |
|            8 |             8192 |            8192 | OOM                | OOM               | 74.03 GB (93.53%) |

ExllamaV2

GPU: NVIDIA A100-SXM4-80GB
Model: IlyasMoutawwakil/vicuna-7b-v1.5-awq-gemm
Version: ExllamaV2
|   Batch Size |   Prefill Length |   Decode Length | Prefill tokens/s   | Decode tokens/s   | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|:-------------------|:------------------|:------------------|
|            8 |               32 |              32 | 2170.5566653526994 | 855.5439061703213 | 4.66 GB (5.88%)   |
|            8 |               64 |              64 | 8809.358083134719  | 855.2168217153052 | 4.79 GB (6.05%)   |
|            8 |              128 |             128 | 9950.853059881654  | 843.669717389118  | 5.04 GB (6.37%)   |
|            8 |              256 |             256 | 10298.249394568622 | 805.7640419758422 | 5.56 GB (7.02%)   |
|            8 |              512 |             512 | 10324.639166137811 | 737.9628318194814 | 6.91 GB (8.73%)   |
|            8 |             1024 |            1024 | 9187.918802352617  | 640.7063451146627 | 11.30 GB (14.28%) |
|            8 |             2048 |            2048 | 7291.295603279198  | 508.1310214280306 | 23.84 GB (30.12%) |
|            8 |             4096 |            4096 | 5026.979565560173  | 332.1365984993962 | 63.93 GB (80.77%) |
|            8 |             8192 |            8192 | OOM                | OOM               | 77.15 GB (97.47%) |

vince62s · 2024-01-29T10:34:37Z

nice work. may the architecture of the GPU impact things?

IlyasMoutawwakil · 2024-01-29T10:44:24Z

@vince62s I'd say "definitely" based on the fact that the kernel has many PTX assembly blocks and a hard constraint on architecture, from the kernel's repo https://github.com/IST-DASLab/marlin

NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)

vince62s · 2024-01-29T10:50:34Z

@vince62s I'd say "definitely" based on the fact that the kernel has many PTX assembly blocks and a hard constraint on architecture, from the kernel's repo https://github.com/IST-DASLab/marlin

NVIDIA GPU with compute capability >= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)

looking forward seeing numbers at batch_size even higher 32/64 which might be reasonable for seq len 1024/2048 when Marlin is optimized for Hopper.

casper-hansen · 2024-02-03T13:37:19Z

Looks good to me! Fixed a small bug with the workspace after the latest update to Marlin. Nice to have a refactor of the Quantizer as well.

IlyasMoutawwakil · 2024-02-04T10:00:00Z

@casper-hansen awesome! apologies for not cleaning up the PR myself 😅 thanks for taking care of it 🙏
btw the workspaces can be created the same way as in exllamav2, i.e. one workspace per device instead of per linear layer.

jeromeku · 2024-04-02T14:20:50Z

@IlyasMoutawwakil

Great work adapting Marlin to AWQ.

I'm currently looking to do the same -- that is adapt optimized inference kernels for different quantization formats.

Roughly, what are the major changes that need to be made to adapt a quantization format in order to use a kernel such as Marlin? Specifically, how do the quantized weights, scales, and zeros need to be preprocessed in order to conform to the required layout for Marlin, AWQ specific GEMV / GEMM, exLlama, etc.?

E.g., starting from 4-bit quantized weights packed [0, 1, 2, 3, 4, 5, 6, 7] as an int, what permutations / shuffling / packing needs to be done in order to use each of these kernels.

IlyasMoutawwakil added 2 commits January 23, 2024 18:11

added marlin layer

f5de276

awq+marlin symmetric quantization and inference

c83fef2

IlyasMoutawwakil and others added 4 commits January 25, 2024 12:54

post init marlin

802b5c2

Merge branch 'main' into marlin-support

b5d0abb

layers fusion + merge

689e652

assertions

a342547

casper-hansen added 2 commits February 3, 2024 14:11

Merge branch 'main' into marlin-support

df0999d

Implement max_par

edcef4f

casper-hansen merged commit 34085ed into main Feb 3, 2024

casper-hansen deleted the marlin-support branch February 12, 2024 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marlin symmetric quantization and inference #320

Marlin symmetric quantization and inference #320

IlyasMoutawwakil commented Jan 25, 2024 •

edited

Loading

IlyasMoutawwakil commented Jan 25, 2024

IlyasMoutawwakil commented Jan 29, 2024 •

edited

Loading

vince62s commented Jan 29, 2024

IlyasMoutawwakil commented Jan 29, 2024

vince62s commented Jan 29, 2024

casper-hansen commented Feb 3, 2024

IlyasMoutawwakil commented Feb 4, 2024 •

edited

Loading

jeromeku commented Apr 2, 2024 •

edited

Loading

Marlin symmetric quantization and inference #320

Marlin symmetric quantization and inference #320

Conversation

IlyasMoutawwakil commented Jan 25, 2024 • edited Loading

IlyasMoutawwakil commented Jan 25, 2024

IlyasMoutawwakil commented Jan 29, 2024 • edited Loading

Perf Bench

Batch Size = 1

GEMM

Marlin

ExllamaV2

Batch Size = 8

GEMM

Marlin

ExllamaV2

vince62s commented Jan 29, 2024

IlyasMoutawwakil commented Jan 29, 2024

vince62s commented Jan 29, 2024

casper-hansen commented Feb 3, 2024

IlyasMoutawwakil commented Feb 4, 2024 • edited Loading

jeromeku commented Apr 2, 2024 • edited Loading

IlyasMoutawwakil commented Jan 25, 2024 •

edited

Loading

IlyasMoutawwakil commented Jan 29, 2024 •

edited

Loading

IlyasMoutawwakil commented Feb 4, 2024 •

edited

Loading

jeromeku commented Apr 2, 2024 •

edited

Loading