Mlas int4 int8 with avx2/512 #20687

liqunfu · 2024-05-15T17:01:08Z

Description

model: phi-3-mini-4k-instruct
avx2 symmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	49.5	70.0	-29.2%	9.6	10.8	-34.2%
32	76.8	52.4	9.7%	15.2	14.6	4.1%
64	78.2	71.4	9.5%	16.6	16.3	1.8%
128	72.9	70.6	3.2%	17.1	16.8	1.7%
256	83.7	63.6	31.6%	18.1	17.4	4%

avx2 asymmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	50.7	61.5	-17.5%	9.6	9.2	4.3%
32	77.4	52.4	47.7%	14.6	13.9	5.0%
64	78.7	63.0	24.9%	16.2	15.9	1.8%
128	80.0	61.9	29.2%	17.2	16.9	1.7%
256	81.5	63.3	28.7%	17.9	17.3	3.4%

avx2vnni symmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	82.9	117.0	-29.0%	15.9	19.3	-17.6%
32	133.0	100.4	32.4%	26.1	24.5	6.5%
64	166.9	118.8	40.4%	28.3	27.1	4.4%
128	165.9	119.6	38.7%	29.3	28.5	2.8%
256	165.2	119.6	38.1%	30.2	29.0	4.1%

avx2vnni asymmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	80.2	118.9	-32.5%	15.1	16.7	-9.5%
32	130.7	99.7	31.0%	25.0	23.8	5.0%
64	168.7	124.9	35.0%	27.3	26.8	1.8%
128	169.6	123.8	36.9%	29.2	27.9	4.6%
256	175.0	125.7	39.0%	30.0	29.7	1.0%

avx512 symmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	135.2	156.5	-13.6%	25.5	23.8	7.1%
32	150.0	159.5	-5.9%	34.9	29.6	17.9%
64	167.5	157.5	6.3%	39.7	34.4	15.4%
128	177.8	158.0	12.5%	40.3	35.4	13.8%
256	182.6	157.3	16.0%	41.7	37.7	10.6%

avx512 asymmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	136.1	151.4	-10.1%	26.1	19.9	31.1%
32	150.0	157.8	-4.9%	34.3	29.3	17.0%
64	165.7	156.6	5.8%	38.7	30.7	26.0%
128	180.4	156.6	15.1%	40.2	34.7	15.8%
256	181.3	158.0	14.7%	41.6	36.6	13.6%

avx512vnni symmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	143.4	155.4	-7.7%	25.6	23.3	9.8%
32	159.2	157.0	1.4%	34.1	29.8	14.4%
64	182.0	159.5	14.1%	38.4	34.8	10.3%
128	221.2	160.8	37.5%	41.0	36.4	12.6%
256	250.5	162.4	54.2%	41.6	37.7	10.3%

avx512vnni asymmetric

blklen	updated prompt tps	baseline prompt tps	prompt tps change%	updated token gen tps	baseline token gen tps	token gen change%
16	142.5	152.3	-6.4%	26.3	19.7	33.5%
32	158.2	155.0	2.0%	34.3	29.2	17.4%
64	184.1	156.6	17.5%	38.3	30.9	23.9%
128	215.8	156.1	17.5%	41.3	35.0	17.9%
256	249.2	155.9	59.8%	41.1	36.3	13.2%

4bit gemm implementation with avx using tile.

tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1.
with internal kernel, weight and activation are loaded based on SIMD register width and blk length:
avx2 256bit register, 64 weights and activation are loaded.
blklen16: 4 blks are computed by the internal kernel
blklen32: 2 blks are computed by the internal kernel
blklen64: 1 blk are computed by the internal kernel
blklen128: 1 blks are computed 2 times by the internal kernel
blklen16: 1 blks are computed 4 times by the internal kernel

avx512 512bit register, 128 weights and activation are loaded.
blklen16: 8 blks are computed by the internal kernel
blklen32: 4 blks are computed by the internal kernel
blklen64: 2 blk are computed by the internal kernel
blklen128: 1 blks are computed by the internal kernel
blklen16: 1 blks are computed 2 times by the internal kernel

blksum is precomputed during prepacking.
computation is reformed:
Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b)
Sum_blk is over one blk
Sum1 is over all blks for one output
Sum2 is over all blks for one output
Sum is computed with sgemm with the current implementation. Further improvement is possible.

…en32, symmetric1 hasBias0 Int8 Signed-off-by: Liqun Fu <liqfu@microsoft.com>

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…tric:1/ComputeType:4/real_time_mean 1542487160 ns 1539062500 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1434872720 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…NBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1265060620 ns 1265625000 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…TGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1214042220 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…6/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 784668090 ns; SQNBITGEMM<4>/BlkLen:64/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 754939430 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

cmake/onnxruntime_mlas.cmake

…ymmetric:1/ComputeType:4/real_time_mean 664029830 ns Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

…/onnxruntime into liqun/mlas-q4-tile-avx

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

github-advanced-security

PREfast found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

onnxruntime/core/mlas/lib/sqnbitgemm.h

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/core/mlas/lib/sqnbitgemm.cpp

onnxruntime/core/mlas/lib/sqnbitgemm.h

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

yufenglee

### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |49.5|70.0|-29.2%|9.6|10.8|-34.2% 32 |76.8|52.4|9.7%|15.2|14.6|4.1% 64 |78.2|71.4|9.5%|16.6|16.3|1.8% 128 |72.9|70.6|3.2%|17.1|16.8|1.7% 256 |83.7|63.6|31.6%|18.1|17.4|4% avx2 asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |50.7|61.5|-17.5%|9.6|9.2|4.3% 32 |77.4|52.4|47.7%|14.6|13.9|5.0% 64 |78.7|63.0|24.9%|16.2|15.9|1.8% 128 |80.0|61.9|29.2%|17.2|16.9|1.7% 256 |81.5|63.3|28.7%|17.9|17.3|3.4% avx2vnni symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |82.9|117.0|-29.0%|15.9|19.3|-17.6% 32 |133.0|100.4|32.4%|26.1|24.5|6.5% 64 |166.9|118.8|40.4%|28.3|27.1|4.4% 128 |165.9|119.6|38.7%|29.3|28.5|2.8% 256 |165.2|119.6|38.1%|30.2|29.0|4.1% avx2vnni asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |80.2|118.9|-32.5%|15.1|16.7|-9.5% 32 |130.7|99.7|31.0%|25.0|23.8|5.0% 64 |168.7|124.9|35.0%|27.3|26.8|1.8% 128 |169.6|123.8|36.9%|29.2|27.9|4.6% 256 |175.0|125.7|39.0%|30.0|29.7|1.0% avx512 symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |135.2|156.5|-13.6|25.5|23.8|7.1 32 |150.0|159.5|-5.9|34.9|29.6|17.9 64 |167.5|157.5|6.3|39.7|34.4|15.4 128 |177.8|158.0|12.5|40.3|35.4|13.8 256 |182.6|157.3|16.0|41.7|37.7|10.6 avx512 asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |136.1|151.4|-10.1%|26.1|19.9|31.1% 32 |150.0|157.8|-4.9%|34.3|29.3|17.0% 64 |165.7|156.6|5.8%|38.7|30.7|26.0% 128 |180.4|156.6|15.1%|40.2|34.7|15.8% 256 |181.3|158.0|14.7%|41.6|36.6|13.6% avx512vnni symmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |143.4|155.4|-7.7%|25.6|23.3|9.8% 32 |159.2|157.0|1.4%|34.1|29.8|14.4% 64 |182.0|159.5|14.1%|38.4|34.8|10.3% 128 |221.2|160.8|37.5%|41.0|36.4|12.6% 256 |250.5|162.4|54.2%|41.6|37.7|10.3% avx512vnni asymmetric blklen|updated prompt tps | baseline prompt tps | prompt tps change%|updated token gen tps | baseline token gen tps | token gen change% -|-|-|-|-|-|- 16 |142.5|152.3|-6.4%|26.3|19.7|33.5% 32 |158.2|155.0|2.0%|34.3|29.2|17.4% 64 |184.1|156.6|17.5%|38.3|30.9|23.9% 128 |215.8|156.1|17.5%|41.3|35.0|17.9% 256 |249.2|155.9|59.8%|41.1|36.3|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>

liqunfu added 8 commits May 2, 2024 20:00

quick adapt llama.cpp to experiment performance. Only works with blkl…

293f121

…en32, symmetric1 hasBias0 Int8 Signed-off-by: Liqun Fu <liqfu@microsoft.com>

fire

04c2e56

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

tile 2x4 SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symme…

cdfda6f

…tric:1/ComputeType:4/real_time_mean 1542487160 ns 1539062500 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

use one_16_epi16 and accumulate_2blk_dot: SQNBITGEMM<4>/BlkLen:32/M:2…

92dad97

…048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1434872720 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

apply to M1, BQuant layout pack block (subblk) larger than blklen: SQ…

5418e9c

…NBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1265060620 ns 1265625000 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

use new AQuant layout (not work if total M is not RangeCountM): SQNBI…

0401f72

…TGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symmetric:1/ComputeType:4/real_time_mean 1214042220 ns Signed-off-by: Liqun Fu <liqfu@microsoft.com>

blklen16

f2c33af

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

liqunfu requested a review from a team as a code owner May 15, 2024 17:01

liqunfu requested review from edgchen1, yufenglee and chenfucn May 15, 2024 17:01

liqunfu marked this pull request as draft May 15, 2024 17:03

edgchen1 reviewed May 20, 2024

View reviewed changes

cmake/onnxruntime_mlas.cmake Outdated Show resolved Hide resolved

impl avx512: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/S…

0ca24f4

…ymmetric:1/ComputeType:4/real_time_mean 664029830 ns Signed-off-by: liqunfu <liqun.fu@microsoft.com>

liqunfu changed the title ~~Mlas int4 int8 with avx2~~ Mlas int4 int8 with avx2/512 May 26, 2024

liqunfu and others added 8 commits June 1, 2024 02:33

matmul_nbit & fix alignment for sgemm

7f89d5f

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

merge main

ed0e666

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

fix mlas benchmark not using multi threads

35d02a6

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

profiling

b9493ad

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

Merge branch 'liqun/mlas-q4-tile-avx' of https://github.com/microsoft…

c443eb5

…/onnxruntime into liqun/mlas-q4-tile-avx

sgemm after sq4bit for avx2

ac66951

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

avx512

42a1305

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

layout to follow compute, M1 separate with M > 1

740031a

Signed-off-by: Liqun Fu <liqfu@microsoft.com>

github-advanced-security bot found potential problems Jun 28, 2024

View reviewed changes

liqunfu added 3 commits June 28, 2024 22:48

make avx512 run

1a6031e

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

Merge branch 'main' into liqun/mlas-q4-tile-avx

283fd2d

avx512 blklen64 pass

d035939

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

github-advanced-security bot found potential problems Jul 4, 2024

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm.cpp Fixed Show fixed Hide fixed

onnxruntime/core/mlas/lib/sqnbitgemm.cpp Fixed Show fixed Hide fixed

pass avx512 blklen32

f329d2d

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

liqunfu added 5 commits July 30, 2024 20:13

unused zp, etc.

f77cffd

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

unused zp, etc.

a6fd378

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

remove test code changes

c875e5c

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

remove test code changes

3b56710

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

lint

746562f

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

liqunfu marked this pull request as ready for review July 30, 2024 21:37

liqunfu added 2 commits July 30, 2024 22:52

lint

52fc7fa

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

code name

0933a6b

Signed-off-by: liqunfu <liqun.fu@microsoft.com>

edgchen1 reviewed Jul 31, 2024

View reviewed changes