Enable AVX NE CONVERT for FP16 to FP32 cast #21183

eralmual · 2024-06-26T18:46:46Z

Description

Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA.

Motivation and Context

Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.

tianleiwu · 2024-06-26T22:16:58Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-06-26T22:17:00Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-06-26T22:17:01Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-06-26T22:17:18Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-06-26T22:17:35Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-06-26T22:17:36Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/mlas/inc/mlas.h

onnxruntime/core/mlas/lib/amd64/cvtfp16Avx2.asm

yufenglee · 2024-06-27T17:56:41Z

i think the build failure of QNN CI pipeline is that it uses msvc 14.36, which doesn't support vcvtneeph2ps instruction yet. Other windows CI pipeline uses 14.40.

@snnn, any ideas why QNN CI pipeline doesn't use same msvc version?

eralmual · 2024-07-12T15:52:43Z

Hi @yufenglee @tianleiwu! Do you have any other feedback of the PR?

tianleiwu · 2024-07-12T18:22:15Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-07-12T18:22:31Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-07-12T18:22:43Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-07-12T18:22:53Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-07-12T18:22:58Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-07-12T18:23:08Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/mlas/lib/x86_64/cvtfp16a.S

tianleiwu · 2024-07-12T18:39:26Z

@eralmual, some build pipeline failed, need to fix the build.
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1431587&view=logs&j=9d16baec-2ed2-55b0-74fb-c50315f92eff&t=39997a68-8fc6-587d-198c-e5d495a0b19a&l=1126
gcc 11.4 build errror:
/onnxruntime_src/onnxruntime/core/mlas/lib/x86_64/cvtfp16a.S:44: Error: no such instruction: `vcvtneeph2ps ymm0,ymmword PTR [rdi]'

Could you add some conditional compilation to make sure cvtfp16a.S is not compiled when compiler not support vcvtneeph2ps?

eralmual · 2024-07-19T22:05:08Z

@tianleiwu @yufenglee since the new and the old .asm implementation is now on the same file (as per the request to fuse both implementations on the same file), doing a compiler check to include that file would lock both versions, do you want me to get the two functions separate again so we can use the check without affecting the old version?

tianleiwu · 2024-07-20T21:46:40Z

@eralmual, the solution is either to sperate to a new file and only compile the file when compiler support it;
or add some #if macro check in .asm source file to conditionally compile some code block, the macro can be a check of compiler name and version (like #ifdef _MSC_VER), or check whether some custom defined build flag (like USE_AVX_NE_CONVERT) exists.

From the pipeline builds, it seems that it only supported by compiler in Windows. Did you try build it in Linux?

eralmual · 2024-07-26T15:55:43Z

Hi @tianleiwu could you run the pipeline again please

tianleiwu · 2024-07-27T02:04:21Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-07-27T02:04:31Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-07-27T02:04:41Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-07-27T02:04:54Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-07-27T02:04:56Z

Azure Pipelines successfully started running 9 pipeline(s).

tianleiwu · 2024-09-03T21:01:57Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

tianleiwu · 2024-09-03T21:01:59Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

tianleiwu · 2024-09-03T21:02:00Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-09-03T21:02:23Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2024-09-03T21:02:33Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-09-03T21:02:36Z

Azure Pipelines successfully started running 10 pipeline(s).

eralmual · 2024-09-04T16:30:06Z

Do not have access to a MacOS system to debug the error so excluded apple from using the kernels. Will share performance data so the PR can merge

* Developed x86 and amd64 assembly kernel using AVX NE CONVERT. * Developed x86 assembly kernel using SSE instructions. * Added fallback implementation for FP16 to FP32 cast. * Runtime check to determine if CPU supports the ISA requiered for the kernel. * Added kernel dispatching logic on platform.cpp

tianleiwu · 2024-09-04T21:43:32Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-09-04T21:43:39Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

tianleiwu · 2024-09-04T21:43:47Z

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-09-04T21:44:08Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-04T21:44:09Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2024-09-04T21:44:16Z

Azure Pipelines successfully started running 10 pipeline(s).

tianleiwu · 2024-09-06T21:09:44Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-09-06T21:09:59Z

Azure Pipelines successfully started running 3 pipeline(s).

yufenglee

yufenglee · 2024-09-10T04:19:48Z

Thanks Erick for your contribution!!!

### Description Added checks to convert partial vectors in the early stages of the FP16 to FP32 cast using AVX NE CONVERT ISA. ### Motivation and Context Avoid storing data in sections outside of the output buffer, these checks are missing on the [original PR](#21183). This fix prevents memory corruption when the output buffer has a size [n*16 + 1, n*16 + 7] with 0< n

eralmual requested a review from a team as a code owner June 26, 2024 18:46

yufenglee reviewed Jun 27, 2024

View reviewed changes

onnxruntime/core/mlas/inc/mlas.h Outdated Show resolved Hide resolved

yufenglee reviewed Jun 27, 2024

View reviewed changes

onnxruntime/core/mlas/lib/amd64/cvtfp16Avx2.asm Outdated Show resolved Hide resolved

eralmual force-pushed the fp162fp32 branch from 803c718 to 61fcb9b Compare July 11, 2024 17:58

tianleiwu reviewed Jul 12, 2024

View reviewed changes

onnxruntime/core/mlas/lib/x86_64/cvtfp16a.S Outdated Show resolved Hide resolved

eralmual force-pushed the fp162fp32 branch 2 times, most recently from b1325e0 to 9a30cb2 Compare July 26, 2024 15:55

eralmual force-pushed the fp162fp32 branch from 56cd7c3 to 92c3392 Compare September 3, 2024 20:47

eralmual force-pushed the fp162fp32 branch from 92c3392 to a6b0a7e Compare September 4, 2024 16:27

eralmual force-pushed the fp162fp32 branch 3 times, most recently from a6d7b7b to 289d92f Compare September 4, 2024 20:52

eralmual requested a review from yufenglee September 9, 2024 16:53

yufenglee approved these changes Sep 10, 2024

View reviewed changes

yufenglee merged commit 7489bfe into microsoft:main Sep 10, 2024
84 checks passed

eralmual deleted the fp162fp32 branch September 10, 2024 15:49

eralmual restored the fp162fp32 branch September 10, 2024 19:41

eralmual mentioned this pull request Sep 13, 2024

Check partial conversion on FP16 to FP32 AVX Cast kernel #22091

Merged

eralmual deleted the fp162fp32 branch September 17, 2024 18:00

Enable AVX NE CONVERT for FP16 to FP32 cast #21183

Enable AVX NE CONVERT for FP16 to FP32 cast #21183

Conversation

eralmual commented Jun 26, 2024

Description

Motivation and Context

tianleiwu commented Jun 26, 2024

tianleiwu commented Jun 26, 2024

tianleiwu commented Jun 26, 2024

azure-pipelines bot commented Jun 26, 2024

azure-pipelines bot commented Jun 26, 2024

azure-pipelines bot commented Jun 26, 2024

yufenglee commented Jun 27, 2024

eralmual commented Jul 12, 2024

tianleiwu commented Jul 12, 2024

tianleiwu commented Jul 12, 2024

tianleiwu commented Jul 12, 2024

azure-pipelines bot commented Jul 12, 2024

azure-pipelines bot commented Jul 12, 2024

azure-pipelines bot commented Jul 12, 2024

tianleiwu commented Jul 12, 2024 • edited Loading

eralmual commented Jul 19, 2024

tianleiwu commented Jul 20, 2024 • edited Loading

eralmual commented Jul 26, 2024

tianleiwu commented Jul 27, 2024

tianleiwu commented Jul 27, 2024

tianleiwu commented Jul 27, 2024

azure-pipelines bot commented Jul 27, 2024

azure-pipelines bot commented Jul 27, 2024

tianleiwu commented Sep 3, 2024

tianleiwu commented Sep 3, 2024

tianleiwu commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

azure-pipelines bot commented Sep 3, 2024

eralmual commented Sep 4, 2024

tianleiwu commented Sep 4, 2024

tianleiwu commented Sep 4, 2024

tianleiwu commented Sep 4, 2024

azure-pipelines bot commented Sep 4, 2024

azure-pipelines bot commented Sep 4, 2024

azure-pipelines bot commented Sep 4, 2024

tianleiwu commented Sep 6, 2024

azure-pipelines bot commented Sep 6, 2024

yufenglee left a comment

Choose a reason for hiding this comment

yufenglee commented Sep 10, 2024

tianleiwu commented Jul 12, 2024 •

edited

Loading

tianleiwu commented Jul 20, 2024 •

edited

Loading