Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AVX NE CONVERT for FP16 to FP32 cast #21183

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

eralmual
Copy link
Contributor

Description

Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA.

Motivation and Context

Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.

@eralmual eralmual requested a review from a team as a code owner June 26, 2024 18:46
@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yufenglee
Copy link
Member

i think the build failure of QNN CI pipeline is that it uses msvc 14.36, which doesn't support vcvtneeph2ps instruction yet. Other windows CI pipeline uses 14.40.

@snnn, any ideas why QNN CI pipeline doesn't use same msvc version?

@eralmual
Copy link
Contributor Author

Hi @yufenglee @tianleiwu! Do you have any other feedback of the PR?

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@tianleiwu
Copy link
Contributor

tianleiwu commented Jul 12, 2024

@eralmual, some build pipeline failed, need to fix the build.
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1431587&view=logs&j=9d16baec-2ed2-55b0-74fb-c50315f92eff&t=39997a68-8fc6-587d-198c-e5d495a0b19a&l=1126
gcc 11.4 build errror:
/onnxruntime_src/onnxruntime/core/mlas/lib/x86_64/cvtfp16a.S:44: Error: no such instruction: `vcvtneeph2ps ymm0,ymmword PTR [rdi]'

Could you add some conditional compilation to make sure cvtfp16a.S is not compiled when compiler not support vcvtneeph2ps?

@eralmual
Copy link
Contributor Author

@tianleiwu @yufenglee since the new and the old .asm implementation is now on the same file (as per the request to fuse both implementations on the same file), doing a compiler check to include that file would lock both versions, do you want me to get the two functions separate again so we can use the check without affecting the old version?

@tianleiwu
Copy link
Contributor

tianleiwu commented Jul 20, 2024

@eralmual, the solution is either to sperate to a new file and only compile the file when compiler support it;
or add some #if macro check in .asm source file to conditionally compile some code block, the macro can be a check of compiler name and version (like #ifdef _MSC_VER), or check whether some custom defined build flag (like USE_AVX_NE_CONVERT) exists.

From the pipeline builds, it seems that it only supported by compiler in Windows. Did you try build it in Linux?

@eralmual eralmual force-pushed the fp162fp32 branch 2 times, most recently from b1325e0 to 9a30cb2 Compare July 26, 2024 15:55
@eralmual
Copy link
Contributor Author

Hi @tianleiwu could you run the pipeline again please

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

@tianleiwu
Copy link
Contributor

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@eralmual
Copy link
Contributor Author

eralmual commented Sep 4, 2024

Do not have access to a MacOS system to debug the error so excluded apple from using the kernels. Will share performance data so the PR can merge

@eralmual eralmual force-pushed the fp162fp32 branch 3 times, most recently from a6d7b7b to 289d92f Compare September 4, 2024 20:52
* Developed x86 and amd64 assembly kernel using AVX NE CONVERT.
* Developed x86 assembly kernel using SSE instructions.
* Added fallback implementation for FP16 to FP32 cast.
* Runtime check to determine if CPU supports the ISA requiered for the kernel.
* Added kernel dispatching logic on platform.cpp
@tianleiwu
Copy link
Contributor

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@tianleiwu
Copy link
Contributor

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

@tianleiwu
Copy link
Contributor

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@tianleiwu
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yufenglee yufenglee merged commit 7489bfe into microsoft:main Sep 10, 2024
84 checks passed
@yufenglee
Copy link
Member

Thanks Erick for your contribution!!!

@eralmual eralmual deleted the fp162fp32 branch September 10, 2024 15:49
@eralmual eralmual restored the fp162fp32 branch September 10, 2024 19:41
yufenglee pushed a commit that referenced this pull request Sep 16, 2024
### Description
Added checks to convert partial vectors in the early stages of the FP16
to FP32 cast using AVX NE CONVERT ISA.



### Motivation and Context
Avoid storing data in sections outside of the output buffer, these
checks are missing on the [original
PR](#21183).
This fix prevents memory corruption when the output buffer has a size
[n*16 + 1, n*16 + 7] with 0< n
@eralmual eralmual deleted the fp162fp32 branch September 17, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants