Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Add Vector128 Sum intrinsic for amd64 #75142

Merged
merged 10 commits into from
Sep 22, 2022
Merged

Conversation

matouskozak
Copy link
Member

@matouskozak matouskozak commented Sep 6, 2022

Add support for the following Vector128 API's:

  • Sum: It doesn't support byte and sbyte types yet. It does generate instructions for i64 type but not intrinsics but the assembly generated is significantly smaller than without it.

@dotnet dotnet deleted a comment from azure-pipelines bot Sep 8, 2022
@jandupej
Copy link
Member

jandupej commented Sep 8, 2022

I'm nitpicking here. For f32, this horizontal sum boils down to:

haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The haddps instruction has a latency of 6 both on ICL/TGL and Zen3. This could be slightly improved by eliminating the first haddps:

xorps xmm1, xmm1        ; ICL, Zen3 - dependency-breaker (probably lat=0)
movhlps xmm1, xmm0      ; ICL (p5) lat=1, thr=1         ; Zen3 lat=1, thr=2
addps xmm0, xmm1        ; ICL (p01) lat=4, thr=2        ; Zen3 lat=3, thr=2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

Still, horizontal add probably won't be executed in an inner loop, so saving 1-2 clocks of latency is not significant. And this would probably have to be measured, too.

@matouskozak matouskozak marked this pull request as ready for review September 9, 2022 06:21
@tannergooding
Copy link
Member

tannergooding commented Sep 9, 2022

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

I expect the longer code will have an overall net-negative impact in loops since it takes up 2x the space, produces a 3 instruction dependency chain, and likewise will take up additional micro-ops in the decoder.

We also have to be considerate because this can be non-deterministic if you aren't careful. For floating-point, (a + b) + c != a + (b + c) and so doing a[0] + a[1] + a[2] + a[3] for the scalar, but doing (a[0] + a[1]) + (a[2] + a[3]) for 2x hadd or (a[0] + [a2]) + (a[1] + a[3]) for shuffle, add, hadd; may all produce different results.

@dotnet dotnet deleted a comment from azure-pipelines bot Sep 13, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot Sep 19, 2022
@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -545,6 +551,92 @@ emit_sum_vector (MonoCompile *cfg, MonoType *vector_type, MonoTypeEnum element_t
}
#endif

#ifdef TARGET_AMD64
static int type_to_extract_op (MonoTypeEnum type);
static const int fast_log2 [] = { 1, 0, 1, -1, 2, -1, -1, -1, 3 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not simply calculating log2. It seems that you've assigned -1 to places where you think should be illegal element numbers. If that's the case, element number 0 and 1 should be -1 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, -1 should be at illegal inputs which 0 and 1 are as well in this case.

@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@matouskozak matouskozak merged commit 15b2520 into main Sep 22, 2022
@matouskozak matouskozak deleted the amd64_sum_intrinsics branch September 22, 2022 05:46
@ghost ghost locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants