[mono] Add Vector128 Sum intrinsic for amd64 #75142

matouskozak · 2022-09-06T16:58:29Z

Add support for the following Vector128 API's:

Sum: It doesn't support byte and sbyte types yet. It does generate instructions for i64 type but not intrinsics but the assembly generated is significantly smaller than without it.

jandupej · 2022-09-08T10:27:49Z

I'm nitpicking here. For f32, this horizontal sum boils down to:

haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The haddps instruction has a latency of 6 both on ICL/TGL and Zen3. This could be slightly improved by eliminating the first haddps:

xorps xmm1, xmm1        ; ICL, Zen3 - dependency-breaker (probably lat=0)
movhlps xmm1, xmm0      ; ICL (p5) lat=1, thr=1         ; Zen3 lat=1, thr=2
addps xmm0, xmm1        ; ICL (p01) lat=4, thr=2        ; Zen3 lat=3, thr=2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

Still, horizontal add probably won't be executed in an inner loop, so saving 1-2 clocks of latency is not significant. And this would probably have to be measured, too.

tannergooding · 2022-09-09T16:43:07Z

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

I expect the longer code will have an overall net-negative impact in loops since it takes up 2x the space, produces a 3 instruction dependency chain, and likewise will take up additional micro-ops in the decoder.

We also have to be considerate because this can be non-deterministic if you aren't careful. For floating-point, (a + b) + c != a + (b + c) and so doing a[0] + a[1] + a[2] + a[3] for the scalar, but doing (a[0] + a[1]) + (a[2] + a[3]) for 2x hadd or (a[0] + [a2]) + (a[1] + a[3]) for shuffle, add, hadd; may all produce different results.

matouskozak · 2022-09-19T14:42:17Z

/azp run runtime-extra-platforms

azure-pipelines · 2022-09-19T14:42:47Z

Azure Pipelines successfully started running 1 pipeline(s).

fanyang-mono · 2022-09-20T14:40:25Z

src/mono/mono/mini/simd-intrinsics.c

@@ -545,6 +551,92 @@ emit_sum_vector (MonoCompile *cfg, MonoType *vector_type, MonoTypeEnum element_t
 }
 #endif

+#ifdef TARGET_AMD64
+static int type_to_extract_op (MonoTypeEnum type);
+static const int fast_log2 [] = { 1, 0, 1, -1, 2, -1, -1, -1, 3 };


This is not simply calculating log2. It seems that you've assigned -1 to places where you think should be illegal element numbers. If that's the case, element number 0 and 1 should be -1 as well.

You are right, -1 should be at illegal inputs which 0 and 1 are as well in this case.

matouskozak · 2022-09-20T16:48:53Z

/azp run runtime-extra-platforms

azure-pipelines · 2022-09-20T16:49:14Z

Azure Pipelines successfully started running 1 pipeline(s).

matouskozak added 3 commits September 5, 2022 09:07

TODO entry point for SN_sum

f673216

Sum for Vector128 AMD64 in progress

b7c3b2b

Vector128 Sum intrinsic (not supporting byte type)

61939ba

ghost assigned matouskozak Sep 6, 2022

dotnet-issue-labeler bot added the area-Codegen-JIT-mono label Sep 6, 2022

matouskozak added 2 commits September 7, 2022 16:11

remove semicolon

ca0a322

space in function def

cc38134

dotnet deleted a comment from azure-pipelines bot Sep 8, 2022

matouskozak marked this pull request as ready for review September 9, 2022 06:21

matouskozak requested review from vargaz, lambdageek and SamMonoRT as code owners September 9, 2022 06:21

matouskozak requested a review from tannergooding September 9, 2022 06:21

vargaz approved these changes Sep 9, 2022

View reviewed changes

matouskozak added 2 commits September 13, 2022 17:14

defined macro

57b3eb6

code style fix

62b6eb2

dotnet deleted a comment from azure-pipelines bot Sep 13, 2022

matouskozak added 2 commits September 19, 2022 10:41

SIMD intrinsics check, code style fixes

b6a3f52

move ISA check after i64 code

bd66c50

dotnet deleted a comment from azure-pipelines bot Sep 19, 2022

matouskozak requested a review from fanyang-mono September 19, 2022 18:47

fanyang-mono reviewed Sep 20, 2022

View reviewed changes

Fast log -1 illegal entries

7e8adf0

fanyang-mono approved these changes Sep 20, 2022

View reviewed changes

matouskozak merged commit 15b2520 into main Sep 22, 2022

matouskozak deleted the amd64_sum_intrinsics branch September 22, 2022 05:46

ghost locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono] Add Vector128 Sum intrinsic for amd64 #75142

[mono] Add Vector128 Sum intrinsic for amd64 #75142

matouskozak commented Sep 6, 2022 •

edited

Loading

jandupej commented Sep 8, 2022

tannergooding commented Sep 9, 2022 •

edited

Loading

matouskozak commented Sep 19, 2022

azure-pipelines bot commented Sep 19, 2022

fanyang-mono Sep 20, 2022

matouskozak Sep 20, 2022

matouskozak commented Sep 20, 2022

azure-pipelines bot commented Sep 20, 2022

[mono] Add Vector128 Sum intrinsic for amd64 #75142

[mono] Add Vector128 Sum intrinsic for amd64 #75142

Conversation

matouskozak commented Sep 6, 2022 • edited Loading

jandupej commented Sep 8, 2022

tannergooding commented Sep 9, 2022 • edited Loading

matouskozak commented Sep 19, 2022

azure-pipelines bot commented Sep 19, 2022

fanyang-mono Sep 20, 2022

Choose a reason for hiding this comment

matouskozak Sep 20, 2022

Choose a reason for hiding this comment

matouskozak commented Sep 20, 2022

azure-pipelines bot commented Sep 20, 2022

matouskozak commented Sep 6, 2022 •

edited

Loading

tannergooding commented Sep 9, 2022 •

edited

Loading