[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

kg · 2023-02-28T16:41:04Z

This updates the jiterpreter to implement simd intrinsics that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:

Semi-native SIMD using WASM v128 opcodes. If code generation fails while this is active, it will automatically be switched off
Emulated SIMD using the same C functions that the interp uses to implement the opcodes, for equivalent behavior. These are called directly as wasm imports instead of via an indirect call + table lookup like in the interp.

Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher.

Along with this I added interp opcodes for most of the PackedSimd API, with C interp intrinsics auto-generated for each one by wrapping clang's matching intrinsics.

This PR also updates genmintops to generate SIMD tables, and inlines trace entry logic because we saw a performance regression when it was outlined.

This PR also improves and unifies the imported function logic in jiterpreter-support so it's easier to work with.

ghost · 2023-02-28T16:41:51Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

This updates the jiterpreter to implement some of the new SIMD opcodes that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:

Semi-native SIMD using WASM v128 opcodes. If code generation fails while this is active, it will automatically be switched off
Emulated SIMD using the same C functions that the interp uses to implement the opcodes, for equivalent behavior.

I probably need to add a dedicated feature detection path that compiles a test wasm module containing v128 opcodes to sense the availability of SIMD, but the automated fallback may be good enough.

Right now the C functions are invoked using call_indirect, so they're virtual calls. They could be static if I overhaul jiterpreter-support to allow defining new imports during code generation. On the other hand, the interpreter also does a virtual call, so this is at least not any worse.

Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher.

This PR also updates genmintops to generate SIMD tables.

There is some stray merge damage left in the PR right now I need to figure out how to resolve, and I need to decide whether the v128 opcodes should be enabled by default when merging this. I'm inclined to leave them switched off, so the baseline is just calling the interp C helpers and we can see if that's an improvement by itself.

Author:	kg
Assignees:	-
Labels:	`NO-MERGE`, `arch-wasm`, `area-Codegen-Jiterpreter-mono`
Milestone:	-

kg · 2023-03-02T16:57:56Z

browser-bench timings with this rebased on main:

measurement	jiterp (v128)	jiterp (no v128)	no interp simd
Json, non-ASCII text serialize	2.9236ms	3.0078ms	2.9899ms
Json, non-ASCII text deserialize	5.5887ms	5.7569ms	5.5366ms
Json, small serialize	0.1248ms	0.1266ms	0.1287ms
Json, small deserialize	0.2142ms	0.2190ms	0.2100ms
Json, large serialize	33.3725ms	34.3400ms	34.7770ms
Json, large deserialize	56.4239ms	58.3258ms	55.8387ms
Span, Reverse bytes	0.0210ms	0.0736ms	0.0892ms
Span, Reverse chars	0.1310ms	0.1310ms	0.0914ms
Span, IndexOf bytes	0.7161us	0.8906us	0.8870us
Span, IndexOf chars	0.1101ms	0.1329ms	0.0840ms
Span, SequenceEqual bytes	0.0325ms	0.0248ms	0.0889ms
Span, SequenceEqual chars	0.0647ms	0.0496ms	0.1781ms
String, Normalize	1.4029ms	1.3966ms	1.4207ms
String, Normalize ASCII	0.1752ms	0.1732ms	0.1752ms
Vector, Create Vector128	0.0942us	0.0928us	0.0902us
Vector, Add 2 Vector128's	0.1076us	0.1062us	0.1133us
Vector, Multiply 2 Vector128's	0.1076us	0.1065us	0.1133us
Vector, Dot product int	0.1004us	0.0998us	0.1069us
Vector, Dot product ulong	0.0891us	0.0878us	0.0891us
Vector, Dot product float	0.1117us	0.1095us	0.1109us
Vector, Dot product double	0.0916us	0.0909us	0.0916us
Vector, Sum sbyte	0.1411us	0.1407us	0.1417us
Vector, Sum short	0.1100us	0.1109us	0.1111us
Vector, Sum uint	0.0907us	0.0898us	0.0912us
Vector, Sum double	0.0898us	0.0893us	0.0900us
Vector, Min float	0.1532us	0.1549us	0.1592us
Vector, Max float	0.1673us	0.1636us	0.1646us
Vector, Min double	0.1197us	0.1180us	0.1197us
Vector, Max double	0.1209us	0.1194us	0.1237us
Vector, Normalize float	0.2620us	0.2593us	0.4157us

My opinion is pretty mixed based on this. It looks like there are some workloads in our bench suite here that vectorize, and the performance improves (especially if we enable native use of wasm v128 opcodes). But a lot of stuff seems to hold steady or even be worse - presumably cases where the interpreter intrinsics support isn't broad enough, so saying 'v128 is hardware accelerated'! hurts us.

Not sure what the right approach to take is with this. Maybe if we had wider support for Vector128 at the interpreter intrinsic level we'd see more gains?

@BrzVlad what do you think? I think for now it makes sense to leave INTERP_ENABLE_SIMD guarded with #if !HOST_BROWSER && __GNUC__, and to land the jiterpreter support so that we have a starting point so we can consider enabling it later. The downside is that it could potentially bitrot in this state since it would never run.

BrzVlad · 2023-03-03T10:14:55Z

@kg I don't think the way you test is correct, since the numbers make no sense. It is like you are testing the same thing. What does no interp simd mean ? Are you enabling simd in csproj ? (WasmEnableSIMD, I think the linker trims out vectorized bcl code if not enabled)

Also I don't think we should bother with perf investigations of this until we get some reference numbers from the microbenchmark suite on desktop interp

kg · 2023-03-03T10:27:06Z

@kg I don't think the way you test is correct, since the numbers make no sense. It is like you are testing the same thing. What does no interp simd mean ? Are you enabling simd in csproj ? (WasmEnableSIMD, I think the linker trims out vectorized bcl code if not enabled)

I enable interp SIMD for the first two columns by changing
#if !HOST_BROWSER && __GNUC__
to
#if __GNUC__
which does cause interpreter simd opcodes to appear in various places as expected.

I'll retest by editing the browser-bench csproj to add that property; because SIMD had been enabled on main I thought it would be sufficient to set the constant in the interp. If the code is being trimmed out that would explain not seeing much of a difference - the opcodes I'm seeing must be for stuff that escaped trimming.

kg · 2023-03-03T11:35:59Z

I rebuilt the runtime with WasmEnableSIMD set to true, then rebuilt browser-bench with it set as well. Then I did a second rebuild of browser-bench with it set to false. The results from a run are effectively the same, I don't think that msbuild property is working in non-AOT. I have trimming disabled too.

lewing · 2023-03-05T04:36:07Z

@kg I merged main can you double check the legacy_cwraps bit

kg · 2023-03-05T05:36:35Z

@kg I merged main can you double check the legacy_cwraps bit

Looks right

kg · 2023-03-08T00:19:07Z

I still have no explanation for why this performs the way it does.

measurement	time (this PR)	time (this+AOT)	time (main)
Json, non-ASCII text serialize	3.8978ms	0.2363ms	4.1681ms
Json, non-ASCII text deserialize	6.2145ms	0.4287ms	6.3337ms
Json, small serialize	0.1357ms	0.0126ms	0.1324ms
Json, small deserialize	0.2251ms	0.0328ms	0.2198ms
Json, large serialize	36.5108ms	3.5608ms	35.9710ms
Json, large deserialize	59.1910ms	9.3782ms	58.2045ms
Span, Reverse bytes	0.0219ms	0.0021ms	0.1244ms
Span, Reverse chars	0.1289ms	0.0053ms	0.4433ms
Span, IndexOf bytes	0.7610us	0.1028us	0.0010ms
Span, IndexOf chars	0.1153ms	0.0028ms	0.0864ms
Span, SequenceEqual bytes	0.0400ms	0.0022ms	0.0295ms
Span, SequenceEqual chars	0.0797ms	0.0042ms	0.0590ms
String, Normalize	1.4766ms	0.7871ms	1.4346ms
String, Normalize ASCII	0.1815ms	0.0061ms	0.1811ms
Vector, Create Vector128	0.0975us	0.0449us	0.0930us
Vector, Add 2 Vector128's	0.1103us	0.0479us	0.1089us
Vector, Multiply 2 Vector128's	0.1117us	0.0453us	0.1100us
Vector, Dot product int	0.1060us	0.0458us	0.1036us
Vector, Dot product ulong	0.0933us	0.0453us	0.0910us
Vector, Dot product float	0.1162us	0.0449us	0.1152us
Vector, Dot product double	0.0971us	0.0447us	0.0940us
Vector, Sum sbyte	0.1494us	0.0449us	0.1449us
Vector, Sum short	0.1165us	0.0447us	0.1150us
Vector, Sum uint	0.0961us	0.0446us	0.0945us
Vector, Sum double	0.0947us	0.0447us	0.0922us
Vector, Min float	0.2321us	0.0451us	0.2200us
Vector, Max float	0.2233us	0.0447us	0.1786us
Vector, Min double	0.1850us	0.0451us	0.1859us
Vector, Max double	0.1447us	0.0450us	0.1349us
Vector, Normalize float	0.2761us	0.0493us	0.4228us

A handful of things like Span.Reverse and Vector.Normalize are obviously improved, some are worse like Span.IndexOf, and some of these timing samples just make no sense and make me wonder if browser-bench is broken on my hardware.

EDIT: Ah, the ones that look way wrong are a msec/usec disconnect. Wish it didn't do that.

BrzVlad · 2023-03-08T09:21:38Z

I think it is best to investigate why Vector, Add 2 Vector128's is not improved since it seems the simplest and most outrageous case. You could add verbose logging to see exactly what C#/jiterp trace code is being executed and if it is correct. If my understanding is correct you are also not falling back to interp C vector methods, but are emitting directly wasm simd opcode ? If that is the case, not having a perf improvement is absurd.

kg · 2023-04-15T01:24:05Z

The custom conditional select and bitwise equality/inequality opcodes in the jiterpreter side are broken somehow, so I've disabled them. I'm not sure what's going on there since tests pass, but unless they're disabled the beginning of xharness output XML ends up going missing. This means that the interpreter C helpers get used instead, which isn't the end of the world.

kg · 2023-04-15T08:07:45Z

This is now blocked on real failures in the vectorized IndexOfAny code that didn't run before due to lack of PackedSimd. I am unable to come up with a satisfactory explanation for why it's not working. (Incidentally, the set of tests that fail locally is different from the set that fail on CI, and the values are different, which makes me think some sort of memory corruption or uninitialized memory.)


[07:08:44] info: Starting:    System.Memory.Tests.dll
[07:10:23] fail: [FAIL] System.SpanTests.ReadOnlySpanTests.DefaultFilledIndexOfMany_Char
[07:10:23] info: Assert.Equal() Failure
[07:10:23] info: Expected: 0
[07:10:23] info: Actual:   -1
[07:10:23] info:    at System.SpanTests.ReadOnlySpanTests.DefaultFilledIndexOfMany_Char()
[07:10:23] info:    at System.Reflection.MethodInvoker.InterpretedInvoke(Object obj, IntPtr* args)
[07:10:23] info:    at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
[07:11:13] fail: [FAIL] System.SpanTests.SpanTests.IndexOfAny_LastIndexOfAny_AlgComplexity_Chars
[07:11:13] info: Assert.Equal() Failure
[07:11:13] info: Expected: 7168
[07:11:13] info: Actual:   -1
[07:11:13] info:    at System.SpanTests.SpanTests.RunLastIndexOfAnyAlgComplexityTest[Char](Char[] needle)
[07:11:13] info:    at System.SpanTests.SpanTests.RunIndexOfAnyLastIndexOfAnyAlgComplexityTest[Char]()
[07:11:13] info:    at System.SpanTests.SpanTests.IndexOfAny_LastIndexOfAny_AlgComplexity_Chars()
[07:11:13] info:    at System.Reflection.MethodInvoker.InterpretedInvoke(Object obj, IntPtr* args)
[07:11:13] info:    at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
[07:11:14] info: Finished:    System.Memory.Tests.dll

While I'm mildly suspicious of the vectorized code since it was previously being guarded, I don't see anything wrong with it. It's possible we're now exercising some part of the interp v128 code that we weren't before, but it doesn't seem to heavily use those APIs. I double and triple checked that I am using the correct clang intrinsic for the relevant opcode, and even experimented with using different intrinsics anyway in case I misread something. It also fails even if jiterpreter is disabled, so that isn't a factor.

kg · 2023-04-15T08:52:17Z

Disabling the interp v128 intrinsics doesn't seem to fix it either (I verified that doing this results in mostly-scalar code with wasm packedsimd opcodes scattered inside). So that seems to indicate that somehow the PackedSimd operation(s) are the problem.

kg · 2023-04-17T18:56:03Z

Radek pointed out that this failure could be #84775 . It could be that there is something wrong with the way interp aligns v128 arguments or return values, or my copy-paste produced some incorrect alignment.

Fix bugs in the jiterpreter-support import management implementation Enable interpreter SIMD for WASM Minor cleanups Fix merge Checkpoint PackedSimd Implement a few PackedSimd methods in the interp Add packedsimd ops to jiterpreter Move the interp simd opcode -> wasm opcode mapping into the C tables Add intrinsic ids for most of the remaining packedsimd methods Map most of the PackedSimd methods to intrinsics Update genmintops Fix merge damage Fix build Add more wasm opcodes Add more wasm opcodes Add bitmask intrinsics Add missing opcodes to transform-simd Use HOST_BROWSER instead of HOST_WASM to fix wasi build Implement the pack-n-elements vector instructions Disable bitselect because it's broken somehow Simplify vector packing Add browser-bench measurements for packing vectors Disable more opcodes Disable more opcodes Fix PackedSimd feature detection on non-wasm targets Maybe fix linux interp assertion Don't fail transform for unsupported PackedSimd methods on non-browser targets Fix i64 popcnt Add basic R4 v128 intrinsics (add/sub/div/mul) Re-enable more jiterpreter simd

kg · 2023-05-06T01:00:03Z

I added some basic R4 vector ops to the interp and changed some asserts so that it's possible to run the SIMD version of pavel's raytracer. It completes in ~16k msec with these changes compared to ~31k msec before, nearly a 50% reduction.

The crash issues and test failures I was hitting before seem to have at least partially been caused by an arglist alignment issue that is now fixed on main, so hopefully tests will pass now...

kg · 2023-05-06T02:06:51Z

The remaining issue on this PR seems to be that once we report PackedSimd support, writing XML for the testResults.xml file ends up dropping a few characters at the start or end of the stream. It's not clear to me why this would happen.

vargaz · 2023-05-06T02:19:07Z

Can this be turned off by default so the PR can be merged ?

kg · 2023-05-06T02:20:50Z

Can this be turned off by default so the PR can be merged ?

I could probably comment out the PackedSimd support, but it makes this PR much less useful since the BCL uses it now. I'm fine with doing that assuming none of the CI tests fail and the only remaining problem is the xml, since it means we're not merging tremendously broken code.

kg · 2023-05-06T04:54:41Z

All the failures I saw were the test results xml size problem, so I added runtime options to govern interp v128 and packedsimd support and set them both to FALSE for browser.

kg added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) arch-wasm WebAssembly architecture area-Codegen-Jiterpreter-mono labels Feb 28, 2023

vargaz approved these changes Feb 28, 2023

View reviewed changes

This was referenced Feb 28, 2023

Test failure: System.Security.Cryptography.X509Certificates.Tests.CertificateCreation.CertificateRequestChainTests/CreateChain_Hybrid #25979

Closed

System.Security.Cryptography.X509Certificates.Tests.ChainTests.BuildInvalidSignatureTwice failure #65448

Closed

kg marked this pull request as ready for review February 28, 2023 21:47

kg requested review from lewing, pavelsavara, lambdageek, BrzVlad and kotlarmilos as code owners February 28, 2023 21:47

kg force-pushed the wasm-jiterpreter-simd branch from 4ade1c6 to 6af7e2e Compare February 28, 2023 21:51

This was referenced Mar 1, 2023

Tracking issue for CI build timeouts #76454

Closed

System.Security.Cryptography.X509Certificates.Tests.ChainTests.BuildInvalidSignatureTwice failure #82837

Open

Some System.Buffers.Binary.Tests.ReverseEndiannessUnitTests tests broken on main #82868

Closed

kg force-pushed the wasm-jiterpreter-simd branch from 0cea840 to 9c07f4f Compare March 2, 2023 16:11

This was referenced Mar 5, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

IOException running NuGet-Migrations during tests in dotnet CLI first run #80619

Closed

kg mentioned this pull request Mar 7, 2023

[wasm] Don't generate unused imports in jiterpreter traces #83061

Merged

kg force-pushed the wasm-jiterpreter-simd branch from 0cc7e5d to 9bfd63d Compare March 7, 2023 23:32

kg force-pushed the wasm-jiterpreter-simd branch 2 times, most recently from 8d8a71d to f97e897 Compare April 15, 2023 06:30

This was referenced Apr 15, 2023

WasmTestOnBrowser-System.Text.Json.Tests.WorkItemExecution timing out #84434

Closed

[mono][wasm] System.Numerics.Vectors.Tests assert equal failure #84775

Closed

kg mentioned this pull request Apr 19, 2023

[wasm?] Mono interpreter vector argument alignment issue #85071

Closed

kg force-pushed the wasm-jiterpreter-simd branch from fabfc94 to 8e00956 Compare May 5, 2023 04:13

kg added 2 commits May 5, 2023 17:13

Repair merge damage

28bbc99

kg force-pushed the wasm-jiterpreter-simd branch from 8e00956 to 28bbc99 Compare May 6, 2023 00:36

Cleanup

2ee9c69

kg removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 6, 2023

Guard interp simd support behind runtime options

77f87b5

build-analysis bot mentioned this pull request May 6, 2023

Several XML tests are sporadically failing in CI due to OOMs #82894

Closed

kg merged commit 7c75cbf into dotnet:main May 7, 2023

kotlarmilos mentioned this pull request May 10, 2023

[Perf] Linux/x64: 8 Improvements on 5/7/2023 5:11:33 AM dotnet/perf-autofiling-issues#17563

Closed

radekdoulik mentioned this pull request May 10, 2023

[Perf] Linux/x64: 116 Regressions on 5/7/2023 5:11:33 AM dotnet/perf-autofiling-issues#17570

Open

This was referenced May 17, 2023

[Perf] Linux/arm64: 8 Improvements on 5/7/2023 10:29:39 PM dotnet/perf-autofiling-issues#17807

Closed

[Perf] Linux/x64: 65 Improvements on 5/10/2023 6:18:38 PM dotnet/perf-autofiling-issues#17819

Closed

kotlarmilos mentioned this pull request Jun 2, 2023

.NET 8 Per-Preview Performance report on WASM, Mono AOT, and Interpreter #84302

Closed

ghost locked as resolved and limited conversation to collaborators Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

kg commented Feb 28, 2023 •

edited

Loading

ghost commented Feb 28, 2023

kg commented Mar 2, 2023

BrzVlad commented Mar 3, 2023 •

edited

Loading

kg commented Mar 3, 2023

kg commented Mar 3, 2023

lewing commented Mar 5, 2023

kg commented Mar 5, 2023

kg commented Mar 8, 2023 •

edited

Loading

BrzVlad commented Mar 8, 2023

kg commented Apr 15, 2023

kg commented Apr 15, 2023

kg commented Apr 15, 2023

kg commented Apr 17, 2023

kg commented May 6, 2023

kg commented May 6, 2023

vargaz commented May 6, 2023

kg commented May 6, 2023

kg commented May 6, 2023

[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

Conversation

kg commented Feb 28, 2023 • edited Loading

ghost commented Feb 28, 2023

kg commented Mar 2, 2023

BrzVlad commented Mar 3, 2023 • edited Loading

kg commented Mar 3, 2023

kg commented Mar 3, 2023

lewing commented Mar 5, 2023

kg commented Mar 5, 2023

kg commented Mar 8, 2023 • edited Loading

BrzVlad commented Mar 8, 2023

kg commented Apr 15, 2023

kg commented Apr 15, 2023

kg commented Apr 15, 2023

kg commented Apr 17, 2023

kg commented May 6, 2023

kg commented May 6, 2023

vargaz commented May 6, 2023

kg commented May 6, 2023

kg commented May 6, 2023

kg commented Feb 28, 2023 •

edited

Loading

BrzVlad commented Mar 3, 2023 •

edited

Loading

kg commented Mar 8, 2023 •

edited

Loading