Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773

Merged
merged 4 commits into from
May 7, 2023

Conversation

kg
Copy link
Contributor

@kg kg commented Feb 28, 2023

This updates the jiterpreter to implement simd intrinsics that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:

  1. Semi-native SIMD using WASM v128 opcodes. If code generation fails while this is active, it will automatically be switched off
  2. Emulated SIMD using the same C functions that the interp uses to implement the opcodes, for equivalent behavior. These are called directly as wasm imports instead of via an indirect call + table lookup like in the interp.

Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher.

Along with this I added interp opcodes for most of the PackedSimd API, with C interp intrinsics auto-generated for each one by wrapping clang's matching intrinsics.

This PR also updates genmintops to generate SIMD tables, and inlines trace entry logic because we saw a performance regression when it was outlined.

This PR also improves and unifies the imported function logic in jiterpreter-support so it's easier to work with.

@kg kg added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) arch-wasm WebAssembly architecture area-Codegen-Jiterpreter-mono labels Feb 28, 2023
@ghost
Copy link

ghost commented Feb 28, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

This updates the jiterpreter to implement some of the new SIMD opcodes that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:

  1. Semi-native SIMD using WASM v128 opcodes. If code generation fails while this is active, it will automatically be switched off
  2. Emulated SIMD using the same C functions that the interp uses to implement the opcodes, for equivalent behavior.

I probably need to add a dedicated feature detection path that compiles a test wasm module containing v128 opcodes to sense the availability of SIMD, but the automated fallback may be good enough.

Right now the C functions are invoked using call_indirect, so they're virtual calls. They could be static if I overhaul jiterpreter-support to allow defining new imports during code generation. On the other hand, the interpreter also does a virtual call, so this is at least not any worse.

Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher.

This PR also updates genmintops to generate SIMD tables.

There is some stray merge damage left in the PR right now I need to figure out how to resolve, and I need to decide whether the v128 opcodes should be enabled by default when merging this. I'm inclined to leave them switched off, so the baseline is just calling the interp C helpers and we can see if that's an improvement by itself.

Author: kg
Assignees: -
Labels:

NO-MERGE, arch-wasm, area-Codegen-Jiterpreter-mono

Milestone: -

@kg
Copy link
Contributor Author

kg commented Mar 2, 2023

browser-bench timings with this rebased on main:

measurement jiterp (v128) jiterp (no v128) no interp simd
Json, non-ASCII text serialize 2.9236ms 3.0078ms 2.9899ms
Json, non-ASCII text deserialize 5.5887ms 5.7569ms 5.5366ms
Json, small serialize 0.1248ms 0.1266ms 0.1287ms
Json, small deserialize 0.2142ms 0.2190ms 0.2100ms
Json, large serialize 33.3725ms 34.3400ms 34.7770ms
Json, large deserialize 56.4239ms 58.3258ms 55.8387ms
Span, Reverse bytes 0.0210ms 0.0736ms 0.0892ms
Span, Reverse chars 0.1310ms 0.1310ms 0.0914ms
Span, IndexOf bytes 0.7161us 0.8906us 0.8870us
Span, IndexOf chars 0.1101ms 0.1329ms 0.0840ms
Span, SequenceEqual bytes 0.0325ms 0.0248ms 0.0889ms
Span, SequenceEqual chars 0.0647ms 0.0496ms 0.1781ms
String, Normalize 1.4029ms 1.3966ms 1.4207ms
String, Normalize ASCII 0.1752ms 0.1732ms 0.1752ms
Vector, Create Vector128 0.0942us 0.0928us 0.0902us
Vector, Add 2 Vector128's 0.1076us 0.1062us 0.1133us
Vector, Multiply 2 Vector128's 0.1076us 0.1065us 0.1133us
Vector, Dot product int 0.1004us 0.0998us 0.1069us
Vector, Dot product ulong 0.0891us 0.0878us 0.0891us
Vector, Dot product float 0.1117us 0.1095us 0.1109us
Vector, Dot product double 0.0916us 0.0909us 0.0916us
Vector, Sum sbyte 0.1411us 0.1407us 0.1417us
Vector, Sum short 0.1100us 0.1109us 0.1111us
Vector, Sum uint 0.0907us 0.0898us 0.0912us
Vector, Sum double 0.0898us 0.0893us 0.0900us
Vector, Min float 0.1532us 0.1549us 0.1592us
Vector, Max float 0.1673us 0.1636us 0.1646us
Vector, Min double 0.1197us 0.1180us 0.1197us
Vector, Max double 0.1209us 0.1194us 0.1237us
Vector, Normalize float 0.2620us 0.2593us 0.4157us

My opinion is pretty mixed based on this. It looks like there are some workloads in our bench suite here that vectorize, and the performance improves (especially if we enable native use of wasm v128 opcodes). But a lot of stuff seems to hold steady or even be worse - presumably cases where the interpreter intrinsics support isn't broad enough, so saying 'v128 is hardware accelerated'! hurts us.

Not sure what the right approach to take is with this. Maybe if we had wider support for Vector128 at the interpreter intrinsic level we'd see more gains?

@BrzVlad what do you think? I think for now it makes sense to leave INTERP_ENABLE_SIMD guarded with #if !HOST_BROWSER && __GNUC__, and to land the jiterpreter support so that we have a starting point so we can consider enabling it later. The downside is that it could potentially bitrot in this state since it would never run.

@BrzVlad
Copy link
Member

BrzVlad commented Mar 3, 2023

@kg I don't think the way you test is correct, since the numbers make no sense. It is like you are testing the same thing. What does no interp simd mean ? Are you enabling simd in csproj ? (WasmEnableSIMD, I think the linker trims out vectorized bcl code if not enabled)

Also I don't think we should bother with perf investigations of this until we get some reference numbers from the microbenchmark suite on desktop interp

@kg
Copy link
Contributor Author

kg commented Mar 3, 2023

@kg I don't think the way you test is correct, since the numbers make no sense. It is like you are testing the same thing. What does no interp simd mean ? Are you enabling simd in csproj ? (WasmEnableSIMD, I think the linker trims out vectorized bcl code if not enabled)

I enable interp SIMD for the first two columns by changing
#if !HOST_BROWSER && __GNUC__
to
#if __GNUC__
which does cause interpreter simd opcodes to appear in various places as expected.

I'll retest by editing the browser-bench csproj to add that property; because SIMD had been enabled on main I thought it would be sufficient to set the constant in the interp. If the code is being trimmed out that would explain not seeing much of a difference - the opcodes I'm seeing must be for stuff that escaped trimming.

@kg
Copy link
Contributor Author

kg commented Mar 3, 2023

I rebuilt the runtime with WasmEnableSIMD set to true, then rebuilt browser-bench with it set as well. Then I did a second rebuild of browser-bench with it set to false. The results from a run are effectively the same, I don't think that msbuild property is working in non-AOT. I have trimming disabled too.

@lewing
Copy link
Member

lewing commented Mar 5, 2023

@kg I merged main can you double check the legacy_cwraps bit

@kg
Copy link
Contributor Author

kg commented Mar 5, 2023

@kg I merged main can you double check the legacy_cwraps bit

Looks right

@kg
Copy link
Contributor Author

kg commented Mar 8, 2023

I still have no explanation for why this performs the way it does.

measurement time (this PR) time (this+AOT) time (main)
Json, non-ASCII text serialize 3.8978ms 0.2363ms 4.1681ms
Json, non-ASCII text deserialize 6.2145ms 0.4287ms 6.3337ms
Json, small serialize 0.1357ms 0.0126ms 0.1324ms
Json, small deserialize 0.2251ms 0.0328ms 0.2198ms
Json, large serialize 36.5108ms 3.5608ms 35.9710ms
Json, large deserialize 59.1910ms 9.3782ms 58.2045ms
Span, Reverse bytes 0.0219ms 0.0021ms 0.1244ms
Span, Reverse chars 0.1289ms 0.0053ms 0.4433ms
Span, IndexOf bytes 0.7610us 0.1028us 0.0010ms
Span, IndexOf chars 0.1153ms 0.0028ms 0.0864ms
Span, SequenceEqual bytes 0.0400ms 0.0022ms 0.0295ms
Span, SequenceEqual chars 0.0797ms 0.0042ms 0.0590ms
String, Normalize 1.4766ms 0.7871ms 1.4346ms
String, Normalize ASCII 0.1815ms 0.0061ms 0.1811ms
Vector, Create Vector128 0.0975us 0.0449us 0.0930us
Vector, Add 2 Vector128's 0.1103us 0.0479us 0.1089us
Vector, Multiply 2 Vector128's 0.1117us 0.0453us 0.1100us
Vector, Dot product int 0.1060us 0.0458us 0.1036us
Vector, Dot product ulong 0.0933us 0.0453us 0.0910us
Vector, Dot product float 0.1162us 0.0449us 0.1152us
Vector, Dot product double 0.0971us 0.0447us 0.0940us
Vector, Sum sbyte 0.1494us 0.0449us 0.1449us
Vector, Sum short 0.1165us 0.0447us 0.1150us
Vector, Sum uint 0.0961us 0.0446us 0.0945us
Vector, Sum double 0.0947us 0.0447us 0.0922us
Vector, Min float 0.2321us 0.0451us 0.2200us
Vector, Max float 0.2233us 0.0447us 0.1786us
Vector, Min double 0.1850us 0.0451us 0.1859us
Vector, Max double 0.1447us 0.0450us 0.1349us
Vector, Normalize float 0.2761us 0.0493us 0.4228us

A handful of things like Span.Reverse and Vector.Normalize are obviously improved, some are worse like Span.IndexOf, and some of these timing samples just make no sense and make me wonder if browser-bench is broken on my hardware.

EDIT: Ah, the ones that look way wrong are a msec/usec disconnect. Wish it didn't do that.

@BrzVlad
Copy link
Member

BrzVlad commented Mar 8, 2023

I think it is best to investigate why Vector, Add 2 Vector128's is not improved since it seems the simplest and most outrageous case. You could add verbose logging to see exactly what C#/jiterp trace code is being executed and if it is correct. If my understanding is correct you are also not falling back to interp C vector methods, but are emitting directly wasm simd opcode ? If that is the case, not having a perf improvement is absurd.

@kg
Copy link
Contributor Author

kg commented Apr 15, 2023

The custom conditional select and bitwise equality/inequality opcodes in the jiterpreter side are broken somehow, so I've disabled them. I'm not sure what's going on there since tests pass, but unless they're disabled the beginning of xharness output XML ends up going missing. This means that the interpreter C helpers get used instead, which isn't the end of the world.

@kg kg force-pushed the wasm-jiterpreter-simd branch 2 times, most recently from 8d8a71d to f97e897 Compare April 15, 2023 06:30
@kg
Copy link
Contributor Author

kg commented Apr 15, 2023

This is now blocked on real failures in the vectorized IndexOfAny code that didn't run before due to lack of PackedSimd. I am unable to come up with a satisfactory explanation for why it's not working. (Incidentally, the set of tests that fail locally is different from the set that fail on CI, and the values are different, which makes me think some sort of memory corruption or uninitialized memory.)


[07:08:44] info: Starting:    System.Memory.Tests.dll
[07:10:23] fail: [FAIL] System.SpanTests.ReadOnlySpanTests.DefaultFilledIndexOfMany_Char
[07:10:23] info: Assert.Equal() Failure
[07:10:23] info: Expected: 0
[07:10:23] info: Actual:   -1
[07:10:23] info:    at System.SpanTests.ReadOnlySpanTests.DefaultFilledIndexOfMany_Char()
[07:10:23] info:    at System.Reflection.MethodInvoker.InterpretedInvoke(Object obj, IntPtr* args)
[07:10:23] info:    at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
[07:11:13] fail: [FAIL] System.SpanTests.SpanTests.IndexOfAny_LastIndexOfAny_AlgComplexity_Chars
[07:11:13] info: Assert.Equal() Failure
[07:11:13] info: Expected: 7168
[07:11:13] info: Actual:   -1
[07:11:13] info:    at System.SpanTests.SpanTests.RunLastIndexOfAnyAlgComplexityTest[Char](Char[] needle)
[07:11:13] info:    at System.SpanTests.SpanTests.RunIndexOfAnyLastIndexOfAnyAlgComplexityTest[Char]()
[07:11:13] info:    at System.SpanTests.SpanTests.IndexOfAny_LastIndexOfAny_AlgComplexity_Chars()
[07:11:13] info:    at System.Reflection.MethodInvoker.InterpretedInvoke(Object obj, IntPtr* args)
[07:11:13] info:    at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
[07:11:14] info: Finished:    System.Memory.Tests.dll

While I'm mildly suspicious of the vectorized code since it was previously being guarded, I don't see anything wrong with it. It's possible we're now exercising some part of the interp v128 code that we weren't before, but it doesn't seem to heavily use those APIs. I double and triple checked that I am using the correct clang intrinsic for the relevant opcode, and even experimented with using different intrinsics anyway in case I misread something. It also fails even if jiterpreter is disabled, so that isn't a factor.

@kg
Copy link
Contributor Author

kg commented Apr 15, 2023

Disabling the interp v128 intrinsics doesn't seem to fix it either (I verified that doing this results in mostly-scalar code with wasm packedsimd opcodes scattered inside). So that seems to indicate that somehow the PackedSimd operation(s) are the problem.

@kg
Copy link
Contributor Author

kg commented Apr 17, 2023

Radek pointed out that this failure could be #84775 . It could be that there is something wrong with the way interp aligns v128 arguments or return values, or my copy-paste produced some incorrect alignment.

kg added 2 commits May 5, 2023 17:13
Fix bugs in the jiterpreter-support import management implementation
Enable interpreter SIMD for WASM

Minor cleanups

Fix merge

Checkpoint PackedSimd

Implement a few PackedSimd methods in the interp

Add packedsimd ops to jiterpreter

Move the interp simd opcode -> wasm opcode mapping into the C tables

Add intrinsic ids for most of the remaining packedsimd methods

Map most of the PackedSimd methods to intrinsics

Update genmintops

Fix merge damage

Fix build

Add more wasm opcodes

Add more wasm opcodes

Add bitmask intrinsics

Add missing opcodes to transform-simd

Use HOST_BROWSER instead of HOST_WASM to fix wasi build

Implement the pack-n-elements vector instructions

Disable bitselect because it's broken somehow
Simplify vector packing
Add browser-bench measurements for packing vectors

Disable more opcodes

Disable more opcodes

Fix PackedSimd feature detection on non-wasm targets

Maybe fix linux interp assertion

Don't fail transform for unsupported PackedSimd methods on non-browser targets

Fix i64 popcnt

Add basic R4 v128 intrinsics (add/sub/div/mul)
Re-enable more jiterpreter simd
@kg kg force-pushed the wasm-jiterpreter-simd branch from 8e00956 to 28bbc99 Compare May 6, 2023 00:36
@kg kg removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 6, 2023
@kg
Copy link
Contributor Author

kg commented May 6, 2023

I added some basic R4 vector ops to the interp and changed some asserts so that it's possible to run the SIMD version of pavel's raytracer. It completes in ~16k msec with these changes compared to ~31k msec before, nearly a 50% reduction.

The crash issues and test failures I was hitting before seem to have at least partially been caused by an arglist alignment issue that is now fixed on main, so hopefully tests will pass now...

@kg
Copy link
Contributor Author

kg commented May 6, 2023

The remaining issue on this PR seems to be that once we report PackedSimd support, writing XML for the testResults.xml file ends up dropping a few characters at the start or end of the stream. It's not clear to me why this would happen.

@vargaz
Copy link
Contributor

vargaz commented May 6, 2023

Can this be turned off by default so the PR can be merged ?

@kg
Copy link
Contributor Author

kg commented May 6, 2023

Can this be turned off by default so the PR can be merged ?

I could probably comment out the PackedSimd support, but it makes this PR much less useful since the BCL uses it now. I'm fine with doing that assuming none of the CI tests fail and the only remaining problem is the xml, since it means we're not merging tremendously broken code.

@kg
Copy link
Contributor Author

kg commented May 6, 2023

All the failures I saw were the test results xml size problem, so I added runtime options to govern interp v128 and packedsimd support and set them both to FALSE for browser.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants