Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Added Sve.CreateBreakPropagateMask #104704

Merged
merged 13 commits into from
Jul 17, 2024

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Jul 11, 2024

Contributes to #99957

Adds

  • CreateBreakPropagateMask

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@TIHan TIHan marked this pull request as ready for review July 11, 2024 19:30
@TIHan
Copy link
Contributor Author

TIHan commented Jul 11, 2024

@dotnet/arm64-contrib @kunalspathak @tannergooding this is ready.
Ran stress tester 3 times and it all looks good:

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_byte() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_ushort() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_uint() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_ulong() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_sbyte() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_short() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_int() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_long() : 31
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@tannergooding
Copy link
Member

Overall this looks good/correct to me.

There's a couple small nits, some callouts around pre-existing code that doesn't look to be quite correct, and then one real case that I'm not sure is correctly handled (as op3 should only ever be a contained zero for CreateBreakPropagateMask).

@TIHan
Copy link
Contributor Author

TIHan commented Jul 12, 2024

@tannergooding Did feedback.

Stress results:

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_byte() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_ushort() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_uint() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_ulong() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_sbyte() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_short() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_int() : 31
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_CreateBreakPropagateMask_long() : 31
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@kunalspathak
Copy link
Member

kunalspathak commented Jul 13, 2024

I might have asked this earlier somewhere, but BRKN operates on .B and returns .B, so shouldn't we just have one API:

Vector<byte> CreateBreakPropagateMask(Vector<byte> totalMask, Vector<byte> fromMask)

Currently with other overloads, the instruction will calculate the result in byte lanes, but we will (mis)interpret it as short, uint, etc.?

@tannergooding
Copy link
Member

so shouldn't we just have one API:

I elaborated on this the other place you asked (see #104502 (comment)). Having one for each type is correct and still fits into how predicates actually work.


For x64 we have a kmask register that is always 64-bits and we then take the lowest n-bits such as:

  • Vector512 - 64-bit mask, no ignored bits
  • Vector512 - 32-bit mask, ignored upper 32-bits
  • Vector256 - 32-bit mask, ignored upper 32-bits
  • Vector512 - 16-bit mask, ignored upper 48-bits
  • Vector256 - 16-bit mask, ignored upper 48-bits
  • Vector128 - 16-bit mask, ignored upper 48-bits
  • Vector512 - 8-bit mask, ignored upper 56-bits
  • Vector256 - 8-bit mask, ignored upper 56-bits
  • Vector128 - 8-bit mask, ignored upper 56-bits
  • Vector256 - 4-bit mask, ignored upper 60-bits
  • Vector128 - 4-bit mask, ignored upper 60-bits
  • Vector128 - 2-bit mask, ignored upper 62-bits

So for x64, a mask of Vector128<float> being 4-bits has to actually convert to get a valid Vector128<byte> mask which needs 16-bits; so there are conversion instructions and suffixes that say how many bits to read.


While for Arm64 we have a pg register that is always sizeof(Vector<T>)-bits (so if SVE registers are 128-bits, then its size is 16-bytes and therefore there are 16-bits in the pg register) we then take bits as:

  • Vector - take every bit
  • Vector - take every other bit
  • Vector - take every fourth bit
  • Vector - take every eighth bit

So there is no conversion required between these to get something "usable". There may, however, be some normalization consideration in some edge cases.

For an instruction like this the mask returned despite being .B is actually universal because it's simply indicating what bytes were valid and that cleanly maps to which elements end up as valid. It then works as expected when you're matching the relevant load types with the type you read the FFR as

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions.

src/coreclr/jit/hwintrinsic.cpp Show resolved Hide resolved
@@ -58,6 +58,7 @@ HARDWARE_INTRINSIC(Sve, CreateBreakAfterMask,
HARDWARE_INTRINSIC(Sve, CreateBreakAfterPropagateMask, -1, 3, true, {INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_sve_brkpa, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(Sve, CreateBreakBeforeMask, -1, 2, true, {INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_sve_brkb, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(Sve, CreateBreakBeforePropagateMask, -1, 3, true, {INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_sve_brkpb, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(Sve, CreateBreakPropagateMask, -1, -1, false, {INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_sve_brkn, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_HasRMWSemantics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should't this have HW_Flag_ExplicitMaskedOperation too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the above comment, I can't combine ExplicitMaskedOperation and EmbeddedMaskedOperation.

assert(intrin.op3->IsVectorZero());
tgtPrefUse = BuildUse(embOp2Node->Op(2));
srcCount += 1;
srcCount += BuildDelayFreeUses(embOp2Node->Op(1), embOp2Node->Op(2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little worried about the handling of this. Can you share the JitDump of one of the simpler repro scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll get a JitDump of the LoadBasicScenario.

What specifically are you worried about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are not building refpositions in order. Basically it should be for Op(1), Op(2) and then Op(3), but here, we are doing Op(2), Op(1) and Op(3).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have fixed this properly and all tests are passing.

{
assert(!tgtPrefOp1);
srcCount += BuildDelayFreeUses(intrin.op1, intrin.op2, predMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I just realized that in #104002 this should have been:

Suggested change
srcCount += BuildDelayFreeUses(intrin.op1, intrin.op2, predMask);
srcCount += BuildDelayFreeUses(intrin.op2, intrin.op1, predMask);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should actually be fine? intrin.op2 is the RMW node, and BuildDelayFreeUses' second parameter is for the RMW node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if you see the summary docs of BuildDelayFreeUses(), node is the node for which we want to create the RefPosition for, in this case intrin.op2 and rmwNode is the node for which we want to make sure that the register doesn't collide with, in this case, it is intrin.op1. See some examples of places where we call BuildDelayFreeUses().

//------------------------------------------------------------------------
// BuildDelayFreeUses: Build Use RefPositions for an operand that might be contained,
//                     and which may need to be marked delayRegFree
//
// Arguments:
//    node              - The node of interest
//    rmwNode           - The node that has RMW semantics (if applicable)
//    candidates        - The set of candidates for the uses
//    useRefPositionRef - If a use RefPosition is created, returns it. If none created, sets it to nullptr.
//
// REVIEW: useRefPositionRef is not consistently set. Also, sometimes this function creates multiple RefPositions
// but can only return one. Does it matter which one gets returned?
//
// Return Value:
//    The number of source registers used by the *parent* of this node.
//
int LinearScan::BuildDelayFreeUses(GenTree*         node,
                                   GenTree*         rmwNode,
                                   SingleTypeRegSet candidates,
                                   RefPosition**    useRefPositionRef)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I make that change, I get the assertion:

Errors:

Assert failure(PID 15364 [0x00003c04], Thread: 6432 [0x1920]): Assertion failed '!"removeListNode didn't find the node"' in 'JIT.HardwareIntrinsics.Arm._Sve.SimpleBinaryOpTest__Sve_CreateMaskForFirstActiveElement_sbyte:RunBasicScenario_UnsafeRead():this' during 'Linear scan register alloc' (IL size 110; hash 0xd48bee8e; FullOpts)

File: C:\work\runtime\src\coreclr\jit\lsrabuild.cpp:69
Image: C:\Users\dotnetperfuser\Desktop\work\windows.arm64.Checked\Tests\Core_Root\corerun.exe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it properly.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kunalspathak kunalspathak merged commit 5677d92 into dotnet:main Jul 17, 2024
166 of 167 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants