-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for folding core SIMD operations that produce TYP_MASK on newer hardware #104875
Conversation
/azp run runtime-coreclr jitstress-isas-x86 |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -7369,6 +7345,67 @@ struct GenTreeVecCon : public GenTree | |||
#endif | |||
}; | |||
|
|||
// GenTreeMskCon -- mask constant (GT_CNS_MSK) | |||
// | |||
struct GenTreeMskCon : public GenTree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be under FEATURE_MASKED_HW_INTRINSICS
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is matching how GenTreeVecCon
exists, which is likewise not under FEATURE_SIMD
.
AFAIR this is because it makes checks around GT_CNS_MSK
and GT_CNS_VEC
much simpler since there is a guaranteed ID for them, they just won't be encountered. If we did put it under an ifdef, then all the places that handle GT_CNS_VEC
including places like OperIsConst
would need to start being ifdef'd as well.
It might be worth doing this long term, but I'd rather investigate such cleanup in a separate PR.
@@ -30407,6 +30598,8 @@ GenTree* Compiler::gtFoldExprHWIntrinsic(GenTreeHWIntrinsic* tree) | |||
uint32_t result = BitOperations::LeadingZeroCount(static_cast<uint64_t>(value)); | |||
|
|||
cnsNode->AsIntConCommon()->SetIconValue(static_cast<int32_t>(result)); | |||
cnsNode->gtType = retType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it the type of the node is incorrect. Arm64 expects a TYP_INT
result but takes a TYP_LONG
input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So was it broken until now? wondering why any test didn't catch this earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not been showing up as broken because of how 32-bit results work on most platforms. That is, the 32-bit and 64-bit registers are the same and setting the lower 32-bits zeros the upper 32-bits. So even with the node mistyped, the codegen and consumption in most scenarios was correct.
This got caught because of the new assert I added that validated we didn't accidentally change the return type as part of adding the TYP_MASK folding (since that could take in SIMD and need a MASK result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it was a bug in that the IR was technically wrong, but because of how implicit conversions and the general operation actually worked things generally just worked as expected.
GenTreeMskCon* mskCon = tree->AsMskCon(); | ||
genSetRegToConst(mskCon->GetRegNum(), targetType, &mskCon->gtSimdMaskVal); | ||
#else | ||
unreached(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at multiple places, for GT_CNS_MSK
we have
#if defined(FEATURE_MASKED_HW_INTRINSICS)
// logic
#else
unreached();
#endif
wondering if there is a single place where we can add this check (for example, during the creation of such node) and then assume that we will never have GT_CNS_MSK
when FEATURE_MASKED_HW_INTRINSICS
is not enabled? I think this goes with my earlier comment if we can just have the definition of GT_CNS_MSK
under #ifdef FEATURE_MASKED_HW_INTRINSICS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that may be better overall. I'll investigate it in a follow up PR as indicated above.
|
||
case 2: | ||
{ | ||
bitMask = 0x5555555555555555; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does any of this will have to change when we add "variable length vector" support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When TYP_SIMD
is added we'll likely need to consider how we plan on supporting the up to 2048-bit vectors and up to 256-bit masks.
My guess is that we'll probably just say that we don't support over 512-bit SVE for simplicity, at least until physical hardware with 1024 or 2048 bits exists. When such hardware does exist, then we'd need to expand this to handle the additional bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall. Added some questions.
jitstress failures are pre-existing, will take a look at them in a followup PR |
#if defined(TARGET_XARCH) | ||
tryHandle = op->OperIsHWIntrinsic(); | ||
#elif defined(TARGET_ARM64) | ||
if (op->OperIsHWIntrinsic() && op->OperIsHWIntrinsic(NI_Sve_CreateTrueMaskAll)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this condition be checking op2->OperIsHWIntrinsic()
instead of op->OperIsHWIntrinsic
? Right now, it is possible for us to try to use op2
as a GT_HWINTRINSIC
node when it isn't one. The SVE tests are currently failing for me with the assertion OperIs(GT_HWINTRINSIC)
, because of the line GenTreeHWIntrinsic* cvtOp = op->AsHWIntrinsic();
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OperIsHWIntrinsic is always safe, it simply checks if the node type is GT_HWINTRINSIC
It looks like there might be a missing nested check that op2 is itself a hwintrinsic after we’ve confirmed the root op is createtruemaskall
Really we should be normalizing CreateTrueMaskAll as CNS_MSK instead and simply not having it as part of CvtVectorToMask (it’s an implementation detail of codegen that has no real impact on HIR), but that was a more in depth change and so wasn’t done in this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #104998
As per the title this gives us the ability to fold nodes that produce TYP_MASK giving it parity with operations that exist for NEON or AVX2.
To achieve this we introduce a new
GenTreeMskCon
type that can track mask constants. It was undesirable to reuseGenTreeVecCon
as there is quite a difference of what needs to be considered and how bits are interpretered for aTYP_MASK
vs aTYP_SIMD
.As part of this, it also allows us to push lower cost nodes to the right of SIMD comparisons which can improve codegen in some cases.
There is a TP improvement for Linux x64/Arm64 (Clang) and Windows x86 (MSVC) and a TP regression for Windows and Linux x64 (MSVC). There is no change to TP for Windows/Linux Arm64. This regression is up to
+0.14
(Windows x64 MSVC) and+0.34%
(Linux x64 MSVC) but is primarily "pay for play" in that the methods it impacts are all using SIMD mask nodes and so hit the new processing logic. Methods which aren't using SIMD mask nodes (such as pre-AVX512 machines or Arm64 machines) aren't incurring any penalty here.