-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Sort out ARM load/store instruction size issues #20126
Conversation
7485a02
to
c7357d6
Compare
@dotnet-bot test Ubuntu arm Cross Checked crossgen_comparison Build and Test |
53e9635
to
0c859da
Compare
54d82b3
to
d300051
Compare
No FX diffs on ARM32. On ARM64 there are many diffs due to the G_M55349_IG12:
- ldrsw x0, [fp,#24] // [V03 loc1]
+ ldr w0, [fp,#24] // [V03 loc1]
tst w0, #255
beq G_M55349_IG13 a few due to byte/halfword instructions becoming EA_4BYTE instead of EA_8BYTE ldr x1, [fp,#104] // [V04 loc0]
- ldrsh x1, [x1]
+ ldrsh w1, [x1]
strh w1, [x0,#8]
b G_M14724_IG34 and some code size improvements due to the use of ldar in many more places: mov x0, x21
- ldrsw x22, [x0]
- dmb oshld
- ldrsw x0, [x19,#24]
+ ldar w22, [x0]
+ ldr w0, [x19,#24]
and w23, w22, w0
ldr x0, [x19,#8] These account for ~1.5 kbytes of diff
|
1ea8939
to
31703db
Compare
@BruceForstall Any idea what exactly failed in the arm build? Looks like all test passed but xunit exited with code 1?! |
@mikedn It failed to load a Xunit test wrapper (https://github.com/dotnet/coreclr/issues/20392)
|
e3146c7
to
e2c4939
Compare
@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test Assembly load failure again... |
@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test
Ain't that groovy? |
ARM64 FX minopts diff:
In minopts there are additional diffs due to the - ldrsw x2, [x0,#8]
- cmp x1, x2
+ ldr w2, [x0,#8]
+ cmp w1, w2 |
@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test Assembly load failure again... |
70abe9d
to
a2f36d4
Compare
@dotnet-bot test this please |
@dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
src/jit/emitarm.cpp
Outdated
// Notes: | ||
// Only IF_T1_C encoded instructions are supported. | ||
// | ||
/*static*/ int emitter::insUnscaleImm(instruction ins, insFormat fmt, int imm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit odd the have a parameter that is always required to be IF_T1_C
Do you believe that we will extend support to include more that one instruction format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's unlikely to add more instructions formats here.
AFAIR this is just a side effect of the way I refactored the code - first I added ins
+ fmt
and removed size
parameter, at that point it was still supporting more formats. Then I realized that all formats but IF_T1_C
had trivial scaling needs, similar to that of many other formats that did some kind of scaling without using insUnscaleImm
. So I removed everything but IF_T1_C
and added an assert to be sure the function doesn't get misused by accident.
I suppose I could go one step further - rename it to insUnscaleImm_IF_T1_C
and remove the format parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it needs to be renamed, the extra parameter can just be removed.
We still have this assert which prevents unexpected callers from using this method:
default:
assert(!"Invalid IF_T1_C instruction");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, parameter removed.
[MethodImpl(MethodImplOptions.NoInlining)]
static int Test(Foo[] a, long i)
{
return a[(int)checked(i + 1)].x;
} generates F9400FA0 ldr x0, [fp,#24]
F9400BA1 ldr x1, [fp,#16]
52800022 mov w2, #1
93407C42 sxtw x2, x2
; generates a 64 bit value in x1
AB020021 adds x1, x1, x2
54000146 bvs G_M55886_IG04
B9400802 ldr w2, [x0,#8]
6B02003F cmp w1, w2
54000102 bhs G_M55886_IG05
52800622 mov w2, #49
; the 64 bit value in x1 is used as is, the cast to int was dropped
9B020020 madd x0, x1, x2, x0
91004000 add x0, x0, #16
B9400000 ldr w0, [x0] |
@dotnet/jit-contrib Take for 3.0 or hold off? |
I would be happy to take this for 3.0, as the code issue has been identified as an issue, and this is also some good code cleanup. I'd like to get @briansull to weigh in on it as well (i.e. is there more risk that I might think?) and the conflicts will need to be resolved. |
I am fine with taking this change for 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
What should be done about the index widening issue in |
I missed that this was a bug, rather than a CQ issue. I think it's reasonable to do that in a separate PR. It's a pretty corner case, and it seems that although the bug was mentioned in #22528, it doesn't appear that a test case was ever constructed to demonstrate the issue. |
I thought that it was already fixed with: Fix genCodeForIndexAddr #22528 |
That fixed it for x64, not for arm64. |
Yes, #22528 was only for x64. At the time this PR was already opened and I was hoping to make the ARM64 fix here. But then it turned out that the ARM64 fix is more complex and then this PR didn't get much attention so I kept postponing doing it.
Yes, it seems difficult to construct a reasonable example that exhibits observable incorrect behavior. The little example above does show that incorrect code is being generated but it's already very contrived. In particular, I haven't been able to find any example that would result in an out of range access due to an incorrect range check. I'll fix conflicts later today. |
genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive. Lower bounds are signed but then the subsequent subtract is anyway EA_4BYTE so sign extension isn't actually needed. Just use ldr on both ARM32 and ARM64.
genCodeForIndexAddr's range check generation is messed up: - It uses ldrsw to load array length on ARM64. Not needed, the length is positive. - It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load. - It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions. - It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus.
The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc. The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE.
ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set. XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64 Other ARM64 compilers also use ldr rather than ldrsw. This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires.
In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs. Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()).
- Require EA_4BYTE for byte/halfword instructions on ARM32. - Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64. - Replace emitTypeSize with emitActualTypeSize as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for your patience @mikedn!
One thing that it's worth pointing out is that Future cleanup of the emitter might consider packing the instruction and the attr in a struct (containing an unsigned bitfield for example) and making |
Merging, since the two failed legs are x64, and all the changes are arm/arm64 specific. |
* Avoid using ins_Load/ins_Store with constant type * Use ldr to load array lengths/lower bounds genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive. Lower bounds are signed but then the subsequent subtract is anyway EA_4BYTE so sign extension isn't actually needed. Just use ldr on both ARM32 and ARM64. * Use ldr to load array length (part 2) genCodeForIndexAddr's range check generation is messed up: - It uses ldrsw to load array length on ARM64. Not needed, the length is positive. - It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load. - It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions. - It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus. * Stop the using the instruction size for immediate scaling on ARM32 The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc. The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE. * Stop using ldrsw for TYP_INT loads ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set. XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64 Other ARM64 compilers also use ldr rather than ldrsw. This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires. * Cleanup genCodeForIndir/genCodeForStoreInd In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs. Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()). * Remove ARM32/64 load/store instruction size inconsistencies - Require EA_4BYTE for byte/halfword instructions on ARM32. - Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64. - Replace emitTypeSize with emitActualTypeSize as needed. * Remove unnecessary insUnscaleImm parameter Commit migrated from dotnet/coreclr@c6824d5
This fixes a number of inconsistencies in ARM 32/64's handling of loads (and to a lesser extent of stores as well).
For a start, ARM32 and 64 have a different approach to load instruction size (
emitAttr
). The ARM32 emitter wantsEA_1BYTE
/EA_2BYTE
for byte/halfword loads. On the other hand, the ARM64 emitter expectsEA_4BYTE
orEA_8BYTE
, to it the instruction size indicates the size of the destination register and not the size of the memory location (that is implied by the instruction anyway). It makes more sense for ARM32 to follow the ARM64 approach and requireEA_4BYTE
for small loads, at least because it avoids the need to write different code for ARM32 and ARM64. Besides, ARM32 doesn't really haveEA_1BYTE
/EA_2BYTE
instructions, they're all 4 byte instructions.The only reason why the ARM32 emitter cares about the load instruction size is immediate operand scaling. Of course, the scale factor can be deduce from the instruction itself and the size becomes is irrelevant. Asserting that it's
EA_4BYTE
just to enforce consistency.Then ARM64 itself handles load size in an unusual manner. Instead of simply using
emitActualTypeSize
when callingemitInsLoadStoreOp
, it usesemitTypeSize
and thenemitInsLoadStoreOp
fixes the size usingemitInsAdjustLoadStoreAttr
. And that not only promotesEA_1BYTE
/EA_2BYTE
toEA_4BYTE
, it also promotesEA_4BYTE
toEA_8BYTE
. Long story short - callingemitInsLoadStoreOp(INS_ldr, EA_4BYTE)
results in a 8 byteldr
being unexpectedly emitted!And the reason for
emitInsAdjustLoadStoreAttr
weird behavior is another ARM64 inconsistency - itsins_Load
returnsINS_ldrsw
forTYP_INT
. AndINS_ldrsw
requiresEA_8BYTE
soemitInsAdjustLoadStoreAttr
addresses that need. But the use ofINS_ldrsw
is actually bogus, it should be justINS_ldr
:TYP_INT
load sign extends the 32 bit value to 64 bitTYP_INT
does not imply that the loaded value is actually signedmov
, notmovsxd
forTYP_INT
loadsSo change
ins_Load
to returnINS_ldr
, notINS_ldrsw
, and get rid ofemitInsAdjustLoadStoreAttr
.Along the way also remove some
ins_Load(TYP_INT)
and similar that just make the code harder to read.And then fix
genCodeForIndexAddr
which tries to sign extend the array length unnecessarily and then always emits a 64 bit cmp, disregarding the index type.And fix the volatile load/store logic too, which got caught into this
TYP_INT
load confusion and failed to actually emitINS_ldar
, despite trying to do so.Oh well, I suppose it's kind of confusing. I'm just trying to finish my cast operand containment thing and it seems that I keep running into all kind of weird things...