Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Sort out ARM load/store instruction size issues #20126

Merged
merged 8 commits into from
May 30, 2019

Conversation

mikedn
Copy link

@mikedn mikedn commented Sep 25, 2018

This fixes a number of inconsistencies in ARM 32/64's handling of loads (and to a lesser extent of stores as well).

For a start, ARM32 and 64 have a different approach to load instruction size (emitAttr). The ARM32 emitter wants EA_1BYTE/EA_2BYTE for byte/halfword loads. On the other hand, the ARM64 emitter expects EA_4BYTE or EA_8BYTE, to it the instruction size indicates the size of the destination register and not the size of the memory location (that is implied by the instruction anyway). It makes more sense for ARM32 to follow the ARM64 approach and require EA_4BYTE for small loads, at least because it avoids the need to write different code for ARM32 and ARM64. Besides, ARM32 doesn't really have EA_1BYTE/EA_2BYTE instructions, they're all 4 byte instructions.

The only reason why the ARM32 emitter cares about the load instruction size is immediate operand scaling. Of course, the scale factor can be deduce from the instruction itself and the size becomes is irrelevant. Asserting that it's EA_4BYTE just to enforce consistency.

Then ARM64 itself handles load size in an unusual manner. Instead of simply using emitActualTypeSize when calling emitInsLoadStoreOp, it uses emitTypeSize and then emitInsLoadStoreOp fixes the size using emitInsAdjustLoadStoreAttr. And that not only promotes EA_1BYTE/EA_2BYTE to EA_4BYTE, it also promotes EA_4BYTE to EA_8BYTE. Long story short - calling emitInsLoadStoreOp(INS_ldr, EA_4BYTE) results in a 8 byte ldr being unexpectedly emitted!

And the reason for emitInsAdjustLoadStoreAttr weird behavior is another ARM64 inconsistency - its ins_Load returns INS_ldrsw for TYP_INT. And INS_ldrsw requires EA_8BYTE so emitInsAdjustLoadStoreAttr addresses that need. But the use of INS_ldrsw is actually bogus, it should be just INS_ldr:

  • the JIT's type model doesn't require that a TYP_INT load sign extends the 32 bit value to 64 bit
  • the use of TYP_INT does not imply that the loaded value is actually signed
  • XARCH emits mov, not movsxd for TYP_INT loads

So change ins_Load to return INS_ldr, not INS_ldrsw, and get rid of emitInsAdjustLoadStoreAttr.

Along the way also remove some ins_Load(TYP_INT) and similar that just make the code harder to read.

And then fix genCodeForIndexAddr which tries to sign extend the array length unnecessarily and then always emits a 64 bit cmp, disregarding the index type.

And fix the volatile load/store logic too, which got caught into this TYP_INT load confusion and failed to actually emit INS_ldar, despite trying to do so.

Oh well, I suppose it's kind of confusing. I'm just trying to finish my cast operand containment thing and it seems that I keep running into all kind of weird things...

@echesakov
Copy link

@dotnet-bot test Ubuntu arm Cross Checked crossgen_comparison Build and Test
@dotnet-bot test Ubuntu arm Cross Release crossgen_comparison Build and Test

@mikedn mikedn force-pushed the arm64-load-op branch 5 times, most recently from 53e9635 to 0c859da Compare October 10, 2018 05:22
@mikedn mikedn force-pushed the arm64-load-op branch 2 times, most recently from 54d82b3 to d300051 Compare October 12, 2018 17:37
@mikedn
Copy link
Author

mikedn commented Oct 12, 2018

No FX diffs on ARM32. On ARM64 there are many diffs due to the ldrsw -> ldr change

 G_M55349_IG12:
-            ldrsw   x0, [fp,#24]	// [V03 loc1]
+            ldr     w0, [fp,#24]	// [V03 loc1]
             tst     w0, #255
             beq     G_M55349_IG13

a few due to byte/halfword instructions becoming EA_4BYTE instead of EA_8BYTE

             ldr     x1, [fp,#104]	// [V04 loc0]
-            ldrsh   x1, [x1]
+            ldrsh   w1, [x1]
             strh    w1, [x0,#8]
             b       G_M14724_IG34

and some code size improvements due to the use of ldar in many more places:

             mov     x0, x21
-            ldrsw   x22, [x0]
-            dmb     oshld
-            ldrsw   x0, [x19,#24]
+            ldar    w22, [x0]
+            ldr     w0, [x19,#24]
             and     w23, w22, w0
             ldr     x0, [x19,#8]

These account for ~1.5 kbytes of diff

Total bytes of diff: -1504 (0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -872 : System.Private.CoreLib.dasm (-0.01% of base)
        -496 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
         -80 : System.Net.Http.dasm (-0.01% of base)
         -32 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00% of base)
         -16 : System.Linq.Parallel.dasm (0.00% of base)
6 total files with size differences (6 improved, 0 regressed), 124 unchanged.
Top method improvements by size (bytes):
        -256 : System.Private.CoreLib.dasm - ConcurrentQueue`1:get_Count():int:this (4 methods)
         -48 : Microsoft.CodeAnalysis.CSharp.dasm - SymbolCompletionState:SpinWaitComplete(int,struct):this (4 methods)
         -48 : System.Net.Http.dasm - ConcurrentQueueSegment`1:TryDequeue(byref):bool:this (4 methods)
         -48 : System.Private.CoreLib.dasm - WorkStealingQueue:LocalPush(ref):this (2 methods)
         -32 : Microsoft.CodeAnalysis.CSharp.dasm - SourceMemberContainerTypeSymbol:ForceComplete(ref,struct):this (4 methods)
73 total methods with size differences (73 improved, 0 regressed), 113019 unchanged.
108 files had text diffs but not size diffs.
System.Private.Xml.dasm had 85832 diffs
Microsoft.CodeAnalysis.dasm had 41384 diffs
Microsoft.Diagnostics.Tracing.TraceEvent.dasm had 38596 diffs
System.Private.DataContractSerialization.dasm had 18556 diffs
System.Text.RegularExpressions.dasm had 17844 diffs

@mikedn mikedn force-pushed the arm64-load-op branch 2 times, most recently from 1ea8939 to 31703db Compare October 18, 2018 19:33
@mikedn
Copy link
Author

mikedn commented Oct 19, 2018

@BruceForstall Any idea what exactly failed in the arm build? Looks like all test passed but xunit exited with code 1?!

@echesakov
Copy link

@mikedn It failed to load a Xunit test wrapper (https://github.com/dotnet/coreclr/issues/20392)

13:58:25   System.IO.FileLoadException: Could not load file or assembly 'baseservices.varargs.XUnitWrapper, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'. An operation is not legal in the current state. (Exception from HRESULT: 0x80131509 (COR_E_INVALIDOPERATION))

@mikedn mikedn force-pushed the arm64-load-op branch 3 times, most recently from e3146c7 to e2c4939 Compare October 24, 2018 05:47
@mikedn
Copy link
Author

mikedn commented Oct 24, 2018

@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test

Assembly load failure again...

@mikedn mikedn changed the title [WIP] Sort out ARM load/store instruction size issues Sort out ARM load/store instruction size issues Oct 24, 2018
@mikedn
Copy link
Author

mikedn commented Oct 24, 2018

@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test

Groovy script failed:
java.lang.NullPointerException: Cannot invoke method child() on null object
	at org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:91)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:48)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:35)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:157)
	at org.kohsuke.groovy.sandbox.impl.Checker$checkedCall$0.callStatic(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
	at Script1.run(Script1.groovy:10)

Ain't that groovy?

@mikedn
Copy link
Author

mikedn commented Oct 24, 2018

ARM64 FX minopts diff:

Total bytes of diff: -384 (0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -224 : System.Net.NetworkInformation.dasm (-0.13% of base)
        -160 : System.Private.CoreLib.dasm (0.00% of base)
2 total files with size differences (2 improved, 0 regressed), 128 unchanged.
Top method improvements by size (bytes):
         -16 : System.Private.CoreLib.dasm - Volatile:Read(byref):int (4 methods)
         -16 : System.Private.CoreLib.dasm - SpinLock:ContinueTryEnter(int,byref):this (2 methods)
         -16 : System.Private.CoreLib.dasm - SpinLock:ContinueTryEnterWithThreadTracking(int,int,byref):this (2 methods)
         -16 : System.Private.CoreLib.dasm - SpinLock:ExitSlowPath(bool):this (2 methods)
         -16 : System.Private.CoreLib.dasm - SpinLock:get_IsHeld():bool:this (2 methods)
43 total methods with size differences (43 improved, 0 regressed), 113273 unchanged.
117 files had text diffs but not size diffs.
System.Private.Xml.dasm had 186632 diffs
Microsoft.CodeAnalysis.VisualBasic.dasm had 174484 diffs
Microsoft.CodeAnalysis.CSharp.dasm had 142040 diffs
Microsoft.CodeAnalysis.dasm had 72784 diffs
Microsoft.Diagnostics.Tracing.TraceEvent.dasm had 71612 diffs

In minopts there are additional diffs due to the genCodeForIndexAddr change:

-            ldrsw   x2, [x0,#8]
-            cmp     x1, x2
+            ldr     w2, [x0,#8]
+            cmp     w1, w2

@mikedn mikedn changed the title Sort out ARM load/store instruction size issues [WIP] Sort out ARM load/store instruction size issues Oct 24, 2018
@mikedn mikedn changed the title [WIP] Sort out ARM load/store instruction size issues Sort out ARM load/store instruction size issues Oct 24, 2018
@mikedn
Copy link
Author

mikedn commented Oct 27, 2018

@dotnet-bot test Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test
test Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test

Assembly load failure again...

@BruceForstall
Copy link
Member

@dotnet-bot test this please

@BruceForstall
Copy link
Member

@dotnet/jit-contrib

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

// Notes:
// Only IF_T1_C encoded instructions are supported.
//
/*static*/ int emitter::insUnscaleImm(instruction ins, insFormat fmt, int imm)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd the have a parameter that is always required to be IF_T1_C

Do you believe that we will extend support to include more that one instruction format here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's unlikely to add more instructions formats here.

AFAIR this is just a side effect of the way I refactored the code - first I added ins + fmt and removed size parameter, at that point it was still supporting more formats. Then I realized that all formats but IF_T1_C had trivial scaling needs, similar to that of many other formats that did some kind of scaling without using insUnscaleImm. So I removed everything but IF_T1_C and added an assert to be sure the function doesn't get misused by accident.

I suppose I could go one step further - rename it to insUnscaleImm_IF_T1_C and remove the format parameter?

Copy link

@briansull briansull May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it needs to be renamed, the extra parameter can just be removed.
We still have this assert which prevents unexpected callers from using this method:

       default:
            assert(!"Invalid IF_T1_C instruction");

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, parameter removed.

@mikedn
Copy link
Author

mikedn commented Feb 18, 2019

genCodeForIndexAddr needs an extra fix, like the XARCH version it fails to widen the index:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int Test(Foo[] a, long i)
    {
        return a[(int)checked(i + 1)].x;
    }

generates

        F9400FA0          ldr     x0, [fp,#24]
        F9400BA1          ldr     x1, [fp,#16]
        52800022          mov     w2, #1
        93407C42          sxtw    x2, x2
; generates a 64 bit value in x1
        AB020021          adds    x1, x1, x2
        54000146          bvs     G_M55886_IG04
        B9400802          ldr     w2, [x0,#8]
        6B02003F          cmp     w1, w2
        54000102          bhs     G_M55886_IG05
        52800622          mov     w2, #49
; the 64 bit value in x1 is used as is, the cast to int was dropped
        9B020020          madd    x0, x1, x2, x0
        91004000          add     x0, x0, #16
        B9400000          ldr     w0, [x0]

@RussKeldorph
Copy link

@dotnet/jit-contrib Take for 3.0 or hold off?

@CarolEidt
Copy link

I would be happy to take this for 3.0, as the code issue has been identified as an issue, and this is also some good code cleanup. I'd like to get @briansull to weigh in on it as well (i.e. is there more risk that I might think?) and the conflicts will need to be resolved.

@briansull
Copy link

I am fine with taking this change for 3.0
It does improve things quite a bit when selecting the proper load/store instructions for the ARM architectures.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@mikedn
Copy link
Author

mikedn commented May 28, 2019

What should be done about the index widening issue in genCodeForIndexAddr that I mentioned in a previous post? I've been looking at fixing that but I don't see a way to fix it without an extra temp register which means that this change will spread to lsraarm64.cpp as well. Perhaps that would be best done in a separate PR?

@CarolEidt
Copy link

What should be done about the index widening issue in genCodeForIndexAddr that I mentioned in a previous post? I've been looking at fixing that but I don't see a way to fix it without an extra temp register which means that this change will spread to lsraarm64.cpp as well. Perhaps that would be best done in a separate PR?

I missed that this was a bug, rather than a CQ issue. I think it's reasonable to do that in a separate PR. It's a pretty corner case, and it seems that although the bug was mentioned in #22528, it doesn't appear that a test case was ever constructed to demonstrate the issue.

@briansull
Copy link

I thought that it was already fixed with:

Fix genCodeForIndexAddr #22528

@CarolEidt
Copy link

I thought that it was already fixed with: Fix genCodeForIndexAddr #22528

That fixed it for x64, not for arm64.

@mikedn
Copy link
Author

mikedn commented May 29, 2019

Yes, #22528 was only for x64. At the time this PR was already opened and I was hoping to make the ARM64 fix here. But then it turned out that the ARM64 fix is more complex and then this PR didn't get much attention so I kept postponing doing it.

It's a pretty corner case, and it seems that although the bug was mentioned in #22528, it doesn't appear that a test case was ever constructed to demonstrate the issue.

Yes, it seems difficult to construct a reasonable example that exhibits observable incorrect behavior. The little example above does show that incorrect code is being generated but it's already very contrived. In particular, I haven't been able to find any example that would result in an out of range access due to an incorrect range check.

I'll fix conflicts later today.

genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive. Lower bounds are signed but then the subsequent subtract is anyway EA_4BYTE so sign extension isn't actually needed. Just use ldr on both ARM32 and ARM64.
genCodeForIndexAddr's range check generation is messed up:
- It uses ldrsw to load array length on ARM64. Not needed, the length is positive.
- It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load.
- It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions.
- It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus.
The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc.

The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE.
ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set.

XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64

Other ARM64 compilers also use ldr rather than ldrsw.

This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires.
In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs.

Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()).
- Require EA_4BYTE for byte/halfword instructions on ARM32.
- Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64.
- Replace emitTypeSize with emitActualTypeSize as needed.
Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for your patience @mikedn!

@mikedn
Copy link
Author

mikedn commented May 29, 2019

One thing that it's worth pointing out is that ins_Load & co. are inherently error prone. They return an instruction for a given type but they do not also return an emitAttr for that type and the caller has to ensure that the proper emitAttr is used.

Future cleanup of the emitter might consider packing the instruction and the attr in a struct (containing an unsigned bitfield for example) and making ins_Load & co. return such a struct. Emitter functions will have one parameter less and the chance to use the wrong attr will be smaller.

@CarolEidt
Copy link

Merging, since the two failed legs are x64, and all the changes are arm/arm64 specific.

@CarolEidt CarolEidt merged commit c6824d5 into dotnet:master May 30, 2019
@mikedn mikedn deleted the arm64-load-op branch September 28, 2019 19:12
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Avoid using ins_Load/ins_Store with constant type

* Use ldr to load array lengths/lower bounds

genCodeForArrIndex and genCodeForArrOffset emit a ldrsw on ARM64 but that's not necessary. Array lengths are always positive. Lower bounds are signed but then the subsequent subtract is anyway EA_4BYTE so sign extension isn't actually needed. Just use ldr on both ARM32 and ARM64.

* Use ldr to load array length (part 2)

genCodeForIndexAddr's range check generation is messed up:
- It uses ldrsw to load array length on ARM64. Not needed, the length is positive.
- It uses sxtw to sign etxend the array length if the index is native int. But it was already extended by the load.
- It creates IND and LEA nodes out of thin air. Just call the appropiate emitter functions.
- It always generates a TYP_I_IMPL cmp, even when the index is TYP_INT. Well, that's just bogus.

* Stop the using the instruction size for immediate scaling on ARM32

The scaling of immediate operands is a property of the instruction and its encoding. It doesnt' make sense to throw the instruction size (emitAttr) into the mix, that's a codegen/emitter specific concept. On ARM32 it's also almost meaningless, at least in the case of integer types - all instructions are really EA_4BYTE, even ldrb, ldrh etc.

The ARM64 emitter already extracts the scaling factor from the instruction. It can't use the instruction size as on ARM64 the size is used to select between 32 bit and 64 bit instructions so it's never EA_1BYTE/EA_2BYTE.

* Stop using ldrsw for TYP_INT loads

ARM64's ins_Load returns INS_ldrsw for TYP_INT but there's nothing in the JIT type system that requires sign extending TYP_INT values on load. The fact that an indir node has TYP_INT doesn't even imply that the value to load is signed, it may be unsigned and indir nodes will never have type TYP_UINT nor have the GTF_UNSIGNED flag set.

XARCH uses a mov (rather than movsxd, the equivalent of ldrsw) so it zero extends. There's no reason for ARM64 to behave differently and doing so makes it more difficult to share codegen logic between XARCH and ARM64

Other ARM64 compilers also use ldr rather than ldrsw.

This requires patching up emitInsAdjustLoadStoreAttr so EA_4BYTE loads don't end up using EA_8BYTE, which ldrsw requires.

* Cleanup genCodeForIndir/genCodeForStoreInd

In particular, cleanup selection of acquire/release instructions. The existing code starts by selecting a "normal" instruction, only to throw it away and redo the type/size logic in the volatile case. And get it wrong in the process, it required that "targetType" be TYP_UINT or TYP_LONG to use ldar. But there are no TYP_UINT indirs.

Also rename "targetType" to "type", using "target" is misleading. The real target type would be genActualType(tree->TypeGet()).

* Remove ARM32/64 load/store instruction size inconsistencies

- Require EA_4BYTE for byte/halfword instructions on ARM32.
- Remove emitInsAdjustLoadStoreAttr on ARM64. Getting the correct instruction size simply requires using emitActualTypeSize, that will provide the correct size on both ARM32 and ARM64.
- Replace emitTypeSize with emitActualTypeSize as needed.

* Remove unnecessary insUnscaleImm parameter


Commit migrated from dotnet/coreclr@c6824d5
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants