Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[execute]binary compile with -O0 run diff to gcc -O0 on aarch64 #50847

Closed
yansendao opened this issue Aug 17, 2021 · 9 comments
Closed

[execute]binary compile with -O0 run diff to gcc -O0 on aarch64 #50847

yansendao opened this issue Aug 17, 2021 · 9 comments
Labels
bugzilla Issues migrated from bugzilla c

Comments

@yansendao
Copy link

Bugzilla Link 51505
Resolution FIXED
Resolved on Oct 11, 2021 20:29
Version 12.0
OS Linux
Blocks #51489
Attachments test-creduce.c
CC @aemerson,@DavidSpickett,@DougGregor,@zygoloid,@tstellar
Fixed by commit(s) 5f48c14 80f974e

Extended Description

yansendao@jvmtest-129:boole-issue2603$ gcc test-creduce.c -Wall && ./a.out
3
yansendao@jvmtest-129:boole-issue2603$ ~/software/llvm12/bin/clang test-creduce.c -Wall && ./a.out
test-creduce.c:7:10: warning: result of comparison of constant -1 with expression of type 'unsigned int' is always true [-Wtautological-constant-out-of-range-compare]
if (~a > -1L)
~~ ^ ~~~
1 warning generated.
0
yansendao@jvmtest-129:boole-issue2603$ ~/software/llvm12/bin/clang -v
clang version 12.0.1
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/yansendao/software/llvm12/bin
Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-gnu/8
Found candidate GCC installation: /usr/lib64/gcc/aarch64-linux-gnu/7
Found candidate GCC installation: /usr/lib64/gcc/aarch64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib64/gcc/aarch64-linux-gnu/8
Selected GCC installation: /usr/lib64/gcc/aarch64-linux-gnu/7.5.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
yansendao@jvmtest-129:boole-issue2603$ cat test-creduce.c
int printf(const char *, ...);
unsigned a;
long b;
short c, d;
long *e = &b;
short f() {
if (~a > -1L)
return 3;
return d;
}
int main() {
c = f();
*e = c;
printf("%d\n", (int)b);
return 0;
}

@DavidSpickett
Copy link
Collaborator

Thanks for the report, I can reproduce with latest clang.

Here's a godbolt showing both: https://godbolt.org/z/nosMbnrbP

The key seems to be that GCC takes the thing that clang warns about, ~a > -1L always being true. clang generates code for that check at O0, gcc does not.

    adrp    x8, a              //
    ldr     w8, [x8, :lo12:a]  // Load variable a into w8 
    tbz     x8, #​63, .LBB0_2   // If bit 63 of a is not set, branch to LBB0_2
    b       .LBB0_1            // Otherwise carry on

.LBB0_1:
mov w8, #​3 // Return the 3 we expect
strh w8, [sp, #​14]
b .LBB0_3
.LBB0_2:
adrp x8, d //
ldrh w8, [x8, :lo12:d] // load variable d (which is zero init)
strh w8, [sp, #​14] // store it to stack
b .LBB0_3
.LBB0_3:
ldrh w0, [sp, #​14] // load d back into w0 (the return register)
add sp, sp, #​16
ret

I need a bit more time to understand the warning. Certainly it seems like clang is proving itself wrong! (or just emitting incorrect code)

@DavidSpickett
Copy link
Collaborator

Missed the "L" first time reading the example. So of course this warning makes sense, sizeof(-1L) is 8, sizeof(a) is 4. 4 bytes is never going to be > 8.

@DavidSpickett
Copy link
Collaborator

I think I don't fully understand the logic behind the warning, if it were just down to byte size then it would always be false.

Anyway if we take what gcc does as correct, x86 and riscv seem to get it right with clang.

The AArch64 code has half of it right. It will always follow the branch ("is always true") but it's got the labels flipped.

    ldr     w8, [x8, :lo12:a] // Loads w8, which clears the top 32 bits of x8
    tbz     x8, #​63, .LBB0_2  // Therefore this always takes the branch
    b       .LBB0_1  

.LBB0_1: // But this is the label the tbz should branch to
mov w8, #​3
strh w8, [sp, #​14]
b .LBB0_3
.LBB0_2: // Not this one which is the "return d"
adrp x8, d
ldrh w8, [x8, :lo12:d]
strh w8, [sp, #​14]
b .LBB0_3

https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/TBZ

I confirmed manually by changing the asm:
< tbz x8, #​63, .LBB0_2
< b .LBB0_1

  tbz     x8, #&#8203;63, .LBB0_1
  b       .LBB0_2

This is a change somewhere in clang-12 because clang-11 does this:
ldr w9, [x8, :lo12:a] // load a
mvn w9, w9 // invert it
mov w8, w9 // put it in w8 instead
cmp x8, #​0 // compare x8 to 0 (aka x8 - 0 = x8)
b.lt .LBB0_2 // branch if result was less then 0 (never taken since anything -0 is not < 0)
mov w8, #​3 // return 3 as we expect
strh w8, [sp, #​14]
b .LBB0_3

I'll see if I can narrow it down.

@DavidSpickett
Copy link
Collaborator

TLDR: We make a mistake when inverting a bit test because we sign extend a value instead of zero extending it. It only effects Global Isel, I think I have a fix.

I've found where this goes wrong, and a potential fix. I need to consider if it's just fixing this one case and how to test more generally.

I'm using a smaller example in these tests:
int main() {
unsigned a;
if (~a > -1L)
return 3;
return 0;
}

The first thing to mention is that fastisel does not make this mistake:
(either through intelligence or inaction)
$ ./bin/clang /tmp/test.c -o - -S -O0 -mllvm --global-isel=false
<...>
mov w9, #-1 // w9 = -1 (32 bit)
eor w8, w8, w9 // xor a with -1
ubfx x8, x8, #​0, #​32 // extract the bottom 32 bits of x8. So we have top 32 as zeroes, bottom 32 as xor result
tbnz x8, #​63, .LBB0_2 // if bit 63 is not zero, branch to return 0 (bit 63 will always be zero)
mov w8, #​3 // return 3
<...>

https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/TBNZ

The commit that changed the assembly output was:
commit 19dc9c9
Author: Jessica Paquette jpaquette@apple.com
Date: Tue Oct 20 13:17:39 2020 -0700

[AArch64][GlobalISel] Move imm adjustment for G_ICMP to post-legalizer lowering

However this in itself is not the problem. G_ICMP is lowered to instructions after post legalizer lowering.
This just meant that they took a new path.

First note that an earlier pass flips the labels and so the greater than/less comparison that we use.

< # *** IR Dump Before AArch64PreLegalizerCombiner ***:

*** IR Dump After AArch64PreLegalizerCombiner ***:

20,22c20,22
< %8:(s1) = G_ICMP intpred(sgt), %6:(s64), %7:_
< G_BRCOND %8:_(s1), %bb.2
< G_BR %bb.3

%8:(s1) = G_ICMP intpred(sle), %6:(s64), %7:_
G_BRCOND %8:_(s1), %bb.3
G_BR %bb.2

This changed (~a > -1L) to (~a <= -1L) and then flipped the labels so we still go to the right place.

Then another change happens to go from <= to < in AArch64PostLegalizerLowering.

  • %7:_(s64) = G_CONSTANT i64 -1
    <...>
  • %13:(s32) = G_ICMP intpred(sle), %6:(s64), %7:_
  • %14:_(s64) = G_CONSTANT i64 0
  • %13:(s32) = G_ICMP intpred(slt), %6:(s64), %14:_

(~a <= -1) becomes (~a < 0). Which are equivalent so we don't have to change the branch destinations.

Then we get to converting the G_ICMP into instructions.

*** IR Dump Before InstructionSelect (instruction-select) ***:

<...>
bb.1.entry:
successors: %bb.2, %bb.3

%9:gpr(s32) = G_CONSTANT i32 3
%0:gpr(p0) = G_FRAME_INDEX %stack.0.retval
%2:gpr(s32) = G_CONSTANT i32 0
G_STORE %2:gpr(s32), %0:gpr(p0) :: (store (s32) into %ir.retval)
%1:gpr(p0) = G_FRAME_INDEX %stack.1.a
%3:gpr(s32) = G_LOAD %1:gpr(p0) :: (dereferenceable load (s32) from %ir.a)
%4:gpr(s32) = G_CONSTANT i32 -1 // get a 32 bit -1
%5:gpr(s32) = G_XOR %3:gpr, %4:gpr // xor a with it, for a 32 bit result
%6:gpr(s64) = G_ZEXT %5:gpr(s32) // zero extend the result (top 32 bits will be 0) to 64 bit
%14:gpr(s64) = G_CONSTANT i64 0 // 64 bit 0 to compare to
%13:gpr(s32) = G_ICMP intpred(slt), %6:gpr(s64), %14:gpr // xor and zext'd a is less than 0? It never will be because the sign bit will never be set.
%8:gpr(s1) = G_TRUNC %13:gpr(s32) // truncate the result into a single bit
G_BRCOND %8:gpr(s1), %bb.3 // if the bit was set branch to return 0
G_BR %bb.2 // but because the top 32 bits will be zero, we never take that branch and instead return 3 always

bb.2.if.then: // return 3

bb.3.if.end: // return 0

So that all looks good. (minus the G_TRUNC maybe but whatever)

After selection:

*** IR Dump After InstructionSelect (instruction-select) ***:

<...>
bb.1.entry:
successors: %bb.2, %bb.3

STRWui $wzr, %stack.0.retval, 0 :: (store (s32) into %ir.retval) // write 0 to retval
%3:gpr32 = LDRWui %stack.1.a, 0 :: (dereferenceable load (s32) from %ir.a) // load a into a w (32 bit) register
%21:gpr64all = SUBREG_TO_REG 0, %3:gpr32, %subreg.sub_32 // copy a from the 32 bit w to the same 64 bit x register (which zero extends)
%20:gpr64 = COPY %21:gpr64all // copy to another 64 bit x register
TBZX %20:gpr64, 63, %bb.3 // if bit 63 of that x register is zero, branch to bb.3 (return 0)
B %bb.2 // otherwise branch and return 3

bb.2.if.then: // return 3

bb.3.if.end: // return 0

Note that the XOR has been removed (folded?) into the TBZX. This process is where the mistake happens.

This is done in AArch64InstructionSelector::emitTestBit and in particular we call getTestBitReg during that.

/// Emit a TB(N)Z instruction which tests \p Bit in \p TestReg.
/// \p IsNegative is true if the test should be "not zero".
/// This will also optimize the test bit instruction when possible.
MachineInstr *emitTestBit(Register TestReg, uint64_t Bit, bool IsNegative,

Later we call anothe function to optimize the test bit.

// Attempt to optimize the test bit by walking over instructions.
TestReg = getTestBitReg(TestReg, Bit, IsNegative, MRI);

This function walks the instructions we have. So it sees a ZEXT, an XOR, then emits the TBZX.
(not sure why the ZEXT turns up first but it does and doesn't seem to break anything)

The idea is that if you're XORing some value with a known value that has a bit N set. Then testing for that bit N in the result,
you can invert the test and remove the XOR. Seeing as XORing with a 1 always flips the bit.

Our comparison here sets IsNegative true initially. As we're checking whether the sign bit of the result is set. (is the result less than 0)
We see the XOR with -1 (32 bit -1) and we grab that value to check if bit 63 is set in it.

This is the mistake, I think. We get the value by doing:
if (VRegAndVal)
C = VRegAndVal->Value.getSExtValue();

Which means that the 32 bit XOR that we did becomes 64 bits of 1s, therefore bit 63 is set in what we assume was the XOR value, so
we can remove the XOR and invert the bit test. (incorrectly)

Except that what we really wanted to get was <32 bits of 0s><32 bits of 1s>. If I do:
--- a/llvm/lib/Target/AArch64/GISel/AArch64InstructionSelector.cpp
+++ b/llvm/lib/Target/AArch64/GISel/AArch64InstructionSelector.cpp
@@ -1343,7 +1343,7 @@ static Register getTestBitReg(Register Reg, uint64_t &Bit, bool &Invert,
VRegAndVal = getConstantVRegValWithLookThrough(ConstantReg, MRI);
}
if (VRegAndVal)

  •    C = VRegAndVal->Value.getSExtValue();
    
  •    C = VRegAndVal->Value.getZExtValue();
     break;
    

Then we get the expected codegen. It doesn't break any llvm tests which is a little odd, but maybe the 32 to 64 bit promotion case
hasn't been well explored yet. There may be other variations of this, other situations where something is always false for example.
I'm sure we can root them out.

This is what you get with that change applied:

*** IR Dump After InstructionSelect (instruction-select) ***:

<...>
bb.1.entry:
successors: %bb.2, %bb.3

STRWui $wzr, %stack.0.retval, 0 :: (store (s32) into %ir.retval)
%3:gpr32 = LDRWui %stack.1.a, 0 :: (dereferenceable load (s32) from %ir.a) // load a (32 bit)
%21:gpr64all = SUBREG_TO_REG 0, %3:gpr32, %subreg.sub_32 // zero extend to 64 bit
%20:gpr64 = COPY %21:gpr64all
TBNZX %20:gpr64, 63, %bb.3 // if the sign bit IS set (NZ = not zero) go to return 0
B %bb.2 // otherwise go to return 3

bb.2.if.then: // return 3

bb.3.if.end: // return 0

Since we zero extend a, bit 63 can never be set so we will always go to bb.2 and return 3 as expected.

Why does this not happen at -O1? Simply because Global Isel isn't on at O1. (see AArch64TargetMachine::AArch64TargetMachine)

If you do manually enable it, enough optimisation has happened earlier that there's no G_ICMP to deal with anyway.

*** IR Dump Before InstructionSelect (instruction-select) ***:

Machine code for function main: IsSSA, TracksLiveness, Legalized, RegBankSelected

bb.1.entry:
%0:gpr(s32) = G_CONSTANT i32 3
$w0 = COPY %0:gpr(s32)
RET_ReallyLR implicit $w0

End machine code for function main.

*** IR Dump After InstructionSelect (instruction-select) ***:

Machine code for function main: IsSSA, TracksLiveness, Legalized, RegBankSelected, Selected

bb.1.entry:
%0:gpr32 = MOVi32imm 3
$w0 = COPY %0:gpr32
RET_ReallyLR implicit $w0

End machine code for function main.

@DavidSpickett
Copy link
Collaborator

Fixed by https://reviews.llvm.org/D108755.

@aemerson
Copy link
Contributor

Tom, can we get this fix into the release?

https://reviews.llvm.org/rG5f48c144c58f6d23e850a1978a6fe05887103b17

@aemerson
Copy link
Contributor

Re-opening to track inclusion into the release branch.

@tstellar
Copy link
Collaborator

Merged: 80f974e

@tstellar
Copy link
Collaborator

mentioned in issue #51489

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla c
Projects
None yet
Development

No branches or pull requests

4 participants