Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Optimize ToString() for byte, ushort, uint and ulong #27056

Merged
merged 5 commits into from
Oct 9, 2019

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 6, 2019

E.g. byte.ToString() currently calls Number.FormatUInt32(m_value, null, null); method which for ToString() just redirects it to UInt32ToDecStr so it's better to call it directly.
Same for ushort, uint and ulong.

Example:

static string Test(byte b) => b.ToString();

Was:

G_M64984_IG01:
       sub      rsp, 56
       xor      rax, rax
       mov      qword ptr [rsp+28H], rax
       mov      dword ptr [rsp+40H], ecx
G_M64984_IG02:
       movzx    rcx, byte  ptr [rsp+40H]
       xor      rdx, rdx
       xor      r8d, r8d
       lea      rax, bword ptr [rsp+28H]
       mov      bword ptr [rax], rdx
       mov      dword ptr [rax+8], r8d
       lea      rdx, bword ptr [rsp+28H]
       xor      r8, r8
       call     System.Number:FormatUInt32(int,struct,ref):ref
       nop      
G_M64984_IG03:
       add      rsp, 56
       ret      
; Total bytes of code: 56

Now:

G_M64982_IG01:
       sub      rsp, 40
       mov      dword ptr [rsp+30H], ecx
G_M64982_IG02:
       movzx    rcx, byte  ptr [rsp+30H]
       mov      edx, -1
       call     System.Number:UInt32ToDecStr(int,int):ref
       nop      
G_M64982_IG03:
       add      rsp, 40
       ret      
; Total bytes of code: 29

(because passing null to Span/ReadOnlySpan argument is not cheap)

What else can be improved:

  1. UInt32ToDecStr looks a bit overkill for byte (which is just a 1,2 or 3 chars string)
  2. I'd add a branchless internal Math.MaxFast for various BCL scenarious like this.
  3. <const>.ToString() can be replaced with string literals in JIT for positive constants (I have a prototype)
  4. Redundant bounds check

@jkotas
Copy link
Member

jkotas commented Oct 7, 2019

A lot of tests are failing with crashes like:

Assert failure(PID 4428 [0x0000114c], Thread: 3308 [0x0cec]): !CREATE_CHECK_STRING(!"Detected use of a corrupted OBJECTREF. Possible GC hole.")
    File: f:\workspace.10\_work\1\s\src\vm\object.cpp Line: 674
    Image: C:\dotnetbuild\work\95055a0c-c2eb-4d47-ac6c-86721a860e6b\Payload\dotnet.exe

Looks this change has buffer overrun that is corrupting the GC heap.

@jkotas
Copy link
Member

jkotas commented Oct 7, 2019

add a branchless internal Math.MaxFast for various BCL scenarious like this.

Unlikely to make any difference. This specific case is likely to be well predicted in real world workloads and the Max operation is just a drop in a bucket in the overall cost of this method.

<const>.ToString() can be replaced with string literals in JIT for positive constants (I have a prototype)

Do you have any evidence that this pattern is frequent enough in real world code to make this optimization worth it?

Redundant bounds check

There are 10000s similar redudant bound checks in CoreCLR/CoreFX. We are fine with paying for them to get more secure system, unless they affect performance significantly.

@stephentoub
Copy link
Member

stephentoub commented Oct 7, 2019

<const>.ToString() can be replaced with string literals in JIT for positive constants

If the JIT could do it, why couldn't Roslyn do it?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 7, 2019

@jkotas I agree those are mostly nit

Do you have any evidence that this pattern is frequent enough in real world code to make this optimization worth it?

Yeah I've built a quick prototype to log occurrences (inserted it after ValnumCSE) will share the results once I finish a local infrastructure for jit-diff (and add some popular projects/libs to inspect)

If the JIT could do it, why couldn't Roslyn do it?

it can do I guess (for known positive constants) but JIT can find more opportunities after inlining and const propogation/folding, etc.
e.g. Rosly won't notice an opportunity here:

string Test1() => Test2(42);
string Test2(int val) => val.ToString();

@EgorBo
Copy link
Member Author

EgorBo commented Oct 7, 2019

public const uint Timeout = 60;
static string TimeoutStr() => "Timeout: " + Timeout;

Btw, just noticed Roslyn-master is now using string.Concat(string, string) for this scenario instead of string.Concat(object, object), nice! (and it should be more friendly to my prototype 🙂)

Timeout.ToString() is eventually represented as:

\--*  CALL      ref    System.Number.UInt32ToDecStr $100
   +--*  CNS_INT   int    60 $42
   \--*  CNS_INT   int    -1 $41

(after optVNAssertionPropCurStmt)

@jkotas jkotas merged commit 020639b into dotnet:master Oct 9, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
danmoseley pushed a commit to dotnet/corefx that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@danmoseley
Copy link
Member

cc @tannergooding

@tannergooding
Copy link
Member

tannergooding commented Oct 9, 2019

E.g. byte.ToString() currently calls Number.FormatUInt32(m_value, null, null); method which for ToString() just redirects it to UInt32ToDecStr so it's better to call it directly.

This was literally just:

        override string ToString()
        {
            return Number.FormatUInt32(m_value, null, null);
        }

        static unsafe string FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
        {
            // Fast path for default format
            if (format.Length == 0)
            {
                return UInt32ToDecStr(value, digits: -1);
            }

            // more code

        }

Wouldn't it have been better for the JIT to understand that the length of (ReadOnlySpan<char>)null is 0 and that it can, therefore, just directly emit a call to UInt32ToDecStr(value, digits: -1)?

byte.ToString() is just a direct method call; which means we should get an inline (or worst case an unconditional jump to the target method).

The parameters passed in (modulo m_value itself) are all constant.

So this seems like a perfect opportunity for the JIT to do simple dead code elimination...

@jkotas
Copy link
Member

jkotas commented Oct 9, 2019

Wouldn't it have been better for the JIT to understand that the length of (ReadOnlySpan)null is 0

FormatUInt32 in your example is big method. The JIT wound need to do partial speculative inlining to perform the optimization you are suggesting...

@tannergooding
Copy link
Member

Doing inlining analysis seems like it would be feasible here, especially for AOT or Tier 1 since the root method is only a single method call (no additional logic).

The pattern where Method(arg) just calls InternalMethod(arg, otherArgs) is fairly common and seems like a more general win...

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
MichalStrehovsky pushed a commit to dotnet/corert that referenced this pull request Oct 9, 2019
…27056)

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@tannergooding
Copy link
Member

To clarify, I was referring to doing just the regular full inlining analysis. Just basically having some logic that says if (Tier 1 && MethodOnlyForwardsToAnotherMethod) { DoInline }

At the point we are trying for Tier 1, we know that the method is "hot" and that this might be worthwhile and the code pattern where you have multiple overloads forwarding to a single central implementation is particularly common and frequently involves passing constants for the remaining args; so this will likely introduce more opportunities to produce better codegen for these kind of methods (namely due to dead code elimination).

Would it be worth logging an issue for this, possibly as just a venue to investigate?

@jkotas
Copy link
Member

jkotas commented Oct 9, 2019

I do not understand what you mean. FormatUInt32 does not just forward to another method.

FormatUInt32 is a big method that does a lot of stuff. It can be reduced to just forwarding to another method after the inlining and optimizations are applied. It is very hard for the inliner to see that. The RyuJIT inliner is in no position to do analysis like this today.

The C/C++ compilers try to do optimizations like this, but they are very far from getting it right most of time.

@tannergooding
Copy link
Member

I do not understand what you mean. FormatUInt32 does not just forward to another method.

I wasn't referring to FormatUInt32. I was referring to byte.ToString() which itself just forwards to FormatUInt32.

Based on the code from the original post:

static string Test(byte b) => b.ToString();

Compiles down to:

G_M64984_IG01:
       sub      rsp, 56
       xor      rax, rax
       mov      qword ptr [rsp+28H], rax
       mov      dword ptr [rsp+40H], ecx
G_M64984_IG02:
       movzx    rcx, byte  ptr [rsp+40H]
       xor      rdx, rdx
       xor      r8d, r8d
       lea      rax, bword ptr [rsp+28H]
       mov      bword ptr [rax], rdx
       mov      dword ptr [rax+8], r8d
       lea      rdx, bword ptr [rsp+28H]
       xor      r8, r8
       call     System.Number:FormatUInt32(int,struct,ref):ref
       nop      
G_M64984_IG03:
       add      rsp, 56
       ret      
; Total bytes of code: 56

So, the JIT is seeing that byte.ToString() is small and therefore just inlines that (which is the forward to FormatUInt32).

However, it may be beneficial to keep the call to byte.ToString() and have its call to FormatUInt32 be inlined instead. Having FormatUInt32 inlined here would, worst case, produce similar code to what we already produced for FormatUInt32; but would likely give some opportunities for dead code elimination and other optimizations due to the additional constant args that are used in the call to FormatUInt32.

This is going based on the premise that the byte.ToString() => Number.FormatUInt32(m_value, null, null) pattern is common throughout the framework and methods like Number.FormatUInt32 are generally the thing that has all the logic and the checks for "fast path" vs "slow path". So, I believe this would allow code targeting the "fast path" (due to the constants being passed in) to just avoid those fast vs slow path checks entirely.

@tannergooding
Copy link
Member

tannergooding commented Oct 9, 2019

So, today you have effectively:

static string Test(byte b) => b.ToString();

byte.ToString() => Number.FormatUInt32(m_value, null, null);

Number.FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
{
    if (format.Length == 0)
    {
        return UInt32ToDecStr(value, digits: -1);
    }

    // slow path code
}

The JIT optimizes to:

static string Test(byte b) => Number.FormatUInt32(b, null, null);

Number.FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
{
  if (check)
  {
    fast path
  }
  else
  {
      slow path
  }
}

Instead you optimize to:

static string Test(byte b) => b.ToString();

byte.ToString()
{
    ReadOnlySpan<char> format = null;
    IFormatProvider? provider = null;

    if (format.Length == 0)
    {
        return UInt32ToDecStr(value, digits: -1);
    }

    // slow path code
}

The JIT, ideally, can now do dead code elimination and make ToString() small.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2019

@tannergooding I recently played with a similar inline technique in Mono-LLVM, it's called Partial Inlining (already mentioned by Jan) e.g. https://reviews.llvm.org/D40477

static string Test(byte b) => b.ToString();

byte.ToString()
{
    ReadOnlySpan<char> format = null;
    IFormatProvider? provider = null;

    if (format.Length == 0)
    {
        return UInt32ToDecStr(value, digits: -1);
    }

    // slow path code
}

in your example what is the slow path -- do you expect inliner to extract it to a separate method? (it's what LLVM's one does and it affects binary size)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2019

Btw, in this PR I also wanted to optimize signed types like this:

// Int32.ToString()
public override string ToString()
{
    if (m_value >= 0)
        return Number.UInt32ToDecStr((uint)m_value, digits: -1);
    else
        return ToStringSlowPath();

    static string ToStringSlowPath()
    {
        return Number.NegativeInt32ToDecStr(m_value, -1, NumberFormatInfo.GetInstance(null).NegativeSign);
    }
}

but couldn't disable inlining for ToStringSlowPath 🙁 is there a feature request for attributes on inner methods? (because currently it's not supported)

@tannergooding
Copy link
Member

in you example what is slow path

Slow path is currently all the other code in the method.

do you expect inliner to extract it to a separate method? (it's what LLVM's one does and it affects binary size)

This is where the last statement comes into play The JIT, ideally, can now do dead code elimination and make ToString() small.

Due to how the inliner works today, it would be inlined as is (that is, the entire contents of Number.FormatUInt32 would become the body for byte.ToString()). But after inlining, the JIT still does dead code elimination, constant folding, and other optimizations. So, since format is null, we can know that format.Length == 0 is always true; so the JIT will see that return UInt32ToDecStr(value, digits: -1); is always hit and everything else (slow path) is dead code.

So the JIT would ultimately end up compiling it down to:

static string Test(byte b) => b.ToString();

byte.ToString() => UInt32ToDecStr(m_value, digits: -1);

Is there a feature request for attributes on inner methods? (because currently it's not supported)

Yes, and I believe it is targeting C# 9: dotnet/csharplang#1888

@AndyAyersMS
Copy link
Member

It is very hard for the inliner to see that. The RyuJIT inliner is in no position to do analysis like this today. The C/C++ compilers try to do optimizations like this, but they are very far from getting it right most of time.

Agree. It is not so easy for the compiler to see ahead of time that all this simplification can happen.

If you have a fast/slow path case like this, you might consider manually splitting the slow path into a separate method, so that the parent method is more likely inlineable.

@tannergooding
Copy link
Member

@AndyAyersMS, for the sake of learning, could you elaborate a bit on why this is difficult to do?

My understanding of the inliner is that it only does some basic analysis of the IL size and instruction types today. So, for example, we can see that byte.ToString() contains little IL and that it ultimately is just "forwarding" to some method (but we can't know what method it is yet).

Also, my understanding of Tier 0 is that we don't do any cross method inlining today. For R2R we do cross method inlining for code in the same module (e.g. between two methods in S.P.Corelib).

So, as a Tier1 (where we know the method is hot and further optimizations may be warranted) or R2R optimization, it seems like it would be viable to say we inline the call that byte.ToString() is making rather than inlining byte.ToString() itself. The premise being that methods which forward generally don't contain the bulk logic and we will have a call regardless; but inlining the call in byte.ToString(), rather than inlining byte.ToString() itself would give more opportunities for optimization, especially around eliminating dead code paths and constant folding due to it being a common pattern.

@jkotas
Copy link
Member

jkotas commented Oct 9, 2019

You are basically saying that inlining a big method into a small method is sometimes profitable. The chances that it is going to be profitable is very tiny, like 1 in 1000, even if you take additional facts like constant arguments into account. The analysis required to figure it out is expensive. It is not worth it to do the expensive analysis for the small chance that it is going to help.

If we had heavy optimizing compiler that can take a lot more time to optimize than RyuJIT, it may be something to look into - but that would be likely based on LLVM and we would get whatever LLVM is doing for this for "free".

@tannergooding
Copy link
Member

I see, I guess I wouldn't have thought the chances of this being profitable were something like 1 in 1000.

My expectation, given the coding patterns we use in CoreFX/CoreCLR and recommend others follow, is that this would have much higher profitability for methods which simply forward to another method (I would have different expectations for inlining a big method into a small method that has other additional logic rather than just forwarding). I would have also expected that, worst case, the codegen is no worse than the original big method and that we may have just spent some extra cycles trying to better optimize a known hot path.

Anyways thanks for taking the time to explain 😄

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2019

It'd be nice to collect some profile data in tier0 and save weights for branches/callees 🙂 so we can catch such cases.

PS: Do you think it's a good idea to raise an API proposal for G_UNLIKELY/@llvm.expect kind of API?

@AndyAyersMS
Copy link
Member

@tannergooding if you turn this problem around you can perhaps see why it ends up being tricky. I'd paraphrase what you have above as "if you do this particular set of inlines, these optimizations fall out." But the trick is figuring out the potential impact of a set of inlines ahead of time without knowing initially where it might lead.

I say "set of inlines" because in this case the callee itself has calls that themselves must be inlined and/or properly summarized to get the right level of detail exposed to realize that optimizations can happen. For instance the Span constructor and format.Length are both calls, and realizing null passed to the span constructor implies Length returns zero is not obvious unless both are inlined, and the span struct is promoted, and we constant prop through the length field, and that this result feeds a branch, and optimizing that branch removes a lot of code.

@EgorBo There are already proposals about for programmer-provided frequency hints, eg #6024. But we have a lot of work to do in the jit before we can really benefit from any of this; the internal accounting for profile data is not in great shape.

uint newValue = value / 10;
*(--bufferEnd) = (byte)(value - (newValue * 10) + '0');
value = newValue;
value = Math.DivRem(value, 10, out uint remainder);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment here was incorrect, did you verify the generated asm here? Before it was div by const (which is transformed into multiply by const reciprocal), but is it still so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once DivRem is inlined it becomes the same expression as it was, this change is mostly cosmetic

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants