-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Optimize ToString() for byte, ushort, uint and ulong #27056
Conversation
1216e92
to
2d20884
Compare
A lot of tests are failing with crashes like:
Looks this change has buffer overrun that is corrupting the GC heap. |
Unlikely to make any difference. This specific case is likely to be well predicted in real world workloads and the Max operation is just a drop in a bucket in the overall cost of this method.
Do you have any evidence that this pattern is frequent enough in real world code to make this optimization worth it?
There are 10000s similar redudant bound checks in CoreCLR/CoreFX. We are fine with paying for them to get more secure system, unless they affect performance significantly. |
If the JIT could do it, why couldn't Roslyn do it? |
@jkotas I agree those are mostly nit
Yeah I've built a quick prototype to log occurrences (inserted it after ValnumCSE) will share the results once I finish a local infrastructure for jit-diff (and add some popular projects/libs to inspect)
it can do I guess (for known positive constants) but JIT can find more opportunities after inlining and const propogation/folding, etc. string Test1() => Test2(42);
string Test2(int val) => val.ToString(); |
public const uint Timeout = 60;
static string TimeoutStr() => "Timeout: " + Timeout; Btw, just noticed Roslyn-master is now using Timeout.ToString() is eventually represented as:
(after optVNAssertionPropCurStmt) |
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This was literally just: override string ToString()
{
return Number.FormatUInt32(m_value, null, null);
}
static unsafe string FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
{
// Fast path for default format
if (format.Length == 0)
{
return UInt32ToDecStr(value, digits: -1);
}
// more code
} Wouldn't it have been better for the JIT to understand that the length of
The parameters passed in (modulo So this seems like a perfect opportunity for the JIT to do simple dead code elimination... |
|
Doing inlining analysis seems like it would be feasible here, especially for AOT or Tier 1 since the root method is only a single method call (no additional logic). The pattern where |
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
…27056) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
To clarify, I was referring to doing just the regular full inlining analysis. Just basically having some logic that says At the point we are trying for Tier 1, we know that the method is "hot" and that this might be worthwhile and the code pattern where you have multiple overloads forwarding to a single central implementation is particularly common and frequently involves passing constants for the remaining args; so this will likely introduce more opportunities to produce better codegen for these kind of methods (namely due to dead code elimination). Would it be worth logging an issue for this, possibly as just a venue to investigate? |
I do not understand what you mean.
The C/C++ compilers try to do optimizations like this, but they are very far from getting it right most of time. |
I wasn't referring to Based on the code from the original post: static string Test(byte b) => b.ToString(); Compiles down to: G_M64984_IG01:
sub rsp, 56
xor rax, rax
mov qword ptr [rsp+28H], rax
mov dword ptr [rsp+40H], ecx
G_M64984_IG02:
movzx rcx, byte ptr [rsp+40H]
xor rdx, rdx
xor r8d, r8d
lea rax, bword ptr [rsp+28H]
mov bword ptr [rax], rdx
mov dword ptr [rax+8], r8d
lea rdx, bword ptr [rsp+28H]
xor r8, r8
call System.Number:FormatUInt32(int,struct,ref):ref
nop
G_M64984_IG03:
add rsp, 56
ret
; Total bytes of code: 56 So, the JIT is seeing that However, it may be beneficial to keep the call to This is going based on the premise that the |
So, today you have effectively: static string Test(byte b) => b.ToString();
byte.ToString() => Number.FormatUInt32(m_value, null, null);
Number.FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
{
if (format.Length == 0)
{
return UInt32ToDecStr(value, digits: -1);
}
// slow path code
} The JIT optimizes to: static string Test(byte b) => Number.FormatUInt32(b, null, null);
Number.FormatUInt32(uint value, ReadOnlySpan<char> format, IFormatProvider? provider)
{
if (check)
{
fast path
}
else
{
slow path
}
} Instead you optimize to: static string Test(byte b) => b.ToString();
byte.ToString()
{
ReadOnlySpan<char> format = null;
IFormatProvider? provider = null;
if (format.Length == 0)
{
return UInt32ToDecStr(value, digits: -1);
}
// slow path code
} The JIT, ideally, can now do dead code elimination and make |
@tannergooding I recently played with a similar inline technique in Mono-LLVM, it's called Partial Inlining (already mentioned by Jan) e.g. https://reviews.llvm.org/D40477 static string Test(byte b) => b.ToString();
byte.ToString()
{
ReadOnlySpan<char> format = null;
IFormatProvider? provider = null;
if (format.Length == 0)
{
return UInt32ToDecStr(value, digits: -1);
}
// slow path code
} in your example what is the slow path -- do you expect inliner to extract it to a separate method? (it's what LLVM's one does and it affects binary size) |
Btw, in this PR I also wanted to optimize signed types like this: // Int32.ToString()
public override string ToString()
{
if (m_value >= 0)
return Number.UInt32ToDecStr((uint)m_value, digits: -1);
else
return ToStringSlowPath();
static string ToStringSlowPath()
{
return Number.NegativeInt32ToDecStr(m_value, -1, NumberFormatInfo.GetInstance(null).NegativeSign);
}
} but couldn't disable inlining for ToStringSlowPath 🙁 is there a feature request for attributes on inner methods? (because currently it's not supported) |
Slow path is currently all the other code in the method.
This is where the last statement comes into play Due to how the inliner works today, it would be inlined as is (that is, the entire contents of So the JIT would ultimately end up compiling it down to: static string Test(byte b) => b.ToString();
byte.ToString() => UInt32ToDecStr(m_value, digits: -1);
Yes, and I believe it is targeting C# 9: dotnet/csharplang#1888 |
Agree. It is not so easy for the compiler to see ahead of time that all this simplification can happen. If you have a fast/slow path case like this, you might consider manually splitting the slow path into a separate method, so that the parent method is more likely inlineable. |
@AndyAyersMS, for the sake of learning, could you elaborate a bit on why this is difficult to do? My understanding of the inliner is that it only does some basic analysis of the IL size and instruction types today. So, for example, we can see that Also, my understanding of Tier 0 is that we don't do any cross method inlining today. For R2R we do cross method inlining for code in the same module (e.g. between two methods in S.P.Corelib). So, as a Tier1 (where we know the method is hot and further optimizations may be warranted) or R2R optimization, it seems like it would be viable to say we inline the call that |
You are basically saying that inlining a big method into a small method is sometimes profitable. The chances that it is going to be profitable is very tiny, like 1 in 1000, even if you take additional facts like constant arguments into account. The analysis required to figure it out is expensive. It is not worth it to do the expensive analysis for the small chance that it is going to help. If we had heavy optimizing compiler that can take a lot more time to optimize than RyuJIT, it may be something to look into - but that would be likely based on LLVM and we would get whatever LLVM is doing for this for "free". |
I see, I guess I wouldn't have thought the chances of this being profitable were something like 1 in 1000. My expectation, given the coding patterns we use in CoreFX/CoreCLR and recommend others follow, is that this would have much higher profitability for methods which simply forward to another method (I would have different expectations for inlining a big method into a small method that has other additional logic rather than just forwarding). I would have also expected that, worst case, the codegen is no worse than the original big method and that we may have just spent some extra cycles trying to better optimize a known hot path. Anyways thanks for taking the time to explain 😄 |
It'd be nice to collect some profile data in tier0 and save weights for branches/callees 🙂 so we can catch such cases. PS: Do you think it's a good idea to raise an API proposal for G_UNLIKELY/@llvm.expect kind of API? |
@tannergooding if you turn this problem around you can perhaps see why it ends up being tricky. I'd paraphrase what you have above as "if you do this particular set of inlines, these optimizations fall out." But the trick is figuring out the potential impact of a set of inlines ahead of time without knowing initially where it might lead. I say "set of inlines" because in this case the callee itself has calls that themselves must be inlined and/or properly summarized to get the right level of detail exposed to realize that optimizations can happen. For instance the @EgorBo There are already proposals about for programmer-provided frequency hints, eg #6024. But we have a lot of work to do in the jit before we can really benefit from any of this; the internal accounting for profile data is not in great shape. |
uint newValue = value / 10; | ||
*(--bufferEnd) = (byte)(value - (newValue * 10) + '0'); | ||
value = newValue; | ||
value = Math.DivRem(value, 10, out uint remainder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment here was incorrect, did you verify the generated asm here? Before it was div by const (which is transformed into multiply by const reciprocal), but is it still so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once DivRem is inlined it becomes the same expression as it was, this change is mostly cosmetic
E.g.
byte.ToString()
currently callsNumber.FormatUInt32(m_value, null, null);
method which forToString()
just redirects it to UInt32ToDecStr so it's better to call it directly.Same for
ushort
,uint
andulong
.Example:
Was:
Now:
(because passing
null
to Span/ReadOnlySpan argument is not cheap)What else can be improved:
byte
(which is just a 1,2 or 3 chars string)Math.MaxFast
for various BCL scenarious like this.<const>.ToString()
can be replaced with string literals in JIT for positive constants (I have a prototype)