Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107874

stephentoub · 2024-09-16T15:24:51Z

If a regex is created with RegexOptions.Compiled and RegexOptions.IgnoreCase, and it begins with a pattern that's a reasonably small number of alternating strings, it'll now end up using SearchValues<string> to find the next possible match location. However, the SearchValues<string> instance doesn't end up getting created if the interpreter is being used. If the implementation falls back to the interpreter because compilation isn't supported because dynamic code isn't supported, then it won't use any optimizations to find the next starting location. That's a regression from when it would previously at least use a vectorized search to find one character class from the set of starting strings.

This fixes it to just always create the SearchValues<string>. This adds some overhead when using RegexOptions.Compiled, but it's typically just a few percentage points, and only applies in the cases where this SearchValues<string> optimization kicks in. At the moment, changing it to have perfect knowledge about whether it can avoid that creation is too invasive. This overhead also doesn't apply to the source generator.

Contributes to #99553 (this should be backported to release/9.0)

…ported If a regex is created with RegexOptions.Compiled and RegexOptions.IgnoreCase, and it begins with a pattern that's a reasonably small number of alternating strings, it'll now end up using `SearchValues<string>` to find the next possible match location. However, the `SearchValues<string>` instance doesn't end up getting created if the interpreter is being used. If the implementation falls back to the interpreter because compilation isn't supported because dynamic code isn't supported, then it won't use any optimizations to find the next starting location. That's a regression from when it would previously at least use a vectorized search to find one character class from the set of starting strings. This fixes it to just always create the `SearchValues<string>`. This adds some overhead when using RegexOptions.Compiled, but it's typically just a few percentage points, and only applies in the cases where this `SearchValues<string>` optimization kicks in. At the moment, changing it to have perfect knowledge about whether it can avoid that creation is too invasive. This overhead also doesn't apply to the source generator.

dotnet-policy-service · 2024-09-16T15:25:23Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

MihaZupan

LGTM. Thank you for the clear description.

stephentoub · 2024-09-16T16:15:42Z

@BrzVlad, this makes the regression go away for the cited case of IsDynamicCodeSupported being false with coreclr (rather than regressing, .NET 8 to .NET 9 improves by almost ~40% for me on the offending benchmark). Do you similarly see the same for mono? I want to make sure there aren't two issues lurking.

stephentoub · 2024-09-16T16:37:09Z

/backport to release/9.0

github-actions · 2024-09-16T16:37:19Z

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/10888162147

BrzVlad · 2024-09-16T18:07:52Z

@stephentoub Thanks for the quick fix. The huge regression I was observing indeed goes away. There seems to be a regression of only 1.75x compared to .NET8 on mono interpreter. 85% of the time is now spent running the method System.Buffers.AhoCorasick:IndexOfAnyCore<System.Buffers.StringSearchValuesHelper/CaseInsensitiveAsciiLetters, System.Buffers.AhoCorasick/NoFastScan> (System.ReadOnlySpan`1<char>). I think this is now consistent with your theory that SearchValues is slow on mono interpreter. I assume that, in your tests, CoreCLR was hitting the vectorized path, while the non-vectorized path didn't get much love and is now slower on interpreter (interpreter has vectorized V128 but doesn't yet implement sse, advsimd etc apis). I don't have a strong opinion on how relevant this smaller regression is, but if some quick fixes can be done to the non-vecorized path then it would be great.

stephentoub · 2024-09-16T20:02:40Z

but if some quick fixes can be done to the non-vecorized path then it would be great.

I'd defer to @MihaZupan for that, but I suspect it's unlikely.

MihaZupan · 2024-09-16T20:31:48Z

@BrzVlad would you be able to test the performance with this patch applied as well?
https://github.com/dotnet/runtime/compare/main...MihaZupan:runtime:searchvalues-string-ahoNonVecAscii?w=1

If that doesn't help much, I think the next best thing would be reviving #92680, but that seems like a .NET 10 change to me.

BrzVlad · 2024-09-17T08:36:53Z

@MihaZupan That patch reduced the regression from 1.75x to 1.6x compared to .NET8. I think it is fine to not attempt to fix this for .NET9

lewing · 2024-09-17T16:49:32Z

/ba-g the stuck leg is hitting dotnet/dnceng#3879 all failures are known

…ported (dotnet#107874) If a regex is created with RegexOptions.Compiled and RegexOptions.IgnoreCase, and it begins with a pattern that's a reasonably small number of alternating strings, it'll now end up using `SearchValues<string>` to find the next possible match location. However, the `SearchValues<string>` instance doesn't end up getting created if the interpreter is being used. If the implementation falls back to the interpreter because compilation isn't supported because dynamic code isn't supported, then it won't use any optimizations to find the next starting location. That's a regression from when it would previously at least use a vectorized search to find one character class from the set of starting strings. This fixes it to just always create the `SearchValues<string>`. This adds some overhead when using RegexOptions.Compiled, but it's typically just a few percentage points, and only applies in the cases where this `SearchValues<string>` optimization kicks in. At the moment, changing it to have perfect knowledge about whether it can avoid that creation is too invasive. This overhead also doesn't apply to the source generator.

stephentoub requested review from BrzVlad and MihaZupan September 16, 2024 15:24

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Sep 16, 2024

dotnet-policy-service bot assigned stephentoub Sep 16, 2024

MihaZupan approved these changes Sep 16, 2024

View reviewed changes

github-actions bot mentioned this pull request Sep 16, 2024

[release/9.0] Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107877

Merged

4 tasks

lewing merged commit 8ae3796 into dotnet:main Sep 17, 2024
80 of 85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107874

Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107874

stephentoub commented Sep 16, 2024

dotnet-policy-service bot commented Sep 16, 2024

MihaZupan left a comment

stephentoub commented Sep 16, 2024

stephentoub commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

BrzVlad commented Sep 16, 2024

stephentoub commented Sep 16, 2024

MihaZupan commented Sep 16, 2024

BrzVlad commented Sep 17, 2024

lewing commented Sep 17, 2024

Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107874

Fix RegexOptions.Compiled|IgnoreCase perf when dynamic code isn't supported #107874

Conversation

stephentoub commented Sep 16, 2024

dotnet-policy-service bot commented Sep 16, 2024

MihaZupan left a comment

Choose a reason for hiding this comment

stephentoub commented Sep 16, 2024

stephentoub commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

BrzVlad commented Sep 16, 2024

stephentoub commented Sep 16, 2024

MihaZupan commented Sep 16, 2024

BrzVlad commented Sep 17, 2024

lewing commented Sep 17, 2024