Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Visit blocks in RPO during LSRA #107927

Merged
merged 8 commits into from
Sep 20, 2024
Merged

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Sep 17, 2024

Part of #107749. LSRA's currently does a lexical pass over the blocklist to build a visitation order. Since we intend to run block layout after LSRA with #107634, LSRA ideally shouldn't be sensitive to lexical ordering, and since the current logic tries to visit a block's predecessors before the block itself, it seems easier and faster to just use an RPO traversal.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 17, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr outerloop, Fuzzlyn

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@amanasifkhalid amanasifkhalid marked this pull request as ready for review September 18, 2024 01:45
@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS @kunalspathak PTAL. Fuzzlyn failures are known or NaN false positives.

Note that LSRA previously had three ordering strategies: the default preds-first one, lexical order, and randomized order (which was never implemented). RPO looks like a viable replacement for the first one, and lexical order doesn't make much sense if we plan to move block layout later, so I removed the functionality for specifying LSRA's block order. Is it ok to remove this for now, and add it back in if/when we decide to implement a randomized order?

@amanasifkhalid
Copy link
Member Author

Diffs are large, though a net size improvement. Looking at the instructions retired per collection, the larger MinOpts TP regressions are concentrated in collections with relatively few MinOpts methods, so I think the TP impact isn't as bad as it looks.

@kunalspathak
Copy link
Member

Fuzzlyn linux/arm failures seems to expose some more issues by this change. We should address all of them before merging this PR.

@amanasifkhalid
Copy link
Member Author

Fuzzlyn linux/arm failures seems to expose some more issues by this change. We should address all of them before merging this PR.

The assertion varDsc->IsAlwaysAliveInMemory() || ((regSet.GetMaskVars() & regMask) == 0) looks like the one you just fixed, though it doesn't match the assert listed at the bottom, otherTargetInterval->registerType == TYP_DOUBLE -- the latter assert's message references a method that doesn't exist in the trimmed repro. I'm guessing this is a Fuzzlyn bug? I can't reproduce either assert with the provided repro; I'll try rerunning Fuzzlyn and see if we get anything actionable.

@amanasifkhalid
Copy link
Member Author

/azp run Fuzzlyn

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@amanasifkhalid
Copy link
Member Author

@kunalspathak the Linux arm failure didn't repro in the last Fuzzlyn run, so if it is a bug, it doesn't readily repro. Aside from inner/outerloop tests, are there any other suites you'd like me to run? JitStress doesn't seem to do anything interesting for block ordering, though if it's still worthwhile to run LSRA stress modes, I can do that -- thanks!

@kunalspathak
Copy link
Member

@kunalspathak the Linux arm failure didn't repro in the last Fuzzlyn run, so if it is a bug, it doesn't readily repro. Aside from inner/outerloop tests, are there any other suites you'd like me to run? JitStress doesn't seem to do anything interesting for block ordering, though if it's still worthwhile to run LSRA stress modes, I can do that -- thanks!

Yes, since Fuzzlyn is randomized, it might not necessary repro on every run, but we should take the failure that we saw in previous run and see why it showed up with this changes and go from there. I would usually run *jitstressregs* pipelines too.

@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr jitstressregs, runtime-coreclr jitstressregs-x86, runtime-coreclr jitstress2-jitstressregs

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@amanasifkhalid
Copy link
Member Author

Note that the diffs from the latest run look quite different because the collections got a bit messed up on x64: we're missing coreclr_tests and libraries_tests*, and diffs from smoke_tests.nativeaot are included multiple times.

@amanasifkhalid
Copy link
Member Author

I haven't been able to repro the Linux arm assert(s) using a crossjit, and based on the inconsistent error message in Fuzzlyn, I suspect the repro was wrong to begin with. jitstressregs pipelines are passing.

@kunalspathak
Copy link
Member

I haven't been able to repro the Linux arm assert(s) using a crossjit, and based on the inconsistent error message in Fuzzlyn

Did you try downloading the superpmi collection from the prior run and tried reproing on your pre-merge commit? If not, I will find some time to repro it with your changes.

@kunalspathak
Copy link
Member

You could also try a temporary change as part of your PR to increase the duration of Fuzzlyn to 4 hours and see if it catches anything.

@amanasifkhalid
Copy link
Member Author

Did you try downloading the superpmi collection from the prior run and tried reproing on your pre-merge commit?

Thanks for recommending this. I was able to hit both of the above asserts when replaying with a crossjit, though I can repro them with a mainline JIT, too. I've opened #108000 to track these asserts.

@amanasifkhalid
Copy link
Member Author

@kunalspathak were there any other pipelines you wanted me to run?

@kunalspathak
Copy link
Member

@kunalspathak were there any other pipelines you wanted me to run?

i think that should be it. If the failures are in main as well, then I think we should be good. I will review PR and get back to you.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the RPO based traversal, the only thing that current traversal was not doing was:

  • Sequencing the blocks based on profile data
  • Relying on block weight / bbNum in case of tie-breaker

Both of these can have significant impact on final code we produced as we see in the diffs. While the TP improvement is great (around 3~5% in MinOpts for some platforms), the perfscore is half and half. I see same number of methods improved vs. regressed.

image

To get some confidence I am tempting to see some history of benchmarks that get improved vs. regressed to make sure that this change is the way to go forward. @AndyAyersMS any thoughts?

Remove lexical and random order layout

I am ok with that.

// If the DFS didn't visit any blocks, add them to the end of blockSequence
if (dfsTree->GetPostOrderCount() < compiler->fgBBcount)
{
// Unvisited blocks are more likely to be at the back of the list, so iterate backwards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the examples where there will be left over unvisited blocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any blocks that don't have FlowEdges into them won't be visited by the RPO traversal. For every example I looked at, this happens when we have throw blocks without any explicit predecessors. For example:

BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..00D)-> BB06(0.5),BB02(0.5)     ( cond )                     i LIR hascall idxlen
BB02 [0008]  1       BB01                  0.25 [00D..???)-> BB07(0.01),BB03(0.99)   ( cond )                     LIR internal idxlen
BB03 [0010]  2       BB02,BB05             3.96 [00D..01C)-> BB05(0.5),BB04(0.5)     ( cond )                     i LIR loophead idxlen bwd
BB04 [0006]  1       BB03                  0.99 [???..???)-> BB05(1)                 (always)                     i LIR idxlen
BB05 [0004]  2       BB03,BB04             3.96 [01C..026)-> BB03(0.5),BB06(0.5)     ( cond )                     i LIR hascall idxlen bwd
BB06 [0019]  3       BB01,BB05,BB09        1    [026..031)                           (return)                     i LIR gcsafe newobj
BB07 [0011]  2       BB02,BB09             0.04 [00D..01C)-> BB09(0.5),BB08(0.5)     ( cond )                     i LIR loophead idxlen bwd
BB08 [0013]  1       BB07                  0.01 [???..???)-> BB09(1)                 (always)                     i LIR idxlen
BB09 [0014]  2       BB07,BB08             0.04 [01C..026)-> BB07(0.5),BB06(0.5)     ( cond )                     i LIR hascall idxlen bwd
BB10 [0020]  0                             0    [???..???)                           (throw )                     i LIR rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

BB10 would've been removed in previous phases if it weren't for the BBF_DONT_REMOVE flag set on it. StackLevelSetter, which runs right before LSRA, has some logic for removing throw blocks that are no longer needed, even if they're flagged with BBF_DONT_REMOVE. So if we still see these blocks by the time we get to LSRA, I guess they're needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edges to the throw helpers are added during codegen. So they will appear unreached but are needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since such blocks will be cold (I assume), I am wondering if we should add logic to LSRA to quickly assign registers in such blocks, similar to the way we do for MinOpts, to save some TP time. I am not sure if there will be resolution added to the hot blocks because of this decision and if there is a way to add resolution too in cold blocks.

@AndyAyersMS
Copy link
Member

To get some confidence I am tempting to see some history of benchmarks that get improved vs. regressed to make sure that this change is the way to go forward. @AndyAyersMS any thoughts

You mean setting up a lab experiment?

@kunalspathak
Copy link
Member

To get some confidence I am tempting to see some history of benchmarks that get improved vs. regressed to make sure that this change is the way to go forward. @AndyAyersMS any thoughts

You mean setting up a lab experiment?

Yes.

@amanasifkhalid
Copy link
Member Author

You mean setting up a lab experiment?

Since this change is contained to one PR, would it be easier to merge this, and then wait for the perf triage? Since JIT churn has been pretty low lately, it should be easy to spot regressions from this, and if we aren't happy with the results, we can just revert this PR.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets merge it as-is and monitor the perf issues.

@amanasifkhalid
Copy link
Member Author

@kunalspathak thanks for the review!

Lets merge it as-is and monitor the perf issues.

@AndyAyersMS is it alright if we go with this?

@AndyAyersMS
Copy link
Member

@kunalspathak thanks for the review!

Lets merge it as-is and monitor the perf issues.

@AndyAyersMS is it alright if we go with this?

Yes

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another place where the loop-aware "RPO" would be useful.

@kunalspathak
Copy link
Member

kunalspathak commented Sep 20, 2024

loop-aware "RPO" would be useful

What is loop-aware "RPO"?

@amanasifkhalid amanasifkhalid merged commit ddf8075 into dotnet:main Sep 20, 2024
157 of 160 checks passed
@amanasifkhalid amanasifkhalid deleted the lsra-rpo branch September 20, 2024 18:38
@amanasifkhalid
Copy link
Member Author

What is loop-aware "RPO"?

When visiting a block during an RPO traversal, we check if the block is a loop header, and if so, we visit the rest of the loop's body before visiting anything else -- value numbering currently does this, if you want to see what the implementation looks like. This visit ordering has the nice property of keeping loop bodies compact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants