JIT: Abandon loop search if we reach the end of the bbNext chain #69503

AndyAyersMS · 2022-05-18T16:29:20Z

In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during optFindNaturalLoops.

Analysis of a customer-provided dump showed that MakeCompactAndFindExits
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to MakeCompactAndFindExits had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.

…t chain In dotnet#69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.

ghost · 2022-05-18T16:29:29Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during optFindNaturalLoops.

Analysis of a customer-provided dump showed that MakeCompactAndFindExits
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to MakeCompactAndFindExits had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2022-05-18T16:30:07Z

@jakobbotsch PTAL
cc @dotnet/jit-contrib

AndyAyersMS · 2022-05-18T16:55:00Z

Some notes on repro attempts. Dump analysis indicated the jit AV happened while tier1 jitting of WorkerThreadStart from SPC.
I ported #58294 back to 6.0 and used it to apply random PGO counts to a PMI-driven jitting of this method (release SPC/runtime plus checked jit).

This requires a bit of ingenuity because random PGO only applies if there is underlying PGO data. So I first had to jit all methods via PMI at Tier0 with TieredPGO and write out the textual pgo data (all zeros for most methods), then read this text PGO data back in for the randomized tests.

I also tried editing the text PGO data to fill in the observed PGO counts from the dump, and confirmed those values were getting picked up, but this did not lead to the same behavior.

With this setup, randomized PGO runs turned up a few other asserts in other methods that I'll be opening issues for:

in some cases recognized loops are not properly nested
in some cases we see unexpected IR nodes cloning trees because of a flow optimization we do after inserting gc polls (I think this one is fixed in 7.0)

Because I was more interested in reproing the problem in WorkerThreadStart I added the ability to apply randomized PGO data just a specific method range (something I'll also PR shortly) to my local 6.0 JIT and ran 10,000 iterations without being able to repro that failure (the gc poll issue came up 6/10,000 times).

So it seems like there is some value in running randomized PGO with a much greater range of seed values. I am going to analyze how much actual diversity I see in those 10,000 instances but at a glance it appears we have indeed produced hundreds or thousands of different outputs. What I did above seems quite automatable.

We might also consider better SOS-level diagnostics, eg to extract the PGO data more readily or find all the PGO data read by the jit during a run (here the root method plus all 100 odd inlines). The dump we got was not a full dump and it seemed unlikely that I could locate all the relevant data. But abstractly it would be nice of we could create "text form" PGO from a dump for a method.

There was also a small mismatch in some of the IL offsets in the customer dump and my local repros; I need to check back with the customer to see if they're doing some form of IL modification. Their partial dump did not allow me to see the method IL but the overall method size was the same, so this mismatch is puzzling.

"script" for doing some of the above here: https://gist.github.com/AndyAyersMS/1a7187cd46f6023831d9d81927413aa2

AndyAyersMS · 2022-05-18T22:43:48Z

Failure seems to be #69190

AndyAyersMS · 2022-05-18T22:45:00Z

/backport to release/6.0

github-actions · 2022-05-18T22:45:13Z

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/2348597178

AndyAyersMS · 2022-05-19T02:33:55Z

Finally got a repro using random pgo and was able to validate my guess as to how things go wrong and the fix.

Here there is a potential loop from 57...137. We try and compact and move the range 60...109. But then we fail to compact after that.

The next loop runs from 71...136, but we've moved the first part of the loop down past the seocnd, so that 136 now sits above 71. And when we try to compact that loop we walk off the end of the bbNext chain.

Relocated blocks [BB60..BB109] inserted after BB137
New Basic Block BB149 [0333] created.
Setting edge weights for BB59 -> BB149 to [0 .. 3.402823e+38]
Added an unconditional jump to BB60 after block BB59
Failed to compact potential loop from [BB57 to BB137]
Did not find expected loop block when walking from BB148 [possible loop was from top BB71 to bottom BB136]
Failed to compact potential loop from [BB71 to BB136]

kunalspathak · 2022-05-19T05:04:11Z

Is this one of the examples of too much relying on bbNum?

SOS-level diagnostics

I agree. Either that or at least a documentation on how to debug and extract info as much as possible will be helpful.

AndyAyersMS · 2022-05-19T15:31:16Z

Is this one of the examples of too much relying on bbNum?

Sort of ... we certainly need to be careful if we're making assumptions about bbNum telling us something about the bbNext chain ordering. There is a complicated-to-describe invariant in the logic for MakeCompactAndFindExits -- if we move things carefully and consistently I think it works out, but in cases like we see here where we only move some of the non-loop code out of a loop we can get in trouble.

ghost assigned AndyAyersMS May 18, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2022

jakobbotsch approved these changes May 18, 2022

View reviewed changes

AndyAyersMS merged commit 41491a3 into dotnet:main May 18, 2022

github-actions bot mentioned this pull request May 18, 2022

[release/6.0] JIT: Abandon loop search if we reach the end of the bbNext chain #69525

Merged

JulieLeeMSFT added this to the 7.0.0 milestone May 23, 2022

ghost locked as resolved and limited conversation to collaborators Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Abandon loop search if we reach the end of the bbNext chain #69503

JIT: Abandon loop search if we reach the end of the bbNext chain #69503

AndyAyersMS commented May 18, 2022

ghost commented May 18, 2022

AndyAyersMS commented May 18, 2022

AndyAyersMS commented May 18, 2022 •

edited

Loading

AndyAyersMS commented May 18, 2022

AndyAyersMS commented May 18, 2022

github-actions bot commented May 18, 2022

AndyAyersMS commented May 19, 2022

kunalspathak commented May 19, 2022

AndyAyersMS commented May 19, 2022

JIT: Abandon loop search if we reach the end of the bbNext chain #69503

JIT: Abandon loop search if we reach the end of the bbNext chain #69503

Conversation

AndyAyersMS commented May 18, 2022

ghost commented May 18, 2022

AndyAyersMS commented May 18, 2022

AndyAyersMS commented May 18, 2022 • edited Loading

AndyAyersMS commented May 18, 2022

AndyAyersMS commented May 18, 2022

github-actions bot commented May 18, 2022

AndyAyersMS commented May 19, 2022

kunalspathak commented May 19, 2022

AndyAyersMS commented May 19, 2022

AndyAyersMS commented May 18, 2022 •

edited

Loading