-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionIn Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled). Reproduction StepsTo fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:
When node becomes synced, it should deadlock eventually after few minutes or hours. Expected behaviorAppliction works :) Actual behaviorApplcation deadlocks. Regression?No response Known WorkaroundsNo response ConfigurationUbuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64) Other informationI've attached As you can find there, all GC threads are waiting on public void Dispose()
{
if (_failCause == FailCause.None)
{
if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
{
try
{
System.GC.EndNoGCRegion();
_gcKeeper.ScheduleGC();
}
catch (InvalidOperationException)
{
if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
}
catch (Exception e)
{
if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
}
}
else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
}
else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
} Most of other threads are just waiting, in a typical state, I'd say. In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex I still keep
|
I just took a quick look at the callstacks, the hang actually is due to the fact that one of the GC threads is not in the right place -
this means the other threads cannot finish with their joins. this whole picture looks really wrong as in, this thread is not even 1 join off from other threads, it seems to think there's no GC work to do where the other threads are already a few joins ahead (join to do CC-ing @cshung. |
chatted with @cshung - this symptom (3 joins off) actually fits exactly the bug he fixed a while ago for NoGCRegion mode. he will be providing more details. |
Root cause:Background - understanding joinTo understand the problem - we first need to understand join. Under server GC, we have multiple threads running. They need to synchronize, and we are doing that using joins. Join can be understood as a rendezvous point where all threads meet until they can go further forward. Join is simply implemented using a counting scheme. If a thread reaches a join, the count reduces by 1, and if the count reaches 0 starting from the number of threads, the join can now proceed. The join mechanism itself doesn't know if the threads are actually joining at the same piece of code - all it does is count the number of threads reaching one of the join points. So in the case that a thread ran into a different join point, the count will still reduce by 1, and the join will proceed, but obviously this is not intended. This happened with the If all threads experience the same sequence of join point, this is impossible. Therefore it must be the case that the bad thread that is stuck in the event wandered through a different code path that leads to additional join points. What happened?In the case of NoGCRegion, there is a possibility for a thread to wander into an alternative code path through the sequence of events:
The FixThe fix for that one is to make sure all threads experience the same sequence of join point, our simple solution is to cache the pause mode value on all threads while we are still under GC suspension, that will make sure either all threads ran into Next StepsAs we progress, that WIP PR will eventually be merged and be part of .NET 8. Do you need that fix sooner or in earlier releases? I might be able to split that fix out and merge just that, or port it to earlier releases as needed. |
@cshung thank you for detailed explanation. We would be happy to help testing this out if you can supply us a working environment that we could use (or maybe @kkokosa can just build it from source? I've never tried). As for timings this would help us to optimize our critical path in our Ethereum Client. Basically every 12s comes a new block and we want to process it as quickly as possible without the pauses. Especially long spikes caused by GC can cause Ethereum validators to miss-out on attestations and loose potential rewards that come from it. So there is direct connection to economical value for our clients. Right now we fallback to Summarizing it would be nice if it can be included in some .Net 7 bugfix version. And would help us promote .Net as one of the best technologies to be used in blockchain. We are mainly competing against Go, Rust and Java on this front. We could potentially include a specialized version just with this patch as our release strategies are: Single file application deployment, Docker and PPA. In all of them we can control the runtime. Although some clients can build from source - which would result in this behaviour, so it would be additional complexity to properly support both. |
I don't see any reason why we wouldn't be able to include this in a 7.0 servicing release. the easiest way that we validate GC fixes these days is by using a private build of clrgc.dll which you can invoke via one of the 2 ways - by setting this runtime config -
or this environment variable -
is this a feasible option for you? |
@Maoni0 yes thank you @kkokosa will prepare the dll and configuration and @kamilchodola will test it. We will keep you updated if it fixed our issue. |
@kkokosa, @LukaszRozmej, @benaadams The PR #84738 that backports the fix to .NET 7 is merged now. We can expect the fix to be included in the next servicing release. |
Description
In Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).
Reproduction Steps
To fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:
performance/new-payload-no-gc
branchlighthouse --version
executesjwtsecret
file containing a random 64 character hex string./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse
src/Nethermind/Nethermind.Runner
Nethermind:dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"
When node becomes synced, it should deadlock eventually after few minutes or hours.
Expected behavior
Appliction works :)
Actual behavior
Applcation deadlocks.
Regression?
No response
Known Workarounds
No response
Configuration
Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM
.NET 7.0.4
Other information
I've attached
lldb
to a deadlocked Nethermind.Runner process, please find attached my investigation containingthread backtrace all
merged withclrstack
for managed parts (manually): stacks.txtAs you can find there, all GC threads are waiting on
gc_t_join.join
in themark_phase
, while thread 172 is waiting onwait_for_gc_done
from theGC.Collect
comming fromScheduleGC
in:Most of other threads are just waiting, in a typical state, I'd say.
In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex
0x0000000a00000004
info and whether it is even a good direection.I still keep
lldb
attached to deadlocked process, happy to investigate futher if you'd drive me.The text was updated successfully, but these errors were encountered: