Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096

Closed
kkokosa opened this issue Mar 29, 2023 · 8 comments · Fixed by #84189
Closed

Deadlock when using TryStartNoGCRegion and/or GC.Collect #84096

kkokosa opened this issue Mar 29, 2023 · 8 comments · Fixed by #84189
Milestone

Comments

@kkokosa
Copy link
Contributor

kkokosa commented Mar 29, 2023

Description

In Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).

Reproduction Steps

To fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:

  • build Nethermind from source performance/new-payload-no-gc branch
  • install consensus client fe. Lighthouse:
  • create jwtsecret file containing a random 64 character hex string
  • run consesuns client: ./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse
  • run from src/Nethermind/Nethermind.Runner Nethermind: dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"
  • (unfortunatelly) wait until node becomes synced (it will take a day or two)

When node becomes synced, it should deadlock eventually after few minutes or hours.

Expected behavior

Appliction works :)

Actual behavior

Applcation deadlocks.

Regression?

No response

Known Workarounds

No response

Configuration

Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM
.NET 7.0.4

Other information

I've attached lldb to a deadlocked Nethermind.Runner process, please find attached my investigation containing thread backtrace all merged with clrstack for managed parts (manually): stacks.txt

As you can find there, all GC threads are waiting on gc_t_join.join in the mark_phase, while thread 172 is waiting on wait_for_gc_done from the GC.Collect comming from ScheduleGC in:

public void Dispose()
{
    if (_failCause == FailCause.None)
    {
        if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
        {
            try
            {
                System.GC.EndNoGCRegion();
                _gcKeeper.ScheduleGC();
            }
            catch (InvalidOperationException)
            {
                if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
            }
            catch (Exception e)
            {
                if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
            }
        }
        else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
    }
    else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
}

Most of other threads are just waiting, in a typical state, I'd say.

In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex 0x0000000a00000004 info and whether it is even a good direection.

I still keep lldb attached to deadlocked process, happy to investigate futher if you'd drive me.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 29, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 29, 2023
@ghost
Copy link

ghost commented Mar 29, 2023

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

In Nethermind we have Prevent GC during NewPayload · Pull Request #5381 which apparently introduces deadlock and the whole app just stops to work (all threads seems to be stalled).

Reproduction Steps

To fully reproduce it you'd need to start a new Nethermind node as explained in Running Nethermind & CL - Nethermind Docs:

  • build Nethermind from source performance/new-payload-no-gc branch
  • install consensus client fe. Lighthouse:
  • create jwtsecret file containing a random 64 character hex string
  • run consesuns client: ./lighthouse bn --network mainnet --execution-endpoint http://localhost:8551 --execution-jwt ~/ethereum/jwtsecret --checkpoint-sync-url https://mainnet.checkpoint.sigp.io --disable-deposit-contract-sync --datadir ~/ethereum/lighthouse
  • run from src/Nethermind/Nethermind.Runner Nethermind: dotnet run -c Release -- --config mainnet --datadir "/root/ethereum/nethermind" --JsonRpc.JwtSecretFile="/root/ethereum/jwtsecret"
  • (unfortunatelly) wait until node becomes synced (it will take a day or two)

When node becomes synced, it should deadlock eventually after few minutes or hours.

Expected behavior

Appliction works :)

Actual behavior

Applcation deadlocks.

Regression?

No response

Known Workarounds

No response

Configuration

Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-67-generic x86_64)
AMD EPYC 7642 48-Core Processor (16C16T), 2299Mhz - 0Mhz, 64GB RAM

Other information

I've attached lldb to a deadlocked Nethermind.Runner process, please find attached my investigation containing thread backtrace all merged with clrstack for managed parts (manually): stacks.txt

As you can find there, all GC threads are waiting on gc_t_join.join in the mark_phase, while thread 172 is waiting on wait_for_gc_done from the GC.Collect comming from ScheduleGC in:

public void Dispose()
{
    if (_failCause == FailCause.None)
    {
        if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
        {
            try
            {
                System.GC.EndNoGCRegion();
                _gcKeeper.ScheduleGC();
            }
            catch (InvalidOperationException)
            {
                if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with Exception with {_size} bytes");
            }
            catch (Exception e)
            {
                if (_logger.IsError) _logger.Error($"{nameof(System.GC.EndNoGCRegion)} failed with exception.", e);
            }
        }
        else if (_logger.IsDebug) _logger.Debug($"Failed to keep in NoGCRegion with {_size} bytes");
    }
    else if (_logger.IsDebug) _logger.Debug($"Failed to start NoGCRegion with {_size} bytes with cause {_failCause.FastToString()}");
}

Most of other threads are just waiting, in a typical state, I'd say.

In the file there is also beginning of my synchronization data investigation, but not sure what to do with mutex 0x0000000a00000004 info and whether it is even a good direection.

I still keep lldb attached to deadlocked process, happy to investigate futher if you'd drive me.

Author: kkokosa
Assignees: -
Labels:

area-GC-coreclr, untriaged, needs-area-label

Milestone: -

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 30, 2023
@Maoni0
Copy link
Member

Maoni0 commented Mar 30, 2023

I just took a quick look at the callstacks, the hang actually is due to the fact that one of the GC threads is not in the right place -

  thread #12, name = '.NET Server GC', stop reason = signal SIGSTOP
    ... [omitted]
    frame #9: 0x00007fb6a74f6ea2 libcoreclr.so`CLREventBase::WaitEx(this=<unavailable>, dwMilliseconds=4294967295, mode=<unavailable>, syncState=0x0000000000000000) at synch.cpp:466:20
    frame #10: 0x00007fb6a754d612 libcoreclr.so`SVR::gc_heap::gc_thread_function(this=0x0000556b996f6d10) at gc.cpp:6807:28
    frame #11: 0x00007fb6a754d596 libcoreclr.so`SVR::gc_heap::gc_thread_stub(arg=<unavailable>) at gc.cpp:34183:11
    frame #12: 0x00007fb6a745c4fe libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char16_t const*)::$_1::__invoke(void*) [inlined] (anonymous namespace)::CreateNonSuspendableThread(this=<unavailable>, argument=<unavailable>)(void*), void*, char16_t const*)::$_1::operator()(void*) const at gcenv.ee.cpp:1476:13
    frame #13: 0x00007fb6a745c4bd libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(argument=<unavailable>)(void*), void*, char16_t const*)::$_1::__invoke(void*) at gcenv.ee.cpp:1461:27
    frame #14: 0x00007fb6a7770a2e libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000556b99720390) at thread.cpp:1829:16
    frame #15: 0x00007fb6a799eb43 libc.so.6`___lldb_unnamed_symbol3481 + 755
    frame #16: 0x00007fb6a7a30a00 libc.so.6`___lldb_unnamed_symbol3865 + 11

this means the other threads cannot finish with their joins. this whole picture looks really wrong as in, this thread is not even 1 join off from other threads, it seems to think there's no GC work to do where the other threads are already a few joins ahead (join to do joined_generation_to_condemn, initial join in mark_phase, join for card marking).

CC-ing @cshung.

@Maoni0
Copy link
Member

Maoni0 commented Mar 30, 2023

chatted with @cshung - this symptom (3 joins off) actually fits exactly the bug he fixed a while ago for NoGCRegion mode. he will be providing more details.

@cshung
Copy link
Member

cshung commented Mar 31, 2023

Root cause:

Background - understanding join

To understand the problem - we first need to understand join. Under server GC, we have multiple threads running. They need to synchronize, and we are doing that using joins. Join can be understood as a rendezvous point where all threads meet until they can go further forward. Join is simply implemented using a counting scheme. If a thread reaches a join, the count reduces by 1, and if the count reaches 0 starting from the number of threads, the join can now proceed.

The join mechanism itself doesn't know if the threads are actually joining at the same piece of code - all it does is count the number of threads reaching one of the join points. So in the case that a thread ran into a different join point, the count will still reduce by 1, and the join will proceed, but obviously this is not intended.

This happened with the stacks.txt that you shared, all the threads are stuck in scan_dependent_handles, but one of them is waiting for an event to signal. That is not good because all threads should always stop on the same join point.

If all threads experience the same sequence of join point, this is impossible. Therefore it must be the case that the bad thread that is stuck in the event wandered through a different code path that leads to additional join points.

What happened?

In the case of NoGCRegion, there is a possibility for a thread to wander into an alternative code path through the sequence of events:

  1. One thread triggered a GC during a NoGC region - effectively terminated the NoGCRegion.
  2. All the ServerGC threads works to perform a GC.
  3. One thread completed the work, by the time, the pause mode is still pause_no_gc, so we enter into the code path to allocate_for_no_gc_after_gc, stuck in the join for gc_join_after_commit_soh_no_gc
  4. The heap 0 thread completed, and it resumed the runtime.
  5. Some managed thread might call EndNoGCRegion, that set the pause mode to something else.
  6. Other thread detected that and won't get into the allocate_for_no_gc_after_gc, and the sequence of join diverge from this point on.

The allocate_for_no_gc_after_gc code path has 3 joins in it, and as we count, all other threads are 3 joins ahead, that confirms the hypothesis that what you are seeing should be this bug I found and fixed.

Fix

The fix for that one is to make sure all threads experience the same sequence of join point, our simple solution is to cache the pause mode value on all threads while we are still under GC suspension, that will make sure either all threads ran into allocate_for_no_gc_after_gc or none of them do, that will guarantee the same sequence constraint. The fix is currently part of this WIP PR #82045. That PR implements some other functionalities as well, the only few lines that is fixing this bug is the code related to saved_settings_pause_mode.

Next Steps

As we progress, that WIP PR will eventually be merged and be part of .NET 8. Do you need that fix sooner or in earlier releases? I might be able to split that fix out and merge just that, or port it to earlier releases as needed.

@LukaszRozmej
Copy link

LukaszRozmej commented Mar 31, 2023

@cshung thank you for detailed explanation. We would be happy to help testing this out if you can supply us a working environment that we could use (or maybe @kkokosa can just build it from source? I've never tried).

As for timings this would help us to optimize our critical path in our Ethereum Client. Basically every 12s comes a new block and we want to process it as quickly as possible without the pauses. Especially long spikes caused by GC can cause Ethereum validators to miss-out on attestations and loose potential rewards that come from it. So there is direct connection to economical value for our clients.

Right now we fallback to GCSettings.LatencyMode = GCLatencyMode.SustainedLowLatency; setting and are seeing some improvements with it. We think TryStartNoGCRegion will yield even better results.

Summarizing it would be nice if it can be included in some .Net 7 bugfix version. And would help us promote .Net as one of the best technologies to be used in blockchain. We are mainly competing against Go, Rust and Java on this front.

We could potentially include a specialized version just with this patch as our release strategies are: Single file application deployment, Docker and PPA. In all of them we can control the runtime. Although some clients can build from source - which would result in this behaviour, so it would be additional complexity to properly support both.

@Maoni0
Copy link
Member

Maoni0 commented Mar 31, 2023

I don't see any reason why we wouldn't be able to include this in a 7.0 servicing release.

the easiest way that we validate GC fixes these days is by using a private build of clrgc.dll which you can invoke via one of the 2 ways -

by setting this runtime config -

{
   "runtimeOptions": {
      "configProperties": {
         "System.GC.Name": "clrgc.dll"
      }
   }
}

or this environment variable -

set DOTNET_GCName=clrgc.dll

is this a feasible option for you?

@LukaszRozmej
Copy link

@Maoni0 yes thank you @kkokosa will prepare the dll and configuration and @kamilchodola will test it. We will keep you updated if it fixed our issue.

@cshung cshung removed the untriaged New issue has not been triaged by the area owner label Mar 31, 2023
@cshung cshung added this to the 8.0.0 milestone Mar 31, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 31, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 5, 2023
@cshung
Copy link
Member

cshung commented Apr 26, 2023

@kkokosa, @LukaszRozmej, @benaadams

The PR #84738 that backports the fix to .NET 7 is merged now. We can expect the fix to be included in the next servicing release.

@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants