Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock #54426

Merged
merged 1 commit into from
Jun 22, 2021
Merged

Fix deadlock #54426

merged 1 commit into from
Jun 22, 2021

Conversation

cshung
Copy link
Member

@cshung cshung commented Jun 18, 2021

What's wrong?

When I was prototyping the NoGC support for regions, I notice that my code ran into a deadlock when running under server GC. When I observe the threads in detail, it appears the threads are blocking on different joins.

Why does that happen?
The majority of the code requires that the server GC threads all start together, follow the same path (with respect to joins), and therefore they should all be before the same join and all march towards the next join.

The issue with this bug is that in a special case, a particular GC thread started earlier than the others and lead towards a different path.

In particular, in a normal case, we expect all threads to be waiting at the beginning of the gc_thread_function. Except for the heap 0 thread, which blocks on the ee_suspend_event, all other threads should be blocking on the gc_start_event.

However, in the case of minimal_gc_p == TRUE, the work to reset the gc_start_event is skipped. Therefore when one thread is done with the work, it proceeds to run the next iteration right away without being blocked in the gc_start_event, and that is bad because the heap 0 thread will be waiting regardless, so all threads are not running in locked steps.

The fix?
I make sure the last thread entered the initial join reset the gc_start_event before letting all threads return from garbage_collect. This will ensure all threads get blocked on the same waiting condition, just like it was for the minimal_gc_p == FALSE case after generation_to_condemn.

Note that this fix is independent of USE_REGIONS, meaning it might happen in previous releases. We might want to consider backport this fix to earlier LTS.

@ghost
Copy link

ghost commented Jun 18, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

What's wrong?

When I was prototyping the NoGC support for regions, I notice that my code ran into a deadlock when running under server GC. When I observe the threads in detail, it appears the threads are blocking on different joins.

Why does that happen?
The majority of the code requires that the server GC threads all start together, follow the same path (with respect to joins), and therefore they should all be before the same join and all march towards the next join.

The issue with this bug is that in a special case, a particular GC thread started earlier than the others and lead towards a different path.

In particular, in a normal case, we expect all threads to be waiting at the beginning of the gc_thread_function. Except for the heap 0 thread, which blocks on the ee_suspend_event, all other threads should be blocking on the gc_start_event.

However, in the case of minimal_gc_p == TRUE, the work to reset the gc_start_event is skipped. Therefore when one thread is done with the work, it proceeds to run the next iteration right away without being blocked in the gc_start_event, and that is bad because the heap 0 thread will be waiting regardless, so all threads are not running in locked steps.

The fix?
I make sure the last thread entered the initial join reset the gc_start_event before letting all threads return from garbage_collect. This will ensure all threads get blocked on the same waiting condition, just like it was for the minimal_gc_p == FALSE case after generation_to_condemn.

Note that this fix is independent of USE_REGIONS, meaning it might happen in previous releases. We might want to consider backport this fix to earlier LTS.

Author: cshung
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@PeterSolMS
Copy link
Contributor

Ok, makes sense.

@cshung cshung merged commit 85aebc4 into dotnet:main Jun 22, 2021
@cshung cshung deleted the public/reset branch June 22, 2021 16:00
@ghost ghost locked as resolved and limited conversation to collaborators Jul 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants