reinitialize the sched lock on thread_atfork
#10889
Open
+1
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We are seeing intermittent hangs in various tests at Stripe when using Ruby 3.3.1 that involve subprocesses: one process is stuck waiting for a child process to complete, while the child process is stuck waiting for
vm->ractor.sched.lock
. Inspecting the lock on x86-64 Linux says that the lock is being held by the parent process -- but the parent process's threads are all nowhere near the critical section that this lock is protecting, i.e. the parent process is not actually holding the lock. The tests in question are all using thesubprocess
gem, but we believe the below scenario is applicable to any use offork
, including much ofOpen3
in the standard library.We think what must have happened is:
fork
from Ruby.vm->ractor.sched.lock
fork(2)
.vm->ractor.sched.lock
and discovers that the parent process is holding onto the lock.vm->ractor.sched.lock
.vm->ractor.sched.lock
.To address the above problem, we re-initialize the mutex in
thread_atfork
so the child process starts with a clean slate. This is undefined behavior according to pthreads, but so is re-initializing the condition variables and barrier here. I don't think this makes things any worse. 😅We've done some limited testing with this patch on our internal CI and it appears to make the hangs go away (none of the affected tests hang in a handful of runs, whereas at least one of them would fail with high probability on any given run before). We plan on doing wider testing with it starting next week.