reinitialize the sched lock on `thread_atfork` #10889

froydnj · 2024-05-31T21:43:19Z

We are seeing intermittent hangs in various tests at Stripe when using Ruby 3.3.1 that involve subprocesses: one process is stuck waiting for a child process to complete, while the child process is stuck waiting for vm->ractor.sched.lock. Inspecting the lock on x86-64 Linux says that the lock is being held by the parent process -- but the parent process's threads are all nowhere near the critical section that this lock is protecting, i.e. the parent process is not actually holding the lock. The tests in question are all using the subprocess gem, but we believe the below scenario is applicable to any use of fork, including much of Open3 in the standard library.

We think what must have happened is:

Some thread T2 in the parent process calls fork from Ruby.
Some thread T1 in the parent process locks vm->ractor.sched.lock
T1 calls fork(2).
The child process starts, but only with a copy of T2; T1 only exists in the parent process.
The child process eventually tries to lock vm->ractor.sched.lock and discovers that the parent process is holding onto the lock.
The parent process eventually resumes T1 to unlock vm->ractor.sched.lock.
The child process is stuck waiting forever, because nothing in it exists to unlock vm->ractor.sched.lock.

To address the above problem, we re-initialize the mutex in thread_atfork so the child process starts with a clean slate. This is undefined behavior according to pthreads, but so is re-initializing the condition variables and barrier here. I don't think this makes things any worse. 😅

We've done some limited testing with this patch on our internal CI and it appears to make the hangs go away (none of the affected tests hang in a handful of runs, whereas at least one of them would fail with high probability on any given run before). We plan on doing wider testing with it starting next week.

luke-gru · 2024-06-01T03:17:53Z

Oddly enough I opened an issue regarding this behavior too, but could not reproduce it well enough to have confidence in my change. Here's my ticket, now closed: https://bugs.ruby-lang.org/issues/19395

reinitialize the sched lock on thread_atfork

5b60bbb

ko1 requested review from ko1 and removed request for ko1 June 13, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reinitialize the sched lock on `thread_atfork` #10889

reinitialize the sched lock on `thread_atfork` #10889

froydnj commented May 31, 2024

luke-gru commented Jun 1, 2024

reinitialize the sched lock on thread_atfork #10889

Are you sure you want to change the base?

reinitialize the sched lock on thread_atfork #10889

Conversation

froydnj commented May 31, 2024

luke-gru commented Jun 1, 2024

reinitialize the sched lock on `thread_atfork` #10889

reinitialize the sched lock on `thread_atfork` #10889