Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reinitialize the sched lock on thread_atfork #10889

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

froydnj
Copy link
Contributor

@froydnj froydnj commented May 31, 2024

We are seeing intermittent hangs in various tests at Stripe when using Ruby 3.3.1 that involve subprocesses: one process is stuck waiting for a child process to complete, while the child process is stuck waiting for vm->ractor.sched.lock. Inspecting the lock on x86-64 Linux says that the lock is being held by the parent process -- but the parent process's threads are all nowhere near the critical section that this lock is protecting, i.e. the parent process is not actually holding the lock. The tests in question are all using the subprocess gem, but we believe the below scenario is applicable to any use of fork, including much of Open3 in the standard library.

We think what must have happened is:

  1. Some thread T2 in the parent process calls fork from Ruby.
  2. Some thread T1 in the parent process locks vm->ractor.sched.lock
  3. T1 calls fork(2).
  4. The child process starts, but only with a copy of T2; T1 only exists in the parent process.
  5. The child process eventually tries to lock vm->ractor.sched.lock and discovers that the parent process is holding onto the lock.
  6. The parent process eventually resumes T1 to unlock vm->ractor.sched.lock.
  7. The child process is stuck waiting forever, because nothing in it exists to unlock vm->ractor.sched.lock.

To address the above problem, we re-initialize the mutex in thread_atfork so the child process starts with a clean slate. This is undefined behavior according to pthreads, but so is re-initializing the condition variables and barrier here. I don't think this makes things any worse. 😅

We've done some limited testing with this patch on our internal CI and it appears to make the hangs go away (none of the affected tests hang in a handful of runs, whereas at least one of them would fail with high probability on any given run before). We plan on doing wider testing with it starting next week.

@luke-gru
Copy link
Contributor

luke-gru commented Jun 1, 2024

Oddly enough I opened an issue regarding this behavior too, but could not reproduce it well enough to have confidence in my change. Here's my ticket, now closed: https://bugs.ruby-lang.org/issues/19395

@ko1 ko1 requested review from ko1 and removed request for ko1 June 13, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants