Move fork context management to rust #5521

stuhood · 2018-02-26T16:45:36Z

Problem

As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side fork_lock attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead.

Solution

Move "fork context" management to rust, and execute our double fork for DaemonPantsRunner inside the scheduler's fork context. This acquires all existing locks, which removes the need for a fork_lock that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool.

Result

It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the fork_lock in enough places.

A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.

stuhood · 2018-02-26T17:54:51Z

Reviewable.

illicitonion

It's not immediately apparently to me how this leads to being able to run background threads, but explicit fine-grained locking is probably good regardless, and I'm sure it's an important step :)

illicitonion · 2018-02-27T11:37:21Z

src/python/pants/pantsd/process_manager.py

@@ -401,6 +401,14 @@ def _kill(self, kill_sig):
    if self.pid:
      os.kill(self.pid, kill_sig)

+  def _noop_fork_context(self, func):


When is this the correct thing to use?

Inlined and moved this docstring into the daemonize pydoc.

illicitonion · 2018-02-27T11:37:33Z

src/python/pants/engine/scheduler.py


  def visualize_to_dir(self):
    return self._native.visualize_to_dir

  def to_keys(self, subjects):
    return list(self._to_key(subject) for subject in subjects)

-  def pre_fork(self):
-    self._native.lib.scheduler_pre_fork(self._scheduler)
+  def with_fork_context(self, func):


Can you add a quick pydoc explaining what this is and how it should be used?

I'll refer to the rust docs on the topic (in lib.rs).

illicitonion · 2018-02-27T11:41:08Z

src/python/pants/pantsd/process_manager.py

+
+    # Perform the double fork under the fork_context. Three outcomes are possible after the double
+    # fork: we're either the original process, the double-fork parent, or the double-fork child.
+    # These are represented by parent_or_child being None, True, or False, respectively.


Maybe this could use a tuple(is_original, is_parent) or something more enum-y, rather than a tri-state boolean?

if parent_or_child: ... else: ...

doesn't read fantastically...

I ache for actual enums... sigh.

https://pypi.python.org/pypi/enum/

illicitonion · 2018-02-27T11:42:27Z

src/python/pants/pantsd/service/store_gc_service.py

@@ -37,20 +37,16 @@ def _launch_thread(f):
  def _extend_lease(self):
    while 1:
      # Use the fork lock to ensure this thread isn't cloned via fork while holding the graph lock.
-      with self.fork_lock:
-        self._logger.debug('Extending leases')


Can I have my logging back please? :)

Whoops. Yep.

illicitonion · 2018-02-27T11:46:07Z

src/rust/engine/fs/src/pool.rs

+  ///
+  /// Run a function while the pool is shut down, and restore the pool after it completes.
+  ///
+  pub fn with_shutdown<F, T>(&self, f: F) -> T


Could we do away with the lock entirely by making with_shutdown take &mut self?

(I can believe this is impractical, but it would be nice if possible...)

I don't think so, no... we have a reference to the pool via an Arc, and getting a mutable reference into that would require either cloning or something potentially panicy.

illicitonion · 2018-02-27T11:50:47Z

src/rust/engine/src/externs.rs

@@ -190,6 +201,8 @@ pub fn unsafe_call(func: &Function, args: &[Value]) -> Value {
 /////////////////////////////////////////////////////////////////////////////////////////

 lazy_static! {
+  // NB: Unfortunately, it's not currently possible to merge these locks, because mutating


Nice comment :)

dotordogh

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

dotordogh · 2018-02-28T02:43:54Z

src/python/pants/pantsd/process_manager.py

@@ -448,32 +456,43 @@ def daemonize(self, pre_fork_opts=None, post_fork_parent_opts=None, post_fork_ch
    daemons. Having a disparate umask from pre-vs-post fork causes files written in each phase to
    differ in their permissions without good reason - in this case, we want to inherit the umask.
    """
+    fork_context = fork_context or self._noop_fork_context
+
+    def double_fork():


Is it worth explaining in this context why double forking is necessary?

It's explained in the comment above.

I missed that! Sorry!

stuhood · 2018-03-15T03:26:27Z

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

Yea, basically. Kris has added a lot of tests to cover daemon usecases, so we can be pretty confident that nothing is fundamentally broken.

illicitonion

Looks great :) Thanks!

stuhood · 2018-07-13T22:19:13Z

I'm going to hold onto this branch, but the direction we're headed in will no longer require forking.

stuhood · 2018-08-22T00:12:50Z

I believe that this is related to #6356, so I'm re-opening it in order to push a rebase and resume progress.

stuhood · 2018-08-22T00:24:57Z

This rebased version of the patch has the same "shape" as the old version (with_shutdown methods on resources), although it now needs to deal with significantly more resources. With this many resources in play, with_shutdown "context managers" get a bit hairy, so I'd like to experiment with one other idea before landing this. But I've eagerly pushed it in order to get a CI run.

…_context, which acquires all relevant resources.

…st: excellent!).

stuhood · 2018-08-28T16:06:28Z

I didn't see a clear way to do "composition of a bunch of resettable objects" without running into object safety, so planning to land this as is.

stuhood · 2018-08-28T16:23:41Z

src/rust/engine/process_execution/src/local.rs

-    self.fs_pool.reset();
+  fn with_shutdown(&self, f: &mut FnMut() -> ()) {
+    // TODO: Although we have a Resettable<CpuPool>, we do not shut it down, because our caller
+    // will (and attempting to shut things down twice guarantees a deadlock because Resettable is


Rather than deadlock, you'd actually panic: RwLock panics on reentrance.

illicitonion

Thanks!

As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side `fork_lock` attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead. Move "fork context" management to rust, and execute our double fork for `DaemonPantsRunner` inside the scheduler's fork context. This acquires all existing locks, which removes the need for a `fork_lock` that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool. It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the `fork_lock` in enough places. A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.

stuhood requested review from illicitonion, kwlzn and dotordogh February 26, 2018 16:45

stuhood mentioned this pull request Feb 26, 2018

Prototype switching to the notify crate for file invalidation #4999

Closed

illicitonion approved these changes Feb 27, 2018

View reviewed changes

dotordogh reviewed Feb 28, 2018

View reviewed changes

stuhood changed the title ~~Move the fork context management to rust~~ Move fork context management to rust Mar 15, 2018

stuhood force-pushed the stuhood/fork-lock-in-rust branch from c51ffad to e49fd02 Compare March 15, 2018 03:24

stuhood force-pushed the stuhood/fork-lock-in-rust branch from e49fd02 to 6c47781 Compare March 15, 2018 05:15

illicitonion approved these changes Mar 15, 2018

View reviewed changes

stuhood force-pushed the master branch from b6bb42d to 9e2fdb5 Compare May 11, 2018 23:54

stuhood closed this Jul 13, 2018

stuhood mentioned this pull request Aug 21, 2018

Hang in lmdb::environment::Environment::create_db #6356

Closed

stuhood reopened this Aug 22, 2018

stuhood force-pushed the stuhood/fork-lock-in-rust branch from 6c47781 to 650af24 Compare August 22, 2018 00:13

stuhood force-pushed the stuhood/fork-lock-in-rust branch 3 times, most recently from e04d1af to d7a4105 Compare August 22, 2018 03:10

stuhood added a commit that referenced this pull request Aug 22, 2018

Rebase #5521 to 1.9.x.

20f07d0

stuhood force-pushed the stuhood/fork-lock-in-rust branch 2 times, most recently from 2cd2b97 to 166c3c8 Compare August 27, 2018 23:30

stuhood added 4 commits August 27, 2018 17:09

Remove the fork_lock in favor of executing inside Scheduler.with_fork…

119188f

…_context, which acquires all relevant resources.

Attempt to log segmentation faults to stderr.

5dd7835

SchedulerTestBase is not a test.

6318deb

Memoize init_native to... minimize churn?

ecee24a

Shut down remote components in the correct order (caught by a unit te…

c3c8afa

…st: excellent!).

stuhood force-pushed the stuhood/fork-lock-in-rust branch from 166c3c8 to c3c8afa Compare August 28, 2018 02:11

stuhood commented Aug 28, 2018

View reviewed changes

illicitonion approved these changes Aug 28, 2018

View reviewed changes

stuhood merged commit ec931f8 into pantsbuild:master Aug 28, 2018

stuhood deleted the stuhood/fork-lock-in-rust branch August 28, 2018 18:15

stuhood added the needs-cherrypick label Aug 28, 2018

stuhood added this to the 1.9.x milestone Aug 28, 2018

stuhood removed the needs-cherrypick label Aug 28, 2018

stuhood mentioned this pull request Aug 30, 2018

Pantsd deadlock due to fail-fast #6426

Closed

stuhood mentioned this pull request Oct 9, 2018

Periodic hang in test_pantsd_filesystem_invalidation #6565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move fork context management to rust #5521

Move fork context management to rust #5521

stuhood commented Feb 26, 2018 •

edited

Loading

stuhood commented Feb 26, 2018

illicitonion left a comment

illicitonion Feb 27, 2018

stuhood Mar 15, 2018

illicitonion Feb 27, 2018

stuhood Mar 15, 2018

illicitonion Feb 27, 2018

stuhood Mar 15, 2018

jsirois Mar 15, 2018

illicitonion Feb 27, 2018

stuhood Mar 15, 2018

illicitonion Feb 27, 2018

stuhood Mar 15, 2018

illicitonion Feb 27, 2018

dotordogh left a comment

dotordogh Feb 28, 2018 •

edited

Loading

stuhood Mar 15, 2018

dotordogh Mar 15, 2018

stuhood commented Mar 15, 2018

illicitonion left a comment

stuhood commented Jul 13, 2018

stuhood commented Aug 22, 2018

stuhood commented Aug 22, 2018

stuhood commented Aug 28, 2018

stuhood Aug 28, 2018

illicitonion left a comment

Move fork context management to rust #5521

Move fork context management to rust #5521

Conversation

stuhood commented Feb 26, 2018 • edited Loading

Problem

Solution

Result

stuhood commented Feb 26, 2018

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotordogh left a comment

Choose a reason for hiding this comment

dotordogh Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Mar 15, 2018

illicitonion left a comment

Choose a reason for hiding this comment

stuhood commented Jul 13, 2018

stuhood commented Aug 22, 2018

stuhood commented Aug 22, 2018

stuhood commented Aug 28, 2018

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

stuhood commented Feb 26, 2018 •

edited

Loading

dotordogh Feb 28, 2018 •

edited

Loading