Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move fork context management to rust #5521

Merged
merged 5 commits into from
Aug 28, 2018

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Feb 26, 2018

Problem

As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side fork_lock attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead.

Solution

Move "fork context" management to rust, and execute our double fork for DaemonPantsRunner inside the scheduler's fork context. This acquires all existing locks, which removes the need for a fork_lock that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool.

Result

It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the fork_lock in enough places.

A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Feb 26, 2018

Reviewable.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately apparently to me how this leads to being able to run background threads, but explicit fine-grained locking is probably good regardless, and I'm sure it's an important step :)

@@ -401,6 +401,14 @@ def _kill(self, kill_sig):
if self.pid:
os.kill(self.pid, kill_sig)

def _noop_fork_context(self, func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this the correct thing to use?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined and moved this docstring into the daemonize pydoc.


def visualize_to_dir(self):
return self._native.visualize_to_dir

def to_keys(self, subjects):
return list(self._to_key(subject) for subject in subjects)

def pre_fork(self):
self._native.lib.scheduler_pre_fork(self._scheduler)
def with_fork_context(self, func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a quick pydoc explaining what this is and how it should be used?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refer to the rust docs on the topic (in lib.rs).


# Perform the double fork under the fork_context. Three outcomes are possible after the double
# fork: we're either the original process, the double-fork parent, or the double-fork child.
# These are represented by parent_or_child being None, True, or False, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could use a tuple(is_original, is_parent) or something more enum-y, rather than a tri-state boolean?

if parent_or_child:
  ...
else:
  ...

doesn't read fantastically...

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ache for actual enums... sigh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -37,20 +37,16 @@ def _launch_thread(f):
def _extend_lease(self):
while 1:
# Use the fork lock to ensure this thread isn't cloned via fork while holding the graph lock.
with self.fork_lock:
self._logger.debug('Extending leases')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I have my logging back please? :)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Yep.

///
/// Run a function while the pool is shut down, and restore the pool after it completes.
///
pub fn with_shutdown<F, T>(&self, f: F) -> T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do away with the lock entirely by making with_shutdown take &mut self?

(I can believe this is impractical, but it would be nice if possible...)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, no... we have a reference to the pool via an Arc, and getting a mutable reference into that would require either cloning or something potentially panicy.

@@ -190,6 +201,8 @@ pub fn unsafe_call(func: &Function, args: &[Value]) -> Value {
/////////////////////////////////////////////////////////////////////////////////////////

lazy_static! {
// NB: Unfortunately, it's not currently possible to merge these locks, because mutating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment :)

Copy link
Contributor

@dotordogh dotordogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

@@ -448,32 +456,43 @@ def daemonize(self, pre_fork_opts=None, post_fork_parent_opts=None, post_fork_ch
daemons. Having a disparate umask from pre-vs-post fork causes files written in each phase to
differ in their permissions without good reason - in this case, we want to inherit the umask.
"""
fork_context = fork_context or self._noop_fork_context

def double_fork():
Copy link
Contributor

@dotordogh dotordogh Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth explaining in this context why double forking is necessary?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explained in the comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that! Sorry!

@stuhood stuhood changed the title Move the fork context management to rust Move fork context management to rust Mar 15, 2018
@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 15, 2018

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

Yea, basically. Kris has added a lot of tests to cover daemon usecases, so we can be pretty confident that nothing is fundamentally broken.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great :) Thanks!

@stuhood
Copy link
Sponsor Member Author

stuhood commented Jul 13, 2018

I'm going to hold onto this branch, but the direction we're headed in will no longer require forking.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Aug 22, 2018

I believe that this is related to #6356, so I'm re-opening it in order to push a rebase and resume progress.

@stuhood stuhood reopened this Aug 22, 2018
@stuhood
Copy link
Sponsor Member Author

stuhood commented Aug 22, 2018

This rebased version of the patch has the same "shape" as the old version (with_shutdown methods on resources), although it now needs to deal with significantly more resources. With this many resources in play, with_shutdown "context managers" get a bit hairy, so I'd like to experiment with one other idea before landing this. But I've eagerly pushed it in order to get a CI run.

@stuhood stuhood force-pushed the stuhood/fork-lock-in-rust branch 3 times, most recently from e04d1af to d7a4105 Compare August 22, 2018 03:10
stuhood added a commit that referenced this pull request Aug 22, 2018
@stuhood stuhood force-pushed the stuhood/fork-lock-in-rust branch 2 times, most recently from 2cd2b97 to 166c3c8 Compare August 27, 2018 23:30
@stuhood
Copy link
Sponsor Member Author

stuhood commented Aug 28, 2018

I didn't see a clear way to do "composition of a bunch of resettable objects" without running into object safety, so planning to land this as is.

self.fs_pool.reset();
fn with_shutdown(&self, f: &mut FnMut() -> ()) {
// TODO: Although we have a Resettable<CpuPool>, we do not shut it down, because our caller
// will (and attempting to shut things down twice guarantees a deadlock because Resettable is
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than deadlock, you'd actually panic: RwLock panics on reentrance.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@stuhood stuhood merged commit ec931f8 into pantsbuild:master Aug 28, 2018
@stuhood stuhood deleted the stuhood/fork-lock-in-rust branch August 28, 2018 18:15
@stuhood stuhood added this to the 1.9.x milestone Aug 28, 2018
stuhood pushed a commit that referenced this pull request Aug 28, 2018
As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side `fork_lock` attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead.

Move "fork context" management to rust, and execute our double fork for `DaemonPantsRunner` inside the scheduler's fork context. This acquires all existing locks, which removes the need for a `fork_lock` that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool.

It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the `fork_lock` in enough places.

A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang in lmdb::environment::Environment::create_db
4 participants