Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reusing Graph node values if their inputs haven't changed #6059

Merged
merged 8 commits into from
Jul 6, 2018

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Jul 2, 2018

Problem

As described in #4558, we currently completely delete Nodes from the Graph when their inputs have changed.

One concrete case where this is problematic is that all Snapshots in the graph end up with a dependency on the scandir outputs of all of their parent directories, because we need to expand symlinks recursively from the root when consuming a Path (in order to see whether any path component on the way down is a symlink). This means that changes anywhere above a Snapshot invalidate that Snapshot, and changes at the root of the repo invalidate all Snapshots (although 99% of the syscalls they depend on are not invalidated, having no dependencies of their own).

But this case is just one of many cases affected by the current implementation: there are many other times where we re-compute more than we should due to the current Node invalidation strategy.

Solution

Implement node "dirtying", as described on #4558.

There are a few components to this work:

  • In addition to being Entry::cleared (which will force a Node to re-run), a Node may be Entry::dirtyed. A "dirty" Node is eligible to be "cleaned" if its dependencies have not changed since it was dirtied.
  • Each Node records a Generation value that acts as proxy for "my output has changed". The Node locally maintains this value, and when a Node re-runs for any reason (either due to being dirtied or cleared), it compares its new output value to its old output value to determine whether to increment the Generation.
  • Each Node records the Generation values of the dependencies that it used to run, at the point when it runs. When a dirtied Node is deciding whether to re-run, it compares the previous generation values of its dependencies to their current dependency values: if they are equal, then the Node can be "cleaned": ie, its previous value can be used without re-running it.

This patch also expands the testing of Graph to differentiate dirtying a Node from clearing it, and confirms that the correct Nodes re-run in each of those cases.

Result

Cleaning all Nodes involved in ./pants list :: after touching pants.ini completes 6 times faster than recomputing them from scratch (56 seconds vs 336 seconds in our repository). More gains are likely possible by implementing the performance improvement(s) described on #6013.

Fixes #4558 and fixes #4394.

@stuhood stuhood changed the title WIP: Add support for reusing Graph node values if their inputs haven't changed Add support for reusing Graph node values if their inputs haven't changed Jul 3, 2018
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jul 3, 2018

This depends on #6061 and #6013, so the reviewable portion starts after the "Remove inaccurate __eq__ implementations." commit.

stuhood pushed a commit that referenced this pull request Jul 3, 2018
### Problem

Overriding `__eq__` on `datatype` violates structural equality, and the assumptions that people have about `datatype` instances. Moreover, it is very error prone in the presence of #6059, which will be using `__eq__` to determine whether objects have changed.

### Solution

Make it impossible to override `__eq__` accidentally on `datatype` by attaching a canary to the `__eq__` method definition on the generated class. Remove a few `__eq__` implementations that violated structural equality and were thus causing issues in #6059.

### Result

`datatypes` behave as expected more frequently, and bugs are avoided. There is no noticeable impact on performance in my testing (likely these were overridden back when _every_ object ended up memoized, rather than just `Key` instances).
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jul 3, 2018

Now just depends on #6013: the reviewable portion begins after "Add an invalidation unit test to replace some integration-level tests...". Individual commits should be (relatively) useful.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice!

/// generation is recorded on the consuming edge, and can later used to determine whether the
/// inputs to a node have changed.
///
/// Unlike the RunToken (which is incremented whenever a node re-runs), the Generation is only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to name these RunGeneration and OutputGeneration or something similarly contrasting.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have very different uses, and aren't really related to one another at that level... so would rather keep the existing separation.

NotStarted {
run_token: RunToken,
generation: Generation,
previous_result: Option<Result<N::Item, N::Error>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a shared-across-instances Arc<Mutex<Vec<Result<N::Item, N::Error>> to get more than n-1 re-use of generations, but unclear what usage patterns are actually going to be to know which is going to be more efficient. (If people flip up and back a lot, a Vec saves a lot of computation; if they tend to make linear changes, the Mutex and linear scan may be more costly)

(An Arc<Mutex<HashMap<Vec<Generation>, Result<N::Item, N::Error>>>> would come to mind, except for how we learnt about the minimum sizes of HashMaps)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack, I'd like to hold off on storing multiple values in this first edition. We can easily add it later if we're not seeing any unexpected memory pressure due to just holding one.

let run_token = run_token.next();
match entry_key {
&EntryKey::Valid(ref n) => {
let context2 = context.clone_for(entry_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally been using context2 to mean "clone of context which I was required to make because of annoying move semantics"; possibly call this context_for_entry_id?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// NB: The unwrap here avoids a clone and a comparison: see the method docs.
(
generation,
previous_result.unwrap_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous_result.expect("A node cannot be marked clean without a previous result.")?

@@ -152,7 +160,7 @@ impl fmt::Debug for Value {
}
}

#[derive(Debug, Clone)]
#[derive(Clone, Debug, Eq, PartialEq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can also be Copy?

Copy link
Sponsor Member Author

@stuhood stuhood Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw contains a Value with a python exception in it, so it's good to be explicit about copies.

@@ -487,6 +508,7 @@ impl<N: Node> InnerGraph<N> {
self.nodes.get(node)
}

// TODO: Now that we never delete Entries, we should consider making this infalliable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're probably going to want to start deleting entries again at some point for memory reasons, just in a more principled/less aggressive way

run_token: RunToken,
generation: Generation,
start_time: Instant,
waiters: Vec<oneshot::Sender<Result<(N::Item, Generation), N::Error>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this signature suggest that Errors are always a new generation, even if they're the same? Possibly this should be a oneshot::Sender<(Result<N::Item, N::Error>, Generation)> so that identical errors don't trigger downstream invalidations?

future::ok(()).to_boxed()
} else {
// The Node needs to (re-)run!
let context = context2.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely don't re-use the name context here, as it's actually different from context in the enclosing scope

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.map(move |res| (res, generation))
.to_boxed();
}
_ => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment in here that we're falling through to the second match?

enum EntryState<N: Node> {
NotStarted(RunToken),
// A node that has either been explicitly cleared, or has not yet started Running for the first
// time. In this state there is no need for a dirty bit because the generation is either in its
Copy link
Sponsor Member Author

@stuhood stuhood Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: the Generation is not incremented when a node is cleared: only the RunToken.

…a Node's result to a stored `previous_result`.
… Nodes ran, and having Nodes include a Context.id in their output.
… after we dirty a Node.

TODO: Storing these in a Vec on the state is potentially less efficient
than storing them on the edges in the graph, since that avoids a bunch of
extra allocation.
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jul 3, 2018

This no longer has dependencies, and is definitely reviewable. Thanks!

@stuhood
Copy link
Sponsor Member Author

stuhood commented Jul 3, 2018

Note to self: two bugs to fix here.

  1. Files that do not exist do not always seem to be invalidated correctly (this is causing the wiki page target failure in travis)
  2. The invalidate_randomly test can occasionally experience nodes being concurrently invalidated. Likely this calls for the same kind of retry that we use in Scheduler::execute.

…gerly prune edges for cleared entries, and lazily prune edges for dirtied entries.
…duler already accounts for this, and it seems reasonable to not bake retry into the Graph.
@stuhood stuhood merged commit 14a3f94 into pantsbuild:master Jul 6, 2018
@stuhood stuhood deleted the stuhood/dirty-nodes branch July 6, 2018 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants