Add support for reusing Graph node values if their inputs haven't changed #6059

stuhood · 2018-07-02T03:12:33Z

Problem

As described in #4558, we currently completely delete Nodes from the Graph when their inputs have changed.

One concrete case where this is problematic is that all Snapshots in the graph end up with a dependency on the scandir outputs of all of their parent directories, because we need to expand symlinks recursively from the root when consuming a Path (in order to see whether any path component on the way down is a symlink). This means that changes anywhere above a Snapshot invalidate that Snapshot, and changes at the root of the repo invalidate all Snapshots (although 99% of the syscalls they depend on are not invalidated, having no dependencies of their own).

But this case is just one of many cases affected by the current implementation: there are many other times where we re-compute more than we should due to the current Node invalidation strategy.

Solution

Implement node "dirtying", as described on #4558.

There are a few components to this work:

In addition to being Entry::cleared (which will force a Node to re-run), a Node may be Entry::dirtyed. A "dirty" Node is eligible to be "cleaned" if its dependencies have not changed since it was dirtied.
Each Node records a Generation value that acts as proxy for "my output has changed". The Node locally maintains this value, and when a Node re-runs for any reason (either due to being dirtied or cleared), it compares its new output value to its old output value to determine whether to increment the Generation.
Each Node records the Generation values of the dependencies that it used to run, at the point when it runs. When a dirtied Node is deciding whether to re-run, it compares the previous generation values of its dependencies to their current dependency values: if they are equal, then the Node can be "cleaned": ie, its previous value can be used without re-running it.

This patch also expands the testing of Graph to differentiate dirtying a Node from clearing it, and confirms that the correct Nodes re-run in each of those cases.

Result

Cleaning all Nodes involved in ./pants list :: after touching pants.ini completes 6 times faster than recomputing them from scratch (56 seconds vs 336 seconds in our repository). More gains are likely possible by implementing the performance improvement(s) described on #6013.

Fixes #4558 and fixes #4394.

stuhood · 2018-07-03T01:00:20Z

This depends on #6061 and #6013, so the reviewable portion starts after the "Remove inaccurate __eq__ implementations." commit.

### Problem Overriding `__eq__` on `datatype` violates structural equality, and the assumptions that people have about `datatype` instances. Moreover, it is very error prone in the presence of #6059, which will be using `__eq__` to determine whether objects have changed. ### Solution Make it impossible to override `__eq__` accidentally on `datatype` by attaching a canary to the `__eq__` method definition on the generated class. Remove a few `__eq__` implementations that violated structural equality and were thus causing issues in #6059. ### Result `datatypes` behave as expected more frequently, and bugs are avoided. There is no noticeable impact on performance in my testing (likely these were overridden back when _every_ object ended up memoized, rather than just `Key` instances).

stuhood · 2018-07-03T01:52:08Z

Now just depends on #6013: the reviewable portion begins after "Add an invalidation unit test to replace some integration-level tests...". Individual commits should be (relatively) useful.

illicitonion

Looks very nice!

illicitonion · 2018-07-03T10:29:58Z

src/rust/engine/graph/src/lib.rs

+/// generation is recorded on the consuming edge, and can later used to determine whether the
+/// inputs to a node have changed.
+///
+/// Unlike the RunToken (which is incremented whenever a node re-runs), the Generation is only


I'd be tempted to name these RunGeneration and OutputGeneration or something similarly contrasting.

They have very different uses, and aren't really related to one another at that level... so would rather keep the existing separation.

illicitonion · 2018-07-03T10:33:28Z

src/rust/engine/graph/src/lib.rs

+  NotStarted {
+    run_token: RunToken,
+    generation: Generation,
+    previous_result: Option<Result<N::Item, N::Error>>,


This could be a shared-across-instances Arc<Mutex<Vec<Result<N::Item, N::Error>> to get more than n-1 re-use of generations, but unclear what usage patterns are actually going to be to know which is going to be more efficient. (If people flip up and back a lot, a Vec saves a lot of computation; if they tend to make linear changes, the Mutex and linear scan may be more costly)

(An Arc<Mutex<HashMap<Vec<Generation>, Result<N::Item, N::Error>>>> would come to mind, except for how we learnt about the minimum sizes of HashMaps)

As discussed in Slack, I'd like to hold off on storing multiple values in this first edition. We can easily add it later if we're not seeing any unexpected memory pressure due to just holding one.

illicitonion · 2018-07-03T10:39:24Z

src/rust/engine/graph/src/lib.rs

+    let run_token = run_token.next();
+    match entry_key {
+      &EntryKey::Valid(ref n) => {
+        let context2 = context.clone_for(entry_id);


We've generally been using context2 to mean "clone of context which I was required to make because of annoying move semantics"; possibly call this context_for_entry_id?

illicitonion · 2018-07-03T10:46:08Z

src/rust/engine/graph/src/lib.rs

+            // NB: The unwrap here avoids a clone and a comparison: see the method docs.
+            (
+              generation,
+              previous_result.unwrap_or_else(|| {


previous_result.expect("A node cannot be marked clean without a previous result.")?

illicitonion · 2018-07-03T10:47:38Z

src/rust/engine/src/core.rs

@@ -152,7 +160,7 @@ impl fmt::Debug for Value {
  }
 }

-#[derive(Debug, Clone)]
+#[derive(Clone, Debug, Eq, PartialEq)]


I think this can also be Copy?

Throw contains a Value with a python exception in it, so it's good to be explicit about copies.

illicitonion · 2018-07-03T10:49:08Z

src/rust/engine/graph/src/lib.rs

@@ -487,6 +508,7 @@ impl<N: Node> InnerGraph<N> {
    self.nodes.get(node)
  }

+  // TODO: Now that we never delete Entries, we should consider making this infalliable.


We're probably going to want to start deleting entries again at some point for memory reasons, just in a more principled/less aggressive way

illicitonion · 2018-07-03T10:56:58Z

src/rust/engine/graph/src/lib.rs

+    run_token: RunToken,
+    generation: Generation,
+    start_time: Instant,
+    waiters: Vec<oneshot::Sender<Result<(N::Item, Generation), N::Error>>>,


Does this signature suggest that Errors are always a new generation, even if they're the same? Possibly this should be a oneshot::Sender<(Result<N::Item, N::Error>, Generation)> so that identical errors don't trigger downstream invalidations?

illicitonion · 2018-07-03T10:57:17Z

src/rust/engine/graph/src/lib.rs

+              future::ok(()).to_boxed()
+            } else {
+              // The Node needs to (re-)run!
+              let context = context2.clone();


Definitely don't re-use the name context here, as it's actually different from context in the enclosing scope

illicitonion · 2018-07-03T10:58:02Z

src/rust/engine/graph/src/lib.rs

+          .map(move |res| (res, generation))
+          .to_boxed();
+      }
+      _ => {}


Comment in here that we're falling through to the second match?

stuhood · 2018-07-03T19:41:23Z

src/rust/engine/graph/src/lib.rs

 enum EntryState<N: Node> {
-  NotStarted(RunToken),
+  // A node that has either been explicitly cleared, or has not yet started Running for the first
+  // time. In this state there is no need for a dirty bit because the generation is either in its


Note to self: the Generation is not incremented when a node is cleared: only the RunToken.

…a Node's result to a stored `previous_result`.

… Nodes ran, and having Nodes include a Context.id in their output.

… after we dirty a Node. TODO: Storing these in a Vec on the state is potentially less efficient than storing them on the edges in the graph, since that avoids a bunch of extra allocation.

… to newly computed generations.

…e outputs.

stuhood · 2018-07-03T21:27:12Z

This no longer has dependencies, and is definitely reviewable. Thanks!

stuhood · 2018-07-03T23:31:55Z

Note to self: two bugs to fix here.

Files that do not exist do not always seem to be invalidated correctly (this is causing the wiki page target failure in travis)
The invalidate_randomly test can occasionally experience nodes being concurrently invalidated. Likely this calls for the same kind of retry that we use in Scheduler::execute.

…gerly prune edges for cleared entries, and lazily prune edges for dirtied entries.

…duler already accounts for this, and it seems reasonable to not bake retry into the Graph.

stuhood force-pushed the stuhood/dirty-nodes branch from 1f2363c to 8e0e2b7 Compare July 2, 2018 03:19

stuhood mentioned this pull request Jul 2, 2018

Remove and prevent inaccurate __eq__ implementations on datatype #6061

Merged

stuhood force-pushed the stuhood/dirty-nodes branch from c1cafb1 to 1494abf Compare July 3, 2018 00:28

stuhood changed the title ~~WIP: Add support for reusing Graph node values if their inputs haven't changed~~ Add support for reusing Graph node values if their inputs haven't changed Jul 3, 2018

stuhood requested review from illicitonion, cosmicexplorer and dotordogh July 3, 2018 01:00

stuhood force-pushed the stuhood/dirty-nodes branch from 1494abf to 23b4051 Compare July 3, 2018 02:19

stuhood mentioned this pull request Jul 3, 2018

Switch to a per-entry state machine in Graph #6013

Merged

illicitonion approved these changes Jul 3, 2018

View reviewed changes

stuhood commented Jul 3, 2018

View reviewed changes

stuhood force-pushed the stuhood/dirty-nodes branch from 23b4051 to 73d017c Compare July 3, 2018 20:19

stuhood added 5 commits July 3, 2018 14:19

Record Generation values per Node, and increment them by comparing …

902e3a9

…a Node's result to a stored `previous_result`.

Make invalidation test more useful by having the context record which…

f9bfec0

… Nodes ran, and having Nodes include a Context.id in their output.

Record dependency Generations when we complete a Node, for comparison…

e01ed55

… after we dirty a Node. TODO: Storing these in a Vec on the state is potentially less efficient than storing them on the edges in the graph, since that avoids a bunch of extra allocation.

Implement cleaning of Nodes by comparing the stored generation values…

074141a

… to newly computed generations.

Add a concurrent invalidation test, and a validation function for nod…

c42e376

…e outputs.

stuhood force-pushed the stuhood/dirty-nodes branch from 73d017c to c42e376 Compare July 3, 2018 21:21

Add an invalidate_all method to invalidate all filesystem nodes. Ea…

9d090ac

…gerly prune edges for cleared entries, and lazily prune edges for dirtied entries.

stuhood force-pushed the stuhood/dirty-nodes branch from f2f0a76 to 9d090ac Compare July 6, 2018 19:02

stuhood added 2 commits July 6, 2018 13:30

Retry create while concurrent invalidations are occurring. The Sche…

82ea6b8

…duler already accounts for this, and it seems reasonable to not bake retry into the Graph.

Review feedback.

4a3befe

stuhood mentioned this pull request Jul 6, 2018

Move each EntryState in the graph under its own Mutex #6074

Closed

stuhood merged commit 14a3f94 into pantsbuild:master Jul 6, 2018

stuhood deleted the stuhood/dirty-nodes branch July 6, 2018 22:33

stuhood mentioned this pull request Jul 17, 2018

Implement support for @rules that can directly request Subsystems #5788

Closed

stuhood mentioned this pull request Sep 18, 2018

Prepare the 1.9.0 release. #6514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reusing Graph node values if their inputs haven't changed #6059

Add support for reusing Graph node values if their inputs haven't changed #6059

stuhood commented Jul 2, 2018 •

edited

Loading

stuhood commented Jul 3, 2018 •

edited

Loading

stuhood commented Jul 3, 2018

illicitonion left a comment

illicitonion Jul 3, 2018

stuhood Jul 6, 2018

illicitonion Jul 3, 2018

stuhood Jul 6, 2018

illicitonion Jul 3, 2018

stuhood Jul 6, 2018

illicitonion Jul 3, 2018

illicitonion Jul 3, 2018

stuhood Jul 3, 2018 •

edited

Loading

illicitonion Jul 3, 2018

illicitonion Jul 3, 2018

illicitonion Jul 3, 2018

stuhood Jul 6, 2018

illicitonion Jul 3, 2018

stuhood Jul 3, 2018 •

edited

Loading

stuhood commented Jul 3, 2018

stuhood commented Jul 3, 2018

Add support for reusing Graph node values if their inputs haven't changed #6059

Add support for reusing Graph node values if their inputs haven't changed #6059

Conversation

stuhood commented Jul 2, 2018 • edited Loading

Problem

Solution

Result

stuhood commented Jul 3, 2018 • edited Loading

stuhood commented Jul 3, 2018

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

stuhood commented Jul 3, 2018

stuhood commented Jul 3, 2018

stuhood commented Jul 2, 2018 •

edited

Loading

stuhood commented Jul 3, 2018 •

edited

Loading

stuhood Jul 3, 2018 •

edited

Loading

stuhood Jul 3, 2018 •

edited

Loading