Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick notify watcher series #9741

Merged

Conversation

hrfuller
Copy link
Member

@hrfuller hrfuller commented May 11, 2020

Problem

To release notify watcher changes we need to pick the notify watcher series from upstream.

Solution

Cherry pick the notify watcher change series.

Result

Running ./pants --v2-ui --v2 --no-v1 --experimental-fs-watcher --no-watchman-enable --enable-pantsd fmt2 src/python/pants/:: works.

stuhood and others added 10 commits May 5, 2020 14:49
…ntsbuild#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!
* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <stuhood@gmail.com>
* Create a git ignorer on the context object. Adjust all call sites which
create a posix fs to pass in an ignorer.

* Ignore fsevents from files that match pants_ignore patterns.

* Always pass is_dir = false to ignorer to avoid stat-ing every path the
event watch thread sees.
…tsbuild#9318 (pantsbuild#9416)

* Add a feature gate to disable the engine fs watcher introduced in pantsbuild#9318
by default, to mitigate issues seen in pantsbuild#9415 until a fix is in place.
…sbuild#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake
…antsbuild#9487)

* add --watchman-enable flag
* disable watchman when flag is false
* Don't wait for the initial watchman event if we aren't using watchman.
* check invalidation watcher liveness from scheduler service
The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test.

Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests.

[ci skip-jvm-tests]
Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated.

Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to:
```
let maybe_display_handle = Self::maybe_display_initialize(&session);
let result = loop {
  if let Ok(res) = receiver.recv_timeout(refresh_interval) {
    break Ok(res);
  } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) {
    break Err(e);
  }
};
Self::maybe_display_teardown(session, maybe_display_handle);
result
```

A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.
A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.
@hrfuller hrfuller changed the title Hfuller/pick notify watcher series Pick notify watcher series May 11, 2020
@hrfuller hrfuller requested a review from blorente May 11, 2020 18:56
@hrfuller
Copy link
Member Author

Review methodology:
@wisechengyi : leading 1.25.x backport effort.
@pierrechevalier83 : General rust knowledge de-truck factoring notify watcher.
@blorente : watchman and pantsd.

Copy link
Contributor

@blorente blorente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the code that is changed in the Daemon and the new watch crate. The daemon code looks good, I'm fairly confident in it, but another pair of eyes on the watch crate would be great.

Thanks for picking this, good work!

build-support/githooks/pre-commit Show resolved Hide resolved
@@ -413,6 +414,46 @@ def _write_named_sockets(self, socket_map):
for socket_name, socket_info in socket_map.items():
self.write_named_socket(socket_name, socket_info)

def _initialize_pid(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so so happy for this change, this will bring so much stability :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think these changes are really great.

Henry Fuller and others added 5 commits May 13, 2020 12:07
…ckend was added, but which will now fairly consistently lose that race.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
…er/pants into hfuller/pick-notify-watcher-series
Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP. Sharing comments I have so far

@@ -29,10 +29,12 @@ use hashing;

use petgraph;

mod entry;
// make the entry module public for testing purposes. We use it to contruct mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: contruct -> construct

@@ -29,10 +29,12 @@ use hashing;

use petgraph;

mod entry;
// make the entry module public for testing purposes. We use it to contruct mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general guideline, I'm a bit opposed to changing an API for the purpose of testing. An API should be able to expose what logically makes sense for a user and test code should emulate a user. That being said, there are practicalities which do need some implementation details leaking, and mocking is one of them; so I'm willing to accept it as a compromise if the tests enabled by this carry their weight. Will comment on the test site if I don't think they do. If I add no comment, consider this a general rambly discussion point and feel free to ignore it 😄

// to use the extra test functions, and they should only be imported into
// test modules.
pub mod test_support {
use super::{EntryId, EntryState, Graph, Node};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I like this approach a much more 😄 : you force the user to declare they want this functionality for tests.
Do you think it would be possible to move the pub mod entry to inside this module and maybe instead of going from pub (crate) to pub, go to pub (in test_support)?


let http_client = reqwest::Client::new();
let rule_graph = RuleGraph::new(tasks.as_map(), root_subject_types);

Ok(Core {
graph: Graph::new(),
graph: graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of being clippy, but can't it be omitted here?

///
/// TODO: Need the above polling
///
/// TODO: To simplify testing the InvalidationWatcher we could create a trait which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indirectly answers my question on exposing parts of the graph entry publicly. When we drop support for watchman, it should be possible to tighten the visibility of these functions and types too. It may be worth mentioning it in this comment so that when this TODO gets enacted, the implementer remember to tighten these APIs again.

@@ -49,6 +50,7 @@ impl InvalidationWatcher {
executor: Executor,
build_root: PathBuf,
ignorer: Arc<GitignoreStyleExcludes>,
enabled: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me that creating a disabled InvalidationWatcher is a weird pattern. Would it be possible to replace this logic with the caller holding an Option<InvalidationWatcher> and avoid the need for any change to this class?

@@ -287,12 +291,18 @@ impl<N: Node> InnerGraph<N> {
/// The Walk will iterate over all nodes that descend from the roots in the direction of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth updating the do comment to add: "until meeting a node that matches the stop_walking_predicate"

type Item = EntryId;

fn next(&mut self) -> Option<Self::Item> {
while let Some(id) = self.deque.pop_front() {
if !self.walked.insert(id) {
// Visit this node and it neighbors if this node has not yet be visited and we aren't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: be -> been

// Visit this node and it neighbors if this node has not yet be visited and we aren't
// stopping our walk at this node, based on if it satifies the stop_walking_predicate.
// This mechanism gives us a way to selectively dirty parts of the graph respecting node boundaries
// like uncacheable nodes, which sholdn't be dirtied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sholdn't -> shouldn't

src/python/pants/option/global_options.py Show resolved Hide resolved
…el BUILD files and hope that pants does not notice them before we scan the directory again!

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
@hrfuller
Copy link
Member Author

Ok @wisechengyi This is as green as it will get!

Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left comments as I reviewed the code, but no blocker. Happy with this change. It's quite the heavy lifting that you and Stu did there! Thanks both for your work on this 😄

if generation == last_seen_generation && result.poll_should_wait(context) {
// The Node is currently clean with the observed generation: add a poller on the
// Completed node that will be notified when it is dirtied or dropped. If the Node moves
// to another state, the received will be notified that the sender was dropped, and it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: received -> receiver


// But polling with the previous token should wait, since nothing has changed.
let request = graph.poll(TNode::new(2), Some(token2), None, &context);
match timeout(Duration::from_millis(1000), request).await {
Copy link
Contributor

@pierrechevalier83 pierrechevalier83 May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in interpreting this as this test will actually wait a full second before convincing itself that the polling timeout? Does it need to be that high? The drawback is that this test won't be responsive. What poses a lower bound? I'm guessing the polling interval of 100 ms does pose one and we want to allow a few polling intervals. Would it even be worth making that configurable and using a low value in the context of tests?


// Polling with the previous token (in the same session) should wait, since nothing has changed.
let request = graph.poll(TNode::new(2), Some(token1), None, &context);
match timeout(Duration::from_millis(1000), request).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on full second wait. These add up pretty fast.

@@ -23,7 +23,7 @@ use tokio::task_local;
use ui::EngineDisplay;
use uuid::Uuid;

const TIME_FORMAT_STR: &str = "%H:%M:%S";
const TIME_FORMAT_STR: &str = "%H:%M:%S:%3f";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment to explain what this format means. I don't know by heart and it may help other readers too.

@hrfuller hrfuller merged commit c786629 into pantsbuild:1.25.x-twtr May 15, 2020
@hrfuller hrfuller deleted the hfuller/pick-notify-watcher-series branch May 15, 2020 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants