Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port nailgun client to rust #10865

Merged
merged 13 commits into from
Oct 7, 2020
Merged

Port nailgun client to rust #10865

merged 13 commits into from
Oct 7, 2020

Conversation

gshuflin
Copy link
Contributor

@gshuflin gshuflin commented Sep 25, 2020

This commit ports the nailgun client used in remote_pants_runner.py from Python to Rust, using the nails crate. This is in preparation for using the Rust nailgun library's heartbeat functionality to replace the current UNIX signal-based mechanism for the client to inform pantsd that a given pants run has been stopped, although this commit does not yet remove that signal-based code.

Removing the Python version of the client allows a lot of Python code around the client side of the nailgun protocol to be removed, and also results in the removal of the logic for retrying connecting to the server on failure. Since a lot of this code is old, and predates the current design of pantsd, I think it makes sense to remove this code now, and add back any retry logic we see as still necessary in future commits.

@coveralls
Copy link

coveralls commented Sep 25, 2020

Coverage Status

Coverage remained the same at 0.0% when pulling 2f98874 on gshuflin:heartbeat into 6eea7c4 on pantsbuild:master.

@gshuflin gshuflin changed the title WIP initial rust-py types WIP Port nailgun client to rust Sep 27, 2020
@gshuflin gshuflin force-pushed the heartbeat branch 6 times, most recently from 398d719 to 3a77186 Compare October 2, 2020 01:20
@gshuflin gshuflin changed the title WIP Port nailgun client to rust Port nailgun client to rust Oct 2, 2020
@gshuflin gshuflin marked this pull request as ready for review October 2, 2020 01:26
self_process = psutil.Process()
children = self_process.children()
logger.debug(f"Sending SIGINT to child processes: {children}")
logger.warning(f"Sending SIGINT to child processes: {children}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to replace SIGINT with the _received_signal or signal.strsignal(_received_signal) or signal.getsignal(_received_signal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signal.strsignal apparently only exists after python 3.8

self_process = psutil.Process()
children = self_process.children()
logger.debug(f"Sending SIGINT to child processes: {children}")
logger.warning(f"Sending SIGINT to child processes: {children}")
for child_process in children:
child_process.send_signal(signal.SIGINT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/signal.SIGINT/_received_signal/

def __init__(self, pailgun_client: NailgunClient, pid: int, timeout: float = 1):
self._pailgun_client = pailgun_client
self._timeout = timeout
def __init__(self, pid: int):
self.pid = pid
super().__init__(pantsd_instance=False)

def _forward_signal_with_timeout(self, signum, signame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to rename this helper to _forward_signal since it no longer deals with timeouts.

def _backoff(attempt):
"""Minimal backoff strategy for daemon restarts."""
time.sleep(attempt + (attempt - 1))

def run(self) -> ExitCode:
"""Runs pants remotely with retry and recovery for nascent executions."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lop off the second half of the doc sentence.

pantsd_signal_handler = PailgunClientSignalHandler(pid=pid)

with ExceptionSink.trapped_signals(pantsd_signal_handler), STTYSettings.preserved():
return cast(int, rust_nailgun_client.execute(signal_fn, command, args, modified_env))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to stick to the alias and cast to ExitCode.

Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @gshuflin !

@@ -27,6 +27,8 @@
// Arc<Mutex> can be more clear than needing to grok Orderings:
#![allow(clippy::mutex_atomic)]
#![type_length_limit = "2058438"]
#![allow(dead_code)]
#![allow(unused_variables)]

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is already pretty dense: it would be good to split it into client.rs and server.rs probably, loaded by a mostly empty lib.rs.

Comment on lines 30 to 31
#![allow(dead_code)]
#![allow(unused_variables)]
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be removed in a few different files where they were added.

for child_process in children:
child_process.send_signal(signal.SIGINT)

def handle_sigint(self, signum: int, _frame):
logger.warning("Calling handle_sigint in pantsd")
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any process that loads the ExceptionSink will end up with this handler installed, so this message might not always be accurate.

reason=KeyboardInterrupt("Sending user interrupt to pantsd"),
)
self._pailgun_client.maybe_send_signal(signum)
ExceptionSink._signal_sent = signum
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than reaching out to do this explicitly, should this handler be calling super() or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's a mistake for PailgunClientSignalHandler and SignalHandler to have a subclass-superclass relationship actually. If we were going to keep this code long term I would probably want to rewrite SignalHandler as an "interface"-like superclass and have the remote and non-remote cases subclass it separately.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. You'll probably want to expose some other API to set the signal then, rather than accessing the private field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend adding a comment docstring to ExceptionSink._signal_sent, as we appear to be depending on it here, and it appears to be the only class-level field that does not have a comment to document it.

Comment on lines 638 to 641
deprecation_start_version="",
help='["DEPRECATED: the pailgun client has been rewritten to no longer use this"].'
"The length of time (in seconds) to wait for further output after sending a "
"signal to the remote pantsd process before killing it.",
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deprecation_start_version="",
help='["DEPRECATED: the pailgun client has been rewritten to no longer use this"].'
"The length of time (in seconds) to wait for further output after sending a "
"signal to the remote pantsd process before killing it.",
removal_version="2.1.0.dev0",
removal_hint="the pailgun client has been rewritten to no longer use this",
help="The length of time (in seconds) to wait for further output after sending a "
"signal to the remote pantsd process before killing it.",

Comment on lines 149 to 150
Either::Right((exited, execution_result_future)) => {
Err("Exiting nailgun client future via explicit quit message".to_string())
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is handling the case where ExceptionSink._signal_sent ends up set?

This is an interesting situation, and rolls back to what we had discussed about replacing the weird python nailgun "timeout" thing with "send SIGINT to the server the first time someone ctrl+c's and SIGTERM the second time". I think that the exact mechanism that might be good here is:

  1. when the client receives the first SIGINT, it should send SIGINT to the server, but not kill itself... this will allow it to continue to receive data as the server shuts down (and in a heartbeat future, for it to drain the socket after the server shuts down cleanly when heartbeats stop arriving)
  2. when the client receives the second SIGINT, it should SIGTERM to the server, and then set this exit condition (and in a heartbeat future, only kill itself without waiting for the server to notice, and rely on the client going away causing the server to tear down).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, but I'd rather implement it as part of the commit that implements heartbeats, in the interest of getting this (now rather lengthy) PR merged.

@@ -33,6 +33,8 @@
clippy::transmute_ptr_to_ptr,
clippy::zero_ptr
)]
#![allow(dead_code)]
#![allow(unused_variables)]
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, here and elsewhere.

data executor: PyExecutor;
data port: u16;

def execute(&self, py_signal_fn: PyObject, command: PyString, args: PyList, env: PyDict) -> CPyResult<PyInt> {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpython crate can do more complicated argument transforms (via FromPyObject: note the implementations), so I think that this signature can be at least:

Suggested change
def execute(&self, py_signal_fn: PyObject, command: PyString, args: PyList, env: PyDict) -> CPyResult<PyInt> {
def execute(&self, py_signal_fn: PyObject, command: String, args: Vec<String>, env: PyDict) -> CPyResult<PyInt> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok: then regarding my comment here: #10865 (comment), it sounds like the massaging into rust types is already done, and you can probably ignore that comment safely.

let output = loop {
let event = receiver.recv_timeout(timeout);
if let Some(_termination) = maybe_break_execution_loop(&python_signal_fn) {
break Err("Quitting becuase of explicit interrupt".to_string());
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/becuase/because/

Comment on lines +624 to +631
let _spawned_fut = executor.spawn(async move {
let exit_code = nailgun_fut.await;
let _ = sender.send(exit_code);
});
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need to spawn a task. If you got an error like "not running on the tokio runtime", you should "enter" the runtime (which sets some thread-local state that tokio uses to find the runtime):

let result = executor.enter(|| {
  // ... do some stuff on the runtime 
})

But also, using executor.block_on should enter the runtime... so I think that you might be able to replace most of this function body with:

let exit_code = executor.block_on(nailgun_fut);

Copy link
Contributor

@cosmicexplorer cosmicexplorer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few comments asking about how we know that some futures won't be leaked, and a few asking for a TODO to wrap up the tty/no tty complexity into an enum. Feel free to answer or not, and to add TODOs or not, at your own discretion! ^_^

I love this work, it looks great, and the above comments were mostly trying to make it easier for me or others to work with and extend it once we have our great rust client! None of them should be considered a blocking change.

@@ -59,14 +59,16 @@ def _toggle_ignoring_sigint(self, toggle: bool) -> None:
with self._ignore_sigint_lock:
self._ignoring_sigint = toggle

def handle_sigint(self, signum: int, _frame):
def _send_signal_to_children(self, received_signal: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a tiny docstring here? Something like:

def _send_signal_to_children(self, received_signal: int) -> None:
  """Send a signal to any children of this process in order.

  Pants may have spawned 0, 1, or multiple subprocesses via Python or Rust.
  Upon receiving a signal, this method is invoked to propagate the signal to all children,
  regardless of how they were spawned.
  """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _send_signal_to_children(self, received_signal: int) -> None:
def _send_signal_to_children(self, signum: int, signame: str) -> None:

This would allow us to include the direct signal name in the debug message, instead of having to look it up.

@@ -635,6 +635,8 @@ def register_bootstrap_options(cls, register):
advanced=True,
type=float,
default=5.0,
removal_version="2.1.0.dev0",
removal_hint="The pailgun client has been rewritten to no longer use this",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
removal_hint="The pailgun client has been rewritten to no longer use this",
removal_hint="The pailgun client now uses a more robust API which avoids the need for any timeout logic.",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we could also remove the TODO(#7514)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll have to reword this again since cf. stu's above comment we do need some kind of timeout logic after all.

self_process = psutil.Process()
children = self_process.children()
logger.debug(f"Sending SIGINT to child processes: {children}")
logger.debug(f"Sending signal number {received_signal} to child processes: {children}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(f"Sending signal number {received_signal} to child processes: {children}")
logger.debug(f"Sending signal {signame} ({signum}) to child processes: {children}")

ExceptionSink._signal_sent = signum
self._send_signal_to_children(signum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._send_signal_to_children(signum)
self._send_signal_to_children(signum, 'SIGINT')

For example.

reason=KeyboardInterrupt("Sending user interrupt to pantsd"),
)
self._pailgun_client.maybe_send_signal(signum)
ExceptionSink._signal_sent = signum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend adding a comment docstring to ExceptionSink._signal_sent, as we appear to be depending on it here, and it appears to be the only class-level field that does not have a comment to document it.

}

///
/// Corresponds to `ttynames_to_env` in `nailgun_protocol.py`. See this struct's rustdocs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "corresponds" mean that it reads the env vars which are set in that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I'm just moving this code around. I'd prefer to keep any proposed changes to server.rs out of scope for this PR, since this is just moving code that used to exist in lib.rs into a separate file.

}
};
debug!("Sending message to nailgun client task to exit.");
match exit_sender.send(()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to add a TODO which wraps up the exit sending here. From looking at the code, it's not immediately clear to me what exit_sender.send()) does compared to seeing sender.send(exit_code) above. Maybe just comments or renaming these variables might be helpful, but it's confusing to see the exit_sender not sending the exit_code, for example.

@@ -715,7 +810,7 @@ fn nailgun_server_await_shutdown(
executor_ptr: PyExecutor,
nailgun_server_ptr: PyNailgunServer,
) -> PyUnitResult {
with_executor(py, executor_ptr, |executor| {
with_executor(py, &executor_ptr, |executor| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thank you for looking into this!

@@ -29,7 +29,7 @@ def test_pantsd_file_logging(self) -> None:
["--backend-packages=pants.backend.python", "list", "3rdparty::"]
)
ctx.checker.assert_started()
assert "[DEBUG] connecting to pantsd on port" in daemon_run.stderr_data
assert "[DEBUG] Connecting to pantsd on port" in daemon_run.stderr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change! Thanks!

data executor: PyExecutor;
data port: u16;

def execute(&self, py_signal_fn: PyObject, command: String, args: Vec<String>, env: PyDict) -> CPyResult<PyInt> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the body of this method into a private function just above, e.g.

fn _execute_nailgun_client(client: &PyNailgunClient, py_signal_fn: Value, command: String, args: Vec<String>, env: Vec<String, String>) -> i64 { ... }

The benefit of that is that it would decouple the "massage cpython objects into rust" part from the nailgun execution. Since this is within a macro definition, it seems like there is benefit to reducing the amount of logic defined inline. Totally subjective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, see Stu's comment on argument transforms from cpython here: #10865 (comment).

@gshuflin gshuflin force-pushed the heartbeat branch 4 times, most recently from 5a639ca to d777079 Compare October 7, 2020 06:49
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-build-wheels]
@gshuflin gshuflin merged commit ee50791 into pantsbuild:master Oct 7, 2020
@gshuflin gshuflin deleted the heartbeat branch October 7, 2020 19:27
gshuflin added a commit to gshuflin/pants that referenced this pull request Oct 8, 2020
This reverts commit ee50791.

[ci skip-build-wheels]
gshuflin added a commit that referenced this pull request Oct 8, 2020
Commit ee50791, which ported the pantsd client nailgun functionality from python to rust, has a bug in it that makes inputting text in `./pants repl` unusably laggy. This commit should be reverted until a solution for that bug is found.
@Eric-Arellano Eric-Arellano mentioned this pull request Oct 11, 2020
Eric-Arellano added a commit that referenced this pull request Oct 12, 2020
Internal only changes left off from the changelog:

* Use cpython types in Rust functions that manipulate python objects (#10942)
  `PR #10942 <https://github.com/pantsbuild/pants/pull/10942>`_

* update libz-sys version to fix macOS compile error (#10941)
  `PR #10941 <https://github.com/pantsbuild/pants/pull/10941>`_

* Upgrade to Rust stable 1.47.0. (#10933)
  `PR #10933 <https://github.com/pantsbuild/pants/pull/10933>`_

* Finish CreateDigest Directory cleanup. (#10935)
  `PR #10935 <https://github.com/pantsbuild/pants/pull/10935>`_

* Hotfix broken import from merge conflict (#10934)
  `PR #10934 <https://github.com/pantsbuild/pants/pull/10934>`_

* Revert "Port nailgun client to rust (#10865)" (#10929)
  `PR #10929 <https://github.com/pantsbuild/pants/pull/10929>`_

* An ExternalTool for downloading the grpc_python_plugin. (#10927)
  `PR #10927 <https://github.com/pantsbuild/pants/pull/10927>`_

* Port nailgun client to rust (#10865)
  `PR #10865 <https://github.com/pantsbuild/pants/pull/10865>`_

* print stacktraces during import errors (#10906)
  `PR #10906 <https://github.com/pantsbuild/pants/pull/10906>`_

* fs.Digest is declared in Rust (#10905)
  `PR #10905 <https://github.com/pantsbuild/pants/pull/10905>`_

* add requests_ca_bundle to settable_env_vars (#10909)
  `PR #10909 <https://github.com/pantsbuild/pants/pull/10909>`_

[ci skip-rust]
stuhood added a commit to stuhood/pants that referenced this pull request Nov 11, 2020
stuhood added a commit that referenced this pull request Nov 12, 2020
### Problem

#10865 previously landed to port Pants' nailgun client to Rust. It was reverted in #10929 due to an issue with TTY access, where (in particular), the `repl` goal was mostly unresponsive.

### Solution

The unresponsive `repl` was due to a bug in the `nails` library, where `stdin` was being consumed eagerly regardless of whether the server signaled that it would like to receive `stdin`. Pants sends an environment variable to the server that indicates which TTY the client is connected to, and the server will directly connect to that TTY if it can. When the server directly connects to the client's TTY, it does not accept `stdin`, but since `stdin` was read eagerly by the `nails` client (and ending up stuck in a buffer, since the server would not request it), the result was two different processes reading `stdin` from the TTY: the client, and the server.

Bump to `nails` `0.7.0`, which [makes `stdin` initialization lazy](stuhood/nails@6b8c19a).
@stuhood stuhood mentioned this pull request Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants