[core/swarm] Graceful shutdown for connections, networks and swarms. #1682

romanb · 2020-07-29T15:43:18Z

This is a proposal (at least for discussion) of enhanced graceful shutdown options for Connections, Networks and Swarms that builds on #1619. Even if #1619 is ultimately not merged, this may at least be a thread for discussing the topic of graceful shutdowns and its usefulness for libp2p. For a proper diff against #1619 see here.

Motivation

In the context of network I/O, a graceful shutdown is a form of shutdown that allows ongoing network traffic to cease in a controlled manner and at chosen boundaries (e.g. request/response boundaries), while at the same time rejecting new I/O streams (e.g. connections or substreams). The main purpose of a graceful shutdown of a networked application is thereby to minimise friction on restarts, incremental rollouts of updates, etc. that is, to minimise the amount of failed requests, responses or any errors resulting from sudden interruption of the ongoing network I/O at an arbitrary point. The more often networked nodes get restarted or redeployed, possibly as a result of frequent updates, the more errors and "hiccups" can result in the network traffic without graceful shutdowns. Graceful shutdowns usually also include redirection of traffic away from the nodes being shut down, which can sometimes be assisted by proxies or load balancers operating at network layers for more classical client-server deployments, but only the end-to-end applications and protocols typically know what needs to be done for a clean shutdown that avoids unnecessary errors.

How it works

This PR builds on the ability to wait for connection shutdown to complete, introduced in #1619 and extends the ability for performing graceful shutdowns in the following ways:

The ConnectionHandler (and thus also ProtocolsHandler) can participate in the shutdown, via new poll_close methods. The muxer and underlying transport connection only starts closing once the connection handler signals readiness to do so. The default implementations of ConnectionHandler::poll_close always signal readiness to immediately proceed with the transport-level shutdown.
A Network can be gracefully shut down via Network::close or Network::start_close, which involves a graceful shutdown of the underlying connection Pool. The Pool in turn proceeds with a shutdown by rejecting new connections while draining established connections. Draining established connections just means waiting for them to close, i.e. waiting for ConnectionHandler::poll_close followed by Muxer::poll_close to complete.
A Swarm can be gracefully shut down via Swarm::close or Swarm::start_close, which involves a graceful shutdown of the underlying Network in tandem with the NetworkBehaviour until the Network is closed and the NetworkBehaviour ready for shutdown.

Important Details

Analogous to new inbound and outbound connections during shutdown, while a single connection is shutting down, it rejects new inbound substreams and, by the return type of ConnectionHandler::poll_close, no new outbound substreams can be requested.
The NodeHandlerWrapper managing the ProtocolsHandler always waits for already ongoing inbound and outbound substream upgrades to complete. Since the NodeHandlerWrapper is a ConnectionHandler, the previous point applies w.r.t. new inbound and outbound substreams.
While the connection handler is closing, it can still emit and receive events (e.g. to/from a NetworkBehaviour), but as mentioned above no longer receives new inbound substreams and can no longer request new outbound substreams.
When the connection_keep_alive expires, a graceful shutdown is initiated rather than a sudden close via returning a keep-alive "error" as is done now.
Connection::poll, Network::poll and Swarm::poll now return the event in an Option, since None signals termination as a result of a clean shutdown, i.e. in the manner of a Stream that is exhausted.
I added ConnectionHandlerEvent::Close to allow a ConnectionHandler to request a graceful close of the connection (as opposed to returning an error, which will close the connection without delay).
I added NetworkBehaviourAction::CloseConnection and NetworkBehaviourAction::DisconnectPeer to allow a NetworkBehaviour to request a connection to be closed (gracefully) or for a peer to be disconnected (immediately, i.e. not gracefully), respectively.

Usage Example

For an example of how a connection handler can make use of the ability to participate in graceful connection shutdown, I've included an implementation of ProtocolsHandler::poll_close for the RequestResponseHandler in libp2p-request-response. In the case of libp2p-request-response, these changes amount to allowing already ongoing requests to complete when a connection closes, while redirecting new requests to other connections.

Tests

I have essentially changed all tests that would previously emit frequent "broken pipe" errors (more of these since #1619 no longer always tries a clean connection shutdown automatically in the background task - only if actually requested) to perform a clean shutdown of the Swarms or Networks used in these tests.

The `Network` does currently not emit events for actively closed connections, e.g. via `EstablishedConnection::close` or `ConnectedPeer::disconnect()`. As a result, when actively closing connections, there will be `ConnectionEstablished` events emitted without eventually a matching `ConnectionClosed` event. This seems undesirable and has the consequence that the `Swarm::ban_peer_id` feature in `libp2p-swarm` does not result in appropriate calls to `NetworkBehaviour::inject_connection_closed` and `NetworkBehaviour::inject_disconnected`. Furthermore, the `disconnect()` functionality in `libp2p-core` is currently broken as it leaves the `Pool` in an inconsistent state. This commit does the following: 1. When connection background tasks are dropped (i.e. removed from the `Manager`), they always terminate immediately, without attempting an orderly close of the connection. 2. An orderly close is sent to the background task of a connection as a regular command. The background task emits a `Closed` event before terminating. 3. `Pool::disconnect()` removes all connection tasks for the affected peer from the `Manager`, i.e. without an orderly close, thereby also fixing the discovered state inconsistency due to not removing the corresponding entries in the `Pool` itself after removing them from the `Manager`. 4. A new test is added to `libp2p-swarm` that exercises the ban/unban functionality and places assertions on the number and order of calls to the `NetworkBehaviour`. In that context some new testing utilities have been added to `libp2p-swarm`. This addresses libp2p#1584.

Co-authored-by: Toralf Wittner <tw@dtex.org>

There is no need for a `StartClose` future.

…ive-close

Building on the ability to wait for connection shutdown to complete introduced in libp2p#1619, this commit extends the ability for performing graceful shutdowns in the following ways: 1. The `ConnectionHandler` (and thus also `ProtocolsHandler`) can participate in the shutdown, via new `poll_close` methods. The muxer and underlying transport connection only starts closing once the connection handler signals readiness to do so. 2. A `Network` can be gracefully shut down, which involves a graceful shutdown of the underlying connection `Pool`. The `Pool` in turn proceeds with a shutdown by rejecting new connections while draining established connections. 3. A `Swarm` can be gracefully shut down, which involves a graceful shutdown of the underlying `Network` followed by polling the `NetworkBehaviour` until it returns `Poll::Pending`, i.e. it has no more output. In particular, the following are important details: * Analogous to new inbound and outbound connections during shutdown, while a single connection is shutting down, it rejects new inbound substreams and, by the return type of `ConnectionHandler::poll_close`, no new outbound substreams can be requested. * The `NodeHandlerWrapper` managing the `ProtocolsHandler` always waits for already ongoing inbound and outbound substream upgrades to complete. Since the `NodeHandlerWrapper` is a `ConnectionHandler`, the previous point applies w.r.t. new inbound and outbound substreams. * When the `connection_keep_alive` expires, a graceful shutdown is initiated.

Shut down the `NetworkBehaviour` in tandem with the `Network` via a dedicated `poll_close` API.

core/src/connection.rs

twittner · 2020-07-31T13:45:53Z

core/src/connection.rs

+                    if self.state == ConnectionState::Open {
+                        self.handler.inject_substream(substream, SubstreamEndpoint::Listener)
+                    } else {
+                        log::trace!("Inbound substream dropped. Connection is closing.")


Note that dropping a substream can cause write errors on the remote side, e.g. sending of initial data could fail when the substream is reset.

Yes, the current idea is to treat new inbound substreams analogously (as much as possible anyway) to new connections during shutdown - that is, to refuse them at the earliest possible moment. Any failures relating to attempts at creating new substreams or connections towards the peer that is shutting down should ideally result in retries with a different peer, much like without the graceful shutdown.

I am not sure if the shutdown logic as currently implemented can really alleviate error rates because any error in one substream will often lead to a connection close, causing failures in other substreams. For example, the request-response handler will immediately close the connection when any inbound or outbound upgrade fails other than via timeout or protocol mismatch. This means that if the remote closes the connection and my next outbound request over this connection fails, all my ongoing requests to this connection will fail too.

That is true, unfortunately. It depends very much on how substream write errors are handled, in the specific case of libp2p-request-response by RequestResponseCodec::write_request, which is currently completely up to the user. As you say, most other protocols probably consider write errors on a substream as fatal as well. The only somewhat clean solution that comes to my mind would seem to lead back to the ability for the substream multiplexers to have a configurable inbound substream limit similar to that of the connection limit for the Pool which can be changed during shutdown (i.e. set to 0), i.e. a limit that does not affect existing substreams but can be used to exert back-pressure in a way such that the remote knows it reached the current substream limit on that connection, but the connection itself is not broken.

core/src/connection/manager/task.rs

protocols/request-response/src/handler.rs

Co-authored-by: Toralf Wittner <tw@dtex.org>

romanb · 2020-08-03T14:30:06Z

So apart from the technical details concerning how it is done, I guess the main point for discussion here is whether the perceived benefits of providing such shutdown options justify the added complexity. I happen to think that, while the diff is large, the actual changes are relatively simple in nature, and I find that providing "clean" shutdown options for a Network or Swarm, analogous to a clean connection shutdown, is quite useful. However, as @twittner also pointed out, in many situations it may be sufficient to resort to means external to the libp2p-executing process itself to achieve similar effects for more graceful rolling updates of networked nodes, e.g. DNS. Not to forget that a graceful shutdown may prolong the time until the node is restarted and ready to receive new connections again, i.e. it may appear "unavailable" for a longer time than if it were to just drop all connections immediately and restart. One may of course also just take the view that any kind of graceful shutdown is completely unnecessary and trying to avoid errors during node upgrades not worth any effort, even with frequent node upgrades accompanied by process restarts.

romanb · 2020-08-03T16:07:55Z

There are some unresolved but important considerations relating to the handling of new inbound substreams and substream write errors during shutdown, see this discussion, which is why I'm putting this effort on ice for the moment.

romanb · 2021-01-27T12:16:56Z

Closing, since I don't think I will pick this up again. The main difficulty here is that a generic approach to a clean shutdown probably cannot be implemented without making too many assumptions about how concrete applications operate with the substreams on a connection and hence only a concrete application knows what can and needs to be done to gracefully cease traffic for a shutdown.

Roman S. Borschel and others added 14 commits June 19, 2020 15:06

Update swarm/src/lib.rs

69a6193

Co-authored-by: Toralf Wittner <tw@dtex.org>

Incorporate some review feedback.

b639459

Merge branch 'master' into active-close

13d3874

Adapt to changes in master.

16362e1

More verbose panic messages.

603cb2b

Merge branch 'master' into active-close

d35d780

Simplify

d1f8fe8

There is no need for a `StartClose` future.

Fix doc links.

9322433

Merge branch 'master' into active-close

885580c

Further small cleanup.

487dc69

Merge branch 'active-close' of github.com:romanb/rust-libp2p into act…

7f08192

…ive-close

Cleanup

672bf82

romanb mentioned this pull request Jul 29, 2020

[core/swarm] Emit events for active connection close and fix disconnect(). #1619

Merged

Roman S. Borschel added 6 commits July 29, 2020 17:50

Resolve missed conflict.

5727d6d

Fix rustdoc link.

3c5efe3

Fix rustdoc link.

069af14

Fix rustdoc link.

bd4cf05

Cleaner shutdown for Swarms.

7d437e4

Shut down the `NetworkBehaviour` in tandem with the `Network` via a dedicated `poll_close` API.

Documentation update.

f2f17f7

twittner reviewed Jul 31, 2020

View reviewed changes

romanb and others added 2 commits August 3, 2020 13:27

Update protocols/request-response/src/handler.rs

1dbee76

Co-authored-by: Toralf Wittner <tw@dtex.org>

Incorporate some review feedback.

a16f4f5

romanb added the priority:nicetohave label Aug 3, 2020

Remove unnecessary 'continue's.

6e91e54

romanb added the discussion label Aug 3, 2020

romanb added the on-ice label Aug 3, 2020

dvc94ch mentioned this pull request Dec 25, 2020

request-response: closes substream before keep alive is reached or response is sent #1903

Closed

romanb closed this Jan 27, 2021

mxinden mentioned this pull request Jul 28, 2021

Kademlia graceful shutdown #2164

Closed

mxinden mentioned this pull request Sep 16, 2022

gossipsub + tokio: unexpected warning #2897

Closed

thomaseizinger mentioned this pull request Oct 11, 2022

Task pool connection is not dropped in case of error #2661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core/swarm] Graceful shutdown for connections, networks and swarms. #1682

[core/swarm] Graceful shutdown for connections, networks and swarms. #1682

romanb commented Jul 29, 2020 •

edited

Loading

twittner Jul 31, 2020

romanb Aug 3, 2020

twittner Aug 3, 2020

romanb Aug 3, 2020

romanb commented Aug 3, 2020

romanb commented Aug 3, 2020

romanb commented Jan 27, 2021

[core/swarm] Graceful shutdown for connections, networks and swarms. #1682

[core/swarm] Graceful shutdown for connections, networks and swarms. #1682

Conversation

romanb commented Jul 29, 2020 • edited Loading

Motivation

How it works

Important Details

Usage Example

Tests

twittner Jul 31, 2020

Choose a reason for hiding this comment

romanb Aug 3, 2020

Choose a reason for hiding this comment

twittner Aug 3, 2020

Choose a reason for hiding this comment

romanb Aug 3, 2020

Choose a reason for hiding this comment

romanb commented Aug 3, 2020

romanb commented Aug 3, 2020

romanb commented Jan 27, 2021

romanb commented Jul 29, 2020 •

edited

Loading