stability: Persistent failure in GRPC connections #8939

bdarnell · 2016-08-30T09:36:56Z

The beta cluster (running beta release candidate 2a2fdd9) is showing symptoms similar to #8694 (the grpc connection timeout bug). Some ranges are unable to elect leaders and while one node was down (due to operator error) the cluster ground to a halt. Once that node was restarted the cluster was able to make progress again.

Several signs point to an rpc-level problem: There are only 3-4 active raft transport streams showing up in either requestz or goroutine stacks, and the raft status page only shows columns for 5 nodes. It appears that (to take one example), node 104.196.54.82 is unable to connect to node id 6. The logs show that we're getting a "connection reset" error from grpc:

I160830 09:25:30.980574 server/status.go:568  Failed to get ranges from 6: "rpc error: code = 13 desc = connection error: desc = \"transport: read tcp 10.142.0.20:38006->10.142.0.21:26257: read: connection reset by peer\""
I160830 09:25:54.035188 server/status.go:568  Failed to get ranges from 6: "rpc error: code = 13 desc = connection error: desc = \"transport: read tcp 10.142.0.20:38006->10.142.0.21:26257: read: connection reset by peer\""

The source port (38006) is not changing, and there are no logs from grpc/clientconn.go indicating that it is trying to create a new connection. Somehow this failure appears to be preventing grpc from retrying when it should (and this failure, combined with another real node failure, is enough to knock some ranges out of commission)

The text was updated successfully, but these errors were encountered:

bdarnell · 2016-08-30T10:04:04Z

I think this is exactly the same as #8694, just with a different error. It happens that node 6 was restarted twice, three minutes apart. If the second of those restarts occurred while another node was in its TLS handshake to the previous incarnation, that node would get a "connection reset" error and consider it a permanent failure. This is a race that could happen to any TLS-enabled use of GRPC, but it's made much more common by our long startup delays.

The error handling in GRPC needs to be reworked: instead of treating errors from ClientHandshake as permanent by default with a whitelist of transient errors (currently io.EOF and context.DeadlineExceeded), it should assume that all errors are retryable (as it does for every operation except ClientHandshake) and use a blacklist of certificate-related errors to treat as permanent failures (or maybe even those should be transient - a misconfigured server could transiently reject valid certificates and it shouldn't take a restart of all clients to recover from that).

tamird · 2016-08-30T10:30:25Z

A similar-in-spirit change was already made upstream:
grpc/grpc-go#864. Perhaps we could try pulling that
in?

On Aug 30, 2016 06:04, "Ben Darnell" notifications@github.com wrote:

I think this is exactly the same as #8694
#8694, just with a
different error. It happens that node 6 was restarted twice, three minutes
apart. If the second of those restarts occurred while another node was in
its TLS handshake to the previous incarnation, that node would get a
"connection reset" error and consider it a permanent failure. This is a
race that could happen to any TLS-enabled use of GRPC, but it's made much
more common by our long startup delays.

The error handling in GRPC needs to be reworked: instead of treating
errors from ClientHandshake as permanent by default with a whitelist of
transient errors (currently io.EOF and context.DeadlineExceeded), it should
assume that all errors are retryable (as it does for every operation except
ClientHandshake) and use a blacklist of certificate-related errors to treat
as permanent failures (or maybe even those should be transient - a
misconfigured server could transiently reject valid certificates and it
shouldn't take a restart of all clients to recover from that).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#8939 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPAOj2A7jW3HvLoG2LWJDcyXwg92Cks5qlAAbgaJpZM4JwVL1
.

bdarnell · 2016-08-30T13:19:58Z

Ah, cool. Pulling in grpc/grpc-go#864 should definitely be the first step. It looks to me like it should fix the problem, but grpc/grpc-go#870 suggests that there might still be something missing.

tbg · 2016-08-31T19:24:11Z

Just restarted beta gamma with recent master (includes the grpc fix).

pssh -O StrictHostKeyChecking=No -h hosts/${cluster} -i -t 0 "grep 'connection reset by peer' logs/*.stderr" turns up a few entries, but nothing interesting (none of the ones you mentioned). Is that a good sanity check?

tamird · 2016-08-31T20:03:22Z

As long as it isn't spamming then we might just be in luck.

On Wed, Aug 31, 2016 at 3:24 PM, Tobias Schottdorf <notifications@github.com

wrote:

Just restarted beta gamma with recent master (includes the grpc fix).

pssh -O StrictHostKeyChecking=No -h hosts/${cluster} -i -t 0 "grep
'connection reset by peer' logs/*.stderr" turns up a few entries, but
nothing interesting (none of the ones you mentioned). Is that a good sanity
check?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#8939 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPLJVOw10ScBhsL8isIOxNxlwxPcVks5qldTjgaJpZM4JwVL1
.

bdarnell · 2016-09-01T05:20:30Z

The other way to check for this is to look at the number of raft transport streams. Should be at least 5 (one long-lived stream for each other replica, plus other transient ones for snapshots). Looks like we're good:

$ pssh -h hosts/beta -i 'curl -s -k https://localhost:8080/debug/requests'|grep -A1 'MultiRaft&b=-1'
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [7 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [7 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [8 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]

bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 30, 2016

bdarnell self-assigned this Aug 30, 2016

tamird mentioned this issue Aug 30, 2016

Connection reset by peer when server is restarted grpc/grpc-go#870

Closed

tamird mentioned this issue Aug 30, 2016

Update grpc #8947

Merged

bdarnell mentioned this issue Aug 31, 2016

stability: Goroutine leak when communication fails #8975

Closed

bdarnell closed this as completed Sep 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Persistent failure in GRPC connections #8939

stability: Persistent failure in GRPC connections #8939

bdarnell commented Aug 30, 2016

bdarnell commented Aug 30, 2016

tamird commented Aug 30, 2016

bdarnell commented Aug 30, 2016

tbg commented Aug 31, 2016

tamird commented Aug 31, 2016

bdarnell commented Sep 1, 2016

stability: Persistent failure in GRPC connections #8939

stability: Persistent failure in GRPC connections #8939

Comments

bdarnell commented Aug 30, 2016

bdarnell commented Aug 30, 2016

tamird commented Aug 30, 2016

bdarnell commented Aug 30, 2016

tbg commented Aug 31, 2016

tamird commented Aug 31, 2016

bdarnell commented Sep 1, 2016