Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Persistent failure in GRPC connections #8939

Closed
bdarnell opened this issue Aug 30, 2016 · 6 comments
Closed

stability: Persistent failure in GRPC connections #8939

bdarnell opened this issue Aug 30, 2016 · 6 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

The beta cluster (running beta release candidate 2a2fdd9) is showing symptoms similar to #8694 (the grpc connection timeout bug). Some ranges are unable to elect leaders and while one node was down (due to operator error) the cluster ground to a halt. Once that node was restarted the cluster was able to make progress again.

Several signs point to an rpc-level problem: There are only 3-4 active raft transport streams showing up in either requestz or goroutine stacks, and the raft status page only shows columns for 5 nodes. It appears that (to take one example), node 104.196.54.82 is unable to connect to node id 6. The logs show that we're getting a "connection reset" error from grpc:

I160830 09:25:30.980574 server/status.go:568  Failed to get ranges from 6: "rpc error: code = 13 desc = connection error: desc = \"transport: read tcp 10.142.0.20:38006->10.142.0.21:26257: read: connection reset by peer\""
I160830 09:25:54.035188 server/status.go:568  Failed to get ranges from 6: "rpc error: code = 13 desc = connection error: desc = \"transport: read tcp 10.142.0.20:38006->10.142.0.21:26257: read: connection reset by peer\""

The source port (38006) is not changing, and there are no logs from grpc/clientconn.go indicating that it is trying to create a new connection. Somehow this failure appears to be preventing grpc from retrying when it should (and this failure, combined with another real node failure, is enough to knock some ranges out of commission)

@bdarnell bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 30, 2016
@bdarnell bdarnell self-assigned this Aug 30, 2016
@bdarnell
Copy link
Contributor Author

I think this is exactly the same as #8694, just with a different error. It happens that node 6 was restarted twice, three minutes apart. If the second of those restarts occurred while another node was in its TLS handshake to the previous incarnation, that node would get a "connection reset" error and consider it a permanent failure. This is a race that could happen to any TLS-enabled use of GRPC, but it's made much more common by our long startup delays.

The error handling in GRPC needs to be reworked: instead of treating errors from ClientHandshake as permanent by default with a whitelist of transient errors (currently io.EOF and context.DeadlineExceeded), it should assume that all errors are retryable (as it does for every operation except ClientHandshake) and use a blacklist of certificate-related errors to treat as permanent failures (or maybe even those should be transient - a misconfigured server could transiently reject valid certificates and it shouldn't take a restart of all clients to recover from that).

@tamird
Copy link
Contributor

tamird commented Aug 30, 2016

A similar-in-spirit change was already made upstream:
grpc/grpc-go#864. Perhaps we could try pulling that
in?

On Aug 30, 2016 06:04, "Ben Darnell" notifications@github.com wrote:

I think this is exactly the same as #8694
#8694, just with a
different error. It happens that node 6 was restarted twice, three minutes
apart. If the second of those restarts occurred while another node was in
its TLS handshake to the previous incarnation, that node would get a
"connection reset" error and consider it a permanent failure. This is a
race that could happen to any TLS-enabled use of GRPC, but it's made much
more common by our long startup delays.

The error handling in GRPC needs to be reworked: instead of treating
errors from ClientHandshake as permanent by default with a whitelist of
transient errors (currently io.EOF and context.DeadlineExceeded), it should
assume that all errors are retryable (as it does for every operation except
ClientHandshake) and use a blacklist of certificate-related errors to treat
as permanent failures (or maybe even those should be transient - a
misconfigured server could transiently reject valid certificates and it
shouldn't take a restart of all clients to recover from that).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#8939 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPAOj2A7jW3HvLoG2LWJDcyXwg92Cks5qlAAbgaJpZM4JwVL1
.

@bdarnell
Copy link
Contributor Author

Ah, cool. Pulling in grpc/grpc-go#864 should definitely be the first step. It looks to me like it should fix the problem, but grpc/grpc-go#870 suggests that there might still be something missing.

@tbg
Copy link
Member

tbg commented Aug 31, 2016

Just restarted beta gamma with recent master (includes the grpc fix).

pssh -O StrictHostKeyChecking=No -h hosts/${cluster} -i -t 0 "grep 'connection reset by peer' logs/*.stderr" turns up a few entries, but nothing interesting (none of the ones you mentioned). Is that a good sanity check?

@tamird
Copy link
Contributor

tamird commented Aug 31, 2016

As long as it isn't spamming then we might just be in luck.

On Wed, Aug 31, 2016 at 3:24 PM, Tobias Schottdorf <notifications@github.com

wrote:

Just restarted beta gamma with recent master (includes the grpc fix).

pssh -O StrictHostKeyChecking=No -h hosts/${cluster} -i -t 0 "grep
'connection reset by peer' logs/*.stderr" turns up a few entries, but
nothing interesting (none of the ones you mentioned). Is that a good sanity
check?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#8939 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPLJVOw10ScBhsL8isIOxNxlwxPcVks5qldTjgaJpZM4JwVL1
.

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 1, 2016

The other way to check for this is to look at the number of raft transport streams. Should be at least 5 (one long-lived stream for each other replica, plus other transient ones for snapshots). Looks like we're good:

$ pssh -h hosts/beta -i 'curl -s -k https://localhost:8080/debug/requests'|grep -A1 'MultiRaft&b=-1'
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [7 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [7 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [8 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Recv.MultiRaft&b=-1">
            [5 active]
--
            <a href="?fam=grpc.Sent.MultiRaft&b=-1">
            [6 active]

@bdarnell bdarnell closed this as completed Sep 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

3 participants