Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/roachtest: deflake gossip/chaos roachtest #44926

Conversation

petermattis
Copy link
Collaborator

Deflake gossip/chaos by adding a missing
waitForFullReplication. This test loops, killing a node and then
verifying that the remaining nodes in the cluster stabilize on the same
view of gossip connectivity. Periodically the test was failing because
gossip wasn't stabilizing. The root issue was that the SQL query to
retrieve the gossip connectivity from one node was hanging. And that
query was hanging due to unavailability of a range. Logs show that the
leaseholder for that range was on a down node and that the range only
seemed to contain a single replica. This could happen near the start of
the test if we started killing nodes before full replication was
achieved.

Fixes #38829

Release note: None

@petermattis petermattis requested a review from tbg February 10, 2020 17:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator Author

@tbg roachtest run -l gossip/chaos was failing nearly every time for me. Now it has succeeded 10x in a row. I'm going to run this a few times on actual clusters.

@tbg
Copy link
Member

tbg commented Feb 10, 2020

We should definitely merge this, but are you sure it fixes #38829? I don't understand that failure mode at all where it would complete after a few minutes or so (as opposed to getting stuck indefinitely).

@tbg
Copy link
Member

tbg commented Feb 10, 2020

Also, more generally, using SQL to get the connectivity is probably the wrong thing to do in this test. In the real world, clusters may have lost quorum and we still want gossip to connect 100% of the time. So the test should check the connectivity in a more robust way. Not putting any of that on anyone's plate though. I'm fine calling that issue fixed if the test stops failing (as it seems it will).

@petermattis
Copy link
Collaborator Author

We should definitely merge this, but are you sure it fixes #38829? I don't understand that failure mode at all where it would complete after a few minutes or so (as opposed to getting stuck indefinitely).

Definitely fixes the failures I was seeing, which includes the ones in #38829. I'm not sure why the SQL query would sometimes return after ~300s. Is there a 5m timeout on some context somewhere? Note that I was seeing that failure mode myself. With this PR it appears to be gone.

@petermattis petermattis force-pushed the pmattis/gossip-wait-for-full-replication branch from 2ab0be5 to 38f4bd9 Compare February 10, 2020 17:54
Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/gossip.go, line 92 at r1 (raw file):

				}
			}
			c.l.Printf("gossip ok: %s (%0.0fs)\n", expected, timeutil.Since(start).Seconds())

Also converted this spurious fmt.Printf to c.l.Printf.

@petermattis
Copy link
Collaborator Author

Running on real clusters produced gossip.go:63,gossip.go:100,gossip.go:112,gossip.go:122,test_runner.go:741: gossip did not stabilize in 293.4s. I'm going to dig more before merging this.

@tbg
Copy link
Member

tbg commented Feb 11, 2020

Is there a 5m timeout on some context somewhere?

If there were a timeout, the SQL query would return an error, which would lead to a t.Fatal which we're not seeing here. The problem appears to be that the query eventually does return, but it takes a long time to do so. Sleeping over it, I think I understand this somewhat. Let's say we're checking two up nodes.

t=0: we check n1. We get some connectivity C back.
t=1: we start checking n2. This hangs for five minutes.
t=5m the query returns. It returns a connectivity C'. C' is different from C, because C didn't yet have gossip updated. So the loop wraps around, but it bails since 20s have since passed.

So this all makes sense, only the fact that the query stalls is problematic.

I still think this is likely to be caused by getting a request stuck on a gc'able replica. You've identified startWriteNodeStatus as a potential culprit; this uses KV in the sync startup path and will block the SQL endpoint from operating until successful. The KV request in that method could end up on a replica that is gc'able, and will "just sit there" until the replica goes away.

Open question is why that can take minutes. The scanner loop is ~10 minutes by default, though in the scenario of the test we expect a reactive GC, i.e. within seconds, since former peers of the replica ought to be around. With a vmodule=*=1 log file (plus a goroutine profile showing a stuck KV request), I can probably come up with a coherent story here.

@petermattis
Copy link
Collaborator Author

I still think this is likely to be caused by getting a request stuck on a gc'able replica. You've identified startWriteNodeStatus as a potential culprit; this uses KV in the sync startup path and will block the SQL endpoint from operating until successful. The KV request in that method could end up on a replica that is gc'able, and will "just sit there" until the replica goes away.

Open question is why that can take minutes. The scanner loop is ~10 minutes by default, though in the scenario of the test we expect a reactive GC, i.e. within seconds, since former peers of the replica ought to be around. With a vmodule=*=1 log file (plus a goroutine profile showing a stuck KV request), I can probably come up with a coherent story here.

See my new comments on #38829. I believe there are two different failure modes for this test. The first is fixed by this PR and is caused by a replica losing quorum which causes startWriteNodeStatus to hang forever during startup.

The second failure mode is the SQL query returning successfully in around ~5 min. I believe the timer we're hitting is time-until-store-dead. #38829 (comment) indicates that we're failing to query r26 (the new system.namespace table) because nobody can acquire the lease for that range for some reason. When time-until-store-dead kicks in, the old Raft leader for that range is removed, a new one is elected, and the query goes through. That's my theory. I'm not sure why we couldn't get a lease for r26.

PS I'm at an offsite meeting all day and won't have a chance to look at this at all.

Deflake `gossip/chaos` by adding a missing
`waitForFullReplication`. This test loops, killing a node and then
verifying that the remaining nodes in the cluster stabilize on the same
view of gossip connectivity. Periodically the test was failing because
gossip wasn't stabilizing. The root issue was that the SQL query to
retrieve the gossip connectivity from one node was hanging. And that
query was hanging due to unavailability of a range. Logs show that the
leaseholder for that range was on a down node and that the range only
seemed to contain a single replica. This could happen near the start of
the test if we started killing nodes before full replication was
achieved.

Fixes cockroachdb#38829

Release note: None
@tbg
Copy link
Member

tbg commented Mar 12, 2020

This PR is now a part of #46045. Thanks for all the prep work, @petermattis. Turns out there was a real KV bug here.

@tbg tbg closed this Mar 12, 2020
@petermattis petermattis deleted the pmattis/gossip-wait-for-full-replication branch March 12, 2020 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: gossip/chaos/nodes=9 failed
3 participants