cmd/roachtest: deflake gossip/chaos roachtest #44926

petermattis · 2020-02-10T17:39:39Z

Deflake gossip/chaos by adding a missing
waitForFullReplication. This test loops, killing a node and then
verifying that the remaining nodes in the cluster stabilize on the same
view of gossip connectivity. Periodically the test was failing because
gossip wasn't stabilizing. The root issue was that the SQL query to
retrieve the gossip connectivity from one node was hanging. And that
query was hanging due to unavailability of a range. Logs show that the
leaseholder for that range was on a down node and that the range only
seemed to contain a single replica. This could happen near the start of
the test if we started killing nodes before full replication was
achieved.

Fixes #38829

Release note: None

cockroach-teamcity · 2020-02-10T17:39:48Z

This change is

petermattis · 2020-02-10T17:41:08Z

@tbg roachtest run -l gossip/chaos was failing nearly every time for me. Now it has succeeded 10x in a row. I'm going to run this a few times on actual clusters.

tbg · 2020-02-10T17:42:22Z

We should definitely merge this, but are you sure it fixes #38829? I don't understand that failure mode at all where it would complete after a few minutes or so (as opposed to getting stuck indefinitely).

tbg · 2020-02-10T17:44:27Z

Also, more generally, using SQL to get the connectivity is probably the wrong thing to do in this test. In the real world, clusters may have lost quorum and we still want gossip to connect 100% of the time. So the test should check the connectivity in a more robust way. Not putting any of that on anyone's plate though. I'm fine calling that issue fixed if the test stops failing (as it seems it will).

petermattis · 2020-02-10T17:46:53Z

We should definitely merge this, but are you sure it fixes #38829? I don't understand that failure mode at all where it would complete after a few minutes or so (as opposed to getting stuck indefinitely).

Definitely fixes the failures I was seeing, which includes the ones in #38829. I'm not sure why the SQL query would sometimes return after ~300s. Is there a 5m timeout on some context somewhere? Note that I was seeing that failure mode myself. With this PR it appears to be gone.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/gossip.go, line 92 at r1 (raw file):

				}
			}
			c.l.Printf("gossip ok: %s (%0.0fs)\n", expected, timeutil.Since(start).Seconds())

Also converted this spurious fmt.Printf to c.l.Printf.

petermattis · 2020-02-10T18:25:09Z

Running on real clusters produced gossip.go:63,gossip.go:100,gossip.go:112,gossip.go:122,test_runner.go:741: gossip did not stabilize in 293.4s. I'm going to dig more before merging this.

tbg · 2020-02-11T09:36:42Z

Is there a 5m timeout on some context somewhere?

If there were a timeout, the SQL query would return an error, which would lead to a t.Fatal which we're not seeing here. The problem appears to be that the query eventually does return, but it takes a long time to do so. Sleeping over it, I think I understand this somewhat. Let's say we're checking two up nodes.

t=0: we check n1. We get some connectivity C back.
t=1: we start checking n2. This hangs for five minutes.
t=5m the query returns. It returns a connectivity C'. C' is different from C, because C didn't yet have gossip updated. So the loop wraps around, but it bails since 20s have since passed.

So this all makes sense, only the fact that the query stalls is problematic.

I still think this is likely to be caused by getting a request stuck on a gc'able replica. You've identified startWriteNodeStatus as a potential culprit; this uses KV in the sync startup path and will block the SQL endpoint from operating until successful. The KV request in that method could end up on a replica that is gc'able, and will "just sit there" until the replica goes away.

Open question is why that can take minutes. The scanner loop is ~10 minutes by default, though in the scenario of the test we expect a reactive GC, i.e. within seconds, since former peers of the replica ought to be around. With a vmodule=*=1 log file (plus a goroutine profile showing a stuck KV request), I can probably come up with a coherent story here.

petermattis · 2020-02-11T10:03:49Z

I still think this is likely to be caused by getting a request stuck on a gc'able replica. You've identified startWriteNodeStatus as a potential culprit; this uses KV in the sync startup path and will block the SQL endpoint from operating until successful. The KV request in that method could end up on a replica that is gc'able, and will "just sit there" until the replica goes away.

Open question is why that can take minutes. The scanner loop is ~10 minutes by default, though in the scenario of the test we expect a reactive GC, i.e. within seconds, since former peers of the replica ought to be around. With a vmodule=*=1 log file (plus a goroutine profile showing a stuck KV request), I can probably come up with a coherent story here.

See my new comments on #38829. I believe there are two different failure modes for this test. The first is fixed by this PR and is caused by a replica losing quorum which causes startWriteNodeStatus to hang forever during startup.

The second failure mode is the SQL query returning successfully in around ~5 min. I believe the timer we're hitting is time-until-store-dead. #38829 (comment) indicates that we're failing to query r26 (the new system.namespace table) because nobody can acquire the lease for that range for some reason. When time-until-store-dead kicks in, the old Raft leader for that range is removed, a new one is elected, and the query goes through. That's my theory. I'm not sure why we couldn't get a lease for r26.

PS I'm at an offsite meeting all day and won't have a chance to look at this at all.

Deflake `gossip/chaos` by adding a missing `waitForFullReplication`. This test loops, killing a node and then verifying that the remaining nodes in the cluster stabilize on the same view of gossip connectivity. Periodically the test was failing because gossip wasn't stabilizing. The root issue was that the SQL query to retrieve the gossip connectivity from one node was hanging. And that query was hanging due to unavailability of a range. Logs show that the leaseholder for that range was on a down node and that the range only seemed to contain a single replica. This could happen near the start of the test if we started killing nodes before full replication was achieved. Fixes cockroachdb#38829 Release note: None

tbg · 2020-03-12T19:32:40Z

This PR is now a part of #46045. Thanks for all the prep work, @petermattis. Turns out there was a real KV bug here.

petermattis requested a review from tbg February 10, 2020 17:39

tbg approved these changes Feb 10, 2020

View reviewed changes

petermattis force-pushed the pmattis/gossip-wait-for-full-replication branch from 2ab0be5 to 38f4bd9 Compare February 10, 2020 17:54

petermattis commented Feb 10, 2020

View reviewed changes

petermattis force-pushed the pmattis/gossip-wait-for-full-replication branch from 38f4bd9 to a7a9146 Compare February 11, 2020 12:22

petermattis mentioned this pull request Feb 11, 2020

roachtest: gossip/chaos/nodes=9 failed #38829

Closed

tbg closed this Mar 12, 2020

petermattis deleted the pmattis/gossip-wait-for-full-replication branch March 12, 2020 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/roachtest: deflake gossip/chaos roachtest #44926

cmd/roachtest: deflake gossip/chaos roachtest #44926

petermattis commented Feb 10, 2020

cockroach-teamcity commented Feb 10, 2020

petermattis commented Feb 10, 2020

tbg commented Feb 10, 2020

tbg commented Feb 10, 2020

petermattis commented Feb 10, 2020

petermattis left a comment

petermattis commented Feb 10, 2020

tbg commented Feb 11, 2020

petermattis commented Feb 11, 2020

tbg commented Mar 12, 2020

cmd/roachtest: deflake gossip/chaos roachtest #44926

cmd/roachtest: deflake gossip/chaos roachtest #44926

Conversation

petermattis commented Feb 10, 2020

cockroach-teamcity commented Feb 10, 2020

petermattis commented Feb 10, 2020

tbg commented Feb 10, 2020

tbg commented Feb 10, 2020

petermattis commented Feb 10, 2020

petermattis left a comment

Choose a reason for hiding this comment

petermattis commented Feb 10, 2020

tbg commented Feb 11, 2020

petermattis commented Feb 11, 2020

tbg commented Mar 12, 2020