Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed #38829

Closed
cockroach-teamcity opened this issue Jul 11, 2019 · 52 comments · Fixed by #46045
Closed

roachtest: gossip/chaos/nodes=9 failed #38829

cockroach-teamcity opened this issue Jul 11, 2019 · 52 comments · Fixed by #46045
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/07607a73daedbf57f47e42e2e3d6cfd529ab65a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=gossip/chaos/nodes=9 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1382906&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190711-1382906/gossip/chaos/nodes=9/run_1
	test_runner.go:693: test timed out (10h0m0s)

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 11, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Jul 11, 2019
@tbg
Copy link
Member

tbg commented Jul 23, 2019

10:17:06 test.go:176: test status: waiting for gossip to stabilize
20:16:40 test.go:285: test failure: 	test_runner.go:693: test timed out (10h0m0s)

This happens at like the 3rd iteration of stopping cluster, starting cluster, waiting for stabilization.

The actual stabilization check is supposed to fatal after 20s:

gossipOK := func(start time.Time) bool {
var expected string
var initialized bool
for i := 1; i <= c.spec.NodeCount; i++ {
if elapsed := timeutil.Since(start); elapsed >= 20*time.Second {
t.Fatalf("gossip did not stabilize in %.1fs", elapsed.Seconds())
}

Unfortunately that gossip loop logs to stdout which is hard to read, but I think all it ever prints is

[10:17:02]
1: gossip not ok (dead node 4 present): 2:1,3:1,4:1,5:3,6:3,7:3,8:2,9:7 (0s)
[10:17:03]
1: gossip not ok (dead node 4 present): 2:1,3:1,4:1,5:3,6:3,7:3,8:2,9:7 (1s)
[10:17:04]
gossip ok: 2:1,3:1,5:3,6:3,7:3,8:2,9:7 (2s)
[10:17:06]
2: gossip not ok (dead node 1 present): 3:1,5:3,6:3,7:3,8:2,9:7 (0s)
[10:17:07]
2: gossip not ok (dead node 1 present): 3:1,5:3,6:3,7:3,8:2,9:7 (1s)

I think this means that the loop got stuck in this method:

gossipNetwork := func(node int) string {
const query = `
SELECT string_agg(source_id::TEXT || ':' || target_id::TEXT, ',')
FROM (SELECT * FROM crdb_internal.gossip_network ORDER BY source_id, target_id)
`
db := c.Conn(ctx, node)
defer db.Close()
var s gosql.NullString
if err := db.QueryRow(query).Scan(&s); err != nil {
t.Fatal(err)
}
if s.Valid {
return s.String
}
return ""
}

When the test runner times out the test it also panics itself:

[20:21:47]
test returned error: gossip/chaos/nodes=9: test timed out and afterwards failed to respond to cancelation
[20:21:52]
panic: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190711-1382906/gossip/chaos/nodes=9/run_1/test.log: file already closed
[20:21:52]
	panic: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190711-1382906/gossip/chaos/nodes=9/run_1/test.log: file already closed
[20:21:52]
[20:21:52]
goroutine 173134 [running]:
[20:21:52]
main.(*logger).PrintfCtxDepth(0xc0013e93e0, 0x33b5260, 0xc000060070, 0x2, 0x2d41182, 0x2e, 0x0, 0x0, 0x0)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/log.go:254 +0xd7
[20:21:52]
main.(*logger).Printf(...)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/log.go:243
[20:21:52]
main.(*cluster).FailOnReplicaDivergence(0xc000912000, 0x33b52e0, 0xc00087e030, 0xc000e92fc0)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1306 +0x371
[20:21:52]
main.(*testRunner).runTest.func2.1(0xc000912000, 0x33b52e0, 0xc00087e030, 0xc000e92fc0, 0xc000a88600, 0xc000a88604, 0xc0003c3220)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:664 +0x4d
[20:21:52]
panic(0x29c9280, 0xc000c96030)
[20:21:52]
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
[20:21:52]
main.(*logger).PrintfCtxDepth(0xc0013e93e0, 0x33b5260, 0xc000060070, 0x2, 0xc00077ee00, 0x6a, 0x0, 0x0, 0x0)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/log.go:254 +0xd7
[20:21:52]
main.(*logger).Printf(...)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/log.go:243
[20:21:52]
main.(*test).printAndFail(0xc000e92fc0, 0x2, 0xc000ecdbc8, 0x1, 0x1)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:264 +0x140
[20:21:52]
main.(*test).fatalfInner(0xc000e92fc0, 0x0, 0x0, 0xc000ecdbc8, 0x1, 0x1)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:250 +0xb3
[20:21:52]
main.(*test).Fatal(...)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:237
[20:21:52]
main.registerGossip.func1.1(0x4, 0x0, 0x0)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:48 +0x17d
[20:21:52]
main.registerGossip.func1.2(0x312ec719, 0xed4b903a2, 0x0, 0x0)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:68 +0x139
[20:21:52]
main.registerGossip.func1.3()
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:99 +0xf1
[20:21:52]
main.registerGossip.func1(0x33b5220, 0xc000897640, 0xc000e92fc0, 0xc000912000)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:111 +0x3d3
[20:21:52]
main.registerGossip.func2(0x33b5220, 0xc000897640, 0xc000e92fc0, 0xc000912000)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:120 +0x4d
[20:21:52]
main.(*testRunner).runTest.func2(0xc000d3cde0, 0xc000912000, 0x33b52e0, 0xc00087e030, 0xc000e92fc0, 0xc000a88600, 0xc000a88604, 0xc0003c3220, 0x33b5220, 0xc000897640)
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:678 +0xf2
[20:21:52]
created by main.(*testRunner).runTest
[20:21:52]
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:658 +0x596

tbg added a commit to tbg/cockroach that referenced this issue Jul 23, 2019
It was printing to Stdout, where the output is hard to find.

Touches cockroachdb#38829.

Release note: None
craig bot pushed a commit that referenced this issue Jul 25, 2019
39049: roachtest: print to correct logs in gossip/chaos r=nvanbenschoten a=tbg

It was printing to Stdout, where the output is hard to find.

Touches #38829.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/77f26d185efb436aaac88243de19a27caa5da9b6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=gossip/chaos/nodes=9 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1509340&tab=artifacts#/gossip/chaos/nodes=9

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190926-1509340/gossip/chaos/nodes=9/run_1
	gossip.go:48,gossip.go:68,gossip.go:99,gossip.go:111,gossip.go:120,test_runner.go:689: dial tcp 104.154.198.192:26257: connect: connection timed out

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f9a102814bdce90d687f6215acadf10a9d784c29

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=gossip/chaos/nodes=9 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1555992&tab=artifacts#/gossip/chaos/nodes=9

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191024-1555992/gossip/chaos/nodes=9/run_1
	cluster.go:1539,gossip.go:35,gossip.go:120,test_runner.go:689: cluster.Put: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1555992-1571899205-29-n9cpu4:1-9 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach returned:
		stderr:
		
		stdout:
		teamcity-1555992-1571899205-29-n9cpu4: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach
		..............
		   1: done
		   2: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine ubuntu@35.222.100.16:./cockroach ubuntu@104.154.129.16:./cockroach
		ssh_exchange_identification: Connection closed by remote host
		lost connection
		: exit status 1
		   3: done
		   4: done
		   5: done
		   6: done
		   7: done
		   8: done
		   9: done
		I191024 07:43:00.678995 1 cluster_synced.go:1100  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8b9f54761adc58eb9aecbf9b26f1a7987d8a01e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=gossip/chaos/nodes=9 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1573251&tab=artifacts#/gossip/chaos/nodes=9

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191105-1573251/gossip/chaos/nodes=9/run_1
	cluster.go:1712,gossip.go:112,gossip.go:120,test_runner.go:697: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1573251-1572938926-29-n9cpu4:9 returned:
		stderr:
		
		stdout:
		teamcity-1573251-1572938926-29-n9cpu4: starting
		0: exit status 255
		~ ./cockroach version
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:88
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:149
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1553
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I191105 08:37:06.616686 1 cluster_synced.go:1635  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7dec3577461a1e53ea70582de62bbd96bf512b73

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=gossip/chaos/nodes=9 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1614006&tab=artifacts#/gossip/chaos/nodes=9

The test failed on branch=release-2.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191127-1614006/gossip/chaos/nodes=9/run_1
	cluster.go:1712,gossip.go:112,gossip.go:120,test_runner.go:697: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1614006-1574838230-30-n9cpu4:1 returned:
		stderr:
		
		stdout:
		teamcity-1614006-1574838230-30-n9cpu4: starting
		0: exit status 255
		~ ./cockroach version
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:88
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:149
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1553
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1337: 
		I191127 07:40:54.691223 1 cluster_synced.go:1635  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@e81faed54ee90bdfed1dddc63bb13d1ecf8806da:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191212-1640161/gossip/chaos/nodes=9/run_1
	test_runner.go:715: test timed out (10h0m0s)

details

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS=-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@299f2729c546b9a2444665c7ccd4b07e62bc84b5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191213-1641315/gossip/chaos/nodes=9/run_1
	gossip.go:62,gossip.go:99,gossip.go:111,gossip.go:120,test_runner.go:700: gossip did not stabilize in 278.9s

details

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS=-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@2f1e342c386973c35246bef68c177fcd0b8b609b:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191218-1649932/gossip/chaos/nodes=9/run_1
	test_runner.go:715: test timed out (10h0m0s)
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@277e28b2ea78929563e85cb1c9efc573f37408a4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191219-1653424/gossip/chaos/nodes=9/run_1
	test_runner.go:715: test timed out (10h0m0s)
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@dd9b1c1f40dda59ee9d446416106d311ae5ce1e6:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191221-1658785/gossip/chaos/nodes=9/run_1
	test_runner.go:715: test timed out (10h0m0s)
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@d54e001248f233734ff926daef5470487c5616b0:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191227-1664318/gossip/chaos/nodes=9/run_1
	test_runner.go:731: test timed out (10h0m0s)
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@8f946b8bb62629002c5e958373b81ca9e1920dd6:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191228-1665456/gossip/chaos/nodes=9/run_1
	gossip.go:48,gossip.go:68,gossip.go:99,gossip.go:111,gossip.go:120,test_runner.go:716: dial tcp 35.194.11.139:26257: connect: connection timed out
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@edd46b1405b7928760b65ec8aad59fd22a8adb6b:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191229-1666388/gossip/chaos/nodes=9/run_1
	test_runner.go:731: test timed out (10h0m0s)
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on release-19.1@cf4d9a46193b9fc2c63b7adf89b3b0b5c84adb2b:

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20191230-1667196/gossip/chaos/nodes=9/run_1
	gossip.go:62,gossip.go:99,gossip.go:111,gossip.go:120,test_runner.go:716: gossip did not stabilize in 35.9s
Repro

Artifacts: /gossip/chaos/nodes=9

make stressrace TESTS=gossip/chaos/nodes=9 PKG=./pkg/roachtest TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@2559992b91b6823110c5a7ef8bc0d61395ca6f28:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200118-1698212/gossip/chaos/nodes=9/run_1
	test_runner.go:731: test timed out (10h0m0s)

Repro

Artifacts: /gossip/chaos/nodes=9
roachdash

powered by pkg/cmd/internal/issues

@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@545c96573d36519c27115278187fdd441f54a64d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200123-1705428/gossip/chaos/nodes=9/run_1
	test_runner.go:734: test timed out (10h0m0s)

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@9bf1af0ec1b352419f621d7329fee5cd12d0bcdb:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@08bb94d98b7f60126666e9c5a317afca61422d7f:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:62,gossip.go:99,gossip.go:111,gossip.go:121,test_runner.go:741: gossip did not stabilize in 307.1s

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1737376-1581149318-36-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: 5476
		9: 5401
		5: 6964
		6: 5222
		7: 4492
		3: 5366
		1: 6447
		4: dead
		8: 5894
		Error:  4: dead

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@petermattis
Copy link
Collaborator

From n1:

W200210 20:21:05.836099 976 storage/replica_range_lease.go:1032  [n1,s1,r26/1:/NamespaceTable/{30-Max}] have been waiting 1m0s attempting to acquire lease
W200210 20:21:05.836396 977 storage/replica_write.go:209  [n1,s1,r26/1:/NamespaceTable/{30-Max}] have been waiting 60.00s for proposing command RequestLease [/NamespaceTable/30,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	https://yourhost:8080/#/reports/range/26

and the following Raft status: {"id":"1","term":10,"vote":"0","commit":54,"lead":"0","raftState":"StateFollower","applied":54,"progress":{},"leadtransferee":"0"}

Notice the timestamps here. Shortly after n2 was stopped, n1 tried to acquire the lease for r26. A minute later it emits this warning. Later we see:

I200210 20:24:35.616380 156 storage/replica_raft.go:969  [n1,s1,r26/1:/NamespaceTable/{30-Max}] pending commands: reproposing 1 (at 54.26) reasonNewLeader

Presumably this happened right after n2 was removed from r26.

@tbg
Copy link
Member

tbg commented Feb 11, 2020

Hmm, I get a better idea of what's going wrong from these logs, but it unfortunately hints at the bowels of raft. It does look like the failure to acquire a lease on part of n1 is the root cause of the stall. A minute in, the raft status

{"id":"1","term":10,"vote":"0","commit":54,"lead":"0","raftState":"StateFollower","applied":54,"progress":{},"leadtransferee":"0"}

basically says: I'm a follower, but I don't know of any leader, and yet I'm not campaigning. Maybe it's regularly trying to campaign; we can't tell (since we use PreVote, which doesn't bump the term). Either way, raft isn't getting its act together here. And yet, n1 is definitely a healthy replica of what looks like a range that should work "just fine", since it has members n1,n3,n2(down),n8,n9 and thus more than a healthy majority (I get this from the most recent replicate queue output in the logs and the fact that I see n9 have an active replica before things stall):

peter-1581364560-03-n9cpu4-0001> I200210 20:17:47.209788 5648 storage/replica_raft.go:248 [n1,s1,r26/1:/NamespaceTable/{30-Max}] proposing SIMPLE(v5) ADD_REPLICA[(n9,s9):5]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):3 (n8,s8):4 (n9,s9):5] next=6

n2 went down around 20:19:52. All we need for things to work is for someone to get elected raft leader. (it's safe to assume that n2 was the raft leader at the time of its demise).

But basically nothing happens, until the situation seems to resolve itself a few minutes in:

peter-1581364560-03-n9cpu4-0009> I200210 20:24:35.613234 172 storage/replica_raft.go:969 [n9,s9,r26/5:/NamespaceTable/{30-Max}] pending commands: reproposing 1 (at 54.26) reasonNewLeader

(the other nodes quickly follow suit).

All that is to say, ugh, I think we really need to see the raft chatter here? It looks like --vmodule=raft=5,*=1 does the right thing.

@tbg
Copy link
Member

tbg commented Feb 11, 2020

Oh, and I should also mention that this is similar to how leases can hang when a replica is gc'able but doesn't know it and gets an inbound request. But then, in this example, we'd expect to see a) a removal of the n1 replica being committed on other nodes while n1 is down and b) the n1 replica to eventually get replicaGC'ed unblocking things. We see neither here.

petermattis added a commit to petermattis/cockroach that referenced this issue Feb 11, 2020
Deflake `gossip/chaos` by adding a missing
`waitForFullReplication`. This test loops, killing a node and then
verifying that the remaining nodes in the cluster stabilize on the same
view of gossip connectivity. Periodically the test was failing because
gossip wasn't stabilizing. The root issue was that the SQL query to
retrieve the gossip connectivity from one node was hanging. And that
query was hanging due to unavailability of a range. Logs show that the
leaseholder for that range was on a down node and that the range only
seemed to contain a single replica. This could happen near the start of
the test if we started killing nodes before full replication was
achieved.

Fixes cockroachdb#38829

Release note: None
@petermattis
Copy link
Collaborator

petermattis commented Feb 11, 2020

I've pushed all of the changes I'm running with to #44926. Reproduction amounted to roachtest run --count 10 --parallelism 5 --cpu-quota 1280 gossip/chaos/nodes=9 (I don't think you need that much cpu-quota, but it won't hurt). 1 out of the 10 runs failed, so you might have to run this 2 or 3 times to see a failure.

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@de7a7bb804716835579e713ab7264f8af168a14f:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@2739821b911d777fa2a927295d699b559360a802:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:48,gossip.go:68,gossip.go:99,gossip.go:111,gossip.go:121,test_runner.go:741: dial tcp 34.67.60.32:26257: connect: connection refused

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1747374-1581581035-41-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 6: 4570
		4: 5059
		8: 4851
		1: 4747
		2: 4318
		3: 3936
		9: 4655
		7: dead
		5: dead
		Error:  7: dead, 5: dead

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@e671a4ef97cbc6cf5d22f8f322fd45733d302094:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:48,gossip.go:68,gossip.go:99,gossip.go:111,gossip.go:121,test_runner.go:741: dial tcp 34.67.13.11:26257: connect: connection refused

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1749385-1581667748-31-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		5: 6549
		7: 6470
		1: 7354
		2: 6454
		4: 6665
		6: 6183
		8: 6213
		9: dead
		Error:  3: dead, 9: dead

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@b797cad6d130714748983bc53d4611ddc6151153:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	test_runner.go:756: test timed out (10h0m0s)

More

Artifacts: /gossip/chaos/nodes=9

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@fc5c7f093bf1e86852c3b839bc0f6710d9902729:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:62,gossip.go:99,gossip.go:111,gossip.go:121,test_runner.go:741: gossip did not stabilize in 20.0s

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1772892-1582877180-33-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: 6591
		5: 6817
		7: 8270
		3: 8276
		9: 6552
		1: dead
		6: 6501
		2: 8493
		8: 6038
		Error:  1: dead

More

Artifacts: /gossip/chaos/nodes=9
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg tbg assigned tbg and unassigned andreimatei Mar 9, 2020
@tbg
Copy link
Member

tbg commented Mar 11, 2020

Yeah, this reproduced readily. Again I saw a proposal hanging on the namespace table. Going to throw some more logging in to get at the internals.

@tbg
Copy link
Member

tbg commented Mar 11, 2020

Current repro (note the improved message that I picked up via rebasing). We see that a lease request is proposed on s6. Nothing happens for a few minutes - and nothing means that we don't even see reproposals, which indicates that the replica is either stuck or quiesced. I assume it is quiesced, because the activity of the replicate queue on a different node + the raft election timeout equals precisely when things unwedge. Time for some more logging. The replica isn't supposed to quiesce when it has pending proposals but maybe there's a hole.

tobias-1583941428-02-n9cpu4-0006> E200311 15:52:00.752129 1094 kv/kvserver/replica_write.go:203  [n6,s6,r26/6:/NamespaceTable/{30-Max}] have been waiting 60.00s for proposing command RequestLease [/NamespaceTable/30,/Min).
This range is likely unavailable.
Please submit this message to Cockroach Labs support along with the following information:

Descriptor:  r26:{-} [(n6,s6):6, (n4,s4):8, (n9,s9):7, (n7,s7):4, (n8,s8):5, next=9, gen=20]
Live:        (n6,s6):6,(n9,s9):7,(n7,s7):4,(n8,s8):5
Non-live:    (n4,s4):8
Raft Status: {"id":"6","term":30,"vote":"8","commit":107,"lead":"0","raftState":"StateFollower","applied":107,"progress":{},"leadtransferee":"0"}

and a copy of https://yourhost:8080/#/reports/range/26
tobias-1583941428-02-n9cpu4-0007> I200311 15:55:41.919305 9008 kv/kvserver/queue.go:913  [n7,replicate,s7,r26/4:/NamespaceTable/{30-Max}] processing replica
tobias-1583941428-02-n9cpu4-0007> I200311 15:55:44.422167 137 kv/kvserver/replica_raft.go:962  [n7,s7,r26/4:/NamespaceTable/{30-Max}] pending commands: reproposing 1 (at 107.56) reasonNewLeader
tobias-1583941428-02-n9cpu4-0007> I200311 15:55:44.424701 121 kv/kvserver/replica_raft.go:962  [n7,s7,r26/4:/NamespaceTable/{30-Max}] pending commands: reproposing 1 (at 107.56) reasonNewLeader
tobias-1583941428-02-n9cpu4-0006> I200311 15:55:44.424751 179 kv/kvserver/replica_raft.go:962  [n6,s6,r26/6:/NamespaceTable/{30-Max}] pending commands: reproposing 1 (at 107.56) reasonNewLeader

....


tobias-1583941428-02-n9cpu4-0006> I200311 15:55:44.426742 1094 kv/kvserver/replica_write.go:162  [n6,s6,r26/6:/NamespaceTable/{30-Max}] slow command RequestLease [/NamespaceTable/30,/Min) finished after 283.67s with error cannot replace lease repl=(n7,s7):4 seq=11 start=1583942141.919321157,0 epo=9 pro=1583942141.919330290,0 with repl=(n6,s6):6 seq=11 start=1583941860.749319289,0 epo=12 pro=1583941860.749336644,0: proposed under invalid lease

@tbg
Copy link
Member

tbg commented Mar 11, 2020

Yep it's quiesced. I'll continue tracking this down tomorrow, can't be hard now because we can add an assertion.

@petermattis
Copy link
Collaborator

Interesting that this is looping back around to me (I wrote the quiescence code originally). I wonder if this is something new, or something that has been there a long time. Eagerly awaiting the continuation of this debugging saga.

@tbg
Copy link
Member

tbg commented Mar 12, 2020

What I see in the logs is that everybody quiesces well before the proposal comes in (which makes sense, otherwise this would be very rare and it's not). So I think we're looking at a sequence of events in which the proposal does not lead to unquiescing the replica.
Let's say we have a quiesced replica and a proposal comes in. The proposalBuf should enqueue an update check in this method:

b.insertIntoArray(p, res.arrayIndex())

It's likely that it does and that this check somehow misfires. The code is here:

https://github.com/cockroachdb/cockroach/blob/ffdd8dfd5470ff438a93735cb5da6e29c92e91ed/pkg/kv/kvserver/replica_raft.go#L413-L421

Once this is called (since it should, since we think it got scheduled) we should have hasReady == true and we should unquiesce:

https://github.com/cockroachdb/cockroach/blob/ffdd8dfd5470ff438a93735cb5da6e29c92e91ed/pkg/kv/kvserver/replica_raft_quiesce.go#L91

I put logging into each of them (thankfully we always get stuck on r26, so this is easy to target), so next repro will tell us more.

@tbg
Copy link
Member

tbg commented Mar 12, 2020

20 runs passed. This better repro again now.

@tbg
Copy link
Member

tbg commented Mar 12, 2020

Oh I got one and I see what this is. What breaks is that in the above there's this implicit invariant that "if you put a proposal into raft, there will be a ready". But that's not true, because

[n3,s3,r26/2:/NamespaceTable/{30-Max}] 2 no leader at term 38; dropping proposal

so the replica remains quiesced. Quiesced replicas aren't ticked, so nothing ever happens until some queue wakes up the leader. It's fairly clear that we need the replica to wake up in this situation.

What I don't understand is... why is it not aware of a leader? It can't be quiesced and not know a leader, for it wouldn't... quiesce in that state? Hmm, it doesn't look like we check that. The system will happily quiesce even if Raft already figured out that the leader is unresponsive:

if status != nil && status.Term == req.Message.Term && status.Commit == req.Message.Commit {
if r.quiesce() {
return nil
}

(quiesce() doesn't check). I'm also seeing indications in the logs that nodes find out about the leader going away proactively:

 vendor/go.etcd.io/etcd/raft/raft.go:1239  [n2,s2,r26/8:/
NamespaceTable/{30-Max}] 8 failed to send message to 2 because it is unreachable [StateProbe match=258 next=259]

(this is from calling ReportUnreachable from the transport streams).

This seems like a compelling theory. I'm going to run a few iters with patches.

@petermattis
Copy link
Collaborator

What I don't understand is... why is it not aware of a leader? It can't be quiesced and not know a leader, for it wouldn't... quiesce in that state? Hmm, it doesn't look like we check that.

I thought a follower only accepted quiesce messages from the leader. That was the intention at least. A leader can quiesce itself and its followers. A follower should only quiesce in response to the current leader.

@tbg
Copy link
Member

tbg commented Mar 12, 2020

No, that is not the case. The follower checks that the commit index for the quiesce message matches the local one plus it matches the term (so the message is really from the leader for that term), but it doesn't actually verify that the local instance knows that this is the leader. I'll be sending a PR either today or tomorrow, right now making sure the repro disappears.

craig bot pushed a commit that referenced this issue Mar 17, 2020
46045: kvserver: avoid hanging proposal after leader goes down  r=nvanbenschoten,petermattis a=tbg

Deflakes gossip/chaos/nodes=9, i.e.

Fixes #38829.

There was a bug in range quiescence due to which commands would hang in
raft for minutes before actually getting replicated. This would occur
whenever a range was quiesced but a follower replica which didn't know
the (Raft) leader would receive a request.  This request would be
evaluated and put into the Raft proposal buffer, and a ready check would
be enqueued. However, no ready would be produced (since the proposal got
dropped by raft; leader unknown) and so the replica would not unquiesce.

This commit prevents this by always waking up the group if the proposal
buffer was initially nonempty, even if an empty Ready is produced.

It goes further than that by trying to ensure that a leader is always
known while quiesced. Previously, on an incoming request to quiesce, we
did not verify that the raft group had learned the leader's identity.

One shortcoming here is that in the situation in which the proposal
would originally hang "forever", it will now hang for one heartbeat
or election timeout where ideally it would be proposed more reactively. Since
this is so rare I didn't try to address this. Instead, refer to
the ideas in

#37906 (comment)

and

#21849

for future changes that could mitigate this.

Without this PR, the test would fail around 10% of the time. With this
change, it passed 40 iterations in a row without a hitch, via:

    ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9

Release justification: bug fix
Release note (bug fix): a rare case in which requests to a quiesced
range could hang in the KV replication layer was fixed. This would
manifest as a message saying "have been waiting ... for proposing" even
though no loss of quorum occurred.


Co-authored-by: Peter Mattis <petermattis@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in 2783f1a Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
4 participants