Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=8 failed on master #30261

Closed
cockroach-teamcity opened this issue Sep 16, 2018 · 35 comments
Closed

roachtest: import/tpch/nodes=8 failed on master #30261

cockroach-teamcity opened this issue Sep 16, 2018 · 35 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/5a566d893d5b25100f2dd77d3c1ac8dcc674f677

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=906668&tab=buildLog

	test.go:498,import.go:92: pq: communication error: rpc error: code = Canceled desc = context canceled

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Sep 16, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 16, 2018
@benesch benesch assigned maddyblue and unassigned petermattis Sep 17, 2018
@tbg tbg modified the milestones: 2.2, 2.1 Sep 24, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4d878135975265d185b4d0011ba1d2dbaee0adcb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=970685&tab=buildLog

The test failed on release-2.1:
	test.go:1002: test timed out (4h0m0s)
	test.go:606,cluster.go:1441,import.go:122: context canceled

@tbg tbg assigned andreimatei and unassigned maddyblue Oct 17, 2018
@tbg
Copy link
Member

tbg commented Oct 17, 2018

@andreimatei same story as with #31184.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/310a04983cda8ab8d67cd401814341b9b7f8ce79

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=972821&tab=buildLog

The test failed on master:
	test.go:1002: test timed out (4h0m0s)
	test.go:606,cluster.go:1441,import.go:122: context canceled

@andreimatei
Copy link
Contributor

passing to @benesch in the hope that #31570 fixes this

@andreimatei andreimatei assigned benesch and unassigned andreimatei Oct 18, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e6348bb4abbfd117424c382ce5ab42e8abbe88f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=974325&tab=buildLog

The test failed on release-2.1:
	test.go:1002: test timed out (4h0m0s)
	test.go:606,cluster.go:1453,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 19, 2018

Thought this might be related to #31618, but now the failure above is release-2.1 which doesn't have the new Raft-based limiting of the uncommitted log.
The failure prior to that, however, did have that commit.

@tbg
Copy link
Member

tbg commented Oct 19, 2018

The most recent failure has a bunch of stuck Raft proposals and rejected Raft snapshots. Merges are on, so I don't think it's worth spending much time on this today due to the known problems with these symptoms about to be fixed in #31570.

On master, we instead (or rather, additionally) have #31618 which poses a problem to this test.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/04cba2800919bdcf6a8467e8da97ae44b77a9626

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=974812&tab=buildLog

The test failed on master:
	test.go:1002: test timed out (4h0m0s)
	test.go:606,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3035b84a682e61fb1cd34db4027dd41f7f2f226a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=977057&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@benesch benesch assigned tbg and unassigned benesch Oct 21, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2998190f18fab952357133aaca9fdda8bc52d5ac

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=978508&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 23, 2018

This has n6/s5 1.00 metrics requests.slow.raft throughout the run, and in fact that's what I had running earlier today (though the investigation is in #31732 (comment)).

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5a373445c0674f060a4bfe369ad290a0cacccb6c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982644&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1194,cluster.go:1213,cluster.go:1318,restore.go:105,import.go:41,cluster.go:1432,errgroup.go:58: exit status 1
	test.go:639,cluster.go:1453,import.go:122: context canceled

tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
craig bot pushed a commit that referenced this issue Oct 26, 2018
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf

We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as #31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches #31732.
Touches #31740.
Touches #30261.
Touches #31768.
Touches #31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg
Copy link
Member

tbg commented Oct 26, 2018

FWIW, I still saw this soft-fail with #31914 in. There was a stuck Raft proposal and a slow lease (probably the same thing) for quite a while and then it went away.

image

Likely that this is because of the race in #31875, perhaps exacerbated by #31732 (comment).

Going to keep pulling on these threads until these tests reliably pass without requiring any raft snapshots. There is lots of dysfunctionality in these parts of the code.

@tbg
Copy link
Member

tbg commented Oct 26, 2018

Picked one of the handful of stuck ranges at random, here's the log. We see that n7 tries to send a Raft snapshot to a follower twice, but doesn't manage to within the 1min deadline (probably blocked on the semaphore on the receiver since other snapshots are going on). Takes around ~15 minutes until a snapshot goes through, totaling ~23m of outage:

slow command TransferLease [/Table/53/1/380474785/4,/Min) finished after 23m25.264043405s

The problem in #31732 (comment) probably also plays a role. The log is full of "intersecting range" failed snapshots, though not referring to this range:

$ roachprod ssh tobias-1540546843-import-tpch-nodes-8:1-8 -- "echo; grep -rE intersects.existing" | grep -oE '[IWEF]1810.*' | cut -c 2- | sort | cut -c 1-12 | sort | uniq -c
   3 181026 11:51
   3 181026 11:52
   4 181026 11:53
  11 181026 11:54
   4 181026 11:55
   2 181026 11:56
  14 181026 11:57
  41 181026 11:58
  15 181026 11:59
   1 181026 12:01
   4 181026 12:02
   2 181026 12:04
   2 181026 12:05
   2 181026 12:09
   2 181026 12:11
   1 181026 12:12
   1 181026 12:15
   1 181026 12:16
   3 181026 12:18
   1 181026 12:19
   5 181026 12:20
  24 181026 12:21
   8 181026 12:22
   5 181026 12:23
   4 181026 16:25
  10 181026 16:26
  33 181026 16:27
   9 181026 16:28
  12 181026 16:29
   5 181026 16:30
  39 181026 16:31
 116 181026 16:32
   1 181026 16:33
   2 181026 16:34
   2 181026 16:36
  10 181026 16:37
   1 181026 16:38
  18 181026 16:39
  13 181026 16:40
   9 181026 16:42
  42 181026 16:43
 111 181026 16:44
 233 181026 16:45
 182 181026 16:46
 182 181026 16:47
 179 181026 16:48
 123 181026 16:49
  90 181026 16:50
  45 181026 16:51
  71 181026 16:52
  87 181026 16:53
  53 181026 16:54
  66 181026 16:55
  25 181026 16:56
   1 181026 16:58
   2 181026 17:09
   2 181026 17:14

Note how this really really spikes while the request we're looking at is stuck. Pretty sure some of these "unsnapshottable" ranges are crowding out the ranges that really need the snapshot to unblock this mess.

181026 11:56:52.329995 41384 storage/replica_command.go:300  [n7,s7,r1722/3:/Table/53/1/3{693635…-702615…}] initiating a split of this range at key /Table/53/1/369630626/4 [r1723]
181026 11:56:53.915460 41430 storage/replica_command.go:300  [n7,s7,r1723/3:/Table/53/1/3{696306…-702615…}] initiating a split of this range at key /Table/53/1/369897794/3 [r1724]
181026 11:56:56.992972 36870 storage/replica_command.go:816  [n5,replicate,s6,r1723/4:/Table/53/1/369{63062…-89779…}] change replicas (REMOVE_REPLICA (n7,s7):3): read existing descriptor r1723:/Table/53/1/369{630626/4-897794/3} [(n6,s8):1, (n8,s5):2, (n7,s7):3, (n5,s6):4, next=5, gen=1]
181026 11:56:57.200411 36870 storage/replica.go:3935  [n5,s6,r1723/4:/Table/53/1/369{63062…-89779…}] proposing REMOVE_REPLICA((n7,s7):3): updated=[(n6,s8):1 (n8,s5):2 (n5,s6):4] next=5
181026 11:56:57.291863 41512 storage/store.go:2547  [n7,replicaGC,s7,r1723/3:/Table/53/1/369{63062…-89779…}] removing replica r1723/3
181026 11:56:57.332060 41512 storage/replica.go:873  [n7,replicaGC,s7,r1723/3:/Table/53/1/369{63062…-89779…}] removed 267183 (267176+7) keys in 36ms [clear=1ms commit=36ms]
181026 16:30:05.738500 103340 storage/replica_command.go:300  [n7,s6,r1722/3:/Table/53/1/3{791656…-819682…}] initiating a split of this range at key /Table/53/1/380474785/4 [r1723]
181026 16:30:06.147423 106299 storage/queue.go:791  [n7,replicate,s6,r1723/3:/Table/53/1/38{04747…-19682…}] no removable replicas from range that needs a removal: [no raft status]
181026 16:30:07.166128 112948 storage/replica_command.go:300  [n7,s6,r1723/3:/Table/53/1/38{04747…-19682…}] initiating a split of this range at key /Table/53/1/380741700/8 [r1724]
181026 16:30:07.330639 114080 storage/replica_command.go:816  [n7,s6,r1723/3:/Table/53/1/380{47478…-74170…}] change replicas (REMOVE_REPLICA (n5,s5):2): read existing descriptor r1723:/Table/53/1/380{474785/4-741700/8} [(n6,s7):1, (n5,s5):2, (n7,s6):3, (n4,s4):4, (n8,s8):5, next=6, gen=1]
181026 16:30:07.441886 114080 storage/replica.go:3935  [n7,s6,r1723/3:/Table/53/1/380{47478…-74170…}] proposing REMOVE_REPLICA((n5,s5):2): updated=[(n6,s7):1 (n8,s8):5 (n7,s6):3 (n4,s4):4] next=6
181026 16:31:07.459022 49770 storage/queue.go:791  [n5,replicaGC,s5,r1723/2:/Table/53/1/380{47478…-74170…}] aborted during DistSender.Send: context deadline exceeded
181026 16:31:07.459902 114911 storage/replica.go:3396  [n7,s6,r1723/3:/Table/53/1/380{47478…-74170…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/380474785/4,/Min)
181026 16:31:07.510139 50370 storage/store.go:2547  [n5,replicaGC,s5,r1723/2:/Table/53/1/380{47478…-74170…}] removing replica r1723/2
181026 16:31:07.518188 50370 storage/replica.go:873  [n5,replicaGC,s5,r1723/2:/Table/53/1/380{47478…-74170…}] removed 6 (0+6) keys in 8ms [clear=0ms commit=7ms]
181026 16:37:22.636053 419729 storage/queue.go:791  [n7,raftlog,s6,r1723/3:/Table/53/1/380{47478…-74170…}] aborted during DistSender.Send: context deadline exceeded
181026 16:47:51.697514 425268 storage/queue.go:791  [n7,raftlog,s6,r1723/3:/Table/53/1/380{47478…-74170…}] aborted during DistSender.Send: context deadline exceeded
181026 16:53:32.687550 428076 storage/store_snapshot.go:636  [n7,raftsnapshot,s6,r1723/3:/Table/53/1/380{47478…-74170…}] sending Raft snapshot f07c5839 at applied index 19
181026 16:53:32.688164 428076 storage/store_snapshot.go:679  [n7,raftsnapshot,s6,r1723/3:/Table/53/1/380{47478…-74170…}] streamed snapshot to (n4,s4):4: kv pairs: 9, log entries: 9, rate-limit: 8.0 MiB/sec, 1ms
181026 16:53:32.688767 112656 storage/replica_raftstorage.go:798  [n4,s4,r1723/4:{-}] applying Raft snapshot at index 19 (id=f07c5839, encoded size=4656, 1 rocksdb batches, 9 log entries)
181026 16:53:32.704097 112656 storage/replica_raftstorage.go:804  [n4,s4,r1723/4:/Table/53/1/380{47478…-74170…}] applied Raft snapshot in 15ms [clear=1ms batch=0ms entries=0ms commit=14ms]
181026 16:53:32.706369 428076 storage/store_snapshot.go:636  [n7,raftsnapshot,s6,r1723/3:/Table/53/1/380{47478…-74170…}] sending Raft snapshot f6333413 at applied index 19
181026 16:53:32.706775 428076 storage/store_snapshot.go:679  [n7,raftsnapshot,s6,r1723/3:/Table/53/1/380{47478…-74170…}] streamed snapshot to (n8,s8):5: kv pairs: 9, log entries: 9, rate-limit: 8.0 MiB/sec, 2ms
181026 16:53:32.707448 61689 storage/replica_raftstorage.go:798  [n8,s8,r1723/5:{-}] applying Raft snapshot at index 19 (id=f6333413, encoded size=4656, 1 rocksdb batches, 9 log entries)
181026 16:53:32.715184 61689 storage/replica_raftstorage.go:804  [n8,s8,r1723/5:/Table/53/1/380{47478…-74170…}] applied Raft snapshot in 8ms [clear=1ms batch=0ms entries=0ms commit=7ms]
181026 16:53:32.723782 114911 storage/replica.go:3405  [n7,s6,r1723/3:/Table/53/1/380{47478…-74170…}] slow command TransferLease [/Table/53/1/380474785/4,/Min) finished after 23m25.264043405s
181026 16:53:32.739924 114080 storage/replica_command.go:816  [n7,s6,r1723/3:/Table/53/1/380{47478…-74170…}] change replicas (REMOVE_REPLICA (n8,s8):5): read existing descriptor r1723:/Table/53/1/380{474785/4-741700/8} [(n6,s7):1, (n8,s8):5, (n7,s6):3, (n4,s4):4, next=6, gen=1]
181026 16:53:32.774860 8274719 storage/queue.go:791  [n6,replicate,s7,r1723/1:/Table/53/1/380{47478…-74170…}] no removable replicas from range that needs a removal: [1*:490, 5:0, 3:0, 4:0]
181026 16:53:32.861699 8274720 storage/replica.go:3935  [n6,s7,r1723/1:/Table/53/1/380{47478…-74170…}] proposing REMOVE_REPLICA((n8,s8):5): updated=[(n6,s7):1 (n4,s4):4 (n7,s6):3] next=6
181026 16:53:32.889151 61717 storage/store.go:2547  [n8,replicaGC,s8,r1723/5:/Table/53/1/380{47478…-74170…}] removing replica r1723/5
181026 16:53:32.896775 61717 storage/replica.go:873  [n8,replicaGC,s8,r1723/5:/Table/53/1/380{47478…-74170…}] removed 6 (0+6) keys in 7ms [clear=0ms commit=7ms]

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/109cf8705b773c0d3a1e7ab02ce63f764e101106

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=991714&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 30, 2018

^- also had #31988.

@tbg
Copy link
Member

tbg commented Oct 30, 2018

Have this reproing again. Ready for some (familiar) badness? Merges are off and I'm running off master with a few debuggability PRs. I have a few stuck Raft proposals. One is a TransferLease on this range:

tschottdorf@tobias-1540921719-import-tpch-nodes-8-0005:~$ grep -R 'have been waiting' logs/cockroach.log
W181030 21:00:47.512887 8626846 storage/replica.go:3400 [n7,s7,r3779/3:/Table/53/5/994{1/389…-9/344…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/5/9941/389530848/5,/Min)

Getting stuck in the middle of a TransferLease is pretty bad because n7 will send requests to n6 and n6 will send them back to n7! This infinite loop in DistSender shows clearly in logspy. This is kind of the extreme case of #22837 and explains why we're not seeing more stuck requests; they're artificially sent around in a tight loop instead. Fun. Ok, time to look at r3779:

image

This range is stuck in the middle of a replication change. It has four replicas (which the "healthy" replicas agree on), but two of them never even got initialized, i.e. they're basically just Raft groups with no data. This must mean that the preemptive snapshots sent out during upreplication got GC'ed away, making the Raft snapshot necessary. Thanks to the same kind of badness as in (the comment burst starting at) #31409 (comment), getting a raft snapshot isn't easy.

Interestingly, in this run n8 is the only node that reports a constant stream of Raft snapshot failures (again the "intersecting existing range" flavor).

And we're also seeing again that the snapshots take ages (I have written about this elsewhere, this is 99% because of the semaphore we use at the receiver which limits the total number of incoming snapshots but can really back up when there are 10 other nodes that all want to send snapshots).

I181030 22:26:35.058919 20435156 storage/store_snapshot.go:636  [n7,raftsnapshot,s7,r3154/3:/Table/53/7/883{1/517…-9/480…}] sending Raft snapshot 78a3ea23 at applied index 22
I181030 22:26:42.132450 20435156 storage/store_snapshot.go:679  [n7,raftsnapshot,s7,r3154/3:/Table/53/7/883{1/517…-9/480…}] streamed snapshot to (n8,s8):2: kv pairs: 1977106, log entries: 4, rate-limit: 8.0 MiB/sec, 41454ms
I181030 22:26:44.987221 20435156 storage/queue.go:751  [n7,raftsnapshot,s7,r3154/3:/Table/53/7/883{1/517…-9/480…}] done
I181030 22:26:44.987253 20435156 storage/queue.go:667  [n7,raftsnapshot,s7,r3154/3:/Table/53/7/883{1/517…-9/480…}] done 44.309719085s
I181030 22:26:44.987315 20604056 storage/queue.go:707  [n7,raftsnapshot,s7,r3979/3:/Table/53/7/842{7/220…-9/929…}] processing replica
I181030 22:26:44.987334 20604056 storage/queue.go:745  [n7,raftsnapshot,s7,r3979/3:/Table/53/7/842{7/220…-9/929…}] processing
I181030 22:26:44.987343 20604056 storage/queue.go:751  [n7,raftsnapshot,s7,r3979/3:/Table/53/7/842{7/220…-9/929…}] done
I181030 22:26:44.987354 20604056 storage/queue.go:667  [n7,raftsnapshot,s7,r3979/3:/Table/53/7/842{7/220…-9/929…}] done 65.268µs
I181030 22:26:44.987386 20604057 storage/queue.go:707  [n7,raftsnapshot,s7,r3019/3:/Table/53/6/100{32/51…-40/46…}] processing replica
I181030 22:26:44.987411 20604057 storage/queue.go:745  [n7,raftsnapshot,s7,r3019/3:/Table/53/6/100{32/51…-40/46…}] processing
I181030 22:26:44.987425 20604057 storage/raft_snapshot_queue.go:94  [n7,raftsnapshot,s7,r3019/3:/Table/53/6/100{32/51…-40/46…}] sending raft snapshot
I181030 22:26:46.120532 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1716/1:/Table/53/1/40{08494…-11166…}] raft snapshot needed, enqueuing
I181030 22:26:47.419366 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r2242/6:/Table/53/2/47{61077…-85054…}] raft snapshot needed, enqueuing
I181030 22:26:50.882993 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1698/1:/Table/53/1/398{72845…-99453…}] raft snapshot needed, enqueuing
I181030 22:26:53.479935 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1768/1:/Table/53/1/465{28678…-55462…}] raft snapshot needed, enqueuing
I181030 22:26:53.913092 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1959/3:/Table/53/1/450{48224…-75017…}] raft snapshot needed, enqueuing
I181030 22:26:59.540529 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1712/1:/Table/53/1/{399795…-400062…}] raft snapshot needed, enqueuing
I181030 22:27:01.272210 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r2918/3:/Table/53/5/824{7/103…-8/587…}] raft snapshot needed, enqueuing
I181030 22:27:05.167608 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r2052/1:/Table/53/1/463{37971…-64701…}] raft snapshot needed, enqueuing
I181030 22:27:06.466713 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1263/3:/Table/53/1/35{69938…-72608…}] raft snapshot needed, enqueuing
I181030 22:27:08.198814 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1765/1:/Table/53/1/464{48490…-75209…}] raft snapshot needed, enqueuing
I181030 22:27:09.497488 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1655/1:/Table/53/1/389{32707…-59408…}] raft snapshot needed, enqueuing
I181030 22:27:09.930547 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r3261/3:/Table/53/5/102{87/24…-95/19…}] raft snapshot needed, enqueuing
I181030 22:27:10.363637 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r1658/1:/Table/53/1/3{898609…-900234…}] raft snapshot needed, enqueuing
I181030 22:27:12.527891 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r2084/3:/Table/53/1/454{26813…-53437…}] raft snapshot needed, enqueuing
I181030 22:27:17.723137 145 storage/raft_snapshot_queue.go:76  [n7,raftsnapshot,s7,r2344/3:/Table/53/5/102{17/55…-25/51…}] raft snapshot needed, enqueuing

@tbg
Copy link
Member

tbg commented Oct 30, 2018

So here the two main problems are a) that the preemptive snapshots were removed (gc'ed) before the upreplication had completed (semi-tracked in #31875, I'll have to fix this problem first before that PR merges) b) the Raft snapshot queue being mostly LIFO and c) that the three member raft group with one replica requiring a snapshot decided that it was a good idea to add a fourth replica (at which point it's losing quorum).

@tbg
Copy link
Member

tbg commented Oct 30, 2018

and d) the semaphore on the receiver that can back up all too easily.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1461,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 31, 2018

Another run, similar problems (my branch has the fix for Raft snapshot queue starvation described in #31947 already).

We (again) have a range with four replicas and two of them need a snapshot. It looks like the main problem is that the most recently added replica had its preemptive snapshot GC'ed before it had a chance to catch up, and isn't receiving a Raft snapshot because $UNKNOWN_REASONS.

It's interesting to note that n1 is uninitialized. Its replicaID is 2 which means that it hasn't been mucked with since the range was created (i.e. split off from its left neighbor). The fact that it's uninitialized means that the left neighbor jumped past the split trigger via a snapshot (perhaps something that would be avoided with #31875), which implicitly wipes the data the RHS could otherwise pick up, leaving it needing a snapshot. Ah, I'm seeing from the logs that it's "the other" scenario that can lead to this outcome: our LHS was moved off n1 right after the split, giving the replica no time to execute the split trigger:

I181031 11:36:19.953348 43512 storage/replica_command.go:299 [n8,s8,r1622/3:/{Table/53/1/5…-Max}] initiating a split of this range at key /Table/53/1/561045127/5 [r1542]
I181031 11:36:20.003817 43470 storage/replica_command.go:855 [n8,s8,r1622/3:/Table/53/1/56{07780…-10451…}] change replicas (REMOVE_REPLICA (n2,s2):1): read existing descriptor r1622:/Table/53/1/56{0778055/8-1045127/5} [(n2,s2):1, (n1,s1):2, (n8,s8):3, (n7,s7):4, next=5, gen=1]

image

There's definitely still lots to fix here. I wonder if something changed in import/restore that rebalances/scatters much more aggressively because it definitely feels like these problems have gotten worse (though there's no doubt that they existed for a long time). But perhaps that's just a subjective feeling.

@tbg
Copy link
Member

tbg commented Oct 31, 2018

What the? Logspying on the raft snapshot queue on n5 (which is backed up with ~200 replicas), I'm seeing r3064 be processed (with a failure), then readded and immediately processing again, despite running with #32053. I thought for a bit that this is because of the "requeue" mechanism but that too removes and re-adds to the heap, so it should do the right thing.

I181031 13:27:01.560834 863439 storage/queue.go:745  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] processing
I181031 13:27:01.560846 863439 storage/raft_snapshot_queue.go:94  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] sending raft snapshot
I181031 13:27:01.995378 159 storage/queue.go:519  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] adding: priority=0.000
I181031 13:27:05.670996 127 storage/raft_snapshot_queue.go:76  [n5,raftsnapshot,s6,r2797/1:/Table/53/3/{496828…-503051…}] raft snapshot needed, enqueuing
I181031 13:27:07.014778 863439 storage/queue.go:667  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] done 5.453986849s
E181031 13:27:07.014796 863439 storage/queue.go:831  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] snapshot failed: (n6,s5):3: remote couldn't accept Raft snapshot 1fa93ad8 at applied index 20 with error: [n6,s5],r3064: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n6,s5,r1891/3:/{Table/53/6/8…-Max}]
I181031 13:27:07.014864 863483 storage/queue.go:707  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] processing replica
I181031 13:27:07.014870 863483 storage/queue.go:745  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] processing
I181031 13:27:07.014880 863483 storage/raft_snapshot_queue.go:94  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] sending raft snapshot
I181031 13:27:07.462447 157 storage/queue.go:519  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] adding: priority=0.000
I181031 13:27:12.235036 863483 storage/queue.go:667  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] done 5.220199318s
E181031 13:27:12.235057 863483 storage/queue.go:831  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] snapshot failed: (n3,s4):3: remote couldn't accept Raft snapshot b3ff357e at applied index 22 with error: [n3,s4],r3956: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s4,r3793/3:/{Table/53/9/2…-Max}]
I181031 13:27:12.235125 863492 storage/queue.go:707  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] processing replica
I181031 13:27:12.235134 863492 storage/queue.go:745  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] processing
I181031 13:27:12.235148 863492 storage/raft_snapshot_queue.go:94  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] sending raft snapshot
I181031 13:27:12.456530 148 storage/queue.go:519  [n5,raftsnapshot,s6,r3956/2:/Table/53/9/77{6704/…-9231/…}] adding: priority=0.000
I181031 13:27:16.192758 127 storage/raft_snapshot_queue.go:76  [n5,raftsnapshot,s6,r2501/1:/Table/53/2/54{74516…-98515…}] raft snapshot needed, enqueuing
I181031 13:27:24.087050 127 storage/raft_snapshot_queue.go:76  [n5,raftsnapshot,s6,r2734/1:/Table/53/2/46{50889…-74859…}] raft snapshot needed, enqueuing
I181031 13:27:27.477469 863492 storage/queue.go:667  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] done 15.242371997s
E181031 13:27:27.477510 863492 storage/queue.go:831  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] snapshot failed: (n8,s8):1: remote couldn't accept Raft snapshot fc7c974e at applied index 18 with error: [n8,s8],r3078: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n8,s8,r3188/1:/{Table/53/7/8…-Max}]
I181031 13:27:27.477570 863590 storage/queue.go:707  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] processing replica
I181031 13:27:27.477579 863590 storage/queue.go:745  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] processing
I181031 13:27:27.477596 863590 storage/raft_snapshot_queue.go:94  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] sending raft snapshot
I181031 13:27:27.631466 127 storage/queue.go:570  [n5,raftsnapshot] 1082: removing
I181031 13:27:27.995434 149 storage/queue.go:519  [n5,raftsnapshot,s6,r3078/3:/Table/53/7/1006{8/57…-9/26…}] adding: priority=0.000
I181031 13:27:28.973971 127 storage/raft_snapshot_queue.go:76  [n5,raftsnapshot,s6,r1487/3:/Table/53/1/38{47686…-50362…}] raft snapshot needed, enqueuing
I181031 13:27:35.119150 863590 storage/queue.go:667  [n5,raftsnapshot,s6,r3064/1:/Table/53/6/820{1/159…-8/559…}] done 7.641604866s

@tbg
Copy link
Member

tbg commented Oct 31, 2018

Ha. User error. The difference between ./build/builder.sh make build and ./build/builder.sh mkrelease is deadly. Ok, I'm glad this is the reason everything is so weird today.

@benesch
Copy link
Contributor

benesch commented Oct 31, 2018

Ha. User error. The difference between ./build/builder.sh make build and ./build/builder.sh mkrelease is deadly. Ok, I'm glad this is the reason everything is so weird today.

Wait, what?! Does that imply all the slowness is a result of compiling RocksDB w/o optimizations?

@tbg
Copy link
Member

tbg commented Oct 31, 2018

No, I was using a binary that had less fixes than I thought.

@benesch
Copy link
Contributor

benesch commented Oct 31, 2018

Oh, cockroach vs cockroach-linux-2.6.32-gnu-amd64? 😅

@tbg
Copy link
Member

tbg commented Oct 31, 2018

Exactly. Luckily this only happened in today's runs and as a bonus, this last comment here actually pretty definitely proves that #32053 will make the badness we're seeing a lot less bad. Curious what will be left over, my repros are running again.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c196a52e8ec61851e32a062c13135bbb4111eb96

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=8 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (4h0m0s)
	test.go:639,cluster.go:1461,import.go:122: context canceled

tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
@tbg
Copy link
Member

tbg commented Dec 3, 2018

Hopefully fixed for good by #32594 and #32414

@tbg tbg closed this as completed Dec 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

6 participants