Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/partitionccl: TestInitialPartitioning failed under stress #39651

Closed
cockroach-teamcity opened this issue Aug 14, 2019 · 1 comment · Fixed by #39658
Closed

ccl/partitionccl: TestInitialPartitioning failed under stress #39651

cockroach-teamcity opened this issue Aug 14, 2019 · 1 comment · Fixed by #39658
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/bad5e2ea8674f0513f949d3940609b1f4d685581

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestInitialPartitioning PKG=github.com/cockroachdb/cockroach/pkg/ccl/partitionccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1438014&tab=buildLog


I190814 05:43:30.845972 176419 storage/replica_command.go:1198  [n3,replicate,s3,r242/1:/Table/83/1{-/00:00:…}] change replicas (ADD_REPLICA (n1,s1):3LEARNER): existing descriptor r242:/Table/83/1{-/00:00:00} [(n3,s3):1, next=3, gen=213]
I190814 05:43:30.848659 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.851700 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.854114 176379 storage/store.go:2571  [n1,replicaGC,s1,r242/2:/Table/83/1{-/00:00:…}] removing replica r242/2
I190814 05:43:30.854448 176379 storage/replica_destroy.go:146  [n1,replicaGC,s1,r242/2:/Table/83/1{-/00:00:…}] removed 7 (0+7) keys in 0ms [clear=0ms commit=0ms]
I190814 05:43:30.854805 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.856258 176419 storage/replica_raft.go:291  [n3,s3,r242/1:/Table/83/1{-/00:00:…}] proposing ADD_REPLICA[(n1,s1):3LEARNER]: after=[(n3,s3):1 (n1,s1):3LEARNER] next=4
I190814 05:43:30.857100 176419 storage/store_snapshot.go:786  [n3,replicate,s3,r242/1:/Table/83/1{-/00:00:…}] sending LEARNER snapshot 25383b71 at applied index 24
I190814 05:43:30.857235 176419 storage/store_snapshot.go:829  [n3,replicate,s3,r242/1:/Table/83/1{-/00:00:…}] streamed snapshot to (n1,s1):3: kv pairs: 14, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
I190814 05:43:30.857513 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a IN ('00:00:00':::INTERVAL)
I190814 05:43:30.858089 176534 storage/replica_raftstorage.go:823  [n1,s1,r242/3:{-}] applying LEARNER snapshot at index 24 (id=25383b71, encoded size=2013, 1 rocksdb batches, 0 log entries)
I190814 05:43:30.858378 176534 storage/replica_raftstorage.go:829  [n1,s1,r242/3:/Table/83/1{-/00:00:…}] applied LEARNER snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I190814 05:43:30.858772 176419 storage/replica_command.go:1198  [n3,replicate,s3,r242/1:/Table/83/1{-/00:00:…}] change replicas (ADD_REPLICA (n1,s1):3): existing descriptor r242:/Table/83/1{-/00:00:00} [(n3,s3):1, (n1,s1):3LEARNER, next=4, gen=214]
I190814 05:43:30.860648 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.863465 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.866260 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.867352 176419 storage/replica_raft.go:291  [n3,s3,r242/1:/Table/83/1{-/00:00:…}] proposing ADD_REPLICA[(n1,s1):3]: after=[(n3,s3):1 (n1,s1):3] next=4
E190814 05:43:30.867779 176419 storage/queue.go:1027  [n3,replicate,s3,r242/1:/Table/83/1{-/00:00:…}] no removable replicas from range that needs a removal: [1*:26, 3:25]
I190814 05:43:30.868235 176590 storage/replica_command.go:1198  [n3,replicate,s3,r245/1:/Table/83/{1/1 day…-2}] change replicas (REMOVE_REPLICA (n1,s1):2LEARNER): existing descriptor r245:/Table/83/{1/1 day -24:00:00-2} [(n3,s3):1, (n1,s1):2LEARNER, next=3, gen=213]
I190814 05:43:30.869667 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a = '00:00:00':::INTERVAL
I190814 05:43:30.872257 173954 ccl/partitionccl/partition_test.go:216  query: SELECT count(*) FROM interval_table WHERE a < '00:00:00':::INTERVAL
I190814 05:43:30.872898 176590 storage/replica_raft.go:291  [n3,s3,r245/1:/Table/83/{1/1 day…-2}] proposing REMOVE_REPLICA[(n1,s1):2LEARNER]: after=[(n3,s3):1] next=3
I190814 05:43:30.873574 176590 storage/replica_command.go:1198  [n3,replicate,s3,r245/1:/Table/83/{1/1 day…-2}] change replicas (ADD_REPLICA (n1,s1):3LEARNER): existing descriptor r245:/Table/83/{1/1 day -24:00:00-2} [(n3,s3):1, next=3, gen=214]
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 14, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Aug 14, 2019
@tbg
Copy link
Member

tbg commented Aug 14, 2019

I think I broke this in #39571, looking.

F190814 11:26:59.316508 2725 storage/store.go:2162  [n1,s1,r234/2:/Table/83/{1/29 ye…-2}] replicaID cannot move backwards from 7 to 2
goroutine 2725 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0003c6301, 0xc0003c6300, 0x0, 0xc004992400)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x6488280, 0xc000000004, 0x5e00346, 0x10, 0x872, 0xc004e2b020, 0x53)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x92b
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x40e6160, 0xc00387e240, 0x4, 0x2, 0x0, 0x0, 0xc0041e06f8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x40e6160, 0xc00387e240, 0x1, 0xc000000004, 0x0, 0x0, 0xc0041e06f8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:189
github.com/cockroachdb/cockroach/pkg/storage.splitPostApply(0x40e6160, 0xc00387e240, 0x0, 0x15bac673223700e4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2162 +0x3ea
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleSplitResult(...)

tbg added a commit to tbg/cockroach that referenced this issue Aug 14, 2019
See cockroachdb#39571.

The above PR was incomplete because we also need to wind back the
replicaID. I was able to confirm that this *actually* works by running
tests with this diff, which simulates an incoming raft message
reflecting a high replicaID just before the split gets applied.

```diff
diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 6d5a044657..981f3e1714 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock(
 func (r *Replica) acquireSplitLock(
 	ctx context.Context, split *roachpb.SplitTrigger,
 ) (func(), error) {
-	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
+	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil)
 	if err != nil {
 		return nil, err
 	}
```

@danhhz is working on a real test for this, so we'll soon have
end-to-end coverage.

Fixes cockroachdb#39651.

Release note: None
craig bot pushed a commit that referenced this issue Aug 14, 2019
39658: storage: avoid crash in splitPostApply r=danhhz a=tbg

See #39571.

The above PR was incomplete because we also need to wind back the
replicaID. I was able to confirm that this *actually* works by running
tests with this diff, which simulates an incoming raft message
reflecting a high replicaID just before the split gets applied.

```diff
diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 6d5a044657..981f3e1714 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock(
 func (r *Replica) acquireSplitLock(
 	ctx context.Context, split *roachpb.SplitTrigger,
 ) (func(), error) {
-	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
+	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil)
 	if err != nil {
 		return nil, err
 	}
```

@danhhz is working on a real test for this, so we'll soon have
end-to-end coverage.

Fixes #39651.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in #39658 Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants