Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: don't bump ReadTimestamp on EndTxn batch with bounded read latches #46625

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit prevents an error-prone situation where requests with an
EndTxn request that acquire read-latches at bounded timestamps were
allowed to move their ReadTimestamp to their WriteTimestamp without
re-acquiring new latches. This could allow the request to read at an
unprotected timestamp and forgo proper synchronization with writes with
respect to its lock-table scan and timestamp cache update.

This relates to 11bffb2, which made a similar fix to server-side
refreshes, but not to this new (as of this release) pre-emptive
ReadTimestamp update mechanism.

This bug was triggering the following assertion when stressing
TestKVNemesisSingleNode:

F200326 15:56:11.350743 1199 kv/kvserver/concurrency/concurrency_manager.go:261  [n1,s1,r33/1:/{Table/50/"60…-Max}] caller violated contract: discovered non-conflicting lock

The batch that hit this issue looked like:

Get [/Table/50/"a1036fd0",/Min),
Get [/Table/50/"e54a51a8",/Min),
QueryIntent [/Table/50/"aafccb40",/Min),
QueryIntent [/Table/50/"be7f37c9",/Min),
EndTxn(commit:true tsflex:true)

This commit adds two subtests to TestTxnCoordSenderRetries that
create this scenario.

Release note (bug fix): A rare bug causing an assertion failure was
fixed. The assertion error message was "caller violated contract:
discovered non-conflicting lock".

Release justification: fixes a series bug that could crash a server.
Additionally, the bug could have theoretically allowed isolation
violations between transactions without hitting the assertion, though
this was never observed in practice.

cc. @danhhz

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@danhhz
Copy link
Contributor

danhhz commented Mar 26, 2020

Cool! Are there any ops that this allows us to reenable in kvnemesis?

func NewDefaultConfig() GeneratorConfig {
config := newAllOperationsConfig()
// TODO(dan): This fails with a WriteTooOld error if the same key is Put twice
// in a single batch. However, if the same Batch is committed using txn.Run,
// then it works and only the last one is materialized. We could make the
// db.Run behavior match txn.Run by ensuring that all requests in a
// nontransactional batch are disjoint and upgrading to a transactional batch
// (see CrossRangeTxnWrapperSender) if they are. roachpb.SpanGroup can be used
// to efficiently check this.
//
// TODO(dan): Make this `config.Ops.Batch.Ops.PutExisting = 0` once #46081 is
// fixed.
config.Ops.Batch = BatchOperationConfig{}
// TODO(dan): Remove when #45586 is addressed.
config.Ops.ClosureTxn.CommitBatchOps.GetExisting = 0
config.Ops.ClosureTxn.CommitBatchOps.GetMissing = 0
return config

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/replica_write.go, line 312 at r1 (raw file):

// retryable error.
func (r *Replica) evaluateWriteBatch(
	ctx context.Context, idKey storagebase.CmdIDKey, ba *roachpb.BatchRequest, spans *spanset.SpanSet,

can we pls sprinkle some comments about this spans field everywhere? And can we rename it to latches?

Even better, let's combine it with ba in a struct (serializedRequest? or just request). It can embed a *roachpb.BatchRequest).

@jordanlewis jordanlewis mentioned this pull request Mar 30, 2020
83 tasks
…ad latches

This commit prevents an error-prone situation where requests with an
EndTxn request that acquire read-latches at bounded timestamps were
allowed to move their ReadTimestamp to their WriteTimestamp without
re-acquiring new latches. This could allow the request to read at an
unprotected timestamp and forgo proper synchronization with writes with
respect to its lock-table scan and timestamp cache update.

This relates to 11bffb2, which made a similar fix to server-side
refreshes, but not to this new (as of this release) pre-emptive
ReadTimestamp update mechanism.

This bug was triggering the following assertion when stressing
`TestKVNemesisSingleNode`:

```
F200326 15:56:11.350743 1199 kv/kvserver/concurrency/concurrency_manager.go:261  [n1,s1,r33/1:/{Table/50/"60…-Max}] caller violated contract: discovered non-conflicting lock
```

The batch that hit this issue looked like:
```
Get [/Table/50/"a1036fd0",/Min),
Get [/Table/50/"e54a51a8",/Min),
QueryIntent [/Table/50/"aafccb40",/Min),
QueryIntent [/Table/50/"be7f37c9",/Min),
EndTxn(commit:true tsflex:true)
```

This commit adds two subtests to `TestTxnCoordSenderRetries` that
create this scenario.

Release note (bug fix): A rare bug causing an assertion failure was
fixed. The assertion error message was "caller violated contract:
discovered non-conflicting lock".

Release justification: fixes a series bug that could crash a server.
Additionally, the bug could have theoretically allowed isolation
violations between transactions without hitting the assertion, though
this was never observed in practice.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Are there any ops that this allows us to reenable in kvnemesis?

I don't think so. This was resolving a crash that was possible with transactional batches, not a silent error.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvserver/replica_write.go, line 312 at r1 (raw file):

can we pls sprinkle some comments about this spans field everywhere? And can we rename it to latches?

Done.

Even better, let's combine it with ba in a struct (serializedRequest? or just request). It can embed a *roachpb.BatchRequest).

Added a TODO.

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 30, 2020

Build failed (retrying...)

@otan
Copy link
Contributor

otan commented Mar 30, 2020

is the test failure in ^ (https://teamcity.cockroachdb.com/viewLog.html?buildId=1837699&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test) related to this PR?

seems scary either way to happen in CI :o

@nvanbenschoten
Copy link
Member Author

is the test failure in ^ (https://teamcity.cockroachdb.com/viewLog.html?buildId=1837699&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test) related to this PR?

No, that looks like #21444.

@craig
Copy link
Contributor

craig bot commented Mar 30, 2020

Build succeeded

@craig craig bot merged commit b0f33f8 into cockroachdb:master Mar 30, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/refreshReadOnly branch March 30, 2020 19:22
@otan otan mentioned this pull request Mar 30, 2020
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants