-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: don't bump ReadTimestamp on EndTxn batch with bounded read latches #46625
kv/kvserver: don't bump ReadTimestamp on EndTxn batch with bounded read latches #46625
Conversation
5541c00
to
85fa1b0
Compare
Cool! Are there any ops that this allows us to reenable in kvnemesis? cockroach/pkg/kv/kvnemesis/generator.go Lines 173 to 189 in 4bce96a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvserver/replica_write.go, line 312 at r1 (raw file):
// retryable error. func (r *Replica) evaluateWriteBatch( ctx context.Context, idKey storagebase.CmdIDKey, ba *roachpb.BatchRequest, spans *spanset.SpanSet,
can we pls sprinkle some comments about this spans
field everywhere? And can we rename it to latches
?
Even better, let's combine it with ba
in a struct (serializedRequest
? or just request
). It can embed a *roachpb.BatchRequest
).
…ad latches This commit prevents an error-prone situation where requests with an EndTxn request that acquire read-latches at bounded timestamps were allowed to move their ReadTimestamp to their WriteTimestamp without re-acquiring new latches. This could allow the request to read at an unprotected timestamp and forgo proper synchronization with writes with respect to its lock-table scan and timestamp cache update. This relates to 11bffb2, which made a similar fix to server-side refreshes, but not to this new (as of this release) pre-emptive ReadTimestamp update mechanism. This bug was triggering the following assertion when stressing `TestKVNemesisSingleNode`: ``` F200326 15:56:11.350743 1199 kv/kvserver/concurrency/concurrency_manager.go:261 [n1,s1,r33/1:/{Table/50/"60…-Max}] caller violated contract: discovered non-conflicting lock ``` The batch that hit this issue looked like: ``` Get [/Table/50/"a1036fd0",/Min), Get [/Table/50/"e54a51a8",/Min), QueryIntent [/Table/50/"aafccb40",/Min), QueryIntent [/Table/50/"be7f37c9",/Min), EndTxn(commit:true tsflex:true) ``` This commit adds two subtests to `TestTxnCoordSenderRetries` that create this scenario. Release note (bug fix): A rare bug causing an assertion failure was fixed. The assertion error message was "caller violated contract: discovered non-conflicting lock". Release justification: fixes a series bug that could crash a server. Additionally, the bug could have theoretically allowed isolation violations between transactions without hitting the assertion, though this was never observed in practice.
85fa1b0
to
add5e2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Are there any ops that this allows us to reenable in kvnemesis?
I don't think so. This was resolving a crash that was possible with transactional batches, not a silent error.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/replica_write.go, line 312 at r1 (raw file):
can we pls sprinkle some comments about this spans field everywhere? And can we rename it to latches?
Done.
Even better, let's combine it with ba in a struct (serializedRequest? or just request). It can embed a *roachpb.BatchRequest).
Added a TODO.
bors r+ |
Build failed (retrying...) |
is the test failure in ^ (https://teamcity.cockroachdb.com/viewLog.html?buildId=1837699&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test) related to this PR? seems scary either way to happen in CI :o |
No, that looks like #21444. |
Build succeeded |
This commit prevents an error-prone situation where requests with an
EndTxn request that acquire read-latches at bounded timestamps were
allowed to move their ReadTimestamp to their WriteTimestamp without
re-acquiring new latches. This could allow the request to read at an
unprotected timestamp and forgo proper synchronization with writes with
respect to its lock-table scan and timestamp cache update.
This relates to 11bffb2, which made a similar fix to server-side
refreshes, but not to this new (as of this release) pre-emptive
ReadTimestamp update mechanism.
This bug was triggering the following assertion when stressing
TestKVNemesisSingleNode
:The batch that hit this issue looked like:
This commit adds two subtests to
TestTxnCoordSenderRetries
thatcreate this scenario.
Release note (bug fix): A rare bug causing an assertion failure was
fixed. The assertion error message was "caller violated contract:
discovered non-conflicting lock".
Release justification: fixes a series bug that could crash a server.
Additionally, the bug could have theoretically allowed isolation
violations between transactions without hitting the assertion, though
this was never observed in practice.
cc. @danhhz