-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client,kv: new savepoint API #43047
client,kv: new savepoint API #43047
Conversation
e3c7e56
to
098e5c9
Compare
67d22c7
to
f2e6f0f
Compare
Note from Peter: extend the new roachpb test to also assert compatibility with the earlier-introduced |
f2e6f0f
to
4600354
Compare
Please tell me if/when I should be looking at this. |
4600354
to
e196a32
Compare
Yes now is a good time to look at this. I'm expecting there may be a few more iterations on this until I finish polishing the followup #43051, but I'd appreciate your early feedback until then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but we still
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)
pkg/internal/client/sender.go, line 278 at r1 (raw file):
} // SavepointToken represents a savepoint.
how about simply Savepoint
?
https://www.youtube.com/watch?v=PEgk2v6KntY
pkg/internal/client/txn.go, line 1185 at r1 (raw file):
// GetSavepoint establishes a savepoint. // This method is only valid when called on RootTxns. func (txn *Txn) GetSavepoint(ctx context.Context) (SavepointToken, error) {
nit: I'd call it CreateSavepoint()
. "Get" seems ambiguous.
pkg/internal/client/txn.go, line 1192 at r1 (raw file):
// RollbackToSavepoint rolls back to the given savepoint. The // savepoint must not have been rolled back or released already.
Why not allow rolling back to a savepoint multiple times? In fact, doesn't SQL actually need to be able to do this?
Or by "rolled back" here did you mean "released implicitly by releasing a parent savepoint"?
pkg/internal/client/txn.go, line 1207 at r1 (raw file):
// are also released and their token must not be used any more. // This method is only valid when called on RootTxns. func (txn *Txn) ReleaseSavepoint(ctx context.Context, s SavepointToken) error {
This doesn't do actually do anything, does it?
I thing defensive programming here is a good idea, but then I'd orient the interface this way - CheckSavepoint
or such.
pkg/kv/txn_coord_sender_savepoints.go, line 23 at r1 (raw file):
) type savepointToken struct {
pls comment
And I'd also name it simply savepoint
or txnSavepoint
.
pkg/kv/txn_coord_sender_savepoints.go, line 23 at r1 (raw file):
) type savepointToken struct {
Don't we also need the current size of txnSpanRefresher.refreshSpans
the list of tracked reads, such that upon rollback we tell the refresher interceptor to discard further reads?
And also something about in-flight writes (txnPipeliner.ifWrites
). There I guess we need to take some sort of snapshot of the current in-flight writes and, on rollback, discard in-flight writes that are not part of the savepoint. But, on rollback, I don't think we should (nor am I sure that we could) simply overwrite the set of in-flight writes with the ones from the savepoint because writes that have been verified since the snapshot has been taken should continue to be verified. Basically, on rollback I think we need to intersect the savepoint with the current set of in-flight writes.
pkg/kv/txn_coord_sender_savepoints.go, line 28 at r1 (raw file):
} var _ client.SavepointToken = (*savepointToken)(nil)
I think the idiomatic way is
var _ client.SavepointToken = &savepointToken{}
pkg/kv/txn_coord_sender_savepoints.go, line 30 at r1 (raw file):
var _ client.SavepointToken = (*savepointToken)(nil) func (s *savepointToken) SavepointToken() {}
say that this implements whatever
pkg/kv/txn_coord_sender_savepoints.go, line 40 at r1 (raw file):
tc.mu.Lock() defer tc.mu.Unlock() if tc.mu.txnState != txnPending || tc.mu.closed {
Looking at this tc.mu.closed
here and elsewhere bothers me. This field is supposed to be very narrow, tied to making cleanupTxnLocked()
idempotent.
tc.mu.txnState
seems to always be set to something before tc.cleanupTxnLocked()
is called, except when handling a TransactionAbortedError
where I guess we don't bother because the TCS is not supposed to be accessible to the txn after the error is returned. So I think it's fine to ignore that case (since GetSavepoint()
can't be called on that TCS any more) or we could set txnState
.
pkg/kv/txn_coord_sender_savepoints.go, line 45 at r1 (raw file):
} return &savepointToken{ seqNum: tc.interceptorAlloc.txnSeqNumAllocator.writeSeq,
and the epoch?
pkg/kv/txn_coord_sender_savepoints.go, line 50 at r1 (raw file):
// RollbackToSavepoint is part of the client.TxnSender interface. func (tc *TxnCoordSender) RollbackToSavepoint(ctx context.Context, s client.SavepointToken) error {
since we're all about defense, I'd also save a transaction ID in the savepoint and compare it to the current ID. This is to protect in particular for attempts to rollback after a TransactionAbortedError
.
pkg/kv/txn_coord_sender_savepoints.go, line 79 at r1 (raw file):
}) return nil
and the epoch from the savepoint? Did you mean to do something with it?
pkg/kv/txn_coord_sender_savepoints.go, line 91 at r1 (raw file):
defer tc.mu.Unlock() _, err := tc.checkSavepointLocked(false /*allowError*/, s)
What's the thinking behind passing allowError = false
here?
pkg/roachpb/ignored_seqnums.go, line 27 at r1 (raw file):
// also overlaps with every subsequent range in the list. // // 2) the new range's "end" seqnum is larger or equal to the "end"
doesn't 2) imply 1) ?
e196a32
to
391354c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review!
See my updates below.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/internal/client/sender.go, line 278 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
how about simply
Savepoint
?
https://www.youtube.com/watch?v=PEgk2v6KntY
Meh. 'd prefer not.
pkg/internal/client/txn.go, line 1185 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: I'd call it
CreateSavepoint()
. "Get" seems ambiguous.
Good idea. Done.
pkg/internal/client/txn.go, line 1192 at r1 (raw file):
Why not allow rolling back to a savepoint multiple times?
Because it doesn't bring us anything. We should not offer more guarantees than strictly necessary for the goals. We can relax this later if deemed both useful and necessary.
In fact, doesn't SQL actually need to be able to do this?
No.
Or by "rolled back" here did you mean "released implicitly by releasing a parent savepoint"?
No that's not what I meant.
pkg/internal/client/txn.go, line 1207 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This doesn't do actually do anything, does it?
I thing defensive programming here is a good idea, but then I'd orient the interface this way -CheckSavepoint
or such.
The client code to this API does not need to know that this doesn't do anything. Better have the API look and feel like a savepoint API in public docs, and keep the details of the internals to the TCS implementation.
pkg/kv/txn_coord_sender_savepoints.go, line 23 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
pls comment
And I'd also name it simplysavepoint
ortxnSavepoint
.
Done.
pkg/kv/txn_coord_sender_savepoints.go, line 23 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Don't we also need the current size of
txnSpanRefresher.refreshSpans
the list of tracked reads, such that upon rollback we tell the refresher interceptor to discard further reads?
And also something about in-flight writes (txnPipeliner.ifWrites
). There I guess we need to take some sort of snapshot of the current in-flight writes and, on rollback, discard in-flight writes that are not part of the savepoint. But, on rollback, I don't think we should (nor am I sure that we could) simply overwrite the set of in-flight writes with the ones from the savepoint because writes that have been verified since the snapshot has been taken should continue to be verified. Basically, on rollback I think we need to intersect the savepoint with the current set of in-flight writes.
I agree with all these points. Added a TODO with your sentences. I'll address this in a followup PR where we improve what can be rolled back. I am considering the current PR as a MVP to enable the SQL work. After this merges, we can make progress on the KV and SQL aspects independently from each other.
pkg/kv/txn_coord_sender_savepoints.go, line 28 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the idiomatic way is
var _ client.SavepointToken = &savepointToken{}
That ship has sailed - both are idiomatic now. Mine also clarifies that no allocation needs to take place.
pkg/kv/txn_coord_sender_savepoints.go, line 30 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
say that this implements whatever
Done.
pkg/kv/txn_coord_sender_savepoints.go, line 40 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Looking at this
tc.mu.closed
here and elsewhere bothers me. This field is supposed to be very narrow, tied to makingcleanupTxnLocked()
idempotent.
tc.mu.txnState
seems to always be set to something beforetc.cleanupTxnLocked()
is called, except when handling aTransactionAbortedError
where I guess we don't bother because the TCS is not supposed to be accessible to the txn after the error is returned. So I think it's fine to ignore that case (sinceGetSavepoint()
can't be called on that TCS any more) or we could settxnState
.
Can you show me more precisely what the good condition could be?
pkg/kv/txn_coord_sender_savepoints.go, line 45 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
and the epoch?
Oops! Well spotted. Fixed and added the missing test.
pkg/kv/txn_coord_sender_savepoints.go, line 50 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
since we're all about defense, I'd also save a transaction ID in the savepoint and compare it to the current ID. This is to protect in particular for attempts to rollback after a
TransactionAbortedError
.
Good idea. Done.
pkg/kv/txn_coord_sender_savepoints.go, line 79 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
and the epoch from the savepoint? Did you mean to do something with it?
Not here. Added a TODO in the struct definition.
pkg/kv/txn_coord_sender_savepoints.go, line 91 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
What's the thinking behind passing
allowError = false
here?
- during a rollback, it's ok if the txn is currently in error (we'll want a rollback to "restore" a txn to a non-error state soon)
- during a release, it's not ok if the txn is currently in error - the SQL client expects the release to fail in that case
pkg/kv/txn_coord_sender_savepoints_test.go, line 71 at r2 (raw file):
case "abort": // TODO(knz): add code missing here: simulate an abort error
@andreimatei can you show me how to simulate a client.Txn processing a txn abort error. I'd like to add it here to exercise the code.
pkg/roachpb/ignored_seqnums.go, line 27 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
doesn't 2) imply 1) ?
No, the start value in the new range could also be greater. Added a clarifying example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)
pkg/internal/client/txn.go, line 1192 at r1 (raw file):
In fact, doesn't SQL actually need to be able to do this?
No.
How come / how will it work? The Postgres docs on ROLLBACK TO SAVEPOINT
say:
"The savepoint remains valid and can be rolled back to again later, if needed."
https://www.postgresql.org/docs/current/sql-rollback-to.html
pkg/internal/client/txn.go, line 1207 at r1 (raw file):
Previously, knz (kena) wrote…
The client code to this API does not need to know that this doesn't do anything. Better have the API look and feel like a savepoint API in public docs, and keep the details of the internals to the TCS implementation.
Well but if you name it Release
, then I wonder if I'm supposed to always call this, or when do I have to and when don't I. If you keep it named like this, you should document it. I'd much rather not...
pkg/kv/txn_coord_sender_savepoints.go, line 23 at r1 (raw file):
Previously, knz (kena) wrote…
I agree with all these points. Added a TODO with your sentences. I'll address this in a followup PR where we improve what can be rolled back. I am considering the current PR as a MVP to enable the SQL work. After this merges, we can make progress on the KV and SQL aspects independently from each other.
k
pkg/kv/txn_coord_sender_savepoints.go, line 40 at r1 (raw file):
Previously, knz (kena) wrote…
Can you show me more precisely what the good condition could be?
just tc.mu.txnState != txnPending
pkg/kv/txn_coord_sender_savepoints.go, line 91 at r1 (raw file):
during a rollback, it's ok if the txn is currently in error (we'll want a rollback to "restore" a txn to a non-error state soon)
But at least for now RollbackToSavepoint()
returns an error immediately after the checkSavepointLocked
if the txn is in the error state...
during a release, it's not ok if the txn is currently in error - the SQL client expects the release to fail in that case
But returning that error would presumably be generated in the connExecutor
, where most statements are rejected when it (the SQL layer) is not in the "open" state. I don't think the TCS should worry about generating an error in this case. I think ReleaseSavepoint
should stick to checking whether the savepoint has been released before.
I really think this allowError
argument makes checkSavepointLocked
a confusing function. (As with all bool arguments), if the logic can reasonably be lifted to the callers, it should be lifted to the callers. And in this case it definitely can. And then given what I said above, I don't think either of the two callers even need the respective logic.
pkg/kv/txn_coord_sender_savepoints.go, line 145 at r2 (raw file):
if (tc.mu.txnState != txnPending && (!allowError || tc.mu.txnState != txnError)) || tc.mu.closed {
ditto about getting rid of tc.mu.closed
pkg/kv/txn_coord_sender_savepoints_test.go, line 30 at r2 (raw file):
datadriven.Walk(t, "testdata/savepoints", func(t *testing.T, path string) { // New database for each test file. s := createTestDB(t)
Unless there's a good reason, please do me a favor and don't use this createTestDB()
which creates some ancient LocalTestCluster
. Use a TestServer
instead like it's 2016.
pkg/kv/txn_coord_sender_savepoints_test.go, line 71 at r2 (raw file):
Previously, knz (kena) wrote…
@andreimatei can you show me how to simulate a client.Txn processing a txn abort error. I'd like to add it here to exercise the code.
Well, I guess you can hack it similarly to how GenerateForcedRetryableError()
is implemented - it calls a ManualRestart
method on the TCS that simulates the TCS having received an error. Or, you can have the transaction send a request and inject a TransactionAbortedError
in the evaluation of that request using StoreTestingKnobs.TestingRequestFilter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)
pkg/internal/client/txn.go, line 1192 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
In fact, doesn't SQL actually need to be able to do this?
No.
How come / how will it work? The Postgres docs on
ROLLBACK TO SAVEPOINT
say:
"The savepoint remains valid and can be rolled back to again later, if needed."
https://www.postgresql.org/docs/current/sql-rollback-to.html
woah TIL. I had misunderstood the semantics.
I'll rework the code and ping you.
pkg/internal/client/txn.go, line 1207 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Well but if you name it
Release
, then I wonder if I'm supposed to always call this, or when do I have to and when don't I. If you keep it named like this, you should document it. I'd much rather not...
Good question. I'll mull over this and decide and let you know.
pkg/kv/txn_coord_sender_savepoints.go, line 40 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
just
tc.mu.txnState != txnPending
thx will do
pkg/kv/txn_coord_sender_savepoints.go, line 91 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
during a rollback, it's ok if the txn is currently in error (we'll want a rollback to "restore" a txn to a non-error state soon)
But at least for now
RollbackToSavepoint()
returns an error immediately after thecheckSavepointLocked
if the txn is in the error state...during a release, it's not ok if the txn is currently in error - the SQL client expects the release to fail in that case
But returning that error would presumably be generated in the
connExecutor
, where most statements are rejected when it (the SQL layer) is not in the "open" state. I don't think the TCS should worry about generating an error in this case. I thinkReleaseSavepoint
should stick to checking whether the savepoint has been released before.I really think this
allowError
argument makescheckSavepointLocked
a confusing function. (As with all bool arguments), if the logic can reasonably be lifted to the callers, it should be lifted to the callers. And in this case it definitely can. And then given what I said above, I don't think either of the two callers even need the respective logic.
now I understand your argument. I'll fix it.
pkg/kv/txn_coord_sender_savepoints_test.go, line 71 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Well, I guess you can hack it similarly to how
GenerateForcedRetryableError()
is implemented - it calls aManualRestart
method on the TCS that simulates the TCS having received an error. Or, you can have the transaction send a request and inject aTransactionAbortedError
in the evaluation of that request usingStoreTestingKnobs.TestingRequestFilter
.
so yeah sorry I don't understand how to do that nor how does it work. Can you point me to another test that does this so I can follow an example?
391354c
to
6d38f3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback so far. RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/internal/client/txn.go, line 1192 at r1 (raw file):
Previously, knz (kena) wrote…
woah TIL. I had misunderstood the semantics.
I'll rework the code and ping you.
Done.
pkg/internal/client/txn.go, line 1207 at r1 (raw file):
Previously, knz (kena) wrote…
Good question. I'll mull over this and decide and let you know.
I stand by the semantics but enhanced the comment to explain.
pkg/kv/txn_coord_sender_savepoints.go, line 40 at r1 (raw file):
Previously, knz (kena) wrote…
thx will do
Done.
pkg/kv/txn_coord_sender_savepoints.go, line 91 at r1 (raw file):
Previously, knz (kena) wrote…
now I understand your argument. I'll fix it.
Done.
pkg/kv/txn_coord_sender_savepoints.go, line 145 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
ditto about getting rid of
tc.mu.closed
Done.
pkg/kv/txn_coord_sender_savepoints_test.go, line 30 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Unless there's a good reason, please do me a favor and don't use this
createTestDB()
which creates some ancientLocalTestCluster
. Use aTestServer
instead like it's 2016.
Done.
pkg/kv/txn_coord_sender_savepoints_test.go, line 71 at r2 (raw file):
Previously, knz (kena) wrote…
so yeah sorry I don't understand how to do that nor how does it work. Can you point me to another test that does this so I can follow an example?
I found how to do it. PTAL.
c369c98
to
fd8fdba
Compare
friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
looks good
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)
pkg/kv/txn_coord_sender_savepoints_test.go, line 40 at r3 (raw file):
// TxnCoordSender, from storage. params := base.TestServerArgs{} doAbort := false
I think you need to access this with sync.atomic
, otherwise the updates will race with random other background requests. And I think you need to find a way to make the aborting targeted to the test's txn because again you don't want to abort random background requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)
pkg/kv/txn_coord_sender_savepoints_test.go, line 40 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think you need to access this with
sync.atomic
, otherwise the updates will race with random other background requests. And I think you need to find a way to make the aborting targeted to the test's txn because again you don't want to abort random background requests.
more targeted by filter for the txn id in req.Hdr.Txn
This patch introduces the KV savepoint API as discussed in the savepoints RFC: ```go // CreateSavepoint establishes a savepoint. // This method is only valid when called on RootTxns. CreateSavepoint(context.Context) (SavepointToken, error) // RollbackToSavepoint rolls back to the given savepoint. The // savepoint must not have been rolled back or released already. // All savepoints "under" the savepoint being rolled back // are also rolled back and their token must not be used any more. // This method is only valid when called on RootTxns. RollbackToSavepoint(context.Context, SavepointToken) error // ReleaseSavepoint releases the given savepoint. The savepoint // must not have been rolled back or released already. // All savepoints "under" the savepoint being released // are also released and their token must not be used any more. // This method is only valid when called on RootTxns. ReleaseSavepoint(context.Context, SavepointToken) error ``` Release note: None
fd8fdba
to
00e69da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/txn_coord_sender_savepoints_test.go, line 40 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
more targeted by filter for the txn id in
req.Hdr.Txn
Well spotted. Done.
TFYR! bors r=andreimatei |
43047: client,kv: new savepoint API r=andreimatei a=knz This patch introduces the KV savepoint API as discussed in the savepoints RFC: ```go // CreateSavepoint establishes a savepoint. // This method is only valid when called on RootTxns. CreateSavepoint(context.Context) (SavepointToken, error) // RollbackToSavepoint rolls back to the given savepoint. The // savepoint must not have been rolled back or released already. // All savepoints "under" the savepoint being rolled back // are also rolled back and their token must not be used any more. // This method is only valid when called on RootTxns. RollbackToSavepoint(context.Context, SavepointToken) error // ReleaseSavepoint releases the given savepoint. The savepoint // must not have been rolled back or released already. // All savepoints "under" the savepoint being released // are also released and their token must not be used any more. // This method is only valid when called on RootTxns. ReleaseSavepoint(context.Context, SavepointToken) error ``` Ths initial implementation does not (yet) enable clients to roll back over errors. Release note: None 44626: workload/tpch: unskip query 12 r=yuzefovich a=yuzefovich Previously, query 12 was skipped because the results returned by lib/pq didn't match the expected output. The issue is that the driver returned []byte for decimals whereas we expected a string value and `fmt.Sprint` was giving us the ASCII codes of the digits instead of printing out the number. This is fixed by checking whether the returned value is []byte, and if so, converting it to string directly. Release note: None 44633: sem/tree: fix LIKE ESCAPE when the pattern contains Unicode symbols r=yuzefovich a=yuzefovich Previously, we were incorrectly updating the pattern when the current character was Unicode symbol that had the width of more than a single byte. Fixes: #44621. Release note (bug fix): Previously, running a query with LIKE operator using custom ESCAPE symbol when the pattern contained Unicode characters could result in an internal error in CockroachDB, and now this has been fixed. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Build succeeded |
This patch introduces the KV savepoint API as discussed in the
savepoints RFC:
Ths initial implementation does not (yet) enable clients to roll back over errors.
Release note: None