-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coordinator batching #2406
coordinator batching #2406
Conversation
aef6fd2
to
0e4a857
Compare
} else if c.Post != nil { | ||
err = c.Post() | ||
resetClientCmdID(c.Args) | ||
db.sender.Send(context.TODO(), c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this code path call db.sender.Send()
(and resetClientCmdID
) instead of db.send()
as on line 477?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is the real send call. The other one is calling itself after having wrapped in a batch.
This is starting to look good. I've worked more towards code cleanup and filling in some gaps. I want to completely understand why (mostly) the Please review. It's unfortunately a huge change, sorry about that. The big base commit has been "observed" by a few of you and everything on top should be ok to review. |
9f2b739
to
e3c29ed
Compare
@@ -118,22 +131,21 @@ func TestTxnRequestTxnTimestamp(t *testing.T) { | |||
txn := NewTxn(*db) | |||
|
|||
for testIdx = range testCases { | |||
txn.db.sender.Send(context.Background(), | |||
proto.Call{Args: testPutReq, Reply: &proto.PutResponse{}}) | |||
_ = sendCall(txn.db.sender, proto.Call{Args: testPutReq, Reply: &proto.PutResponse{}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this ignore the returned error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a check.
LGTM. I didn't review it too thoroughly on this pass, but let's get it checked in (once that circleci failure is resolved) so we can move forward with the TODOs in new PRs. |
150e664
to
ff6ad09
Compare
Ok. I'm worried about breaking other's builds (there have been more spurious test failures than I'm comfortable with). I added some commits which hopefully deal with it. Will rebuild this a bunch of times and if it's reliable we should be ok. Test times in
It's not clear to me whether this is expected but it gets harder to investigate once we land this (the anomaly tests are unfortunately not fun to debug, and prolonged test durations I've eliminated so far were due to push/intent issues). |
Ok, doesn't seem like the tests actually take longer, at least I'm seeing |
see cockroachdb#2130, cockroachdb#1998 for context. * change `(*DistSender).Send` to wrap all requests in a single- element Batch (TxnCoordSender unwrapped any `Batch` previously). * `(*Store).ExecuteCmd` unwraps this single-element `Batch`. * verifyPermissions: check only the inside of BatchRequest. * introduce `Reverse` flag on `BatchRequest` to use instead of checking for `ReverseScanRequest`. Goal: whole `Batch` is either `Reverse` or not, and any range-op in it is simply going to be using the reverse order. Not being able to mix reverse/non-reverse requests stems from the change in descriptor lookup logic. * implemented `proto.Combinable` for `BatchRequest`; changed it to return an `error`. * implement logic to deal with `proto.Countable` and `proto.Bounded` in `Send`. * batch expansion at (*Store.ExecuteCmd) updated ExecuteCmd so that it properly unrolls a Batch. Note that it has to do hacky things with `Key` and `EndKey` to make sure they are truncated appropriately if the `BatchRequest` is (after all, that's all `DistSender` adapts for cross-range ops). Other than that, we mostly propagate the whole header from the Batch and the Txn (if any) from each response to the next request. `TxnCoordSender` still unrolls Batches and `(*DistSender).Send` still wraps each individual request in a new Batch. Some store tests use batches "properly", and so does `resolveWriteIntentError`, so the new functionality is actually tested and it took a bit to get all tests to pass again. Note also that the first simplification was made: the test sender in ./storage lost its fake batch handling. * complete Is{Read,Write,...} for BatchRequest * add batch-compatible abstractions checking for EndTransaction and intents uses a (likely temp) helper that produces correct results for `BatchRequest`. * proper truncation of batch at DistSender this change removes the provisional truncation code in `(*Store).ExecuteCmd` in favor of cleaned up code in `(*DistSender).Send/sendAttempt`, which has also seen some refactoring and many new TODOs. * workaround for non-atomicity of batch on Store a retry half-way through a batch may leave an earlier CPut fail upon retry because it reads the intent it previously wrote. * refactor range descriptor cache keep the recursion in LookupRange and adapt the * retryableLocalSender -> DistSender remove retryableLocalSender and plug in a DistSender which uses a LocalSender as its rangeDescriptorDB. KV tests still fail occasionally, but this is likely due to lack of non-atomicity on Store. Now after rebase, they fail but I'm sure I can fix that. ./kv and ./storage pass with a few tests disabled (which are expected to fail) ./sql passes in 17s, have to check why * {client->batch}.Sender * refactor DistSender and TxnCoordSender * send only batches from client * don't send replies on errors (at least in {TxnCoord,Dist}Sender) we're now pretty close to req->(resp,err) everywhere. * re-enable most storage tests changed multiTestContext and some other test contexts used to use a DistSender attached to a LocalSender (instead of previously only LocalSender). This allows re-enabling all of the tests which had to be disabled because they would involve batches which required multi-range logic not available without a DistSender inbetween. * fix server.TestIntentResolution * expand kv.TestTruncate, refactor truncate() slightly * re-add auto txn wrap for cross-range reqs * CmdID generation in ChunkingSender * various test fixes * fix RangeLookup when in reverse mode looking up EndKey of range with intent * tweak RangeLookup logic so that empty results are never returned. Panic if it does happen. * client workaround for request errors the client code wants to know which request failed, but with the (response, error) pattern no reply is returned on error, so no reply header can be used to find out the location of the error. this needs further fixing, see cockroachdb#1891. * deactivate some assertions pending cockroachdb#2300 * disabled TestTree * feedback and shape-up for merge attempt to remove issues slowing down tests. those likely arose since sending of batches is more sensitive to intent errors (since a single "command" now covers more keys) * descNext -> bool in DistSender * fixed various comments/TODOs * fix rds<->kvs correspondence in RangeLookup when scanning extra descriptor in Reverse mode * disable txn auto gc'ing (b/c store atomicity missing) * make trace.Finalize() panic-aware (prints active trace to stdout, re-panics). * carry an error over in the temporary batch unwrapping logic * clarify the ConditionalPut hack * disable two tests temporarily
and remove the provisional logic in store.ExecuteCmd which unrolled batches. misc renaming
which in turn fixes TestSingleKey (though it's a bit scary that's the only test that caught it). The txn timestamp wasn't properly updated from the response timestamp (so that reads/writes would be pushed, but the Txn not updated, hence committing successfully).
when returning descriptors from intents, it was not properly checked whether the intent had been obtained through a forward scan during a reverse lookup, in which case some scenarios demand that it be discarded. removed spurious uses of KeyAddress and refactored some of DistSender for better legibility.
moved some parts of it to kv. KeyRange is now keys.Range, Short() is now String() on *BatchRequest in proto (changed so that it doesn't import 'keys' any more), the rest mostly went to 'client'. Did a bunch of gorename (to unexport a few things). No change of semantics in this commit.
it is not supposed to be mutable, this is a first step towards that (of course some data contained within is still shared, such as requests and the Txn).
requests which are saturated are now masked out via NoopRequest, stopping once all requests are or no more ranges are left to query.
this is relevant especially when resolving on EndTransaction: only resolve **after** the new txn entry is persisted.
previously, we had special handling for PushRequest when it was in service of a range lookup. This commit removes this logic in favor of the following: We ignore intents but when encountering an addressing-related restart, ask for them to be returned randomly.
Glued everything together, rebased and fixed bugs.
All tests pass, but there is still some stabilizing to do (
kv
takes long/timeouts sometimes, likely an intent/push issue).The big base commit is #2141 after rebase and resulting fixes, so it's mostly been "reviewed". Everything else is new.
Still some work to do, but mostly cleanup. @spencerkimball you probably want to take a look already.