Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordinator batching #2406

Merged
merged 18 commits into from
Sep 15, 2015
Merged

coordinator batching #2406

merged 18 commits into from
Sep 15, 2015

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 8, 2015

Glued everything together, rebased and fixed bugs.
All tests pass, but there is still some stabilizing to do (kv takes long/timeouts sometimes, likely an intent/push issue).

The big base commit is #2141 after rebase and resulting fixes, so it's mostly been "reviewed". Everything else is new.

Still some work to do, but mostly cleanup. @spencerkimball you probably want to take a look already.

} else if c.Post != nil {
err = c.Post()
resetClientCmdID(c.Args)
db.sender.Send(context.TODO(), c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this code path call db.sender.Send() (and resetClientCmdID) instead of db.send() as on line 477?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is the real send call. The other one is calling itself after having wrapped in a batch.

@tbg
Copy link
Member Author

tbg commented Sep 9, 2015

This is starting to look good. I've worked more towards code cleanup and filling in some gaps. I want to completely understand why (mostly) the kv tests take longer (almost certainly has to do with pushes and intents) to make sure there's nothing sketchy going on.

Please review. It's unfortunately a huge change, sorry about that. The big base commit has been "observed" by a few of you and everything on top should be ok to review.

@tbg tbg changed the title [WIP] coordinator batching coordinator batching Sep 9, 2015
@tbg tbg force-pushed the gateway_batch branch 4 times, most recently from 9f2b739 to e3c29ed Compare September 12, 2015 09:08
@@ -118,22 +131,21 @@ func TestTxnRequestTxnTimestamp(t *testing.T) {
txn := NewTxn(*db)

for testIdx = range testCases {
txn.db.sender.Send(context.Background(),
proto.Call{Args: testPutReq, Reply: &proto.PutResponse{}})
_ = sendCall(txn.db.sender, proto.Call{Args: testPutReq, Reply: &proto.PutResponse{}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this ignore the returned error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check.

@bdarnell
Copy link
Contributor

LGTM. I didn't review it too thoroughly on this pass, but let's get it checked in (once that circleci failure is resolved) so we can move forward with the TODOs in new PRs.

@tbg tbg force-pushed the gateway_batch branch 2 times, most recently from 150e664 to ff6ad09 Compare September 15, 2015 18:01
@tbg
Copy link
Member Author

tbg commented Sep 15, 2015

Ok. I'm worried about breaking other's builds (there have been more spurious test failures than I'm comfortable with). I added some commits which hopefully deal with it. Will rebuild this a bunch of times and if it's reliable we should be ok. Test times in kv (and in other places) are still way higher though:

TestEvictCacheOnError gateway_batch 0.55s
TestEvictCacheOnError master 0.55s
TestKVDBInternalMethods master 1.09s
TestRangeSplitsWithConcurrentTxns gateway_batch 0.20s
TestRangeSplitsWithConcurrentTxns master 0.14s
TestRangeSplitsWithWritePressure master 0.17s
TestRetryOnDescriptorLookupError gateway_batch 0.25s
TestRetryOnDescriptorLookupError master 0.25s
TestRetryOnWrongReplicaError gateway_batch 0.24s
TestTxnDBInconsistentAnalysisAnomaly gateway_batch 1.41s
TestTxnDBInconsistentAnalysisAnomaly master 0.92s
TestTxnDBLostUpdateAnomaly gateway_batch 0.65s
TestTxnDBLostUpdateAnomaly master 0.41s
TestTxnDBPhantomDeleteAnomaly gateway_batch 0.58s
TestTxnDBPhantomDeleteAnomaly master 0.41s
TestTxnDBPhantomReadAnomaly gateway_batch 0.75s
TestTxnDBPhantomReadAnomaly master 0.42s
TestTxnDBUncertainty gateway_batch 0.18s
TestTxnDBWriteSkewAnomaly gateway_batch 1.84s
TestTxnDBWriteSkewAnomaly master 1.14s

It's not clear to me whether this is expected but it gets harder to investigate once we land this (the anomaly tests are unfortunately not fun to debug, and prolonged test durations I've eliminated so far were due to push/intent issues).

@tbg
Copy link
Member Author

tbg commented Sep 15, 2015

Ok, doesn't seem like the tests actually take longer, at least I'm seeing kv and others take about as long on master. I've built ~8x and got a green build on all of them (there was one obscure data race once, but it was related to debug logging and was missing a stack trace, so it was hard to debug - looked like printing an error that was also active in some raft goroutine, so not really tied to the change). So I'm going to merge this in a bit.

see cockroachdb#2130, cockroachdb#1998 for context.

* change `(*DistSender).Send` to wrap all requests in a single-
  element Batch (TxnCoordSender unwrapped any `Batch` previously).
* `(*Store).ExecuteCmd` unwraps this single-element `Batch`.
* verifyPermissions: check only the inside of BatchRequest.
* introduce `Reverse` flag on `BatchRequest` to use instead of checking for
  `ReverseScanRequest`. Goal: whole `Batch` is either `Reverse` or not, and
  any range-op in it is simply going to be using the reverse order. Not being
  able to mix reverse/non-reverse requests stems from the change in descriptor
  lookup logic.
* implemented `proto.Combinable` for `BatchRequest`; changed it to return an
`error`.
* implement logic to deal with `proto.Countable` and `proto.Bounded` in `Send`.
* batch expansion at (*Store.ExecuteCmd)
  updated ExecuteCmd so that it properly unrolls a Batch.
  Note that it has to do hacky things with `Key` and `EndKey`
  to make sure they are truncated appropriately if the `BatchRequest`
  is (after all, that's all `DistSender` adapts for cross-range ops).
  Other than that, we mostly propagate the whole header from the Batch
  and the Txn (if any) from each response to the next request.

  `TxnCoordSender` still unrolls Batches and `(*DistSender).Send` still
  wraps each individual request in a new Batch.

  Some store tests use batches "properly", and so does `resolveWriteIntentError`,
  so the new functionality is actually tested and it took a bit to get all
  tests to pass again.

  Note also that the first simplification was made: the test sender in ./storage
  lost its fake batch handling.
* complete Is{Read,Write,...} for BatchRequest
* add batch-compatible abstractions
  checking for EndTransaction and intents uses a (likely temp)
  helper that produces correct results for `BatchRequest`.
* proper truncation of batch at DistSender
  this change removes the provisional truncation code in
  `(*Store).ExecuteCmd` in favor of cleaned up code in
  `(*DistSender).Send/sendAttempt`, which has also seen
  some refactoring and many new TODOs.
* workaround for non-atomicity of batch on Store
  a retry half-way through a batch may leave an earlier CPut
  fail upon retry because it reads the intent it previously
  wrote.
* refactor range descriptor cache
  keep the recursion in LookupRange and adapt the
* retryableLocalSender -> DistSender
  remove retryableLocalSender and plug in a
  DistSender which uses a LocalSender as its
  rangeDescriptorDB.

  KV tests still fail occasionally, but this
  is likely due to lack of non-atomicity on
  Store.
  Now after rebase, they fail but I'm sure
  I can fix that.
  ./kv and ./storage pass with a few tests disabled (which are expected to fail)
  ./sql passes in 17s, have to check why
* {client->batch}.Sender
* refactor DistSender and TxnCoordSender
* send only batches from client
* don't send replies on errors (at least in {TxnCoord,Dist}Sender)
  we're now pretty close to req->(resp,err) everywhere.
* re-enable most storage tests
  changed multiTestContext and some other test contexts used
  to use a DistSender attached to a LocalSender (instead of
  previously only LocalSender). This allows re-enabling all
  of the tests which had to be disabled because they would
  involve batches which required multi-range logic not
  available without a DistSender inbetween.
* fix server.TestIntentResolution
* expand kv.TestTruncate, refactor truncate() slightly
* re-add auto txn wrap for cross-range reqs
* CmdID generation in ChunkingSender
* various test fixes
* fix RangeLookup when in reverse mode looking up EndKey of range with intent
* tweak RangeLookup logic so that empty results are never returned. Panic if it
  does happen.
* client workaround for request errors
  the client code wants to know which request failed, but
  with the (response, error) pattern no reply is returned
  on error, so no reply header can be used to find out the
  location of the error.
  this needs further fixing, see cockroachdb#1891.
* deactivate some assertions pending cockroachdb#2300
* disabled TestTree
* feedback and shape-up for merge
  attempt to remove issues slowing down tests.
  those likely arose since sending of batches
  is more sensitive to intent errors (since
  a single "command" now covers more keys)
* descNext -> bool in DistSender
* fixed various comments/TODOs
* fix rds<->kvs correspondence in RangeLookup when
  scanning extra descriptor in Reverse mode
* disable txn auto gc'ing (b/c store atomicity missing)
* make trace.Finalize() panic-aware (prints active trace to stdout, re-panics).
* carry an error over in the temporary batch unwrapping logic
* clarify the ConditionalPut hack
* disable two tests temporarily
and remove the provisional logic in store.ExecuteCmd which
unrolled batches.

misc renaming
which in turn fixes TestSingleKey (though it's a bit scary that's
the only test that caught it). The txn timestamp wasn't properly
updated from the response timestamp (so that reads/writes would
be pushed, but the Txn not updated, hence committing successfully).
when returning descriptors from intents, it was not
properly checked whether the intent had been obtained
through a forward scan during a reverse lookup, in which
case some scenarios demand that it be discarded.

removed spurious uses of KeyAddress and refactored some
of DistSender for better legibility.
moved some parts of it to kv. KeyRange is now keys.Range,
Short() is now String() on *BatchRequest in proto (changed so
that it doesn't import 'keys' any more), the rest mostly went
to 'client'.
Did a bunch of gorename (to unexport a few things). No change
of semantics in this commit.
it is not supposed to be mutable, this is a first step towards that
(of course some data contained within is still shared, such as
requests and the Txn).
requests which are saturated are now masked out via
NoopRequest, stopping once all requests are or no
more ranges are left to query.
this is relevant especially when resolving on EndTransaction:
only resolve **after** the new txn entry is persisted.
previously, we had special handling for PushRequest when it was in service of a
range lookup. This commit removes this logic in favor of the following: We
ignore intents but when encountering an addressing-related restart, ask for
them to be returned randomly.
tbg added a commit that referenced this pull request Sep 15, 2015
@tbg tbg merged commit 0ca14e2 into cockroachdb:master Sep 15, 2015
@tbg tbg deleted the gateway_batch branch September 15, 2015 22:14
@jess-edwards jess-edwards mentioned this pull request Sep 23, 2015
78 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants