stability: Rebalances must be atomic #12768

bdarnell · 2017-01-08T14:00:41Z

Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.

The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.

We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.

Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.

rjnn · 2017-01-09T16:28:03Z

If we're using flexible paxos, we can make it work as long as we first ensure that the leader is not on the datacenter where the rebalance is taking place. So if we denote datacenters by numbers and replicas by letters, if we are going from {1A, 2B, 3C} to {1A, 2B, 3D}, we just need to ensure that A or B is the leader before upreplicating to {1A, 2B, 3C, 3D}. Now if we have a datacenter failure at 1 or 2, we have enough replicas (three) to elect a leader. If we have a datacenter failure at 3, we still have our leader so we can upreplicate again to a 5-replica range.

bdarnell · 2017-01-09T16:47:38Z

Yeah, we'd need to ensure both that A or B is the leader and that C cannot become leader while the upreplication is in flight.

There are also performance considerations here (which we do not currently take into account): the new replica gets its snapshot from the leader, so for performance we'd want to have 3C be the leader when 3D is added. (or we introduce some way to send preemptive snapshots from a follower)

bdarnell · 2018-07-23T21:11:41Z

We have a real-world instance of this: a node crashed (in a persistent way so it crashes again each time it's restarted) while the range was only configured for two replicas. This required extreme efforts to revive the downed node instead of simply letting the cluster recover normally.

a-robinson · 2018-07-23T21:16:26Z

While it was only configured for two replicas? So this was when up-replicating from 1 to 3? I'm drawing a blank on when else a range would be configured to have two replicas during rebalancing (other than when removing a replica from a dead store and adding a new one to replace it, which this doesn't sound like).

bdarnell · 2018-07-23T21:21:14Z

I think this was while removing a replica from a dead store. So technically this was a two-simultaneous-failure event, but the first failure was transient while the second was persistent. If the replica change had been atomic, we'd have been able to recover when the first transient failure resolved.

a-robinson · 2018-07-23T21:45:37Z

Yeah, we need to fix that, although the fix for that will be different from / simpler than the fix for what this issue was originally describing. We really just need to up-replicate before removing the dead replica, not to make anything atomic.

For the record, we hit this issue or removing a dead replica before adding its replacement a couple months ago as well - #25392.

tbg · 2018-10-11T10:49:29Z

I folded #2067 (can't change the membership with three replicas in three nodes cluster) into this issue.

This is a PR just to show some code to the interested parties. The real thing to look at here is the explanation and suggested strategy below. Don't review the code. ---- Learner replicas are full Replicas minus the right to vote (i.e they don't count for quorum). They are interesting to us because they allow us to phase out preemptive snapshots (via [delayed preemptive snaps] as a migration strategy; see there for all the things we don't like about preemptive snapshots), and because they can help us avoid spending more time in vulnerable configurations than we need to. To see an example of the latter, assume we're trying to upreplicate from three replicas to five replicas. As is, we need to add a fourth replica, wait for a preemptive snapshot, and add the fifth. We spend approximately the duration of one preemptive snapshot in an even replica configuration. In theory, we could send two preemptive snapshots first and then carry out two replica additions, but with learners, it would be much more straightforward and less error-prone. This doesn't solve the problems in cockroachdb#12768, but it helps avoid them. This PR shows the bare minimum of code changes to upreplicate using learner replicas and suggests further steps to make them a reality. Principally, to use learner replicas, we barely need to make any changes. Our current up-replication code is this: 1. send preemptive snapshot 1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the replicated range descriptor and, on commit, induces a Raft configuration change (`raftpb.ConfChangeAddNode`). The new up-replication code (note that the old one has to stick around, because compatibility): 1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the replicated range descriptor with the `Learner` flag set to true and, on commit, induces a Raft configuration change (`raftpb.ConfChangeAddLearnerNode`). 2. wait for the learner to have caught up or send it a Raft snapshot proactively (either works, just have to make sure not to duplicate work) 3. run a ChangeReplicas txn which removes the `Learner` flag from the `ReplicaDescriptor` and induces a Raft conf change of `raftpb.ConfChangeAddNode` (upgrading the learner). The existence of learners will need updates throughout the allocator so that it realizes that they don't count for quorum and are either upgraded or removed in a timely manner. None of that is in this POC. [delayed preemptive snaps]: cockroachdb#35786 Release note: None

bdarnell · 2019-03-23T23:29:24Z

While a complete fix requires the implementation of joint consensus, there may be things we can do to make partial improvements. There are a few factors that may be leaving ranges in this over-replicated, under-diversified state longer than necessary.

In the allocator, removing an "extra" replica is given a low priority, but as this example shows the resulting under-diversification can be as severe a problem as a dead node.
The replicate queue is a single thread that is blocked while sending preemptive snapshots, preventing other ranges from downreplicating for the duration of a snapshot. (we can easily add more concurrency in this queue, but as long and adds and removes go through the same pool we'll likely end up with all threads blocked in snapshot sending)
After adding a replica, we just send the range back through the replicate queue to remove the extra one. That removal is not guaranteed to be the next thing we do. Even if it is, we only make one attempt, and if that fails the extra replica will persist until the next scanner cycle.

I think the last issue is probably most significant - a cluster that experiences one failure to remove an extra replica per ten minutes will usually have an overreplicated range and be vulnerable to region/AZ failure. There are a number of ways this can happen. For example, we try not to remove a node that is "behind", but I think the combination of heuristics here will err on the side of not removing anything (especially considering that the newly-added replica is likely not caught up yet, and if the queue is quick enough we reach this point while only 2/3 of the original replicas have acked the replica change itself).

Over-replicated ranges can indicate problems with the replication queue and in some configurations can lead to unavailability because over-replicated ranges are often under-diversified. Updates cockroachdb#12768 Release note: None

Touches cockroachdb#12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in cockroachdb#12768. PS: the newly added test passes when manually overriding useAtomic with `false`, and I verified that it indeed uses no atomic replication changes. Release note: None

41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg I wrote this up rather quickly, but wanted to get this out for review sooner rather than later. ---- Touches #12768. Release justification: the previous code would enter vulnerable replication configurations when it wasn't necessary, thus undermining what we wanted to achieve in #12768. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

This is the last known work item for cockroachdb#12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None

41190: storage: move off dead/decommissioning stores via atomic swaps r=danhhz a=tbg This is the last known work item for #12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg · 2019-10-04T20:29:16Z

My new god query is

SELECT
  min("rangeID") AS rangeid,
  min("eventType") AS event,
  count(replica) AS numvoters,
  min("timestamp") AS ts,
  min(info) AS info
FROM
  [
    SELECT
      "uniqueID",
      "timestamp",
      "rangeID",
      "eventType",
      info,
      replica
    FROM
      system.rangelog,
      ROWS FROM (
        jsonb_array_elements(
          info::JSONB->'UpdatedDesc'->'internal_replicas'
        )
      )
        AS replica
    WHERE
      COALESCE(replica->'type', '0') IN ('0', '2')
  ]
WHERE
  "eventType" = 'add'
GROUP BY
  "uniqueID"
HAVING
  count(replica) > 3
ORDER BY
  rangeid, ts ASC;

This lists every range that entered a >3x replicated state. I'm running tpccbench/nodes=9/cpu=4/multi-region right now to verify that I'm not seeing any of them (other than the ones related to the system ranges).

tbg · 2019-10-06T12:23:09Z

Good and bad news. The good news is that the above query verifies that we're never upreplicating to four nodes in tpccbench/nodes=9/cpu=4/multi-region. The bad news is that we're still violating the constraints all the time by putting more than one replica into one AZ.

tpccbench/nodes=9/cpu=4/multi-region has nodes 1-3 in region 1, 4-6 in region 2, 7-9 in region 3.

We can check whether we ever put two replicas into region 1 with this query:

SELECT
        min("rangeID") AS rangeid,
        min("eventType") AS event,
        count(replica) AS numvoters,
        min("timestamp") AS ts,
        min(info) AS info
FROM
        [
                SELECT
                        "uniqueID",
                        "timestamp",
                        "rangeID",
                        "eventType",
                        info,
                        replica
                FROM
                        system.rangelog,
                        ROWS FROM (
                                jsonb_array_elements(
                                        info::JSONB->'UpdatedDesc'->'internal_replicas'
                                )
                        )
                                AS replica
                WHERE
                        COALESCE(replica->'type', '0') IN ('0', '2')
		AND
			replica->'node_id' IN ('1','2','3')
        ]
WHERE
        "eventType" = 'add'
GROUP BY
        "uniqueID"
HAVING
        count(replica) > 1
ORDER BY
        rangeid, ts ASC;

which returns lots of results. Picking a range at random, I see this wonderful behavior:

Note how all of the replicas were put into a single region 💀

I sort of expected the allocator to do stupid stuff like that, but this is hilariously bad; I hope I broke this just recently. On the "plus" side, the allocator is aware that it's breaking the constraints:

When the range is split off, with replicas on n3, n4, n8, the constraints should be OK, but then we move the n4 replica to n3, I don't know why, but noting that "constraint check fail".

tbg · 2019-10-06T12:26:01Z

The replicate queue applied to this range at present says

2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] running replicate.process
-- | --
2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] next replica action: consider rebalance
2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] no suitable rebalance target

This tells me that the allocator doesn't even think there is a problem, which is bad.

tbg · 2019-10-06T12:52:50Z

I found out that this range belongs to the partitioned index customer_idx, and so it's intentionally glued to a region via constraints a la

ALTER PARTITION p1_0 OF INDEX tpcc.public.customer@customer_idx CONFIGURE ZONE USING
    constraints = '[+zone=us-east1-b]';

That makes it even more mysterious why the allocator things the constraints are violated, but at least it's better than I initially thought.

Unfortunately it will be hard to programmatically go through all of the ranges and to decide which ones are supposed to be on which set of replicas.

tbg · 2019-10-06T12:54:30Z

For another bit of confusion, look how many "overreplicated ranges" this thing claims we have (it thinks they're basically all overreplicated)

root@localhost:26257/tpcc> select * from system.replication_stats;
  zone_id | subzone_id | report_id | total_ranges | unavailable_ranges | under_replicated_ranges | over_replicated_ranges
+---------+------------+-----------+--------------+--------------------+-------------------------+------------------------+
        0 |          0 |         3 |          198 |                  0 |                       0 |                      0
        1 |          0 |         3 |           18 |                  0 |                       0 |                      0
       15 |          0 |         3 |            1 |                  0 |                       0 |                      0
       16 |          0 |         3 |            1 |                  0 |                       0 |                      0
       17 |          0 |         3 |            2 |                  0 |                       0 |                      0
       22 |          0 |         3 |            1 |                  0 |                       0 |                      0
       53 |          1 |         3 |            5 |                  0 |                       0 |                      5
       53 |          2 |         3 |            6 |                  0 |                       0 |                      6
       53 |          3 |         3 |            4 |                  0 |                       0 |                      4
       54 |          1 |         3 |            1 |                  0 |                       0 |                      0
       54 |          2 |         3 |            1 |                  0 |                       0 |                      0
       54 |          3 |         3 |            1 |                  0 |                       0 |                      0
       55 |          1 |         3 |           21 |                  0 |                       0 |                     21
       55 |          2 |         3 |           20 |                  0 |                       0 |                     20
       55 |          3 |         3 |           22 |                  0 |                       0 |                     22
       56 |          1 |         3 |          434 |                  0 |                       0 |                    434
       56 |          2 |         3 |          424 |                  0 |                       0 |                    424
       56 |          3 |         3 |          428 |                  0 |                       0 |                    428
       56 |          4 |         3 |          243 |                  0 |                       0 |                    243
       56 |          5 |         3 |          251 |                  0 |                       0 |                    251
       56 |          6 |         3 |          251 |                  0 |                       0 |                    251
       57 |          1 |         3 |           89 |                  0 |                       0 |                     89
       57 |          2 |         3 |           88 |                  0 |                       0 |                     88
       57 |          3 |         3 |           87 |                  0 |                       0 |                     87
       57 |          4 |         3 |           48 |                  0 |                       0 |                     48
       57 |          5 |         3 |           47 |                  0 |                       0 |                     47
       57 |          6 |         3 |           47 |                  0 |                       0 |                     47
       57 |          7 |         3 |           42 |                  0 |                       0 |                     42
       57 |          8 |         3 |           40 |                  0 |                       0 |                     40
       57 |          9 |         3 |           41 |                  0 |                       0 |                     41
       58 |          1 |         3 |            2 |                  0 |                       0 |                      2
       58 |          2 |         3 |            2 |                  0 |                       0 |                      2
       58 |          3 |         3 |            2 |                  0 |                       0 |                      2
       59 |          1 |         3 |          709 |                  0 |                       0 |                    709
       59 |          2 |         3 |          705 |                  0 |                       0 |                    705
       59 |          3 |         3 |          692 |                  0 |                       0 |                    692
       60 |          1 |         3 |          379 |                  0 |                       0 |                    379
       60 |          2 |         3 |          377 |                  0 |                       0 |                    377
       60 |          3 |         3 |          377 |                  0 |                       0 |                    377
       60 |          4 |         3 |           56 |                  0 |                       0 |                     56
       60 |          5 |         3 |           53 |                  0 |                       0 |                     53
       60 |          6 |         3 |           56 |                  0 |                       0 |                     56
       61 |          1 |         3 |            1 |                  0 |                       0 |                      1
       61 |          2 |         3 |            1 |                  0 |                       0 |                      1
       61 |          3 |         3 |            1 |                  0 |                       0 |                      1
(45 rows)

Time: 69.719958ms

root@localhost:26257/tpcc>

I don't buy that, the ranges.overreplicated metric flatlines at zero.

cc @andreimatei / @darinpp (cluster is tobias-1570324285-01-n12cpu4-geo)

andreimatei · 2019-10-07T14:57:50Z

Looking.

…

On Sun, Oct 6, 2019 at 8:55 AM Tobias Grieger ***@***.***> wrote: For another bit of confusion, look how many "overreplicated ranges" this thing claims we have: ***@***.***:26257/tpcc> select * from system.replication_stats; zone_id | subzone_id | report_id | total_ranges | unavailable_ranges | under_replicated_ranges | over_replicated_ranges +---------+------------+-----------+--------------+--------------------+-------------------------+------------------------+ 0 | 0 | 3 | 198 | 0 | 0 | 0 1 | 0 | 3 | 18 | 0 | 0 | 0 15 | 0 | 3 | 1 | 0 | 0 | 0 16 | 0 | 3 | 1 | 0 | 0 | 0 17 | 0 | 3 | 2 | 0 | 0 | 0 22 | 0 | 3 | 1 | 0 | 0 | 0 53 | 1 | 3 | 5 | 0 | 0 | 5 53 | 2 | 3 | 6 | 0 | 0 | 6 53 | 3 | 3 | 4 | 0 | 0 | 4 54 | 1 | 3 | 1 | 0 | 0 | 0 54 | 2 | 3 | 1 | 0 | 0 | 0 54 | 3 | 3 | 1 | 0 | 0 | 0 55 | 1 | 3 | 21 | 0 | 0 | 21 55 | 2 | 3 | 20 | 0 | 0 | 20 55 | 3 | 3 | 22 | 0 | 0 | 22 56 | 1 | 3 | 434 | 0 | 0 | 434 56 | 2 | 3 | 424 | 0 | 0 | 424 56 | 3 | 3 | 428 | 0 | 0 | 428 56 | 4 | 3 | 243 | 0 | 0 | 243 56 | 5 | 3 | 251 | 0 | 0 | 251 56 | 6 | 3 | 251 | 0 | 0 | 251 57 | 1 | 3 | 89 | 0 | 0 | 89 57 | 2 | 3 | 88 | 0 | 0 | 88 57 | 3 | 3 | 87 | 0 | 0 | 87 57 | 4 | 3 | 48 | 0 | 0 | 48 57 | 5 | 3 | 47 | 0 | 0 | 47 57 | 6 | 3 | 47 | 0 | 0 | 47 57 | 7 | 3 | 42 | 0 | 0 | 42 57 | 8 | 3 | 40 | 0 | 0 | 40 57 | 9 | 3 | 41 | 0 | 0 | 41 58 | 1 | 3 | 2 | 0 | 0 | 2 58 | 2 | 3 | 2 | 0 | 0 | 2 58 | 3 | 3 | 2 | 0 | 0 | 2 59 | 1 | 3 | 709 | 0 | 0 | 709 59 | 2 | 3 | 705 | 0 | 0 | 705 59 | 3 | 3 | 692 | 0 | 0 | 692 60 | 1 | 3 | 379 | 0 | 0 | 379 60 | 2 | 3 | 377 | 0 | 0 | 377 60 | 3 | 3 | 377 | 0 | 0 | 377 60 | 4 | 3 | 56 | 0 | 0 | 56 60 | 5 | 3 | 53 | 0 | 0 | 53 60 | 6 | 3 | 56 | 0 | 0 | 56 61 | 1 | 3 | 1 | 0 | 0 | 1 61 | 2 | 3 | 1 | 0 | 0 | 1 61 | 3 | 3 | 1 | 0 | 0 | 1 (45 rows) Time: 69.719958ms ***@***.***:26257/tpcc> I don't buy that, the ranges.overreplicated metric flatlines at zero. cc @andreimatei <https://github.com/andreimatei> / @darinpp <https://github.com/darinpp> (cluster is tobias-1570324285-01-n12cpu4-geo) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12768?email_source=notifications&email_token=AAC4C4OVUAAEYOBUQ3JEU4DQNHN2XA5CNFSM4C3WZEH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOJNHQ#issuecomment-538744478>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAC4C4JB3AL34A5WWO2GPSLQNHN2XANCNFSM4C3WZEHQ> .

andreimatei · 2019-10-07T15:51:58Z

Actually, @darinpp, if you get here before me and are so inclined, feel free to take a look. It'll have to wait a bit on my part.

darinpp · 2019-10-08T20:14:48Z

#41328 shows a case where you have 4 replicas in 19.2. It does seem like a bug however.

a-robinson · 2019-10-09T04:45:12Z

When the range is split off, with replicas on n3, n4, n8, the constraints should be OK, but then we move the n4 replica to n3, I don't know why, but noting that "constraint check fail".

...

That makes it even more mysterious why the allocator things the constraints are violated, but at least it's better than I initially thought.

In that screenshot, the details:(constraint check fail) field is part of the description of the existing replica that's being replaced. It's telling you that the reason for removing the replica on s4 and adding a replica on s2 is that s4 violates the constraints of the range.

ajwerner · 2020-02-12T14:38:57Z

@tbg can we close this?

Closes cockroachdb#6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in cockroachdb#12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder Before: the script wasn't implemented. Now: Part of the new release process, this script - tags the selected SHA to the provided name - compiles the binaries and archive and uploads them to S3 as the versioned name - uploads the docker image to docker.io/cockroachdb/cockroach - pushes the tag to github.com/cockroachdb/cockroach - push all the artificats to their respective `latest` locations as appropriate Release note: None 51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex Closes #6782 This change modifies the replica_queue to allow rebalances between multiple stores within a single node. This is possible because we previously introduced atomic rebalances in #12768. The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node. 52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state. Typically this requires running a somewhat expensive RevertRange operation that finds the keys written by the IMPORT in amongst all the table data and deletes just those keys. This is somewhat expensive -- we need to iterate the keys in the target table and check them to see if they need to be reverted. Non-INTO style IMPORTs create the table into which they will IMPORT and thus can just drop it wholesale on failure, instead of doing this expensive revert. However INTO-style IMPORTs could use a similarly fast/cheap wholesale delete *if they knew the table was empty* when the IMPORT was started. This change tracks which tables were empty when the IMPORT started and then deletes, rather than reverts, the table span on failure. Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster. 53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft This commit adds index acceleration support for the bounding box comparison operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects. Release note (performance improvement): The ~ and && geospatial bounding box operations can now benefit from index acceleration if one of the operands is an indexed geometry column. 53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy Fixes #53033 Fixes #52959 Use transaction when querying for the schedules to run. In addition, ensure that a single bad schedule does not cause all of the previous work to be wasted by using transaction savepoints. Release Notes: None 53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten Fixes #50180. This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan. Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation: ``` UPSERT t = SELECT FROM t + INSERT INTO t => UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t ``` I plan to test this out on a contended `indexes` workload at some point in the future. Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting. Co-authored-by: James H. Linder <jamesl@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

petermattis added this to the Later milestone Feb 23, 2017

a-robinson mentioned this issue Dec 13, 2017

Unexpected behaviour in a split-brain scenario #20241

Closed

bdarnell modified the milestones: Later, 2.1 Apr 26, 2018

bdarnell added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018

bdarnell self-assigned this Apr 26, 2018

tbg modified the milestones: 2.1, 2.2 Jul 19, 2018

bdarnell changed the title ~~stability: Datacenter failures during rebalances~~ stability: Rebalances must be atomic Jul 23, 2018

bdarnell mentioned this issue Aug 13, 2018

storage: can't change the membership with three replicas in three nodes cluster #2067

Closed

a-robinson mentioned this issue Sep 5, 2018

storage: RelocateRange doesn't handle edge cases properly #29130

Closed

a-robinson mentioned this issue Sep 26, 2018

storage: slow adherence to replication constraints in edge case #30695

Closed

petermattis removed this from the 2.2 milestone Oct 5, 2018

tbg added the S-1-blocking-adoption label Oct 11, 2018

tbg mentioned this issue Mar 15, 2019

[poc,dnm]: storage: use learner replicas #35787

Closed

awoods187 mentioned this issue Mar 19, 2019

replication: Allow or for replicas in constraints #35956

Closed

bdarnell mentioned this issue Mar 23, 2019

storage: Add over-replication metric #36084

Merged

tbg mentioned this issue Sep 30, 2019

storage: move off dead/decommissioning stores via atomic swaps #41190

Merged

tbg mentioned this issue Oct 8, 2019

storage: system.replication_stats shows faux overreplication #41446

Closed

tbg closed this as completed Feb 12, 2020

nvanbenschoten mentioned this issue Jun 3, 2020

storage: replicas not balanced across multiple stores in same node #6782

Closed

lunevalex mentioned this issue Aug 18, 2020

kvserver: Allow rebalances between stores on the same nodes. #51567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Rebalances must be atomic #12768

stability: Rebalances must be atomic #12768

bdarnell commented Jan 8, 2017

rjnn commented Jan 9, 2017

bdarnell commented Jan 9, 2017

bdarnell commented Jul 23, 2018

a-robinson commented Jul 23, 2018

bdarnell commented Jul 23, 2018

a-robinson commented Jul 23, 2018

tbg commented Oct 11, 2018

bdarnell commented Mar 23, 2019

tbg commented Oct 4, 2019

tbg commented Oct 6, 2019 •

edited

Loading

tbg commented Oct 6, 2019

tbg commented Oct 6, 2019

tbg commented Oct 6, 2019 •

edited

Loading

andreimatei commented Oct 7, 2019 via email

andreimatei commented Oct 7, 2019

darinpp commented Oct 8, 2019

a-robinson commented Oct 9, 2019

ajwerner commented Feb 12, 2020

stability: Rebalances must be atomic #12768

stability: Rebalances must be atomic #12768

Comments

bdarnell commented Jan 8, 2017

rjnn commented Jan 9, 2017

bdarnell commented Jan 9, 2017

bdarnell commented Jul 23, 2018

a-robinson commented Jul 23, 2018

bdarnell commented Jul 23, 2018

a-robinson commented Jul 23, 2018

tbg commented Oct 11, 2018

bdarnell commented Mar 23, 2019

tbg commented Oct 4, 2019

tbg commented Oct 6, 2019 • edited Loading

tbg commented Oct 6, 2019

tbg commented Oct 6, 2019

tbg commented Oct 6, 2019 • edited Loading

andreimatei commented Oct 7, 2019 via email

andreimatei commented Oct 7, 2019

darinpp commented Oct 8, 2019

a-robinson commented Oct 9, 2019

ajwerner commented Feb 12, 2020

tbg commented Oct 6, 2019 •

edited

Loading

tbg commented Oct 6, 2019 •

edited

Loading