Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Rebalances must be atomic #12768

Closed
bdarnell opened this issue Jan 8, 2017 · 31 comments
Closed

stability: Rebalances must be atomic #12768

bdarnell opened this issue Jan 8, 2017 · 31 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-blocking-adoption

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Jan 8, 2017

Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.

The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.

We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.

Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.

@rjnn
Copy link
Contributor

rjnn commented Jan 9, 2017

If we're using flexible paxos, we can make it work as long as we first ensure that the leader is not on the datacenter where the rebalance is taking place. So if we denote datacenters by numbers and replicas by letters, if we are going from {1A, 2B, 3C} to {1A, 2B, 3D}, we just need to ensure that A or B is the leader before upreplicating to {1A, 2B, 3C, 3D}. Now if we have a datacenter failure at 1 or 2, we have enough replicas (three) to elect a leader. If we have a datacenter failure at 3, we still have our leader so we can upreplicate again to a 5-replica range.

@bdarnell
Copy link
Contributor Author

bdarnell commented Jan 9, 2017

Yeah, we'd need to ensure both that A or B is the leader and that C cannot become leader while the upreplication is in flight.

There are also performance considerations here (which we do not currently take into account): the new replica gets its snapshot from the leader, so for performance we'd want to have 3C be the leader when 3D is added. (or we introduce some way to send preemptive snapshots from a follower)

@petermattis petermattis added this to the Later milestone Feb 23, 2017
@bdarnell bdarnell modified the milestones: Later, 2.1 Apr 26, 2018
@bdarnell bdarnell added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018
@bdarnell bdarnell self-assigned this Apr 26, 2018
@tbg tbg modified the milestones: 2.1, 2.2 Jul 19, 2018
@bdarnell bdarnell changed the title stability: Datacenter failures during rebalances stability: Rebalances must be atomic Jul 23, 2018
@bdarnell
Copy link
Contributor Author

We have a real-world instance of this: a node crashed (in a persistent way so it crashes again each time it's restarted) while the range was only configured for two replicas. This required extreme efforts to revive the downed node instead of simply letting the cluster recover normally.

@a-robinson
Copy link
Contributor

While it was only configured for two replicas? So this was when up-replicating from 1 to 3? I'm drawing a blank on when else a range would be configured to have two replicas during rebalancing (other than when removing a replica from a dead store and adding a new one to replace it, which this doesn't sound like).

@bdarnell
Copy link
Contributor Author

I think this was while removing a replica from a dead store. So technically this was a two-simultaneous-failure event, but the first failure was transient while the second was persistent. If the replica change had been atomic, we'd have been able to recover when the first transient failure resolved.

@a-robinson
Copy link
Contributor

Yeah, we need to fix that, although the fix for that will be different from / simpler than the fix for what this issue was originally describing. We really just need to up-replicate before removing the dead replica, not to make anything atomic.

For the record, we hit this issue or removing a dead replica before adding its replacement a couple months ago as well - #25392.

@tbg
Copy link
Member

tbg commented Oct 11, 2018

I folded #2067 (can't change the membership with three replicas in three nodes cluster) into this issue.

tbg added a commit to tbg/cockroach that referenced this issue Mar 15, 2019
This is a PR just to show some code to the interested parties. The real thing
to look at here is the explanation and suggested strategy below. Don't
review the code.

----

Learner replicas are full Replicas minus the right to vote (i.e
they don't count for quorum). They are interesting to us because they
allow us to phase out preemptive snapshots (via [delayed preemptive
snaps] as a migration strategy; see there for all the things we don't
like about preemptive snapshots), and because they can help us avoid
spending more time in vulnerable configurations than we need to.

To see an example of the latter, assume we're trying to upreplicate from
three replicas to five replicas. As is, we need to add a fourth replica,
wait for a preemptive snapshot, and add the fifth. We spend
approximately the duration of one preemptive snapshot in an even replica
configuration. In theory, we could send two preemptive snapshots first
and then carry out two replica additions, but with learners, it would be
much more straightforward and less error-prone. This doesn't solve the
problems in cockroachdb#12768, but it helps avoid them.

This PR shows the bare minimum of code changes to upreplicate using
learner replicas and suggests further steps to make them a reality.

Principally, to use learner replicas, we barely need to make any
changes. Our current up-replication code is this:

1. send preemptive snapshot
1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the
replicated range descriptor and, on commit, induces a Raft configuration
change (`raftpb.ConfChangeAddNode`).

The new up-replication code (note that the old one has to stick around,
because compatibility):

1. run the ChangeReplicas txn, which adds a `ReplicaDescriptor` to the
replicated range descriptor with the `Learner` flag set to true and, on commit, induces a Raft configuration
change (`raftpb.ConfChangeAddLearnerNode`).
2. wait for the learner to have caught up or send it a Raft snapshot
proactively (either works, just have to make sure not to duplicate work)
3. run a ChangeReplicas txn which removes the `Learner` flag from the `ReplicaDescriptor`
and induces a Raft conf change of `raftpb.ConfChangeAddNode` (upgrading
the learner).

The existence of learners will need updates throughout the allocator so
that it realizes that they don't count for quorum and are either
upgraded or removed in a timely manner. None of that is in this POC.

[delayed preemptive snaps]: cockroachdb#35786

Release note: None
@bdarnell
Copy link
Contributor Author

While a complete fix requires the implementation of joint consensus, there may be things we can do to make partial improvements. There are a few factors that may be leaving ranges in this over-replicated, under-diversified state longer than necessary.

  • In the allocator, removing an "extra" replica is given a low priority, but as this example shows the resulting under-diversification can be as severe a problem as a dead node.
  • The replicate queue is a single thread that is blocked while sending preemptive snapshots, preventing other ranges from downreplicating for the duration of a snapshot. (we can easily add more concurrency in this queue, but as long and adds and removes go through the same pool we'll likely end up with all threads blocked in snapshot sending)
  • After adding a replica, we just send the range back through the replicate queue to remove the extra one. That removal is not guaranteed to be the next thing we do. Even if it is, we only make one attempt, and if that fails the extra replica will persist until the next scanner cycle.

I think the last issue is probably most significant - a cluster that experiences one failure to remove an extra replica per ten minutes will usually have an overreplicated range and be vulnerable to region/AZ failure. There are a number of ways this can happen. For example, we try not to remove a node that is "behind", but I think the combination of heuristics here will err on the side of not removing anything (especially considering that the newly-added replica is likely not caught up yet, and if the queue is quick enough we reach this point while only 2/3 of the original replicas have acked the replica change itself).

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 23, 2019
Over-replicated ranges can indicate problems with the replication
queue and in some configurations can lead to unavailability because
over-replicated ranges are often under-diversified.

Updates cockroachdb#12768

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 27, 2019
Touches cockroachdb#12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in cockroachdb#12768.

PS: the newly added test passes when manually overriding useAtomic with
`false`, and I verified that it indeed uses no atomic replication
changes.

Release note: None
craig bot pushed a commit that referenced this issue Sep 27, 2019
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg

I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.

----

Touches #12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2019
This is the last known work item for cockroachdb#12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2019
This is the last known work item for cockroachdb#12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None
craig bot pushed a commit that referenced this issue Oct 1, 2019
41190: storage: move off dead/decommissioning stores via atomic swaps r=danhhz a=tbg

This is the last known work item for #12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg
Copy link
Member

tbg commented Oct 4, 2019

My new god query is

SELECT
  min("rangeID") AS rangeid,
  min("eventType") AS event,
  count(replica) AS numvoters,
  min("timestamp") AS ts,
  min(info) AS info
FROM
  [
    SELECT
      "uniqueID",
      "timestamp",
      "rangeID",
      "eventType",
      info,
      replica
    FROM
      system.rangelog,
      ROWS FROM (
        jsonb_array_elements(
          info::JSONB->'UpdatedDesc'->'internal_replicas'
        )
      )
        AS replica
    WHERE
      COALESCE(replica->'type', '0') IN ('0', '2')
  ]
WHERE
  "eventType" = 'add'
GROUP BY
  "uniqueID"
HAVING
  count(replica) > 3
ORDER BY
  rangeid, ts ASC;

This lists every range that entered a >3x replicated state. I'm running tpccbench/nodes=9/cpu=4/multi-region right now to verify that I'm not seeing any of them (other than the ones related to the system ranges).

@tbg
Copy link
Member

tbg commented Oct 6, 2019

Good and bad news. The good news is that the above query verifies that we're never upreplicating to four nodes in tpccbench/nodes=9/cpu=4/multi-region. The bad news is that we're still violating the constraints all the time by putting more than one replica into one AZ.

tpccbench/nodes=9/cpu=4/multi-region has nodes 1-3 in region 1, 4-6 in region 2, 7-9 in region 3.

We can check whether we ever put two replicas into region 1 with this query:

SELECT
        min("rangeID") AS rangeid,
        min("eventType") AS event,
        count(replica) AS numvoters,
        min("timestamp") AS ts,
        min(info) AS info
FROM
        [
                SELECT
                        "uniqueID",
                        "timestamp",
                        "rangeID",
                        "eventType",
                        info,
                        replica
                FROM
                        system.rangelog,
                        ROWS FROM (
                                jsonb_array_elements(
                                        info::JSONB->'UpdatedDesc'->'internal_replicas'
                                )
                        )
                                AS replica
                WHERE
                        COALESCE(replica->'type', '0') IN ('0', '2')
		AND
			replica->'node_id' IN ('1','2','3')
        ]
WHERE
        "eventType" = 'add'
GROUP BY
        "uniqueID"
HAVING
        count(replica) > 1
ORDER BY
        rangeid, ts ASC;

which returns lots of results. Picking a range at random, I see this wonderful behavior:

image

Note how all of the replicas were put into a single region 💀

I sort of expected the allocator to do stupid stuff like that, but this is hilariously bad; I hope I broke this just recently. On the "plus" side, the allocator is aware that it's breaking the constraints:

image

When the range is split off, with replicas on n3, n4, n8, the constraints should be OK, but then we move the n4 replica to n3, I don't know why, but noting that "constraint check fail".

@tbg
Copy link
Member

tbg commented Oct 6, 2019

The replicate queue applied to this range at present says

2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] running replicate.process
-- | --
2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] next replica action: consider rebalance
2019-10-06 12:25:03 | [n2,s2,r6772/4:/Table/60/2/0{-/1/"AB…}] no suitable rebalance target

This tells me that the allocator doesn't even think there is a problem, which is bad.

@tbg
Copy link
Member

tbg commented Oct 6, 2019

I found out that this range belongs to the partitioned index customer_idx, and so it's intentionally glued to a region via constraints a la

ALTER PARTITION p1_0 OF INDEX tpcc.public.customer@customer_idx CONFIGURE ZONE USING
    constraints = '[+zone=us-east1-b]';

That makes it even more mysterious why the allocator things the constraints are violated, but at least it's better than I initially thought.

Unfortunately it will be hard to programmatically go through all of the ranges and to decide which ones are supposed to be on which set of replicas.

@tbg
Copy link
Member

tbg commented Oct 6, 2019

For another bit of confusion, look how many "overreplicated ranges" this thing claims we have (it thinks they're basically all overreplicated)

root@localhost:26257/tpcc> select * from system.replication_stats;
  zone_id | subzone_id | report_id | total_ranges | unavailable_ranges | under_replicated_ranges | over_replicated_ranges
+---------+------------+-----------+--------------+--------------------+-------------------------+------------------------+
        0 |          0 |         3 |          198 |                  0 |                       0 |                      0
        1 |          0 |         3 |           18 |                  0 |                       0 |                      0
       15 |          0 |         3 |            1 |                  0 |                       0 |                      0
       16 |          0 |         3 |            1 |                  0 |                       0 |                      0
       17 |          0 |         3 |            2 |                  0 |                       0 |                      0
       22 |          0 |         3 |            1 |                  0 |                       0 |                      0
       53 |          1 |         3 |            5 |                  0 |                       0 |                      5
       53 |          2 |         3 |            6 |                  0 |                       0 |                      6
       53 |          3 |         3 |            4 |                  0 |                       0 |                      4
       54 |          1 |         3 |            1 |                  0 |                       0 |                      0
       54 |          2 |         3 |            1 |                  0 |                       0 |                      0
       54 |          3 |         3 |            1 |                  0 |                       0 |                      0
       55 |          1 |         3 |           21 |                  0 |                       0 |                     21
       55 |          2 |         3 |           20 |                  0 |                       0 |                     20
       55 |          3 |         3 |           22 |                  0 |                       0 |                     22
       56 |          1 |         3 |          434 |                  0 |                       0 |                    434
       56 |          2 |         3 |          424 |                  0 |                       0 |                    424
       56 |          3 |         3 |          428 |                  0 |                       0 |                    428
       56 |          4 |         3 |          243 |                  0 |                       0 |                    243
       56 |          5 |         3 |          251 |                  0 |                       0 |                    251
       56 |          6 |         3 |          251 |                  0 |                       0 |                    251
       57 |          1 |         3 |           89 |                  0 |                       0 |                     89
       57 |          2 |         3 |           88 |                  0 |                       0 |                     88
       57 |          3 |         3 |           87 |                  0 |                       0 |                     87
       57 |          4 |         3 |           48 |                  0 |                       0 |                     48
       57 |          5 |         3 |           47 |                  0 |                       0 |                     47
       57 |          6 |         3 |           47 |                  0 |                       0 |                     47
       57 |          7 |         3 |           42 |                  0 |                       0 |                     42
       57 |          8 |         3 |           40 |                  0 |                       0 |                     40
       57 |          9 |         3 |           41 |                  0 |                       0 |                     41
       58 |          1 |         3 |            2 |                  0 |                       0 |                      2
       58 |          2 |         3 |            2 |                  0 |                       0 |                      2
       58 |          3 |         3 |            2 |                  0 |                       0 |                      2
       59 |          1 |         3 |          709 |                  0 |                       0 |                    709
       59 |          2 |         3 |          705 |                  0 |                       0 |                    705
       59 |          3 |         3 |          692 |                  0 |                       0 |                    692
       60 |          1 |         3 |          379 |                  0 |                       0 |                    379
       60 |          2 |         3 |          377 |                  0 |                       0 |                    377
       60 |          3 |         3 |          377 |                  0 |                       0 |                    377
       60 |          4 |         3 |           56 |                  0 |                       0 |                     56
       60 |          5 |         3 |           53 |                  0 |                       0 |                     53
       60 |          6 |         3 |           56 |                  0 |                       0 |                     56
       61 |          1 |         3 |            1 |                  0 |                       0 |                      1
       61 |          2 |         3 |            1 |                  0 |                       0 |                      1
       61 |          3 |         3 |            1 |                  0 |                       0 |                      1
(45 rows)

Time: 69.719958ms

root@localhost:26257/tpcc>

I don't buy that, the ranges.overreplicated metric flatlines at zero.

cc @andreimatei / @darinpp (cluster is tobias-1570324285-01-n12cpu4-geo)

@andreimatei
Copy link
Contributor

andreimatei commented Oct 7, 2019 via email

@andreimatei
Copy link
Contributor

Actually, @darinpp, if you get here before me and are so inclined, feel free to take a look. It'll have to wait a bit on my part.

@darinpp
Copy link
Contributor

darinpp commented Oct 8, 2019

#41328 shows a case where you have 4 replicas in 19.2. It does seem like a bug however.

@a-robinson
Copy link
Contributor

When the range is split off, with replicas on n3, n4, n8, the constraints should be OK, but then we move the n4 replica to n3, I don't know why, but noting that "constraint check fail".

...

That makes it even more mysterious why the allocator things the constraints are violated, but at least it's better than I initially thought.

In that screenshot, the details:(constraint check fail) field is part of the description of the existing replica that's being replaced. It's telling you that the reason for removing the replica on s4 and adding a replica on s2 is that s4 violates the constraints of the range.

@ajwerner
Copy link
Contributor

@tbg can we close this?

@tbg tbg closed this as completed Feb 12, 2020
lunevalex added a commit to lunevalex/cockroach that referenced this issue Aug 19, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node.
This is possible because we previously introduced atomic rebalances in cockroachdb#12768.

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition.

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
lunevalex added a commit to lunevalex/cockroach that referenced this issue Aug 20, 2020
Closes cockroachdb#6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node.
This is possible because we previously introduced atomic rebalances in cockroachdb#12768.

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition.

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.
craig bot pushed a commit that referenced this issue Aug 20, 2020
50053: Script for the PublishRelease TC build configuration r=pbardea a=jlinder

Before: the script wasn't implemented.

Now:

Part of the new release process, this script

- tags the selected SHA to the provided name
- compiles the binaries and archive and uploads them to S3 as the
  versioned name
- uploads the docker image to docker.io/cockroachdb/cockroach
- pushes the tag to github.com/cockroachdb/cockroach
- push all the artificats to their respective `latest` locations as
  appropriate

Release note: None

51567: kvserver: Allow rebalances between stores on the same nodes. r=lunevalex a=lunevalex

Closes #6782

This change modifies the replica_queue to allow rebalances between multiple stores within a single node. 
This is possible because we previously introduced atomic rebalances in #12768. 

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition. 

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup. 

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

52754: importccl: speed up revert of IMPORT INTO empty table r=dt a=dt

When IMPORT INTO fails, it reverts the tables to their pre-IMPORT state.
Typically this requires running a somewhat expensive RevertRange operation
that finds the keys written by the IMPORT in amongst all the table data
and deletes just those keys. This is somewhat expensive -- we need to
iterate the keys in the target table and check them to see if they
need to be reverted.

Non-INTO style IMPORTs create the table into which they will IMPORT and
thus can just drop it wholesale on failure, instead of doing this expensive
revert. However INTO-style IMPORTs could use a similarly fast/cheap
wholesale delete *if they knew the table was empty* when the IMPORT was
started.

This change tracks which tables were empty when the IMPORT started and
then deletes, rather than reverts, the table span on failure.

Release note (performance improvement): Cleaning up after a failure during IMPORT INTO a table which was empty is now faster.

53023: opt: add index acceleration support for ~ and && bounding box operators r=rytaft a=rytaft

This commit adds index acceleration support for the bounding box comparison
operators, `~` and `&&`. It maps `~` to Covers and `&&` to Intersects.

Release note (performance improvement): The ~ and && geospatial bounding
box operations can now benefit from index acceleration if one of the
operands is an indexed geometry column.

53049: bulkio: Fix transaction semantics in job scheduler. r=miretskiy a=miretskiy

Fixes #53033
Fixes #52959 

Use transaction when querying for the schedules to run.
In addition, ensure that a single bad schedule does not cause
all of the previous work to be wasted by using transaction savepoints.


Release Notes: None

53132: sql/opt: add implicit SELECT FOR UPDATE support for UPSERT statements r=nvanbenschoten a=nvanbenschoten

Fixes #50180.

This commit adds support for implicit SELECT FOR UPDATE support for UPSERT statements with a VALUES clause. This should improve throughput and latency for contended UPSERT statements in much the same way that 435fa43 did for UPDATE statements. However, this only has an effect on UPSERT statements into tables with multiple indexes because UPSERT statements into single-index tables hit a fast-path where they perform a blind-write without doing an initial row scan.

Conceptually, if we picture an UPSERT statement as the composition of a SELECT statement and an INSERT statement (with loosened semantics around existing rows) then this change performs the following transformation:
```
UPSERT t = SELECT FROM t + INSERT INTO t
=>
UPSERT t = SELECT FROM t FOR UPDATE + INSERT INTO t
```

I plan to test this out on a contended `indexes` workload at some point in the future.

Release note (sql change): UPSERT statements now acquire locks using the FOR UPDATE locking mode during their initial row scan, which improves performance for contended workloads when UPSERTing into tables with multiple indexes. This behavior is configurable using the enable_implicit_select_for_update session variable and the sql.defaults.implicit_select_for_update.enabled cluster setting.

Co-authored-by: James H. Linder <jamesl@cockroachlabs.com>
Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-blocking-adoption
Projects
None yet
Development

No branches or pull requests

10 participants