WIP: Add cluster upgrade logic for backward-compatible migrations #11658

a-robinson · 2016-11-28T15:16:58Z

As outlined in #9404.

I have yet to use this to make any substantial changes to a cluster -- I expect that to make doing so convenient this will have to move from its own package into the sql package. For example, it's not easy to add a new system table from outside the SQL package, since that sort of thing doesn't go through the public interfaces.

Anyway, while I put together a more proper SQL migration (such as creating system.jobs) and an approach for testing specific real migrations this afternoon, the core of the code is ready for review.

cc @vivekmenezes @paperstreet

This change is

bdarnell · 2016-12-02T08:06:45Z

Reviewed 4 of 4 files at r1, 4 of 4 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

pkg/internal/client/lease.go, line 128 at r3 (raw file):

		if !l.val.Expiration.Equal(val.Expiration) {
			log.Warningf(ctx,
				"local lease expiration time %s not equal to persisted expiration time %s for key %s",

This should probably use a conditional put instead of manually checking for the expected value.

pkg/internal/client/lease.go, line 169 at r3 (raw file):

				val.Owner, m.clientID)
		}
		return txn.Put(l.key, nil)

Also use a conditional put here.

pkg/migrations/migrations.go, line 111 at r3 (raw file):

// initialized in order to mark all known migrations as complete.
func (m *Manager) InitNewCluster(ctx context.Context) error {
	return m.migrations(ctx, false /* runWorkFns */)

I think I'd rather run the migrations at new cluster initialization than have tables be created one way for new clusters and a different way for old clusters. We shouldn't have too many of these and they should be pretty fast on an empty cluster. (we could pass a flag in to tell the work functions that the cluster is starting from scratch if that could save a significant amount of work)

pkg/migrations/migrations.go, line 122 at r3 (raw file):

func (m *Manager) migrations(ctx context.Context, runWorkFns bool) error {
	// Before doing anything, grab the migration lease to ensure that only one

We should check whether any migrations are necessary before grabbing the lease (and then again after we have it). In the common case we won't have to do anything so we won't have to have every new server process do two writes to the lease record.

pkg/migrations/migrations.go, line 142 at r3 (raw file):

	}
	if err != nil {
		return errors.Wrapf(err, "failed to acquire lease for running necessary migrations")

Returning an error here kills the server. Instead, we need to wait for the lease holder to finish its migrations (or for the lease to expire so we can pick up the work).

pkg/migrations/migrations.go, line 166 at r3 (raw file):

			}
			if m.leaseManager.TimeRemaining(lease) < leaseRefreshInterval {
				log.Fatal(ctx, "not enough time left on migration lease, terminating for safety")

This seems harsh. Instead, could we use the lease expiration as a deadline for the transactions used by the work functions (similar to the way we use table metadata leases)?

pkg/migrations/migrations.go, line 219 at r3 (raw file):

		}
		var err error
		for r := retry.StartWithCtx(ctx, retry.Options{MaxRetries: 5}); r.Next(); {

We don't generally set MaxRetries in our retry loops, we just keep looping until they succeed. Or we just run things once and return the error (client.DB has internal retries so I don't think we generally need to wrap another loop around a single Put). Why do we loop here but we don't call the workFn in a loop, for example?

Comments from Reviewable

danhhz · 2016-12-02T15:50:44Z

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

pkg/internal/client/lease.go, line 51 at r1 (raw file):

// LeaseManagerOptions are used to configure a new LeaseManager.
type LeaseManagerOptions struct {
	// clientID must be unique to this LeaseManager instance.

nit: ClientID

pkg/migrations/migrations.go, line 118 at r2 (raw file):

// safe to run).
func (m *Manager) EnsureMigrations(ctx context.Context) error {
	return m.migrations(ctx, true /* runWorkFns */)

I'm not a fan of the inline bool comments. They're harder to wrap, the stutter makes it harder to read, and it's too easy for them to rot. A little better is this pattern

type runWorkFns bool
const (
  doRunWorkFns runWorkFns = true
  dontRunWorkFns runWorkFns = false
)
m.migrations(ctx, doRunWorkFns)

pkg/migrations/migrations.go, line 147 at r2 (raw file):

	// Ensure that we hold the lease throughout the migration process and release
	// it when we're done.
	done := make(chan interface{}, 1)

maybe I need to rtfm, but how is a buffered channel of size 1 different than an unbuffered channel?

pkg/migrations/migrations.go, line 176 at r2 (raw file):

	// them up one-by-one, which could potentially add a noticeable delay to
	// process startup if we eventually have a lot of migrations to check for.
	var keyvals []client.KeyValue

you could tighten the scope on this by moving it inside the retry and completedMigrations up here

pkg/migrations/migrations.go, line 181 at r2 (raw file):

	}
	for r := retry.StartWithCtx(ctx, retry.Options{MaxRetries: 3}); r.Next(); {
		keyvals, err = m.db.Scan(ctx, keys.MigrationPrefix, keys.MigrationKeyMax, 10000 /* maxRows */)

Either use 0 (unlimited) or actually do the paging, plz

pkg/migrations/migrations.go, line 190 at r2 (raw file):

		return errors.Wrapf(err, "failed to get list of completed migrations")
	}
	completedMigrations := make(map[string]bool)

we use map[string]struct{} for this

pkg/migrations/migrations.go, line 237 at r2 (raw file):

func migrationKey(migration migrationDescriptor) roachpb.Key {
	return bytes.Join([][]byte{keys.MigrationPrefix, roachpb.RKey(migration.name)}, nil)

why not append(keys.MigrationPrefix, roachpb.RKey(migration.name)...)?

pkg/migrations/migrations.go, line 122 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should check whether any migrations are necessary before grabbing the lease (and then again after we have it). In the common case we won't have to do anything so we won't have to have every new server process do two writes to the lease record.

Grabbing the lease before checking if there are migrations also linearizes startup of nodes that you bring up at the same time, no?

pkg/migrations/migrations_test.go, line 136 at r2 (raw file):

	}
	for i, migrations := range testCases {
		t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {

if you pass an empty string to t.Run, it numbers them for you

pkg/server/server.go, line 505 at r2 (raw file):

	s.stopper.RunWorker(func() {
		<-serveSQL

this might be a bit too paranoid, but you could select between this and the s.stopper.stopper channel in case this node gets ctrl-c'd before the migrations finish

pkg/server/server.go, line 659 at r2 (raw file):

	// seamlessly use the kv client against other nodes in the cluster.
	migMgr := migrations.NewManager(
		s.stopper, s.db, s.sqlExecutor, s.clock, fmt.Sprintf("%d", s.NodeID()))

strconv.Itoa

Comments from Reviewable

a-robinson · 2016-12-02T19:39:02Z

Thanks for the reviews!

Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

pkg/internal/client/lease.go, line 51 at r1 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

nit: ClientID

Done.

pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably use a conditional put instead of manually checking for the expected value.

I kind of like having the added safety of detecting if our local state is incorrect and immediately clearing out our local lease if we're not actually the owner. We certainly don't expect that to ever happen, but if it did this would protect us against multiple nodes thinking they're the leaseholder for too long.

Do you think the safety check isn't worth the extra code?

pkg/internal/client/lease.go, line 169 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Also use a conditional put here.

Done. The change will mean we don't release the lease if we're wrong about its expiration time, but that'll just be annoying, not dangerous. We also don't expect it to ever happen, of course.

pkg/migrations/migrations.go, line 118 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I'm not a fan of the inline bool comments. They're harder to wrap, the stutter makes it harder to read, and it's too easy for them to rot. A little better is this pattern
type runWorkFns bool
const (
  doRunWorkFns runWorkFns = true
  dontRunWorkFns runWorkFns = false
)
m.migrations(ctx, doRunWorkFns)

Sure, done.

pkg/migrations/migrations.go, line 147 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

maybe I need to rtfm, but how is a buffered channel of size 1 different than an unbuffered channel?

You can send one item on it without blocking:
https://play.golang.org/p/_YSEWGVzNP

It's not super important, but it allows us to release the lease and continue on with starting up the server without needing to block on the extend task.

pkg/migrations/migrations.go, line 176 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

you could tighten the scope on this by moving it inside the retry and completedMigrations up here

Eh, I realize it's pretty subjective, but I find it more readable when the code in the retry as short and simple as possible rather than sticking work inside of it that doesn't actually need to be retried.

pkg/migrations/migrations.go, line 181 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Either use 0 (unlimited) or actually do the paging, plz

Ah, unlimited is perfect. Although I certainly hope we don't ever have 10k migrations.

pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

we use map[string]struct{} for this

For memory usage reasons? I've always seen/used bools just because it makes reads against the map really simple, since the default value of bool works perfectly.

And we have at least a handful of places using a pattern like this (albeit sql/drop.go was added by me):

$ git grep "map\[.*\]bool"
acceptance/continuous_load_test.go:     started := make(map[int]bool)
internal/rsg/rsg.go:    seen  map[string]bool
internal/rsg/rsg.go:            seen:  make(map[string]bool),
migrations/migrations.go:       completedMigrations := make(map[string]bool)
server/admin.go:                oks map[roachpb.StoreID]bool
server/admin.go:                oks: make(map[roachpb.StoreID]bool),
sql/drop.go:    cascadedTables := make(map[sqlbase.ID]bool)
sql/drop.go:    dependentTables map[sqlbase.ID]bool, desc *sqlbase.TableDescriptor,
storage/client_split_test.go:           sids map[storagebase.CmdIDKey][2]bool
storage/client_split_test.go:   seen.sids = make(map[storagebase.CmdIDKey][2]bool)
storage/engine/rocksdb.go:      seenLevel := make(map[int]bool)
storage/node_liveness.go:func (nl *NodeLiveness) GetLivenessMap() map[roachpb.NodeID]bool {
storage/node_liveness.go:       lMap := map[roachpb.NodeID]bool{}
storage/node_liveness_test.go:  expectedLMap := map[roachpb.NodeID]bool{1: true, 2: true, 3: true}
storage/node_liveness_test.go:  expectedLMap = map[roachpb.NodeID]bool{1: true, 2: false, 3: false}
storage/queue_test.go:  shouldAddMap := map[*Replica]bool{
storage/store.go:       var livenessMap map[roachpb.NodeID]bool
util/leaktest/leaktest.go:      orig := map[string]bool{}
util/log/clog_test.go:var vGlobs = map[string]bool{
util/protoutil/clone.go:        known map[typeKey]bool
util/protoutil/clone.go:        types.known = make(map[typeKey]bool)
util/randutil/rand_test.go:     numbers := make(map[int]bool)

pkg/migrations/migrations.go, line 237 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

why not append(keys.MigrationPrefix, roachpb.RKey(migration.name)...)?

No good reason. Switched.

pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think I'd rather run the migrations at new cluster initialization than have tables be created one way for new clusters and a different way for old clusters. We shouldn't have too many of these and they should be pretty fast on an empty cluster. (we could pass a flag in to tell the work functions that the cluster is starting from scratch if that could save a significant amount of work)

I agree it'd be ideal if all tables were created/modified the same way regardless of whether you're running an old cluster or a new cluster. However, I'm worried that might make some unit tests pretty awkward though -- for example, will SQL tests that rely on the system.jobs table or on an updated system.eventlog schema have to run the migration code in addition to the setup that they run today? Or if we do a SQL FormatVersion upgrade in one of these migrations, will we have to create initial tables on startup using the old FormatVersion such that all tables were upgraded to the new FormatVersion in the same way?

I can double check whether there's any effect after working through #5887, since my worries may be completely unfounded.

pkg/migrations/migrations.go, line 122 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Grabbing the lease before checking if there are migrations also linearizes startup of nodes that you bring up at the same time, no?

Good point, it's as if I forgot that our kv store could handle concurrent operations 🐶

pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Returning an error here kills the server. Instead, we need to wait for the lease holder to finish its migrations (or for the lease to expire so we can pick up the work).

That's fair, my time at Google probably influenced me a little too much here -- I've been conditioned to prefer failing fast at startup rather than waiting a long time, but I suppose that's the wrong tradeoff in a lot of other environments. Switched to retrying indefinitely.

pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems harsh. Instead, could we use the lease expiration as a deadline for the transactions used by the work functions (similar to the way we use table metadata leases)?

If I'm understanding correctly, that would assume that the work functions will use kv transactions and that they won't take longer than the initial lease duration, since it looks like we don't extend transactions' deadlines.

I'll cancel the runner's context instead, given that that's a little more graceful. Is that sufficient, or do you think we should do more?

pkg/migrations/migrations.go, line 219 at r3 (raw file):
Ok, the internal retries are probably enough, so I've removed the retry loop. It was just an attempt to be overly protective of not wasting the work done in the workFn by failing to persist the fact that it got done.

I've also removed the retry loop when getting the list of completed migrations.

Why do we loop here but we don't call the workFn in a loop, for example?

I figured the details of the workFn could handle its own retries internally based on the particular logic involved.

pkg/migrations/migrations_test.go, line 136 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

if you pass an empty string to t.Run, it numbers them for you

Well that's handy. Done. We have a good handful of other tests that do this too. I'll send out a separate PR to clean those up.

pkg/server/server.go, line 505 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

this might be a bit too paranoid, but you could select between this and the s.stopper.stopper channel in case this node gets ctrl-c'd before the migrations finish

Good call, although we'll have to fix #10138 before the stopper will ever really help on a real node.

pkg/server/server.go, line 659 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

strconv.Itoa

Done. Although after adding in the int cast, I'm not sure it's much better.

Comments from Reviewable

danhhz · 2016-12-02T20:37:28Z

Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I kind of like having the added safety of detecting if our local state is incorrect and immediately clearing out our local lease if we're not actually the owner. We certainly don't expect that to ever happen, but if it did this would protect us against multiple nodes thinking they're the leaseholder for too long.

Do you think the safety check isn't worth the extra code?

A conditional put failure returns a ConditionFailedError, which you can catch and use to do these extra safety checks

pkg/migrations/migrations.go, line 147 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

You can send one item on it without blocking:
https://play.golang.org/p/_YSEWGVzNP

It's not super important, but it allows us to release the lease and continue on with starting up the server without needing to block on the extend task.

Makes sense

pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For memory usage reasons? I've always seen/used bools just because it makes reads against the map really simple, since the default value of bool works perfectly.

And we have at least a handful of places using a pattern like this (albeit sql/drop.go was added by me):

$ git grep "map\[.*\]bool"
acceptance/continuous_load_test.go:     started := make(map[int]bool)
internal/rsg/rsg.go:    seen  map[string]bool
internal/rsg/rsg.go:            seen:  make(map[string]bool),
migrations/migrations.go:       completedMigrations := make(map[string]bool)
server/admin.go:                oks map[roachpb.StoreID]bool
server/admin.go:                oks: make(map[roachpb.StoreID]bool),
sql/drop.go:    cascadedTables := make(map[sqlbase.ID]bool)
sql/drop.go:    dependentTables map[sqlbase.ID]bool, desc *sqlbase.TableDescriptor,
storage/client_split_test.go:           sids map[storagebase.CmdIDKey][2]bool
storage/client_split_test.go:   seen.sids = make(map[storagebase.CmdIDKey][2]bool)
storage/engine/rocksdb.go:      seenLevel := make(map[int]bool)
storage/node_liveness.go:func (nl *NodeLiveness) GetLivenessMap() map[roachpb.NodeID]bool {
storage/node_liveness.go:       lMap := map[roachpb.NodeID]bool{}
storage/node_liveness_test.go:  expectedLMap := map[roachpb.NodeID]bool{1: true, 2: true, 3: true}
storage/node_liveness_test.go:  expectedLMap = map[roachpb.NodeID]bool{1: true, 2: false, 3: false}
storage/queue_test.go:  shouldAddMap := map[*Replica]bool{
storage/store.go:       var livenessMap map[roachpb.NodeID]bool
util/leaktest/leaktest.go:      orig := map[string]bool{}
util/log/clog_test.go:var vGlobs = map[string]bool{
util/protoutil/clone.go:        known map[typeKey]bool
util/protoutil/clone.go:        types.known = make(map[typeKey]bool)
util/randutil/rand_test.go:     numbers := make(map[int]bool)

Do the same search for `git grep "map[.*]struct" : - )

Yeah less memory was the original motivation, but in this case it's more for consistency

pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That's fair, my time at Google probably influenced me a little too much here -- I've been conditioned to prefer failing fast at startup rather than waiting a long time, but I suppose the wrong tradeoff in a lot of other environments. Switched to retrying indefinitely.

I think the people I've asked about this are fairly evenly divided on whether to block or exit. I was originally imaging erroring because at Foursquare that would have made it obvious much more quickly that your roll was bad

Comments from Reviewable

a-robinson · 2016-12-02T21:38:29Z

Review status: 2 of 9 files reviewed at latest revision, 17 unresolved discussions.

pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

A conditional put failure returns a ConditionFailedError, which you can catch and use to do these extra safety checks

Ah, thanks for pointing that out.

pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Do the same search for `git grep "map[.*]struct" : - )

Yeah less memory was the original motivation, but in this case it's more for consistency

Touche :) I can't say I've changed my preference yet, but I've changed the code.

pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I think the people I've asked about this are fairly evenly divided on whether to block or exit. I was originally imaging erroring because at Foursquare that would have made it obvious much more quickly that your roll was bad

Right, that's been my experience as well. I'll wait for Ben to chime in again before changing anything.

Comments from Reviewable

bdarnell · 2016-12-03T04:25:04Z

Reviewed 2 of 7 files at r4, 4 of 4 files at r5, 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Touche :) I can't say I've changed my preference yet, but I've changed the code.

It's not just for the memory savings - the struct{} version guarantees that there can never be a false value stored in the map so you don't have to worry about the difference between missing and false.

pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Right, that's been my experience as well. I'll wait for Ben to chime in again before changing anything.

If another node acquired the lease, then they're working on the migrations and may need this node to be up for those migrations to complete. We need to treat "another node has the lease" separately from other errors (for other errors, bailing out early is fine).

pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If I'm understanding correctly, that would assume that the work functions will use kv transactions and that they won't take longer than the initial lease duration, since it looks like we don't extend transactions' deadlines.

I'll cancel the runner's context instead, given that that's a little more graceful. Is that sufficient, or do you think we should do more?

Yes, that's assuming that the migrations would either use KV transactions or we'd build a similar deadline mechanism for whatever they use instead. We may not extend transactions' deadlines today but I think we could if we wanted to (the only time the deadline really matters is on EndTransaction).

Canceling the context is fine, but no matter what we do here there's going to be a chance of two migrations running concurrently for at least a short time. Only a deadline in the migrations themselves can guard against that completely.

pkg/server/server.go, line 659 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Done. Although after adding in the int cast, I'm not sure it's much better.

This could be s.NodeID().String() instead.

Comments from Reviewable

a-robinson · 2016-12-05T22:17:39Z

Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If another node acquired the lease, then they're working on the migrations and may need this node to be up for those migrations to complete. We need to treat "another node has the lease" separately from other errors (for other errors, bailing out early is fine).

Gah, of course! Thanks for bringing that up. That hadn't come up in an issue in my testing because the migrations I was running were fast enough that the other nodes didn't time out while waiting for the lease.

I'll leave this as is to just continue blocking and logging the error, but I've added an error type to the lease client in case another use case needs to differentiate or we decide to bail out for other errors.

pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, that's assuming that the migrations would either use KV transactions or we'd build a similar deadline mechanism for whatever they use instead. We may not extend transactions' deadlines today but I think we could if we wanted to (the only time the deadline really matters is on EndTransaction).

Canceling the context is fine, but no matter what we do here there's going to be a chance of two migrations running concurrently for at least a short time. Only a deadline in the migrations themselves can guard against that completely.

Crashing the server could also guard against that pretty completely, but I understand your point.

I may be missing some subtlety here, but at a high level it seems like as long as the migrations are idempotent (which we should be requiring anyway), wrapping them in ACID transactions should mean migrations running on the same time on different machines should actually be safe anyways, since they'll be serialized anyway.

What change do you propose I make here? Implement a way to pass an updatable deadline to the migrations' transaction? Institute a standard practice of the migrations checking the lease expiration time before attempting to commit their changes? Leave it as is? Go back to crashing?

pkg/server/server.go, line 659 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could be s.NodeID().String() instead.

Done, thanks.

Comments from Reviewable

bdarnell · 2016-12-06T05:51:31Z

Reviewed 2 of 7 files at r7, 4 of 4 files at r8, 1 of 1 files at r9.
Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.

pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I agree it'd be ideal if all tables were created/modified the same way regardless of whether you're running an old cluster or a new cluster. However, I'm worried that might make some unit tests pretty awkward though -- for example, will SQL tests that rely on the system.jobs table or on an updated system.eventlog schema have to run the migration code in addition to the setup that they run today? Or if we do a SQL FormatVersion upgrade in one of these migrations, will we have to create initial tables on startup using the old FormatVersion such that all tables were upgraded to the new FormatVersion in the same way?

I can double check whether there's any effect after working through #5887, since my worries may be completely unfounded.

Yeah, my thinking is that tests would run through the migrations on startup and we'd create the initial tables using their original FormatVersion and migrate forward. I'm expecting that these migrations will be rare enough (and cheap enough on empty in-memory databases) that it won't be too painful to add a little setup work. I'm willing to change my mind here if it's too expensive to always go through the migration process, but I think I'd rather start by always doing it the same way.

pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Crashing the server could also guard against that pretty completely, but I understand your point.

I may be missing some subtlety here, but at a high level it seems like as long as the migrations are idempotent (which we should be requiring anyway), wrapping them in ACID transactions should mean migrations running on the same time on different machines should actually be safe anyways, since they'll be serialized anyway.

What change do you propose I make here? Implement a way to pass an updatable deadline to the migrations' transaction? Institute a standard practice of the migrations checking the lease expiration time before attempting to commit their changes? Leave it as is? Go back to crashing?

Crashing the server still leaves a small race - what if we blocked in ExtendLease for an unexpectedly long time and we're now well past our lease expiration? But you're right that we'll need to make these migrations idempotent (if they're not transaction-based), so maybe it doesn't matter.

I think for now let's go back to crashing if we're unable to extend the lease; this is something to keep in mind as we're implementing actual migrations.

Comments from Reviewable

a-robinson · 2016-12-06T15:34:14Z

Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.

pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, my thinking is that tests would run through the migrations on startup and we'd create the initial tables using their original FormatVersion and migrate forward. I'm expecting that these migrations will be rare enough (and cheap enough on empty in-memory databases) that it won't be too painful to add a little setup work. I'm willing to change my mind here if it's too expensive to always go through the migration process, but I think I'd rather start by always doing it the same way.

Alright, removed all the InitNewCluster stuff. I hope it works out this way!

pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Crashing the server still leaves a small race - what if we blocked in ExtendLease for an unexpectedly long time and we're now well past our lease expiration? But you're right that we'll need to make these migrations idempotent (if they're not transaction-based), so maybe it doesn't matter.

I think for now let's go back to crashing if we're unable to extend the lease; this is something to keep in mind as we're implementing actual migrations.

Good point, makes sense. Switched back to killing the process for now, along with a note that we may want to improve it.

Comments from Reviewable

danhhz · 2016-12-12T16:36:52Z

but you should probably wait for @bdarnell. Thanks for picking this off!

Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

pkg/server/server.go, line 515 at r11 (raw file):

		select {
		case <-serveSQL:
		case <-s.stopper.ShouldStop():

I'm not sure either way, but maybe this should be ShouldQuiese

Comments from Reviewable

a-robinson · 2016-12-12T16:50:15Z

Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

pkg/server/server.go, line 515 at r11 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I'm not sure either way, but maybe this should be ShouldQuiese

You're right, done.

Comments from Reviewable

bdarnell · 2016-12-13T05:15:50Z

Reviewed 3 of 4 files at r11, 1 of 1 files at r12, 1 of 1 files at r13.
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.

Comments from Reviewable

Built to initially be used by the migration code since our existing lease mechanisms for ranges and SQL schemas aren't very generalizable.

This is needed to allow us to make certain changes between versions that wouldn't be safe otherwise.

Now that there's a delay between when a node starts serving kv and when it starts serving SQL (due to the SQL migration code path), acceptance tests that rely on SQL can have a bad time without this.

a-robinson force-pushed the migrations branch 8 times, most recently from accd9f8 to 9f29b78 Compare November 30, 2016 17:08

This was referenced Nov 30, 2016

sql: Add system.jobs table and migration to add it to existing DBs #11722

Closed

sql: remove experimental_unique_bytes (requires system table migration) #5887

Closed

dianasaur323 mentioned this pull request Dec 2, 2016

Product Roadmap Q4 2016 #10528

Closed

37 tasks

a-robinson force-pushed the migrations branch from 9f29b78 to 117fc42 Compare December 2, 2016 19:38

a-robinson mentioned this pull request Dec 2, 2016

*: Don't pass testcase index to t.Run() #11788

Merged

a-robinson force-pushed the migrations branch from 117fc42 to 3660725 Compare December 2, 2016 21:38

a-robinson force-pushed the migrations branch 2 times, most recently from 7b7bf34 to 1944f02 Compare December 5, 2016 22:23

a-robinson force-pushed the migrations branch from 1944f02 to 411b40c Compare December 6, 2016 15:34

a-robinson force-pushed the migrations branch from 411b40c to bb727fe Compare December 6, 2016 21:12

a-robinson mentioned this pull request Dec 7, 2016

sql: Add migration to change schema of system.eventlog #12158

Closed

a-robinson force-pushed the migrations branch from bb727fe to 4cc1a79 Compare December 12, 2016 16:50

a-robinson added 3 commits December 14, 2016 10:01

client: Add a package or managing leases on arbitrary kv keys

e745b63

Built to initially be used by the migration code since our existing lease mechanisms for ranges and SQL schemas aren't very generalizable.

migrations: Add a package to manage system migrations on startup

3a603c7

This is needed to allow us to make certain changes between versions that wouldn't be safe otherwise.

acceptance: Wait in StartCluster until nodes can serve SQL

967327d

Now that there's a delay between when a node starts serving kv and when it starts serving SQL (due to the SQL migration code path), acceptance tests that rely on SQL can have a bad time without this.

a-robinson force-pushed the migrations branch from 4cc1a79 to 967327d Compare December 14, 2016 15:09

a-robinson merged commit b7e146d into cockroachdb:master Dec 14, 2016

a-robinson mentioned this pull request Dec 14, 2016

sql: system objects should be created if they don't exist #4267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add cluster upgrade logic for backward-compatible migrations #11658

WIP: Add cluster upgrade logic for backward-compatible migrations #11658

a-robinson commented Nov 28, 2016 •

edited by rjnn

Loading

bdarnell commented Dec 2, 2016

danhhz commented Dec 2, 2016

a-robinson commented Dec 2, 2016 •

edited

Loading

danhhz commented Dec 2, 2016

a-robinson commented Dec 2, 2016

bdarnell commented Dec 3, 2016

a-robinson commented Dec 5, 2016

bdarnell commented Dec 6, 2016

a-robinson commented Dec 6, 2016

danhhz commented Dec 12, 2016

a-robinson commented Dec 12, 2016

bdarnell commented Dec 13, 2016

WIP: Add cluster upgrade logic for backward-compatible migrations #11658

WIP: Add cluster upgrade logic for backward-compatible migrations #11658

Conversation

a-robinson commented Nov 28, 2016 • edited by rjnn Loading

bdarnell commented Dec 2, 2016

danhhz commented Dec 2, 2016

a-robinson commented Dec 2, 2016 • edited Loading

danhhz commented Dec 2, 2016

a-robinson commented Dec 2, 2016

bdarnell commented Dec 3, 2016

a-robinson commented Dec 5, 2016

bdarnell commented Dec 6, 2016

a-robinson commented Dec 6, 2016

danhhz commented Dec 12, 2016

a-robinson commented Dec 12, 2016

bdarnell commented Dec 13, 2016

a-robinson commented Nov 28, 2016 •

edited by rjnn

Loading

a-robinson commented Dec 2, 2016 •

edited

Loading