Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add cluster upgrade logic for backward-compatible migrations #11658

Merged
merged 3 commits into from
Dec 14, 2016

Conversation

a-robinson
Copy link
Contributor

@a-robinson a-robinson commented Nov 28, 2016

As outlined in #9404.

I have yet to use this to make any substantial changes to a cluster -- I expect that to make doing so convenient this will have to move from its own package into the sql package. For example, it's not easy to add a new system table from outside the SQL package, since that sort of thing doesn't go through the public interfaces.

Anyway, while I put together a more proper SQL migration (such as creating system.jobs) and an approach for testing specific real migrations this afternoon, the core of the code is ready for review.

cc @vivekmenezes @paperstreet


This change is Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Dec 2, 2016

Reviewed 4 of 4 files at r1, 4 of 4 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


pkg/internal/client/lease.go, line 128 at r3 (raw file):

		if !l.val.Expiration.Equal(val.Expiration) {
			log.Warningf(ctx,
				"local lease expiration time %s not equal to persisted expiration time %s for key %s",

This should probably use a conditional put instead of manually checking for the expected value.


pkg/internal/client/lease.go, line 169 at r3 (raw file):

				val.Owner, m.clientID)
		}
		return txn.Put(l.key, nil)

Also use a conditional put here.


pkg/migrations/migrations.go, line 111 at r3 (raw file):

// initialized in order to mark all known migrations as complete.
func (m *Manager) InitNewCluster(ctx context.Context) error {
	return m.migrations(ctx, false /* runWorkFns */)

I think I'd rather run the migrations at new cluster initialization than have tables be created one way for new clusters and a different way for old clusters. We shouldn't have too many of these and they should be pretty fast on an empty cluster. (we could pass a flag in to tell the work functions that the cluster is starting from scratch if that could save a significant amount of work)


pkg/migrations/migrations.go, line 122 at r3 (raw file):

func (m *Manager) migrations(ctx context.Context, runWorkFns bool) error {
	// Before doing anything, grab the migration lease to ensure that only one

We should check whether any migrations are necessary before grabbing the lease (and then again after we have it). In the common case we won't have to do anything so we won't have to have every new server process do two writes to the lease record.


pkg/migrations/migrations.go, line 142 at r3 (raw file):

	}
	if err != nil {
		return errors.Wrapf(err, "failed to acquire lease for running necessary migrations")

Returning an error here kills the server. Instead, we need to wait for the lease holder to finish its migrations (or for the lease to expire so we can pick up the work).


pkg/migrations/migrations.go, line 166 at r3 (raw file):

			}
			if m.leaseManager.TimeRemaining(lease) < leaseRefreshInterval {
				log.Fatal(ctx, "not enough time left on migration lease, terminating for safety")

This seems harsh. Instead, could we use the lease expiration as a deadline for the transactions used by the work functions (similar to the way we use table metadata leases)?


pkg/migrations/migrations.go, line 219 at r3 (raw file):

		}
		var err error
		for r := retry.StartWithCtx(ctx, retry.Options{MaxRetries: 5}); r.Next(); {

We don't generally set MaxRetries in our retry loops, we just keep looping until they succeed. Or we just run things once and return the error (client.DB has internal retries so I don't think we generally need to wrap another loop around a single Put). Why do we loop here but we don't call the workFn in a loop, for example?


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Dec 2, 2016

Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


pkg/internal/client/lease.go, line 51 at r1 (raw file):

// LeaseManagerOptions are used to configure a new LeaseManager.
type LeaseManagerOptions struct {
	// clientID must be unique to this LeaseManager instance.

nit: ClientID


pkg/migrations/migrations.go, line 118 at r2 (raw file):

// safe to run).
func (m *Manager) EnsureMigrations(ctx context.Context) error {
	return m.migrations(ctx, true /* runWorkFns */)

I'm not a fan of the inline bool comments. They're harder to wrap, the stutter makes it harder to read, and it's too easy for them to rot. A little better is this pattern

type runWorkFns bool
const (
  doRunWorkFns runWorkFns = true
  dontRunWorkFns runWorkFns = false
)
m.migrations(ctx, doRunWorkFns)

pkg/migrations/migrations.go, line 147 at r2 (raw file):

	// Ensure that we hold the lease throughout the migration process and release
	// it when we're done.
	done := make(chan interface{}, 1)

maybe I need to rtfm, but how is a buffered channel of size 1 different than an unbuffered channel?


pkg/migrations/migrations.go, line 176 at r2 (raw file):

	// them up one-by-one, which could potentially add a noticeable delay to
	// process startup if we eventually have a lot of migrations to check for.
	var keyvals []client.KeyValue

you could tighten the scope on this by moving it inside the retry and completedMigrations up here


pkg/migrations/migrations.go, line 181 at r2 (raw file):

	}
	for r := retry.StartWithCtx(ctx, retry.Options{MaxRetries: 3}); r.Next(); {
		keyvals, err = m.db.Scan(ctx, keys.MigrationPrefix, keys.MigrationKeyMax, 10000 /* maxRows */)

Either use 0 (unlimited) or actually do the paging, plz


pkg/migrations/migrations.go, line 190 at r2 (raw file):

		return errors.Wrapf(err, "failed to get list of completed migrations")
	}
	completedMigrations := make(map[string]bool)

we use map[string]struct{} for this


pkg/migrations/migrations.go, line 237 at r2 (raw file):

func migrationKey(migration migrationDescriptor) roachpb.Key {
	return bytes.Join([][]byte{keys.MigrationPrefix, roachpb.RKey(migration.name)}, nil)

why not append(keys.MigrationPrefix, roachpb.RKey(migration.name)...)?


pkg/migrations/migrations.go, line 122 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should check whether any migrations are necessary before grabbing the lease (and then again after we have it). In the common case we won't have to do anything so we won't have to have every new server process do two writes to the lease record.

Grabbing the lease before checking if there are migrations also linearizes startup of nodes that you bring up at the same time, no?


pkg/migrations/migrations_test.go, line 136 at r2 (raw file):

	}
	for i, migrations := range testCases {
		t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {

if you pass an empty string to t.Run, it numbers them for you


pkg/server/server.go, line 505 at r2 (raw file):

	s.stopper.RunWorker(func() {
		<-serveSQL

this might be a bit too paranoid, but you could select between this and the s.stopper.stopper channel in case this node gets ctrl-c'd before the migrations finish


pkg/server/server.go, line 659 at r2 (raw file):

	// seamlessly use the kv client against other nodes in the cluster.
	migMgr := migrations.NewManager(
		s.stopper, s.db, s.sqlExecutor, s.clock, fmt.Sprintf("%d", s.NodeID()))

strconv.Itoa


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

a-robinson commented Dec 2, 2016

Thanks for the reviews!


Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


pkg/internal/client/lease.go, line 51 at r1 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

nit: ClientID

Done.


pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should probably use a conditional put instead of manually checking for the expected value.

I kind of like having the added safety of detecting if our local state is incorrect and immediately clearing out our local lease if we're not actually the owner. We certainly don't expect that to ever happen, but if it did this would protect us against multiple nodes thinking they're the leaseholder for too long.

Do you think the safety check isn't worth the extra code?


pkg/internal/client/lease.go, line 169 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Also use a conditional put here.

Done. The change will mean we don't release the lease if we're wrong about its expiration time, but that'll just be annoying, not dangerous. We also don't expect it to ever happen, of course.


pkg/migrations/migrations.go, line 118 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I'm not a fan of the inline bool comments. They're harder to wrap, the stutter makes it harder to read, and it's too easy for them to rot. A little better is this pattern

type runWorkFns bool
const (
  doRunWorkFns runWorkFns = true
  dontRunWorkFns runWorkFns = false
)
m.migrations(ctx, doRunWorkFns)

Sure, done.


pkg/migrations/migrations.go, line 147 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

maybe I need to rtfm, but how is a buffered channel of size 1 different than an unbuffered channel?

You can send one item on it without blocking:
https://play.golang.org/p/_YSEWGVzNP

It's not super important, but it allows us to release the lease and continue on with starting up the server without needing to block on the extend task.


pkg/migrations/migrations.go, line 176 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

you could tighten the scope on this by moving it inside the retry and completedMigrations up here

Eh, I realize it's pretty subjective, but I find it more readable when the code in the retry as short and simple as possible rather than sticking work inside of it that doesn't actually need to be retried.


pkg/migrations/migrations.go, line 181 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Either use 0 (unlimited) or actually do the paging, plz

Ah, unlimited is perfect. Although I certainly hope we don't ever have 10k migrations.


pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

we use map[string]struct{} for this

For memory usage reasons? I've always seen/used bools just because it makes reads against the map really simple, since the default value of bool works perfectly.

And we have at least a handful of places using a pattern like this (albeit sql/drop.go was added by me):

$ git grep "map\[.*\]bool"
acceptance/continuous_load_test.go:     started := make(map[int]bool)
internal/rsg/rsg.go:    seen  map[string]bool
internal/rsg/rsg.go:            seen:  make(map[string]bool),
migrations/migrations.go:       completedMigrations := make(map[string]bool)
server/admin.go:                oks map[roachpb.StoreID]bool
server/admin.go:                oks: make(map[roachpb.StoreID]bool),
sql/drop.go:    cascadedTables := make(map[sqlbase.ID]bool)
sql/drop.go:    dependentTables map[sqlbase.ID]bool, desc *sqlbase.TableDescriptor,
storage/client_split_test.go:           sids map[storagebase.CmdIDKey][2]bool
storage/client_split_test.go:   seen.sids = make(map[storagebase.CmdIDKey][2]bool)
storage/engine/rocksdb.go:      seenLevel := make(map[int]bool)
storage/node_liveness.go:func (nl *NodeLiveness) GetLivenessMap() map[roachpb.NodeID]bool {
storage/node_liveness.go:       lMap := map[roachpb.NodeID]bool{}
storage/node_liveness_test.go:  expectedLMap := map[roachpb.NodeID]bool{1: true, 2: true, 3: true}
storage/node_liveness_test.go:  expectedLMap = map[roachpb.NodeID]bool{1: true, 2: false, 3: false}
storage/queue_test.go:  shouldAddMap := map[*Replica]bool{
storage/store.go:       var livenessMap map[roachpb.NodeID]bool
util/leaktest/leaktest.go:      orig := map[string]bool{}
util/log/clog_test.go:var vGlobs = map[string]bool{
util/protoutil/clone.go:        known map[typeKey]bool
util/protoutil/clone.go:        types.known = make(map[typeKey]bool)
util/randutil/rand_test.go:     numbers := make(map[int]bool)

pkg/migrations/migrations.go, line 237 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

why not append(keys.MigrationPrefix, roachpb.RKey(migration.name)...)?

No good reason. Switched.


pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think I'd rather run the migrations at new cluster initialization than have tables be created one way for new clusters and a different way for old clusters. We shouldn't have too many of these and they should be pretty fast on an empty cluster. (we could pass a flag in to tell the work functions that the cluster is starting from scratch if that could save a significant amount of work)

I agree it'd be ideal if all tables were created/modified the same way regardless of whether you're running an old cluster or a new cluster. However, I'm worried that might make some unit tests pretty awkward though -- for example, will SQL tests that rely on the system.jobs table or on an updated system.eventlog schema have to run the migration code in addition to the setup that they run today? Or if we do a SQL FormatVersion upgrade in one of these migrations, will we have to create initial tables on startup using the old FormatVersion such that all tables were upgraded to the new FormatVersion in the same way?

I can double check whether there's any effect after working through #5887, since my worries may be completely unfounded.


pkg/migrations/migrations.go, line 122 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Grabbing the lease before checking if there are migrations also linearizes startup of nodes that you bring up at the same time, no?

Good point, it's as if I forgot that our kv store could handle concurrent operations 🐶


pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Returning an error here kills the server. Instead, we need to wait for the lease holder to finish its migrations (or for the lease to expire so we can pick up the work).

That's fair, my time at Google probably influenced me a little too much here -- I've been conditioned to prefer failing fast at startup rather than waiting a long time, but I suppose that's the wrong tradeoff in a lot of other environments. Switched to retrying indefinitely.


pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems harsh. Instead, could we use the lease expiration as a deadline for the transactions used by the work functions (similar to the way we use table metadata leases)?

If I'm understanding correctly, that would assume that the work functions will use kv transactions and that they won't take longer than the initial lease duration, since it looks like we don't extend transactions' deadlines.

I'll cancel the runner's context instead, given that that's a little more graceful. Is that sufficient, or do you think we should do more?


pkg/migrations/migrations.go, line 219 at r3 (raw file):
Ok, the internal retries are probably enough, so I've removed the retry loop. It was just an attempt to be overly protective of not wasting the work done in the workFn by failing to persist the fact that it got done.

I've also removed the retry loop when getting the list of completed migrations.

Why do we loop here but we don't call the workFn in a loop, for example?

I figured the details of the workFn could handle its own retries internally based on the particular logic involved.


pkg/migrations/migrations_test.go, line 136 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

if you pass an empty string to t.Run, it numbers them for you

Well that's handy. Done. We have a good handful of other tests that do this too. I'll send out a separate PR to clean those up.


pkg/server/server.go, line 505 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

this might be a bit too paranoid, but you could select between this and the s.stopper.stopper channel in case this node gets ctrl-c'd before the migrations finish

Good call, although we'll have to fix #10138 before the stopper will ever really help on a real node.


pkg/server/server.go, line 659 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

strconv.Itoa

Done. Although after adding in the int cast, I'm not sure it's much better.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Dec 2, 2016

Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I kind of like having the added safety of detecting if our local state is incorrect and immediately clearing out our local lease if we're not actually the owner. We certainly don't expect that to ever happen, but if it did this would protect us against multiple nodes thinking they're the leaseholder for too long.

Do you think the safety check isn't worth the extra code?

A conditional put failure returns a ConditionFailedError, which you can catch and use to do these extra safety checks


pkg/migrations/migrations.go, line 147 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

You can send one item on it without blocking:
https://play.golang.org/p/_YSEWGVzNP

It's not super important, but it allows us to release the lease and continue on with starting up the server without needing to block on the extend task.

Makes sense


pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For memory usage reasons? I've always seen/used bools just because it makes reads against the map really simple, since the default value of bool works perfectly.

And we have at least a handful of places using a pattern like this (albeit sql/drop.go was added by me):

$ git grep "map\[.*\]bool"
acceptance/continuous_load_test.go:     started := make(map[int]bool)
internal/rsg/rsg.go:    seen  map[string]bool
internal/rsg/rsg.go:            seen:  make(map[string]bool),
migrations/migrations.go:       completedMigrations := make(map[string]bool)
server/admin.go:                oks map[roachpb.StoreID]bool
server/admin.go:                oks: make(map[roachpb.StoreID]bool),
sql/drop.go:    cascadedTables := make(map[sqlbase.ID]bool)
sql/drop.go:    dependentTables map[sqlbase.ID]bool, desc *sqlbase.TableDescriptor,
storage/client_split_test.go:           sids map[storagebase.CmdIDKey][2]bool
storage/client_split_test.go:   seen.sids = make(map[storagebase.CmdIDKey][2]bool)
storage/engine/rocksdb.go:      seenLevel := make(map[int]bool)
storage/node_liveness.go:func (nl *NodeLiveness) GetLivenessMap() map[roachpb.NodeID]bool {
storage/node_liveness.go:       lMap := map[roachpb.NodeID]bool{}
storage/node_liveness_test.go:  expectedLMap := map[roachpb.NodeID]bool{1: true, 2: true, 3: true}
storage/node_liveness_test.go:  expectedLMap = map[roachpb.NodeID]bool{1: true, 2: false, 3: false}
storage/queue_test.go:  shouldAddMap := map[*Replica]bool{
storage/store.go:       var livenessMap map[roachpb.NodeID]bool
util/leaktest/leaktest.go:      orig := map[string]bool{}
util/log/clog_test.go:var vGlobs = map[string]bool{
util/protoutil/clone.go:        known map[typeKey]bool
util/protoutil/clone.go:        types.known = make(map[typeKey]bool)
util/randutil/rand_test.go:     numbers := make(map[int]bool)

Do the same search for `git grep "map[.*]struct" : - )

Yeah less memory was the original motivation, but in this case it's more for consistency


pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

That's fair, my time at Google probably influenced me a little too much here -- I've been conditioned to prefer failing fast at startup rather than waiting a long time, but I suppose the wrong tradeoff in a lot of other environments. Switched to retrying indefinitely.

I think the people I've asked about this are fairly evenly divided on whether to block or exit. I was originally imaging erroring because at Foursquare that would have made it obvious much more quickly that your roll was bad


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: 2 of 9 files reviewed at latest revision, 17 unresolved discussions.


pkg/internal/client/lease.go, line 128 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

A conditional put failure returns a ConditionFailedError, which you can catch and use to do these extra safety checks

Ah, thanks for pointing that out.


pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Do the same search for `git grep "map[.*]struct" : - )

Yeah less memory was the original motivation, but in this case it's more for consistency

Touche :) I can't say I've changed my preference yet, but I've changed the code.


pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I think the people I've asked about this are fairly evenly divided on whether to block or exit. I was originally imaging erroring because at Foursquare that would have made it obvious much more quickly that your roll was bad

Right, that's been my experience as well. I'll wait for Ben to chime in again before changing anything.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Dec 3, 2016

Reviewed 2 of 7 files at r4, 4 of 4 files at r5, 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/migrations/migrations.go, line 190 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Touche :) I can't say I've changed my preference yet, but I've changed the code.

It's not just for the memory savings - the struct{} version guarantees that there can never be a false value stored in the map so you don't have to worry about the difference between missing and false.


pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Right, that's been my experience as well. I'll wait for Ben to chime in again before changing anything.

If another node acquired the lease, then they're working on the migrations and may need this node to be up for those migrations to complete. We need to treat "another node has the lease" separately from other errors (for other errors, bailing out early is fine).


pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If I'm understanding correctly, that would assume that the work functions will use kv transactions and that they won't take longer than the initial lease duration, since it looks like we don't extend transactions' deadlines.

I'll cancel the runner's context instead, given that that's a little more graceful. Is that sufficient, or do you think we should do more?

Yes, that's assuming that the migrations would either use KV transactions or we'd build a similar deadline mechanism for whatever they use instead. We may not extend transactions' deadlines today but I think we could if we wanted to (the only time the deadline really matters is on EndTransaction).

Canceling the context is fine, but no matter what we do here there's going to be a chance of two migrations running concurrently for at least a short time. Only a deadline in the migrations themselves can guard against that completely.


pkg/server/server.go, line 659 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Done. Although after adding in the int cast, I'm not sure it's much better.

This could be s.NodeID().String() instead.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/migrations/migrations.go, line 142 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If another node acquired the lease, then they're working on the migrations and may need this node to be up for those migrations to complete. We need to treat "another node has the lease" separately from other errors (for other errors, bailing out early is fine).

Gah, of course! Thanks for bringing that up. That hadn't come up in an issue in my testing because the migrations I was running were fast enough that the other nodes didn't time out while waiting for the lease.

I'll leave this as is to just continue blocking and logging the error, but I've added an error type to the lease client in case another use case needs to differentiate or we decide to bail out for other errors.


pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, that's assuming that the migrations would either use KV transactions or we'd build a similar deadline mechanism for whatever they use instead. We may not extend transactions' deadlines today but I think we could if we wanted to (the only time the deadline really matters is on EndTransaction).

Canceling the context is fine, but no matter what we do here there's going to be a chance of two migrations running concurrently for at least a short time. Only a deadline in the migrations themselves can guard against that completely.

Crashing the server could also guard against that pretty completely, but I understand your point.

I may be missing some subtlety here, but at a high level it seems like as long as the migrations are idempotent (which we should be requiring anyway), wrapping them in ACID transactions should mean migrations running on the same time on different machines should actually be safe anyways, since they'll be serialized anyway.

What change do you propose I make here? Implement a way to pass an updatable deadline to the migrations' transaction? Institute a standard practice of the migrations checking the lease expiration time before attempting to commit their changes? Leave it as is? Go back to crashing?


pkg/server/server.go, line 659 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could be s.NodeID().String() instead.

Done, thanks.


Comments from Reviewable

@a-robinson a-robinson force-pushed the migrations branch 2 times, most recently from 7b7bf34 to 1944f02 Compare December 5, 2016 22:23
@bdarnell
Copy link
Contributor

bdarnell commented Dec 6, 2016

Reviewed 2 of 7 files at r7, 4 of 4 files at r8, 1 of 1 files at r9.
Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I agree it'd be ideal if all tables were created/modified the same way regardless of whether you're running an old cluster or a new cluster. However, I'm worried that might make some unit tests pretty awkward though -- for example, will SQL tests that rely on the system.jobs table or on an updated system.eventlog schema have to run the migration code in addition to the setup that they run today? Or if we do a SQL FormatVersion upgrade in one of these migrations, will we have to create initial tables on startup using the old FormatVersion such that all tables were upgraded to the new FormatVersion in the same way?

I can double check whether there's any effect after working through #5887, since my worries may be completely unfounded.

Yeah, my thinking is that tests would run through the migrations on startup and we'd create the initial tables using their original FormatVersion and migrate forward. I'm expecting that these migrations will be rare enough (and cheap enough on empty in-memory databases) that it won't be too painful to add a little setup work. I'm willing to change my mind here if it's too expensive to always go through the migration process, but I think I'd rather start by always doing it the same way.


pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Crashing the server could also guard against that pretty completely, but I understand your point.

I may be missing some subtlety here, but at a high level it seems like as long as the migrations are idempotent (which we should be requiring anyway), wrapping them in ACID transactions should mean migrations running on the same time on different machines should actually be safe anyways, since they'll be serialized anyway.

What change do you propose I make here? Implement a way to pass an updatable deadline to the migrations' transaction? Institute a standard practice of the migrations checking the lease expiration time before attempting to commit their changes? Leave it as is? Go back to crashing?

Crashing the server still leaves a small race - what if we blocked in ExtendLease for an unexpectedly long time and we're now well past our lease expiration? But you're right that we'll need to make these migrations idempotent (if they're not transaction-based), so maybe it doesn't matter.

I think for now let's go back to crashing if we're unable to extend the lease; this is something to keep in mind as we're implementing actual migrations.


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


pkg/migrations/migrations.go, line 111 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, my thinking is that tests would run through the migrations on startup and we'd create the initial tables using their original FormatVersion and migrate forward. I'm expecting that these migrations will be rare enough (and cheap enough on empty in-memory databases) that it won't be too painful to add a little setup work. I'm willing to change my mind here if it's too expensive to always go through the migration process, but I think I'd rather start by always doing it the same way.

Alright, removed all the InitNewCluster stuff. I hope it works out this way!


pkg/migrations/migrations.go, line 166 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Crashing the server still leaves a small race - what if we blocked in ExtendLease for an unexpectedly long time and we're now well past our lease expiration? But you're right that we'll need to make these migrations idempotent (if they're not transaction-based), so maybe it doesn't matter.

I think for now let's go back to crashing if we're unable to extend the lease; this is something to keep in mind as we're implementing actual migrations.

Good point, makes sense. Switched back to killing the process for now, along with a note that we may want to improve it.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Dec 12, 2016

:lgtm_strong: but you should probably wait for @bdarnell. Thanks for picking this off!


Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/server/server.go, line 515 at r11 (raw file):

		select {
		case <-serveSQL:
		case <-s.stopper.ShouldStop():

I'm not sure either way, but maybe this should be ShouldQuiese


Comments from Reviewable

@a-robinson
Copy link
Contributor Author

Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/server/server.go, line 515 at r11 (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

I'm not sure either way, but maybe this should be ShouldQuiese

You're right, done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 3 of 4 files at r11, 1 of 1 files at r12, 1 of 1 files at r13.
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


Comments from Reviewable

Built to initially be used by the migration code since our existing
lease mechanisms for ranges and SQL schemas aren't very generalizable.
This is needed to allow us to make certain changes between versions that
wouldn't be safe otherwise.
Now that there's a delay between when a node starts serving kv and when
it starts serving SQL (due to the SQL migration code path), acceptance
tests that rely on SQL can have a bad time without this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants