Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: upgrade 1.1.5 -> 2.0.2 failed to run migration #26422

Closed
kmcrawford opened this issue Jun 5, 2018 · 34 comments
Closed

sql: upgrade 1.1.5 -> 2.0.2 failed to run migration #26422

kmcrawford opened this issue Jun 5, 2018 · 34 comments
Assignees
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@kmcrawford
Copy link

I upgraded to a dev cluster from 2.0.2 from 1.1.5. The upgrade didn't complete due to the following error:
failed to run migration "upgrade table descs to interleaved format version": parentID 50 does not exist

I was able to rollback to 1.1.5 and can execute queries successfully. However, if I connect to the cluster using the cockroach client and set my database, then run show tables; I get pq: no database with ID 50 found. I spoke to a developer this morning if he has seen anything strange with this instance and he told me things have been strange when he was updating a table and the internet when down before it could complete. I'm not sure if this is what caused this, but I am unable to upgrade the cluster.

@benesch

@tbg tbg added S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting and removed S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. labels Jun 5, 2018
@tbg tbg assigned knz and benesch and unassigned mberhault Jun 5, 2018
@tbg
Copy link
Member

tbg commented Jun 5, 2018

@knz could you triage?

@knz
Copy link
Contributor

knz commented Jun 5, 2018

This can happen if a table descriptor lands in an invalid state, e.g. via a schema change.
In general the migrations should be tolerant of invalid table states.

@knz
Copy link
Contributor

knz commented Jun 5, 2018

@kmcrawford are you able to determine which specific table is causing the problem?

@kmcrawford
Copy link
Author

kmcrawford commented Jun 5, 2018

@knz I am not. I do see this in the console that is strange.

image

@knz
Copy link
Contributor

knz commented Jun 5, 2018

Yes this is helpful. thanks

@knz
Copy link
Contributor

knz commented Jun 5, 2018

@mjibson @vivekmenezes can you have a brief look, then we'll talk together to determine what is the best way to move forward (we need both to ensure that failed migrations don't leave bad descriptors around, and also make migrations more tolerant of bad descriptors).

Perhaps an opportunity to revive SCRUB too.

@knz knz added A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jun 5, 2018
@benesch
Copy link
Contributor

benesch commented Jun 5, 2018

also make migrations more tolerant of bad descriptors

This scares me. The migration in question needs to ensure that every table has been upgraded to the latest format version. If some table descriptors fail to validate how can we be sure that the cluster is in a consistent state? In fact I think this is working as designed: the user is forced to stay on a version that is guaranteed to be able to handle the pre-migration state of the table descriptors.

Obviously it's a huge failing that we've managed to write a bad table descriptor—apologies, @kmcrawford!—but I'd prefer that we build tooling to repair the busted table descriptor rather than attempting to make the system tolerant of bad table descriptors.

@benesch
Copy link
Contributor

benesch commented Jun 5, 2018

@kmcrawford could you run the following query and share the output?

SELECT * FROM crdb_internal.tables;

Unlike SHOW TABLES I believe crdb_internal.tables is tolerant of a missing parent database. You should see (at least) one row with a parent_id of 50. That's the invalid descriptor and it should give us some insight into what's gone wrong. In particular, if it's the delivery.overrides table that's showing up on the jobs page, it's very likely a bug in schema changes.

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Jun 5, 2018

can't imagine how this can happen but there is one case where this can happen. When you run DROP DATABASE, the database gets dropped and the tables linger around in the DROP state until the underlying data gets garbage collected. Such tables do not have parentID. And we do upgrade DROP tables during migration

@kmcrawford
Copy link
Author

@benesch You are correct, and here is the infamous '50'
image

@knz
Copy link
Contributor

knz commented Jun 5, 2018

@benesch I think that a parent-less table should be considered a valid state for a descriptor. As Vivek states there are several known-good situations where this can happen. I suggest we relax the descriptor validation for this specific case.

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Jun 5, 2018

So on further looking at this, this problem is a combination of three things.

  1. DROP database takes a long time to GC the underlying tables in the 1.1 release . Remember the implementation doesn't use ClearRange(). Therefore a large database dropped weeks earlier might still be around.
  2. DROP tables being GC-ed do not show up in the jobs table.
  3. The migration path is the only path that can modify a dropped table. After modifying the dropped table it calls validate on it which will fail.

I think the solution here is to not call validate on an upgrade to a dropped table.

@knz
Copy link
Contributor

knz commented Jun 12, 2018

@vivekmenezes please advise what is your preference.

@vivekmenezes
Copy link
Contributor

@knz I think the minimum we can do here is to not do the descriptor validation for the parent id so that these headless descriptors do not prevent an upgrade. This will at lease unblock the upgrade process. Thoughts?

@knz
Copy link
Contributor

knz commented Jun 12, 2018

I would recommend running the validation but with a boolean to accept a parent-less table.

@benesch
Copy link
Contributor

benesch commented Jun 12, 2018

I'm not sure that #26543 would help: won't DROP TABLE [50] try to validate the descriptor before it drops it?

There is a third option:

DELETE FROM system.descriptor WHERE id = 50;
DELETE FROM system.namespace WHERE id = 50;

@vivekmenezes
Copy link
Contributor

I'm going to be working on this issue tomorrow.

vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Jun 18, 2018
The COMMIT statement used to return
error: got 0 values but expected 1

The column backfiller's updateExprs[] was not being
initialized properly. A NULL default value reflects now in
a proper initialization of the default value to tree.DNull.

related to cockroachdb#26422

Release note: None
craig bot pushed a commit that referenced this issue Jun 18, 2018
26781: sql: sensible error when column default dropped in transaction r=vivekmenezes a=vivekmenezes

The COMMIT statement used to return
error: got 0 values but expected 1

The column backfiller's updateExprs[] was not being
initialized properly. A NULL default value reflects now in
a proper initialization of the default value to tree.DNull.

related to #26422

Release note: None

Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Jun 20, 2018
…db#26422

This validation is fine because the function itself is only
upgrading the format via the migration path and not changing any
cross references. If in the future if we decide to use this
descriptor upgrade function for something more complex
we can reintroduce the use of Validate.

related to cockroachdb#26422

Release note: None
craig bot pushed a commit that referenced this issue Jun 20, 2018
26784: sqlmigrations: simplify descriptor validation to get around #26422 r=vivekmenezes a=vivekmenezes

This validation is fine because the function itself is only
upgrading the format.

related to #26422

Release note: None

Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Jun 20, 2018
…db#26422

This validation is fine because the function itself is only
upgrading the format via the migration path and not changing any
cross references. If in the future if we decide to use this
descriptor upgrade function for something more complex
we can reintroduce the use of Validate.

related to cockroachdb#26422

Release note: None
craig bot pushed a commit that referenced this issue Jun 20, 2018
26857: backport-2.0: sqlmigrations: simplify descriptor validation to get around #26422 r=vivekmenezes a=vivekmenezes

Backport 1/1 commits from #26784.

/cc @cockroachdb/release

---

This validation is fine because the function itself is only
upgrading the format.

related to #26422

Release note: None


Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
@vivekmenezes
Copy link
Contributor

The upgrade to the next 2.0 release 2.0.4 (due mid July) should succeed.

@knz knz added this to the 2.1 milestone Jul 23, 2018
@knz knz changed the title Upgrade 1.1.5 -> 2.0.2 failed to run migration sql: upgrade 1.1.5 -> 2.0.2 failed to run migration Jul 23, 2018
@vivekmenezes
Copy link
Contributor

@kmcrawford release 2.0.4 is out

@vivekmenezes
Copy link
Contributor

We were not able to reproduce the dangling database reference for the table. Closing until it is reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-descriptors Relating to SQL table/db descriptor handling. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants