-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: upgrade 1.1.5 -> 2.0.2 failed to run migration #26422
Comments
@knz could you triage? |
This can happen if a table descriptor lands in an invalid state, e.g. via a schema change. |
@kmcrawford are you able to determine which specific table is causing the problem? |
@knz I am not. I do see this in the console that is strange. |
Yes this is helpful. thanks |
@mjibson @vivekmenezes can you have a brief look, then we'll talk together to determine what is the best way to move forward (we need both to ensure that failed migrations don't leave bad descriptors around, and also make migrations more tolerant of bad descriptors). Perhaps an opportunity to revive SCRUB too. |
This scares me. The migration in question needs to ensure that every table has been upgraded to the latest format version. If some table descriptors fail to validate how can we be sure that the cluster is in a consistent state? In fact I think this is working as designed: the user is forced to stay on a version that is guaranteed to be able to handle the pre-migration state of the table descriptors. Obviously it's a huge failing that we've managed to write a bad table descriptor—apologies, @kmcrawford!—but I'd prefer that we build tooling to repair the busted table descriptor rather than attempting to make the system tolerant of bad table descriptors. |
@kmcrawford could you run the following query and share the output?
Unlike |
can't imagine how this can happen but there is one case where this can happen. When you run DROP DATABASE, the database gets dropped and the tables linger around in the DROP state until the underlying data gets garbage collected. Such tables do not have parentID. And we do upgrade DROP tables during migration |
@benesch You are correct, and here is the infamous '50' |
@benesch I think that a parent-less table should be considered a valid state for a descriptor. As Vivek states there are several known-good situations where this can happen. I suggest we relax the descriptor validation for this specific case. |
So on further looking at this, this problem is a combination of three things.
I think the solution here is to not call validate on an upgrade to a dropped table. |
@vivekmenezes please advise what is your preference. |
@knz I think the minimum we can do here is to not do the descriptor validation for the parent id so that these headless descriptors do not prevent an upgrade. This will at lease unblock the upgrade process. Thoughts? |
I would recommend running the validation but with a boolean to accept a parent-less table. |
I'm not sure that #26543 would help: won't There is a third option:
|
I'm going to be working on this issue tomorrow. |
The COMMIT statement used to return error: got 0 values but expected 1 The column backfiller's updateExprs[] was not being initialized properly. A NULL default value reflects now in a proper initialization of the default value to tree.DNull. related to cockroachdb#26422 Release note: None
26781: sql: sensible error when column default dropped in transaction r=vivekmenezes a=vivekmenezes The COMMIT statement used to return error: got 0 values but expected 1 The column backfiller's updateExprs[] was not being initialized properly. A NULL default value reflects now in a proper initialization of the default value to tree.DNull. related to #26422 Release note: None Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
…db#26422 This validation is fine because the function itself is only upgrading the format via the migration path and not changing any cross references. If in the future if we decide to use this descriptor upgrade function for something more complex we can reintroduce the use of Validate. related to cockroachdb#26422 Release note: None
…db#26422 This validation is fine because the function itself is only upgrading the format via the migration path and not changing any cross references. If in the future if we decide to use this descriptor upgrade function for something more complex we can reintroduce the use of Validate. related to cockroachdb#26422 Release note: None
26857: backport-2.0: sqlmigrations: simplify descriptor validation to get around #26422 r=vivekmenezes a=vivekmenezes Backport 1/1 commits from #26784. /cc @cockroachdb/release --- This validation is fine because the function itself is only upgrading the format. related to #26422 Release note: None Co-authored-by: Vivek Menezes <vivek@cockroachlabs.com>
The upgrade to the next 2.0 release 2.0.4 (due mid July) should succeed. |
@kmcrawford release 2.0.4 is out |
We were not able to reproduce the dangling database reference for the table. Closing until it is reproducible. |
I upgraded to a dev cluster from 2.0.2 from 1.1.5. The upgrade didn't complete due to the following error:
failed to run migration "upgrade table descs to interleaved format version": parentID 50 does not exist
I was able to rollback to 1.1.5 and can execute queries successfully. However, if I connect to the cluster using the cockroach client and set my database, then run
show tables;
I getpq: no database with ID 50 found
. I spoke to a developer this morning if he has seen anything strange with this instance and he told me things have been strange when he was updating a table and the internet when down before it could complete. I'm not sure if this is what caused this, but I am unable to upgrade the cluster.@benesch
The text was updated successfully, but these errors were encountered: