Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable #57032

Closed
halverneus opened this issue Nov 23, 2020 · 23 comments · Fixed by #57066
Assignees
Labels
branch-master Failures and bugs on the master branch. O-community Originated from the community release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-1 High impact: many users impacted, serious risk of high unavailability or data loss X-blathers-triaged blathers was able to find an owner

Comments

@halverneus
Copy link

I'm not sure if this is part of the upgrade to 20.2.0 or some other reason that correlates in time, but I seem to have suddenly lost all foreign key references on our production servers.

Please describe the issue you observed, and any steps we can take to reproduce it:

Make a database using an older version and upgrade. This was noticed some time after the upgrade.

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

  1. Set up CockroachDB 3-node cluster and initialize the database.
  2. Use the following SQL to setup the database for the errors below:

-- The various known types of machines.
CREATE TABLE IF NOT EXISTS machine_type (
    id INT NOT NULL PRIMARY KEY,
    name STRING NOT NULL UNIQUE
);

INSERT INTO machine_type (id, name) VALUES
    (0, 'unknown'),
    (1, 'known');

-- Machines and their associated information.
CREATE TABLE IF NOT EXISTS machine (
    id BYTES DEFAULT uuid_v4() PRIMARY KEY NOT NULL,
    machine_type INT NOT NULL,
    name STRING,
    FOREIGN KEY(machine_type) REFERENCES machine_type(id) ON UPDATE CASCADE ON DELETE CASCADE
);

  1. Add some data.
  2. Upgrade to CockroachDB 20.2.0.
  3. Sudd

Expected behavior
Return results from query like normal.

Additional data / screenshots

root@:26257/fini> SELECT * from backup
-> ;
ERROR: internal error: missing fk back reference "fk_machine_id_ref_machine" to "backup" from "machine"
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1500: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:298: unwrapDescriptor()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:160: GetDescriptorByID()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:215: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811: exec()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706: Txn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193: acquire()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:858: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128: doCall()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.
root@:26257/fini> SELECT * FROM machine;
ERROR: internal error: missing fk back reference "fk_machine_type_ref_machine_type" to "machine" from "machine_type"
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1500: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:298: unwrapDescriptor()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:160: GetDescriptorByID()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:215: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811: exec()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706: Txn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193: acquire()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:858: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128: doCall()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.
root@:26257/fini> SELECT * FROM machine_type;
ERROR: internal error: missing fk forward reference "fk_machine_type_ref_machine_type" to "machine_type" from "machine"
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1519: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:298: unwrapDescriptor()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:160: GetDescriptorByID()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:215: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811: exec()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706: Txn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193: acquire()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:858: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128: doCall()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.

Environment:

  • CockroachDB version 20.2.0
  • Server OS: Ubuntu Bionic
  • Client app: cockroach sql

Additional context
What was the impact?

I'm unable to run queries against our production servers.

@blathers-crl
Copy link

blathers-crl bot commented Nov 23, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/bulk-io (found keywords: backup)
  • @nvanbenschoten (found keywords: kv)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Nov 23, 2020
@rafiss rafiss assigned thoszhang and unassigned rafiss Nov 23, 2020
@ajwerner ajwerner self-assigned this Nov 23, 2020
@halverneus
Copy link
Author

This affects only the foreign keys that were created originally in the database when the database version was 19.2.0. All foreign keys created in subsequent upgrades work perfectly fine. I tried to manually add the constraint and drop the constraint with no luck:

oot@:26257/fini> SELECT * FROM usr_permission LIMIT 1;
ERROR: internal error: missing fk forward reference "fk_permission_ref_usr_permission" to "usr_permission" from "usr"
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1519: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:298: unwrapDescriptor()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:160: GetDescriptorByID()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:215: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811: exec()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706: Txn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193: acquire()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:858: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128: doCall()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.
root@:26257/fini> ALTER TABLE usr ADD CONSTRAINT fk_permission_ref_usr_permission FOREIGN KEY(permission) REFERENCES usr_permission(id) ON UPDATE CASCADE ON DELETE CASCADE;
ERROR: duplicate constraint name: "fk_permission_ref_usr_permission"
SQLSTATE: 42710
root@:26257/fini> ALTER TABLE usr DROP CONSTRAINT fk_permission_ref_usr_permission;
ERROR: internal error: missing fk back reference "fk_permission_ref_usr_permission" to "usr" from "usr_permission"
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1500: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/alter_table.go:577: startExec()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:514: func2()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:119: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:298: visitInternal()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:86: visit()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:50: walkPlan()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:517: startExec()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:125: Start()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:774: Run()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:392: Run()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:422: Run()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:991: PlanAndRun()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1001: execWithDistSQLEngine()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:872: dispatchToExecutionEngine()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:639: execStmtInOpenState()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:114: execStmt()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1465: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1467: execCmd()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1391: run()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:508: ServeConn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:626: func1()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

@jordanlewis
Copy link
Member

jordanlewis commented Nov 23, 2020

Hi @halverneus, thanks for the report. The detail about 19.2 is extra helpful. This is symptomatic of "descriptor corruption", a known issue caused by validation for table metadata becoming more strict in 20.2 and thus uncovering pre-existing inconsistencies. Your data is safe, but we'll need to repair the corrupted descriptors to recover the data.

Could you run cockroach debug doctor cluster --url=<cluster connection string> (you'll have to use a 20.2 cockroach binary for this) to connect to your running cluster and run checks against the metadata, and send the output to us? (This command isn't publicly documented, but cockroach debug doctor cluster --help provides more documentation, including flags for connecting to the cluster.)

Also, please send us a debug zip. Note that debug zips can have sensitive information, so please send it to us via email at sql-schema-team@cockroachlabs.com

@halverneus
Copy link
Author

I ran it from inside one of the containers (I don't have an exposed CRDB port outside of the Docker Swarm cluster). This is what I got:

root@us1fini-svr001:/home/keytech# docker exec -it d7e0adaa1bc4 ./cockroach debug doctor cluster --certs-dir=/run/secrets
Examining 48 descriptors and 51 namespace entries...
   Table   2: ParentID   1, ParentSchemaID 29, Name 'namespace': not being dropped but no namespace entry found
   Table  30: ParentID   1, ParentSchemaID 29, Name 'namespace2': namespace entry {ParentID:1 ParentSchemaID:29 Name:namespace} not found in draining names
   Table  30: ParentID   1, ParentSchemaID 29, Name 'namespace2': could not find name in namespace table
   Table  62: ParentID  52, ParentSchemaID 29, Name 'backup': missing fk back reference "fk_machine_id_ref_machine" to "backup" from "machine"
Examining 0 running jobs...
ERROR: validation failed
Failed running "debug doctor cluster"

@halverneus
Copy link
Author

The debug.zip has just been sent with the reference to this issue number in the subject and the body.

@jordanlewis
Copy link
Member

Hi @halverneus, we had a Google group misconfiguration before, but it's resolved now. Could you please re-send your email.

@halverneus
Copy link
Author

It has been resent. Thanks!

@thoszhang
Copy link
Contributor

The issue turns out to be that in 20.2, we introduced a bug in transforming on-disk descriptors to a new representation that had been introduced in 19.2. We're passing in a oneLevelUncachedDescGetter to Validate (same as in the stack trace above):

params.ctx, catalogkv.NewOneLevelUncachedDescGetter(params.p.Txn(), params.ExecCfg().Codec),

Since the oneLevelUncachedDescGetter passes a nil DescGetter for its reads one level down, when we read referenced descriptors while validating foreign keys, we ultimately pass nil into tabledesc.maybeFillInDescriptor, which causes the foreign key upgrade to be skipped entirely. So Validate sees no post-19.2 FK references and reports a violation.

To fix this for 20.2 in a minimally invasive way, the plan is to extend Validate to treat un-upgraded referenced descriptors as valid.

@thoszhang
Copy link
Contributor

@halverneus We'll get back to you within a day or so about fixing this for your cluster specifically.

@thoszhang thoszhang added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 24, 2020
@blathers-crl
Copy link

blathers-crl bot commented Nov 24, 2020

Hi @lucy-zhang, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang thoszhang added branch-release-20.2 branch-master Failures and bugs on the master branch. labels Nov 24, 2020
@thoszhang thoszhang changed the title missing fk back/forward reference * to * from * release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable Nov 24, 2020
@halverneus
Copy link
Author

I'm afraid I didn't. I've just been monitoring releases going to Docker Hub and changing the version in my Docker Swarm Compose files (and testing on another stack), so I wasn't aware there were any instructions before now. That is unfortunate, on my part.

@jordanlewis
Copy link
Member

Hi @halverneus, we've merged a fix for this issue into our master branch and into our 20.2 release branch #57083. There will be a nightly build available tonight, and I'll update this issue when it's available. I highly recommend waiting for that nightly build.

@halverneus
Copy link
Author

Thanks! I'll look forward to updating in the morning. I'm not seeing a nightly build tag on Docker Hub, so I'm guessing I'll need to build my own Docker image for the time being. I think I can manage that just fine.

@jordanlewis
Copy link
Member

FYI, we decided to include this critical fix in the official 20.2.2 version, which will very likely be released tomorrow. We'll include a link to the release binary, along with its Docker image, when it's available.

@halverneus
Copy link
Author

Awesome! Even better! Thank you so very much!

@halverneus
Copy link
Author

I just deployed the new release. Somethings have started working, but some of the foreign keys are giving me a slightly different error (perhaps I just need to wait for everything to flush?).

SELECT * FROM usr LIMIT 1;
ERROR: internal error: missing fk forward reference "fk_usr_ref_usr" to "usr" from "session" (un-upgraded foreign key references present)
SQLSTATE: XX000
DETAIL: stack trace:
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1635: validateCrossReferences()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/structured.go:1440: Validate()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:298: unwrapDescriptor()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:160: GetDescriptorByID()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:215: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811: exec()
/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706: Txn()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193: acquire()
/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:858: func1()
/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128: doCall()
/usr/local/go/src/runtime/asm_amd64.s:1357: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

@halverneus
Copy link
Author

halverneus commented Nov 25, 2020

I've just checked doing a select on all tables and the only tables that are affected (if you still have my debug.zip) create 'missing fk forward reference' errors (never 'back reference', now, and only on the first half of the first schema upgrade appears affected, so I don't understand the specific cause, in this case). The following are the only remaining errors I have:

SELECT * FROM session_scope LIMIT 1;   
ERROR: internal error: missing fk forward reference "fk_scope_ref_session_scope" to "session_scope" from "session" (un-upgraded foreign key references present)
#####
root@:26257/fini> SELECT * FROM usr_permission LIMIT 1;             
ERROR: internal error: missing fk forward reference "fk_permission_ref_usr_permission" to "usr_permission" from "usr" (un-upgraded foreign key references present)
#####
SELECT * FROM usr LIMIT 1;
ERROR: internal error: missing fk forward reference "fk_usr_ref_usr" to "usr" from "session" (un-upgraded foreign key references present)
#####
SELECT * FROM machine_type LIMIT 1;       
ERROR: internal error: missing fk forward reference "fk_machine_type_ref_machine_type" to "machine_type" from "machine" (un-upgraded foreign key references present)

I can SELECT on all other tables from all other schema version without any issues. Thoughts?

Any way to manually upgrade an 'un-upgraded foreign key'?

@halverneus
Copy link
Author

Changes made to a Docker image that was emailed directly to me fixed the issue.

@jordanlewis
Copy link
Member

jordanlewis commented Nov 25, 2020

Thanks, @halverneus.

To close the loop, the PR #57133 corrected this issue. This resolution will be available in v20.2.3.

@jordanlewis
Copy link
Member

If anyone finds this later and needs an urgent resolution before v20.2.3 is released, please see the binaries in this forum post: https://forum.cockroachlabs.com/t/issue-whenever-a-requests-comes-in-on-cockroachdb-20-2/4116/5

@mishak87
Copy link

mishak87 commented Dec 5, 2020

In case someone needs patched docker image 😊

FROM cockroachdb/cockroach:v20.2.2

RUN microdnf install tar gzip \
 && mkdir -p /tmp/patch \
 && cd /tmp/patch/ \
 && curl -O https://cockroach-builds.s3.amazonaws.com/cockroach-v20.2.2-2-gb91c2e7506.linux-amd64.tgz \
 && tar --strip-components=1 -xzf *.tgz \
 && microdnf remove gzip tar \
 && cp /tmp/patch/cockroach /cockroach/ \
 && rm -rf /tmp/patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. O-community Originated from the community release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-1 High impact: many users impacted, serious risk of high unavailability or data loss X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants