Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schemachanger: improve panic- and error handling #91411

Merged
merged 1 commit into from
Nov 8, 2022

Conversation

postamar
Copy link
Contributor

@postamar postamar commented Nov 7, 2022

Previously, the declarative schema changer would only recover from runtime errors in certain subsystems like building the targets or planning the execution of operations. Consequently an implementation bug leading to a runtime error in the execution layer would trigger a panic which would not be recovered and which would cause the whole process to crash.

This commit fixes this by introducing a common error handler in the form of scerrors.HandleErrorOrPanic, which recovers from panics, wraps errors, and prints informative log messages in a uniform way, to be used at the top of (or near the top of) the declarative schema changer call stack.

Fixes #91400.

Release note (bug fix): fixed a bug in which panics triggered by certain DDL statements were not properly recovered, leading to the cluster node crashing.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@postamar
Copy link
Contributor Author

postamar commented Nov 7, 2022

I'm keen for some suggestions on how to improve this. Here's what the log looks like now after crashing the node by doing ALTER PRIMARY KEY USING COLUMNS (k, k):

I221107 16:23:10.066992 1888 sql/schemachanger/scerrors/errors.go:41 ⋮ [n1,client=127.0.0.1:50816,user=root] 83  building declarative schema change targets for 
‹ALTER TABLE›
I221107 16:23:10.124597 1888 sql/schemachanger/scerrors/errors.go:62 ⋮ [n1,client=127.0.0.1:50816,user=root] 84  done building declarative schema change targets for ‹ALTER TABLE›
I221107 16:23:10.124972 1888 sql/schemachanger/scerrors/errors.go:41 ⋮ [n1,client=127.0.0.1:50816,user=root] 85  building declarative schema changer plan
I221107 16:23:10.146719 1888 sql/schemachanger/scerrors/errors.go:62 ⋮ [n1,client=127.0.0.1:50816,user=root] 86  done building declarative schema changer plan
I221107 16:23:10.146827 1888 sql/schemachanger/scerrors/errors.go:41 ⋮ [n1,client=127.0.0.1:50816,user=root] 87  executing declarative schema change ‹StatementPhase stage 1 of 1 with 14 MutationType ops› (rollback=false)
I221107 16:23:10.197301 1888 sql/schemachanger/scerrors/errors.go:62 ⋮ [n1,client=127.0.0.1:50816,user=root] 88  done executing declarative schema change ‹StatementPhase stage 1 of 1 with 14 MutationType ops› (rollback=false) with error: executing declarative schema change ‹StatementPhase stage 1 of 1 with 14 MutationType ops› (rollback=false): error executing ‹StatementPhase stage 1 of 1 with 14 MutationType ops›: relation ‹"kv"› (104): index ‹"crdb_internal_index_2_name_placeholder"› contains key column ‹""› with unknown ID 0

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 9 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/schemachanger/scerrors/errors.go line 34 at r1 (raw file):

// logging and wrapping errors, for convenience.
func HandleErrorOrPanic(
	ctx context.Context, err *error, wrapMsgFmt string, wrapMsgArgs ...interface{},

I feel like we have an obnoxious linter that's not going to like this. Maybe just add a wrap function that's like func(error) error to which you can pass the message in closure scope? Then just log the error as opposed to this?

@postamar
Copy link
Contributor Author

postamar commented Nov 7, 2022

I told the linter not to annoy us with HandleErrorOrPanic. I feel that it's appropriate to do so, for what it's worth.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/schemachanger/scerrors/errors.go line 28 at r2 (raw file):

)

// HandleErrorOrPanic generates a closure, intended to be deferred, to handle

I don't really get why this function unconditionally logs twice. I feel like it's doing too much. Can we have it either only log on the error path?

@postamar
Copy link
Contributor Author

postamar commented Nov 7, 2022

I don't really get why this function unconditionally logs twice. I feel like it's doing too much. Can we have it either only log on the error path?

Fair enough, we don't need to log twice in most cases. I'll remove the "done" logs unless there's an error or expensive logging is enabled. The result will still be a bit more chatty than what we had before but that's a good thing IMHO.

@postamar
Copy link
Contributor Author

postamar commented Nov 7, 2022

Thanks for the feedback. Things are now in a state where I'm happy with them.

Sample:

I221107 20:08:27.196832 1056 sql/schemachanger/scbuild/build.go:39 ⋮ [n1,client=127.0.0.1:57081,user=root] 81  building declarative schema change targets for ALTER TABLE
I221107 20:08:27.357109 1056 sql/schemachanger/scplan/plan.go:84 ⋮ [n1,client=127.0.0.1:57081,user=root] 82  building declarative schema changer plan in StatementPhase (rollback=false) for ALTER TABLE
I221107 20:08:27.366527 1056 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,client=127.0.0.1:57081,user=root] 83  executing declarative schema change StatementPhase stage 1 of 1 with 14 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:27.367318 1056 sql/schemachanger/scplan/plan.go:84 ⋮ [n1,client=127.0.0.1:57081,user=root] 84  building declarative schema changer plan in PreCommitPhase (rollback=false) for ALTER TABLE
I221107 20:08:27.376256 1056 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,client=127.0.0.1:57081,user=root] 85  executing declarative schema change PreCommitPhase stage 1 of 1 with 2 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:27.428143 1056 sql/conn_executor.go:3362 ⋮ [n1,client=127.0.0.1:57081,user=root] 86  queued new schema change job 811929787974385665 using the new schema changer
I221107 20:08:27.506433 1606 jobs/adopt.go:243 ⋮ [n1,client=127.0.0.1:57081,user=root] 87  job 811929787974385665: resuming execution
I221107 20:08:27.661893 1496 jobs/registry.go:1224 ⋮ [n1] 88  NEW SCHEMA CHANGE job 811929787974385665: stepping through state running
I221107 20:08:27.742482 1496 sql/schemachanger/scrun/scrun.go:205 ⋮ [n1,job=811929787974385665] 89  rebuilding declarative schema change state from descriptors [104]
I221107 20:08:27.743631 1496 sql/schemachanger/scplan/plan.go:84 ⋮ [n1,job=811929787974385665] 90  building declarative schema changer plan in PostCommitPhase (rollback=false) for ALTER TABLE
I221107 20:08:27.752033 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 91  executing declarative schema change PostCommitPhase stage 1 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:27.831907 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 92  executing declarative schema change PostCommitPhase stage 2 of 15 with 1 BackfillType op (rollback=false) for ALTER TABLE
I221107 20:08:27.986663 1596 kv/kvserver/replica_command.go:416 ⋮ [n1,s1,r53/1:‹/{Table/52-Max}›,*roachpb.AdminSplitRequest] 93  initiating a split of this range at key /Table/104/2 [r54] (‹manual›)‹›
I221107 20:08:28.312150 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 94  executing declarative schema change PostCommitPhase stage 3 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:28.411362 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 95  executing declarative schema change PostCommitPhase stage 4 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:28.514337 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 96  executing declarative schema change PostCommitPhase stage 5 of 15 with 1 BackfillType op (rollback=false) for ALTER TABLE
I221107 20:08:28.517451 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 97  executing declarative schema change PostCommitPhase stage 6 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:28.598136 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 98  executing declarative schema change PostCommitPhase stage 7 of 15 with 1 ValidationType op (rollback=false) for ALTER TABLE
I221107 20:08:28.601216 1724 sql/backfill.go:1967 ⋮ [n1,job=811929787974385665] 99  validation: table ‹kv› row count = 0, took 2.252786ms
I221107 20:08:28.601413 1723 sql/backfill.go:1921 ⋮ [n1,job=811929787974385665] 100  validation: index ‹kv›/‹crdb_internal_index_2_name_placeholder› row count = 0, time so far 2.377254ms
I221107 20:08:28.601498 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 101  executing declarative schema change PostCommitPhase stage 8 of 15 with 9 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:28.699425 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 102  executing declarative schema change PostCommitPhase stage 9 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:28.821437 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 103  executing declarative schema change PostCommitPhase stage 10 of 15 with 1 BackfillType op (rollback=false) for ALTER TABLE
I221107 20:08:28.883097 1762 kv/kvserver/replica_command.go:416 ⋮ [n1,s1,r54/1:‹/{Table/104/2-Max}›,*roachpb.AdminSplitRequest] 104  initiating a split of this range at key /Table/104/4 [r55] (‹manual›)‹›
I221107 20:08:29.155207 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 105  executing declarative schema change PostCommitPhase stage 11 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.274668 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 106  executing declarative schema change PostCommitPhase stage 12 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.374953 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 107  executing declarative schema change PostCommitPhase stage 13 of 15 with 1 BackfillType op (rollback=false) for ALTER TABLE
I221107 20:08:29.377211 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 108  executing declarative schema change PostCommitPhase stage 14 of 15 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.477555 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 109  executing declarative schema change PostCommitPhase stage 15 of 15 with 1 ValidationType op (rollback=false) for ALTER TABLE
I221107 20:08:29.479569 1877 sql/backfill.go:1967 ⋮ [n1,job=811929787974385665] 110  validation: table ‹kv› row count = 0, took 1.24265ms
I221107 20:08:29.479687 1876 sql/backfill.go:1921 ⋮ [n1,job=811929787974385665] 111  validation: index ‹kv›/‹crdb_internal_index_4_name_placeholder› row count = 0, time so far 1.266765ms
I221107 20:08:29.479775 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 112  executing declarative schema change PostCommitNonRevertiblePhase stage 1 of 4 with 6 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.580499 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 113  executing declarative schema change PostCommitNonRevertiblePhase stage 2 of 4 with 9 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.684535 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 114  executing declarative schema change PostCommitNonRevertiblePhase stage 3 of 4 with 3 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.806314 1496 sql/schemachanger/scrun/scrun.go:160 ⋮ [n1,job=811929787974385665] 115  executing declarative schema change PostCommitNonRevertiblePhase stage 4 of 4 with 9 MutationType ops (rollback=false) for ALTER TABLE
I221107 20:08:29.930438 1496 jobs/registry.go:1224 ⋮ [n1] 116  NEW SCHEMA CHANGE job 811929787974385665: stepping through state succeeded

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/schemachanger/scerrors/errors.go line 34 at r1 (raw file):

Previously, ajwerner wrote…

I feel like we have an obnoxious linter that's not going to like this. Maybe just add a wrap function that's like func(error) error to which you can pass the message in closure scope? Then just log the error as opposed to this?

I don't know, it seems pretty crazy to have a thing called HandleErrorOnPanic unconditionally log a message when called that has nothing to do with any possible error. I was absolutely fine with the message being logged in the case of an error, but I'm pretty opposed to the choice you've made without really making it clear in the function.

If you renamed the method to something clearer and then made the message a part of the description of the function, I'd be more amenable.


pkg/sql/schemachanger/scerrors/errors.go line 60 at r3 (raw file):

			msgFmt = "failed " + msgFmt + " with error: %v"
			args = append(args[:len(args):len(args)], unwrappedErr)
			log.InfofDepth(ctx, logDepth, msgFmt, args...)

should this be Warning? Also, should we also check and see if the context is canceled?

@postamar
Copy link
Contributor Author

postamar commented Nov 7, 2022

Agreed, HandleErrorOnPanic is a bad name. I'll fix that.

I'd be more amenable.

I'm keen to avoid further back and forth here, please don't hesitate to be more explicit about what I should do.

should this be Warning? Also, should we also check and see if the context is canceled?

I don't know the answer to either of these questions. Please tell me.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's do Warning if there was an error and the context was not canceled.

As for nameing, this is a little convoluted, but I think it reads nicely:

type EventLogger struct {
    format string
    args []interface
}

func StartEvent(format string, args ...interface) EventLogger { /* ... */ }

func (el EventLogger) HandlePanicAndLogError(err *error) { /*...*/ }

Then this thing would be called like:

defer scerrors.LogStart(
   "building declarative schema changer plan in %s (rollback=%v) for %s",
   redact.Safe(params.ExecutionPhase),
   redact.Safe(params.InRollback),
   redact.Safe(initial.StatementTags()),
).RecoverPanicAndLogError(&err)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)

@postamar
Copy link
Contributor Author

postamar commented Nov 8, 2022

Thanks! Much appreciated. I like this a lot. Will do 👍

Previously, the declarative schema changer would only recover from
runtime errors in certain subsystems like building the targets or
planning the execution of operations. Consequently an implementation bug
leading to a runtime error in the execution layer would trigger a panic
which would not be recovered and which would cause the whole process to
crash.

This commit fixes this by introducing a common error handler in the form
of scerrors.EventLogger, which recovers from panics, wraps errors, and
prints informative log messages in a uniform way, to be used at the top
of (or near the top of) the declarative schema changer call stack.

Fixes cockroachdb#91400.

Release note (bug fix): fixed a bug in which panics triggered by certain
DDL statements were not properly recovered, leading to the cluster node
crashing.
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @postamar)

@postamar postamar marked this pull request as ready for review November 8, 2022 20:19
@postamar postamar requested a review from a team November 8, 2022 20:19
@postamar
Copy link
Contributor Author

postamar commented Nov 8, 2022

Thanks for the review!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 8, 2022

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Nov 8, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 93e620e to blathers/backport-release-22.1-91411: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schemachanger: handle runtime error panics
3 participants