-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: remove re-evaluations of requests after a snapshot is applied #33381
storage: remove re-evaluations of requests after a snapshot is applied #33381
Conversation
cc @spencerkimball - #28985 was assigned to you, but I didn't see a PR open and so I've taken it back as a Christmas present to myself. Hope you don't mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/replica.go, line 4823 at r1 (raw file):
// snapshot, so nobody removed p from r.mu.proposals). In this // ambiguous case, we return an ambiguous error. // NOTE: We used to perform a re-evaluation here in order to avoid the
nit: newline above this one
pkg/storage/replica.go, line 4828 at r1 (raw file):
// error to reflect the ambiguity. This mechanism also proved very hard to // test and also is of questionable utility since snapshots are only // applied in the first place if the leaseholder is diverced from the Raft
divorced
pkg/storage/replica.go, line 4840 at r1 (raw file):
p.finishApplication(proposalResult{Err: roachpb.NewError( roachpb.NewAmbiguousResultError( fmt.Sprintf("unknown status for command; " +
"unable to determine whether command was applied via snapshot"?
pkg/storage/replica.go, line 3061 at r2 (raw file):
for { br, pErr, retry := r.tryExecuteWriteBatch(ctx, ba) if retry == proposalIllegalLeaseIndex {
Not for this PR, but you could drop a TODO in my name to add a metric to the proposal count.
pkg/storage/replica.go, line 3094 at r2 (raw file):
// its result (which could be an error) is returned to the client. // // TODO(tschottdorf): take special care with "special" commands and their
Mind removing this vague TODO while you're here?
acea9a5
to
0a86be4
Compare
0a97e4d
to
53ba4c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs. I needed to do another small tweak to processRaftCommand()
. PTAQuickL
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica.go, line 4828 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
divorced
Done.
pkg/storage/replica.go, line 4840 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"unable to determine whether command was applied via snapshot"?
Done.
pkg/storage/replica.go, line 3061 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not for this PR, but you could drop a TODO in my name to add a metric to the proposal count.
done, with my name too
pkg/storage/replica.go, line 3094 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mind removing this vague TODO while you're here?
done. This was just FUD, wasn't it? Something is preventing those writes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica.go, line 3094 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
done. This was just FUD, wasn't it? Something is preventing those writes, right?
The comment might predate sanity that was applied since, but yeah, I don't know of obvious problems here.
Monster nit, but you write reevaluations
earlier and re-evaluations
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
53ba4c9
to
263b334
Compare
This patch affects the handling of the following scenario: - A lease-holder had proposed a command and then received a snapshot with a LeaseAppliedIndex above the command's index. - What can the leaseholder do? It doesn't know whether the command applied or not - the snapshot is too raw for such a determination. So, the command's status is ambiguous. Note that the command can't be reproposed - the proposal would simply be rejected by the lease index check. - Before this patch, in a desperate attempt to avoid returning an ambiguous error to the client, the leaseholder was re-evaluating the request and declared success if the re-evaluation succeeded. It was remembering to declare AmbiguousResultError if the re-evaluation did not succeed for any reason. This patch removes the re-evaluation. We'll now return an AmbiguousResultError directly. This is motivated by the fact that testing the cases proved quite difficult, and the code involved is conceptually complicated (remembering the ambiguity and overwriting errors) and the benefit of the mechanism doesn't seem worth it: snapshots can only be applied when the leaseholder and the Raft leader are divorced - which should be rare. Perhaps when we get more idempotent requests we can re-introduce the re-evaluation in a simpler way. Fixes cockroachdb#28985 Release note: None
263b334
to
50c12f8
Compare
This patch refactors the way ambiguous errors are returned from the evaluation of requests. Before this patch, the situation was confusing and buggy: the responsibility for generating these errors was split between Replica.executeWriteBatch() and the lower lever Replica.tryExecuteWriteBatch(). The latter was returning both errors and so called "retry reasons", except that some retry reasons were not actually resulting in retries. The reasons in question were returned together with an error, which was in blatant violation of proposalResult comment that said that exactly one of these guys can be set. I believe the fact that both an error and a retry reason was set was also causing the timestamp cache updating code to be buggy, I think: https://github.com/cockroachdb/cockroach/blob/a5eddb23cbb7db5cbb9655b8ed7a1875779b3c62/pkg/storage/replica.go#L2214 The code says that we update the ts cache if the request is not to be retried, but it was missing cases. The actual ambiguous errors were sometimes generated in tryExecuteWriteBatch (for reason proposalRangeNoLongerExists) and sometimes in executeWriteBatch (for reason proposalErrorReproposing), which was confusing. This patch does away with the troublesome retry reasons and removes any responsibility around creating ambiguous errors from executeWriteBatch. The lower layers know whether the conditions they're reporting should return in ambiguous errors or not, and so they create ambiguous errors directly. This patch changes the errors returned for some non-commit requests when the reason is proposalErrorReproposing: it used to be a non-ambiguous error and now it will be an ambiguous one. executeWriteBatch used to discriminate between commit and non-commit batches, but that was inconsistent (it didn't take effect for proposalRangeNoLongerExists) and, more importantly, it was the wrong layer to take commits into consideration (in my opinion). Release note: None
50c12f8
to
80eea9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica.go, line 3094 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The comment might predate sanity that was applied since, but yeah, I don't know of obvious problems here.
Monster nit, but you write
reevaluations
earlier andre-evaluations
now.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Build failed |
Build failure was #33683
bors r+
…On Sun, Jan 13, 2019 at 2:27 PM craig[bot] ***@***.***> wrote:
Build failed
- GitHub CI (Cockroach)
<https://teamcity.cockroachdb.com/viewLog.html?buildId=1091739&buildTypeId=Cockroach_UnitTests>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#33381 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcSw5-NDme4t7C7GEGu8oiRN43Lunks5vC4i9gaJpZM4ZjSiV>
.
|
Build failed |
bors r+
…On Sun, Jan 13, 2019 at 2:56 PM craig[bot] ***@***.***> wrote:
Build failed
- GitHub CI (Cockroach)
<https://teamcity.cockroachdb.com/viewLog.html?buildId=1091750&buildTypeId=Cockroach_UnitTests>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#33381 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcbwDVGR___4jokxvuaZcUO5oRo5Mks5vC499gaJpZM4ZjSiV>
.
|
Build succeeded |
See individual commits.