Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: remove re-evaluations of requests after a snapshot is applied #33381

Merged

Conversation

andreimatei
Copy link
Contributor

See individual commits.

@andreimatei andreimatei requested review from bdarnell, tbg and a team December 27, 2018 23:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @spencerkimball - #28985 was assigned to you, but I didn't see a PR open and so I've taken it back as a Christmas present to myself. Hope you don't mind.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: and thank you

Reviewed 2 of 2 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/replica.go, line 4823 at r1 (raw file):

			// snapshot, so nobody removed p from r.mu.proposals). In this
			// ambiguous case, we return an ambiguous error.
			// NOTE: We used to perform a re-evaluation here in order to avoid the

nit: newline above this one


pkg/storage/replica.go, line 4828 at r1 (raw file):

			// error to reflect the ambiguity. This mechanism also proved very hard to
			// test and also is of questionable utility since snapshots are only
			// applied in the first place if the leaseholder is diverced from the Raft

divorced


pkg/storage/replica.go, line 4840 at r1 (raw file):

				p.finishApplication(proposalResult{Err: roachpb.NewError(
					roachpb.NewAmbiguousResultError(
						fmt.Sprintf("unknown status for command; " +

"unable to determine whether command was applied via snapshot"?


pkg/storage/replica.go, line 3061 at r2 (raw file):

	for {
		br, pErr, retry := r.tryExecuteWriteBatch(ctx, ba)
		if retry == proposalIllegalLeaseIndex {

Not for this PR, but you could drop a TODO in my name to add a metric to the proposal count.


pkg/storage/replica.go, line 3094 at r2 (raw file):

//   its result (which could be an error) is returned to the client.
//
// TODO(tschottdorf): take special care with "special" commands and their

Mind removing this vague TODO while you're here?

@andreimatei andreimatei force-pushed the storage.remove-snapshot-reevaluation branch from acea9a5 to 0a86be4 Compare January 3, 2019 00:31
@andreimatei andreimatei requested review from a team January 3, 2019 00:31
@andreimatei andreimatei force-pushed the storage.remove-snapshot-reevaluation branch 2 times, most recently from 0a97e4d to 53ba4c9 Compare January 3, 2019 18:54
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs. I needed to do another small tweak to processRaftCommand(). PTAQuickL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 4828 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

divorced

Done.


pkg/storage/replica.go, line 4840 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"unable to determine whether command was applied via snapshot"?

Done.


pkg/storage/replica.go, line 3061 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not for this PR, but you could drop a TODO in my name to add a metric to the proposal count.

done, with my name too


pkg/storage/replica.go, line 3094 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mind removing this vague TODO while you're here?

done. This was just FUD, wasn't it? Something is preventing those writes, right?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 3094 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done. This was just FUD, wasn't it? Something is preventing those writes, right?

The comment might predate sanity that was applied since, but yeah, I don't know of obvious problems here.

Monster nit, but you write reevaluations earlier and re-evaluations now.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@andreimatei andreimatei force-pushed the storage.remove-snapshot-reevaluation branch from 53ba4c9 to 263b334 Compare January 13, 2019 18:36
This patch affects the handling of the following scenario:
- A lease-holder had proposed a command and then received a snapshot
with a LeaseAppliedIndex above the command's index.
- What can the leaseholder do? It doesn't know whether the command
applied or not - the snapshot is too raw for such a determination. So,
the command's status is ambiguous. Note that the command can't be
reproposed - the proposal would simply be rejected by the lease index
check.
- Before this patch, in a desperate attempt to avoid returning an
ambiguous error to the client, the leaseholder was re-evaluating the
request and declared success if the re-evaluation succeeded. It was
remembering to declare AmbiguousResultError if the re-evaluation did not
succeed for any reason.

This patch removes the re-evaluation. We'll now return an
AmbiguousResultError directly. This is motivated by the fact that
testing the cases proved quite difficult, and the code involved is
conceptually complicated (remembering the ambiguity and overwriting
errors) and the benefit of the mechanism doesn't seem worth it:
snapshots can only be applied when the leaseholder and the Raft leader
are divorced - which should be rare.
Perhaps when we get more idempotent requests we can re-introduce the
re-evaluation in a simpler way.

Fixes cockroachdb#28985

Release note: None
@andreimatei andreimatei force-pushed the storage.remove-snapshot-reevaluation branch from 263b334 to 50c12f8 Compare January 13, 2019 18:44
This patch refactors the way ambiguous errors are returned from the
evaluation of requests. Before this patch, the situation was confusing
and buggy: the responsibility for generating these errors was split
between Replica.executeWriteBatch() and the lower lever
Replica.tryExecuteWriteBatch(). The latter was returning both errors
and so called "retry reasons", except that some retry reasons were not
actually resulting in retries. The reasons in question were returned
together with an error, which was in blatant violation of proposalResult
comment that said that exactly one of these guys can be set.
I believe the fact that both an error and a retry reason was set was
also causing the timestamp cache updating code to be buggy, I think:
https://github.com/cockroachdb/cockroach/blob/a5eddb23cbb7db5cbb9655b8ed7a1875779b3c62/pkg/storage/replica.go#L2214
The code says that we update the ts cache if the request is not to be
retried, but it was missing cases.

The actual ambiguous errors were sometimes generated in
tryExecuteWriteBatch (for reason proposalRangeNoLongerExists) and
sometimes in executeWriteBatch (for reason proposalErrorReproposing),
which was confusing.

This patch does away with the troublesome retry reasons and removes any
responsibility around creating ambiguous errors from executeWriteBatch.
The lower layers know whether the conditions they're reporting should
return in ambiguous errors or not, and so they create ambiguous errors
directly.

This patch changes the errors returned for some non-commit requests when
the reason is proposalErrorReproposing: it used to be a non-ambiguous
error and now it will be an ambiguous one. executeWriteBatch used to
discriminate between commit and non-commit batches, but that was
inconsistent (it didn't take effect for proposalRangeNoLongerExists)
and, more importantly, it was the wrong layer to take commits into
consideration (in my opinion).

Release note: None
@andreimatei andreimatei force-pushed the storage.remove-snapshot-reevaluation branch from 50c12f8 to 80eea9e Compare January 13, 2019 18:50
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 3094 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The comment might predate sanity that was applied since, but yeah, I don't know of obvious problems here.

Monster nit, but you write reevaluations earlier and re-evaluations now.

Done.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@craig
Copy link
Contributor

craig bot commented Jan 13, 2019

Build failed

@andreimatei
Copy link
Contributor Author

andreimatei commented Jan 13, 2019 via email

@craig
Copy link
Contributor

craig bot commented Jan 13, 2019

Build failed

@andreimatei
Copy link
Contributor Author

andreimatei commented Jan 13, 2019 via email

craig bot pushed a commit that referenced this pull request Jan 13, 2019
33381: storage: remove re-evaluations of requests after a snapshot is applied r=andreimatei a=andreimatei

See individual commits.

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jan 13, 2019

Build succeeded

@craig craig bot merged commit 80eea9e into cockroachdb:master Jan 13, 2019
@andreimatei andreimatei deleted the storage.remove-snapshot-reevaluation branch January 15, 2019 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants