Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix premature context cancellation in Distributor.QueryStream() when experimental ingest storage is enabled #7437

Merged
merged 2 commits into from
Feb 22, 2024

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Feb 21, 2024

What this PR does

In #7388 I've introduced a bug when the experimental ingest storage is enabled. When enabled, Distributor.queryIngesterStream() runs with multiple ReplicationSet so I wrapped ring.DoUntilQuorumWithoutSuccessfulContextCancellation() with concurrency.ForEachJobMergeResults(). Unfortunately, concurrency.ForEachJobMergeResults() cancels the callback's context as soon as concurrency.ForEachJobMergeResults() returns, invalidating the "WithoutSuccessfulContextCancellation" part of ring.DoUntilQuorumWithoutSuccessfulContextCancellation().

In this PR I'm fixing the issue via new function ring.DoMultiUntilQuorumWithoutSuccessfulContextCancellation() introduced in dskit (see PR) which is a multi-ReplicationSet wrapper on top of ring.DoUntilQuorumWithoutSuccessfulContextCancellation().

Why the bug only affects ingest storage?

The bug didn't happen with the classic ingesters ring because it has only 1 replication set and concurrency.ForEachJobMergeResults() did a pass-through if there was only 1 job. concurrency.ForEachJobMergeResults() for the "1 job" case has changed in this PR, to actually always cancel the context as expected by its function contract (see dskit PR which is vendored in this PR).

Tests

The new test TestDistributor_QueryStream_ShouldSuccessfullyRunOnSlowIngesterWithStreamingChunksIsEnabled fails with the version of the code in main.

In addition to the unit test, I've manually tested it locally hacking the pkg/distributor/ code to inject mocked errors and delays in different places, and cancelling query request too. I've also added debug logs to ensure for each ring.DoMultiUntilQuorumWithoutSuccessfulContextCancellation() workersCtx created at some point there was its cancel function being called.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@pracucci pracucci marked this pull request as ready for review February 21, 2024 17:30
@pracucci pracucci requested review from grafanabot and a team as code owners February 21, 2024 17:30
@@ -5599,9 +5599,6 @@ func (i *mockIngester) series() map[uint32]*mimirpb.PreallocTimeseries {
}

func (i *mockIngester) Check(context.Context, *grpc_health_v1.HealthCheckRequest, ...grpc.CallOption) (*grpc_health_v1.HealthCheckResponse, error) {
i.Lock()
defer i.Unlock()

i.trackCall("Check")
Copy link
Collaborator Author

@pracucci pracucci Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I moved calls to i.trackCall() as first thing and it takes the lock. See details here.

@@ -5690,17 +5687,19 @@ func makeWireChunk(c chunk.EncodedChunk) client.Chunk {
}

func (i *mockIngester) QueryStream(ctx context.Context, req *client.QueryRequest, _ ...grpc.CallOption) (client.Ingester_QueryStreamClient, error) {
i.trackCall("QueryStream")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviews: moving call to trackCall() fixes TestDistributor_QueryStream_ShouldSupportIngestStorage() flakyness. Why? Because when ingester requests are NOT minimized, the context gets canceled as soon as we reach quorum. There are cases where it gets canceled before enforceReadConsistency() is called. In such cases, the trackCall() was not called.

…experimental ingest storage is enabled

Signed-off-by: Marco Pracucci <marco@pracucci.com>
… the ingester calls as first thing

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the fix-premature-context-cancellation branch from fa18a1c to dfee880 Compare February 22, 2024 12:53
@pracucci
Copy link
Collaborator Author

As with grafana/dskit#495 I'm going to merge this just to move forward with the testing, but I will promptly address any post-merge comment.

The change in this PR should be relatively safe for the classic Mimir architecture because in that we only have 1 replication set and the new ring.DoMultiUntilQuorumWithoutSuccessfulContextCancellation() (multi version) shortcut to ring.DoUntilQuorumWithoutSuccessfulContextCancellation() (single version) when there's just 1 replication set.

@pracucci pracucci merged commit ff17c12 into main Feb 22, 2024
28 checks passed
@pracucci pracucci deleted the fix-premature-context-cancellation branch February 22, 2024 13:23
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants