-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix premature context cancellation in Distributor.QueryStream() when experimental ingest storage is enabled #7437
Conversation
@@ -5599,9 +5599,6 @@ func (i *mockIngester) series() map[uint32]*mimirpb.PreallocTimeseries { | |||
} | |||
|
|||
func (i *mockIngester) Check(context.Context, *grpc_health_v1.HealthCheckRequest, ...grpc.CallOption) (*grpc_health_v1.HealthCheckResponse, error) { | |||
i.Lock() | |||
defer i.Unlock() | |||
|
|||
i.trackCall("Check") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I moved calls to i.trackCall()
as first thing and it takes the lock. See details here.
@@ -5690,17 +5687,19 @@ func makeWireChunk(c chunk.EncodedChunk) client.Chunk { | |||
} | |||
|
|||
func (i *mockIngester) QueryStream(ctx context.Context, req *client.QueryRequest, _ ...grpc.CallOption) (client.Ingester_QueryStreamClient, error) { | |||
i.trackCall("QueryStream") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviews: moving call to trackCall()
fixes TestDistributor_QueryStream_ShouldSupportIngestStorage()
flakyness. Why? Because when ingester requests are NOT minimized, the context gets canceled as soon as we reach quorum. There are cases where it gets canceled before enforceReadConsistency()
is called. In such cases, the trackCall()
was not called.
…experimental ingest storage is enabled Signed-off-by: Marco Pracucci <marco@pracucci.com>
… the ingester calls as first thing Signed-off-by: Marco Pracucci <marco@pracucci.com>
fa18a1c
to
dfee880
Compare
As with grafana/dskit#495 I'm going to merge this just to move forward with the testing, but I will promptly address any post-merge comment. The change in this PR should be relatively safe for the classic Mimir architecture because in that we only have 1 replication set and the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What this PR does
In #7388 I've introduced a bug when the experimental ingest storage is enabled. When enabled,
Distributor.queryIngesterStream()
runs with multipleReplicationSet
so I wrappedring.DoUntilQuorumWithoutSuccessfulContextCancellation()
withconcurrency.ForEachJobMergeResults()
. Unfortunately,concurrency.ForEachJobMergeResults()
cancels the callback's context as soon asconcurrency.ForEachJobMergeResults()
returns, invalidating the "WithoutSuccessfulContextCancellation" part ofring.DoUntilQuorumWithoutSuccessfulContextCancellation()
.In this PR I'm fixing the issue via new function
ring.DoMultiUntilQuorumWithoutSuccessfulContextCancellation()
introduced in dskit (see PR) which is a multi-ReplicationSet
wrapper on top ofring.DoUntilQuorumWithoutSuccessfulContextCancellation()
.Why the bug only affects ingest storage?
The bug didn't happen with the classic ingesters ring because it has only 1 replication set and
concurrency.ForEachJobMergeResults()
did a pass-through if there was only 1 job.concurrency.ForEachJobMergeResults()
for the "1 job" case has changed in this PR, to actually always cancel the context as expected by its function contract (see dskit PR which is vendored in this PR).Tests
The new test
TestDistributor_QueryStream_ShouldSuccessfullyRunOnSlowIngesterWithStreamingChunksIsEnabled
fails with the version of the code inmain
.In addition to the unit test, I've manually tested it locally hacking the
pkg/distributor/
code to inject mocked errors and delays in different places, and cancelling query request too. I've also added debug logs to ensure for eachring.DoMultiUntilQuorumWithoutSuccessfulContextCancellation()
workersCtx
created at some point there was its cancel function being called.Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.