-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add -ingest-storage.kafka.target-consumer-lag-at-startup support #8579
Conversation
a86cb02
to
c3e0fc3
Compare
} | ||
} | ||
|
||
return boff.Err() | ||
// TODO should be moved to dskit's backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I will open a PR to add context.Cause()
support to dskit's backoff
but I prefer to not block on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: grafana/dskit#538
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no major concerns, LGTM
boff.Wait() | ||
continue | ||
} | ||
|
||
// Send a direct request to the Kafka backend to fetch the last produced offset. | ||
// We intentionally don't use WaitNextFetchLastProducedOffset() to not introduce further | ||
// latency. | ||
lastProducedOffsetRequestedAt := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was a subtle bug previously 👍
3919855
to
d2bd338
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs! I've left a few comments.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
0cf787f
to
6de7564
Compare
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
Thanks @tacole02 for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, thank you.
// written to the partition, we give the reader maxLag time to replay the backlog + ingest the new data | ||
// written in the meanwhile. | ||
func() (time.Duration, error) { | ||
timedCtx, cancel := context.WithTimeoutCause(ctx, maxLag, errWaitTargetLagDeadlineExceeded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about context.WithTimeoutCause
, nice.
Going to merge this. Will address the TODO once dskit PR is merged. |
What this PR does
As soon as an ingester switches to
Running
, the reported end-to-end latency spikes high. We can see the correlation here:Why? Because the ingesters are configured with
-ingest-storage.kafka.max-consumer-lag-at-startup=15s
(default), which means an ingester is guaranteed to have an end-to-end latency of up to 15s when switched to Running but there's no guarantee that the latency will be significantly lower than that. Significantly reducing-ingest-storage.kafka.max-consumer-lag-at-startup
setting (e.g. to a couple of seconds) is tricky because if the ingester can't catch up to that low latency then it will never startup.In this PR I'm proposing to add a new config option called
-ingest-storage.kafka.target-consumer-lag-at-startup
and making it a best-effort. How it works:-ingest-storage.kafka.max-consumer-lag-at-startup
(as inmain
)-ingest-storage.kafka.target-consumer-lag-at-startup
(defaults 2s). The grace period is equal to-ingest-storage.kafka.max-consumer-lag-at-startup
; the rationale here is that we expect at least a 1x replay speed from Kafka.This means that
-ingest-storage.kafka.target-consumer-lag-at-startup
is a best-effort, and-ingest-storage.kafka.max-consumer-lag-at-startup
is guaranteed (as inmain
).I've tested it in a Mimir dev cluster and looks fixing the issue:
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.