-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampled logging: log only 1 in N of specific errors #5584
Conversation
pkg/ingester/limiter.go
Outdated
validation.MaxSeriesPerUserFlag, | ||
)) | ||
return log.SampledError{ | ||
Sampler: l.sampler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still just one sampler, but I'd be specific what this sampler is for, as I assume we don't want to reuse same sampler for different errors (because that would make it impossible to infer the amount of errors logged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered 1 sampler per ingester, and 1 per tenant.
Both of them seem to me to allow estimation of the true volume, can you explain why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that in the future someone would add a new error here and reuse the same sampler, because it's called "sampler" and not "perUserSeriesErrorSampler".
Anyway, this is totally unimportant, and definitely not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another 6 errors. As I understand your comment I am doing exactly what you didn't want.
405f279
to
da1e50f
Compare
pkg/ingester/ingester.go
Outdated
return i.limiter.sampler.WrapError(newIngestErrSampleTimestampTooOld(model.Time(timestamp), labels)) | ||
}) | ||
return true | ||
|
||
case storage.ErrOutOfOrderSample: | ||
stats.sampleOutOfOrderCount++ | ||
updateFirstPartial(func() error { | ||
return newIngestErrSampleOutOfOrder(model.Time(timestamp), labels) | ||
return i.limiter.sampler.WrapError(newIngestErrSampleOutOfOrder(model.Time(timestamp), labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was talking about on the previous PR when I mentioned that we should have different samplers.
With a single sampler, there's no way to tell how many of each errors are happening right? There might be 27 "too old" errors and 3 "out of order" errors, and we could still only see the "out of order" ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is very difficult to arrange. In real life the different kinds of errors will be randomly spaced and over time the sampler will provide a statistical picture of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that having a separate sampler for each error would be that difficult to arrange.
But with current approach I'd rather just drop and not log these errors at all, as I don't think it adds much value to see some logs lines from time to time without even being able to tell even how many of each one are happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW in case it's unclear when I said "That is very difficult to arrange", I mean to get the 27 errors and the 3 errors to come out exactly on a pattern such that you only see the 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick update on this: to be clear, I'm not blocking, because I also wouldn't complain if we completely removed the errors. I think this is the typical thing where Murphy's Law will play on me, but I also have to trust your experience and if you think it's useful, then it probably is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR by creating a different sampler for each new error type
a5c87e1
to
78ab903
Compare
Would it make sense to make the sampling rate a runtime config? If something is going on, I might want to peek into the logs without having to restart the ingesters. |
78ab903
to
50be552
Compare
Yes, I looked into this. It was very hard to plumb through pushed updates, and somewhat annoying to do periodic checks like Specifically adding some atomic variable or lock to the sample frequency was annoying. |
I have rebased and fixed up tests. |
**Building new version of **. After image is built and pushed to Docker Hub, a new commit will automatically be added to this PR with new image version . This can take up to 1 hour. |
57f37ff
to
e27a3b1
Compare
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Error sampling rate is configurable. Sampled errors are decorated like "(sampled 1/N)".
These are likely to be high-volume.
Check it contains the string instead.
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank your addressing all the feedback.
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
0feb589
to
fd75545
Compare
@@ -8,6 +8,9 @@ | |||
* [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879 | |||
* [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879 | |||
* [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880 | |||
* [ENHANCEMENT] Go: updated to 1.21.1. #5955 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unrelated.
@@ -8,6 +8,9 @@ | |||
* [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879 | |||
* [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879 | |||
* [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880 | |||
* [ENHANCEMENT] Go: updated to 1.21.1. #5955 | |||
* [ENHANCEMENT] Ingester: added support for sampling errors, which can be enabled by setting `-ingester.error-sample-rate`. This way each error will be logged once in the configured number of times. #5584 | |||
* [BUGFIX] Ingester: fix spurious `not found` errors on label values API during head compaction. #5957 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unrelated.
What this PR does
This work was initially done by @bboreham and then inherited by @duricanikolic.
This PR adds support for sampling ingester errors. Sampling can be enabled by setting the newly added experimental CLI flag
-ingester.error-sample-rate
to a positive value. This way each ingester error will be logged once in the configured number of occurrences. For backwards compatibility, the default value of-ingester.error-sample-rate
is 0.For each error type, a new
Sampler
is created. Initialization and ownership ofSampler
s is iningester.Limiter
, because it was convenient. This means sampling is per-tenant, which makes some sense, but it also means ingesters with a lot of tenants will log more.There is no specific metric counting the log lines dropped. The dropped samples are already counted in
cortex_discarded_samples_total
.Which issue(s) this PR fixes or relates to
Relates to #1900, #5894, #6008
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]