Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampled logging: log only 1 in N of specific errors #5584

Merged
merged 24 commits into from
Sep 12, 2023
Merged

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Jul 25, 2023

What this PR does

This work was initially done by @bboreham and then inherited by @duricanikolic.

This PR adds support for sampling ingester errors. Sampling can be enabled by setting the newly added experimental CLI flag -ingester.error-sample-rate to a positive value. This way each ingester error will be logged once in the configured number of occurrences. For backwards compatibility, the default value of -ingester.error-sample-rate is 0.

For each error type, a new Sampler is created. Initialization and ownership of Samplers is in ingester.Limiter, because it was convenient. This means sampling is per-tenant, which makes some sense, but it also means ingesters with a lot of tenants will log more.

There is no specific metric counting the log lines dropped. The dropped samples are already counted in cortex_discarded_samples_total.

Which issue(s) this PR fixes or relates to

Relates to #1900, #5894, #6008

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/util/log/sampled.go Outdated Show resolved Hide resolved
pkg/ingester/limiter.go Outdated Show resolved Hide resolved
validation.MaxSeriesPerUserFlag,
))
return log.SampledError{
Sampler: l.sampler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still just one sampler, but I'd be specific what this sampler is for, as I assume we don't want to reuse same sampler for different errors (because that would make it impossible to infer the amount of errors logged).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered 1 sampler per ingester, and 1 per tenant.
Both of them seem to me to allow estimation of the true volume, can you explain why not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that in the future someone would add a new error here and reuse the same sampler, because it's called "sampler" and not "perUserSeriesErrorSampler".

Anyway, this is totally unimportant, and definitely not blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another 6 errors. As I understand your comment I am doing exactly what you didn't want.

@bboreham bboreham changed the title WIP: Sampled logging: log only 1 in 1000 of specific errors WIP: Sampled logging: log only 1 in N of specific errors Jul 27, 2023
Comment on lines 913 to 930
return i.limiter.sampler.WrapError(newIngestErrSampleTimestampTooOld(model.Time(timestamp), labels))
})
return true

case storage.ErrOutOfOrderSample:
stats.sampleOutOfOrderCount++
updateFirstPartial(func() error {
return newIngestErrSampleOutOfOrder(model.Time(timestamp), labels)
return i.limiter.sampler.WrapError(newIngestErrSampleOutOfOrder(model.Time(timestamp), labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was talking about on the previous PR when I mentioned that we should have different samplers.

With a single sampler, there's no way to tell how many of each errors are happening right? There might be 27 "too old" errors and 3 "out of order" errors, and we could still only see the "out of order" ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is very difficult to arrange. In real life the different kinds of errors will be randomly spaced and over time the sampler will provide a statistical picture of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that having a separate sampler for each error would be that difficult to arrange.

But with current approach I'd rather just drop and not log these errors at all, as I don't think it adds much value to see some logs lines from time to time without even being able to tell even how many of each one are happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely not my experience. I added sampled "events" to Cortex and used them every week to investigate cardinality and other issues. (removed in #766 because it was never added to blocks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW in case it's unclear when I said "That is very difficult to arrange", I mean to get the 27 errors and the 3 errors to come out exactly on a pattern such that you only see the 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick update on this: to be clear, I'm not blocking, because I also wouldn't complain if we completely removed the errors. I think this is the typical thing where Murphy's Law will play on me, but I also have to trust your experience and if you think it's useful, then it probably is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR by creating a different sampler for each new error type

@colega
Copy link
Contributor

colega commented Aug 7, 2023

Would it make sense to make the sampling rate a runtime config? If something is going on, I might want to peek into the logs without having to restart the ingesters.

@bboreham
Copy link
Contributor Author

bboreham commented Aug 9, 2023

Would it make sense to make the sampling rate a runtime config? If something is going on, I might want to peek into the logs without having to restart the ingesters.

Yes, I looked into this. It was very hard to plumb through pushed updates, and somewhat annoying to do periodic checks like dskit/limiter does. I figured I would get a basic version working first.

Specifically adding some atomic variable or lock to the sample frequency was annoying.

@bboreham
Copy link
Contributor Author

bboreham commented Aug 9, 2023

I have rebased and fixed up tests.

@bboreham bboreham changed the title WIP: Sampled logging: log only 1 in N of specific errors Sampled logging: log only 1 in N of specific errors Aug 14, 2023
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

**Building new version of **. After image is built and pushed to Docker Hub, a new commit will automatically be added to this PR with new image version . This can take up to 1 hour.

@duricanikolic duricanikolic force-pushed the sampled-error-logging branch 2 times, most recently from 57f37ff to e27a3b1 Compare September 5, 2023 11:45
duricanikolic and others added 17 commits September 12, 2023 17:53
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Error sampling rate is configurable.
Sampled errors are decorated like "(sampled 1/N)".
These are likely to be high-volume.
Check it contains the string instead.
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank your addressing all the feedback.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@colega colega enabled auto-merge (squash) September 12, 2023 15:55
@colega colega merged commit fa33346 into main Sep 12, 2023
28 checks passed
@colega colega deleted the sampled-error-logging branch September 12, 2023 16:14
@@ -8,6 +8,9 @@
* [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879
* [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879
* [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880
* [ENHANCEMENT] Go: updated to 1.21.1. #5955
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated.

@@ -8,6 +8,9 @@
* [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879
* [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879
* [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880
* [ENHANCEMENT] Go: updated to 1.21.1. #5955
* [ENHANCEMENT] Ingester: added support for sampling errors, which can be enabled by setting `-ingester.error-sample-rate`. This way each error will be logged once in the configured number of times. #5584
* [BUGFIX] Ingester: fix spurious `not found` errors on label values API during head compaction. #5957
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants