Sampled logging: log only 1 in N of specific errors #5584

bboreham · 2023-07-25T17:17:18Z

What this PR does

This work was initially done by @bboreham and then inherited by @duricanikolic.

This PR adds support for sampling ingester errors. Sampling can be enabled by setting the newly added experimental CLI flag -ingester.error-sample-rate to a positive value. This way each ingester error will be logged once in the configured number of occurrences. For backwards compatibility, the default value of -ingester.error-sample-rate is 0.

For each error type, a new Sampler is created. Initialization and ownership of Samplers is in ingester.Limiter, because it was convenient. This means sampling is per-tenant, which makes some sense, but it also means ingesters with a lot of tenants will log more.

There is no specific metric counting the log lines dropped. The dropped samples are already counted in cortex_discarded_samples_total.

Which issue(s) this PR fixes or relates to

Relates to #1900, #5894, #6008

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/util/log/sampled.go

pkg/ingester/limiter.go

colega · 2023-07-26T16:52:04Z

pkg/ingester/limiter.go

-		validation.MaxSeriesPerUserFlag,
-	))
+	return log.SampledError{
+		Sampler: l.sampler,


There's still just one sampler, but I'd be specific what this sampler is for, as I assume we don't want to reuse same sampler for different errors (because that would make it impossible to infer the amount of errors logged).

I had considered 1 sampler per ingester, and 1 per tenant.
Both of them seem to me to allow estimation of the true volume, can you explain why not?

My concern is that in the future someone would add a new error here and reuse the same sampler, because it's called "sampler" and not "perUserSeriesErrorSampler".

Anyway, this is totally unimportant, and definitely not blocking.

I added another 6 errors. As I understand your comment I am doing exactly what you didn't want.

colega · 2023-07-31T15:43:15Z

pkg/ingester/ingester.go

+				return i.limiter.sampler.WrapError(newIngestErrSampleTimestampTooOld(model.Time(timestamp), labels))
 			})
 			return true

 		case storage.ErrOutOfOrderSample:
 			stats.sampleOutOfOrderCount++
 			updateFirstPartial(func() error {
-				return newIngestErrSampleOutOfOrder(model.Time(timestamp), labels)
+				return i.limiter.sampler.WrapError(newIngestErrSampleOutOfOrder(model.Time(timestamp), labels))


This is what I was talking about on the previous PR when I mentioned that we should have different samplers.

With a single sampler, there's no way to tell how many of each errors are happening right? There might be 27 "too old" errors and 3 "out of order" errors, and we could still only see the "out of order" ones.

That is very difficult to arrange. In real life the different kinds of errors will be randomly spaced and over time the sampler will provide a statistical picture of them.

I don't think that having a separate sampler for each error would be that difficult to arrange.

But with current approach I'd rather just drop and not log these errors at all, as I don't think it adds much value to see some logs lines from time to time without even being able to tell even how many of each one are happening.

This is absolutely not my experience. I added sampled "events" to Cortex and used them every week to investigate cardinality and other issues. (removed in #766 because it was never added to blocks)

BTW in case it's unclear when I said "That is very difficult to arrange", I mean to get the 27 errors and the 3 errors to come out exactly on a pattern such that you only see the 3.

Quick update on this: to be clear, I'm not blocking, because I also wouldn't complain if we completely removed the errors. I think this is the typical thing where Murphy's Law will play on me, but I also have to trust your experience and if you think it's useful, then it probably is.

I have updated the PR by creating a different sampler for each new error type

colega · 2023-08-07T10:22:13Z

Would it make sense to make the sampling rate a runtime config? If something is going on, I might want to peek into the logs without having to restart the ingesters.

bboreham · 2023-08-09T11:58:37Z

Would it make sense to make the sampling rate a runtime config? If something is going on, I might want to peek into the logs without having to restart the ingesters.

Yes, I looked into this. It was very hard to plumb through pushed updates, and somewhat annoying to do periodic checks like dskit/limiter does. I figured I would get a basic version working first.

Specifically adding some atomic variable or lock to the sample frequency was annoying.

bboreham · 2023-08-09T13:14:21Z

I have rebased and fixed up tests.

pkg/ingester/ingester.go

github-actions · 2023-09-05T11:38:46Z

**Building new version of **. After image is built and pushed to Docker Hub, a new commit will automatically be added to this PR with new image version . This can take up to 1 hour.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Error sampling rate is configurable. Sampled errors are decorated like "(sampled 1/N)".

These are likely to be high-volume.

Check it contains the string instead.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

colega

LGTM, thank your addressing all the feedback.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

pracucci · 2023-09-18T16:09:40Z

CHANGELOG.md

@@ -8,6 +8,9 @@
 * [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879
 * [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879
 * [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880
+* [ENHANCEMENT] Go: updated to 1.21.1. #5955


This looks unrelated.

pracucci · 2023-09-18T16:09:44Z

CHANGELOG.md

@@ -8,6 +8,9 @@
 * [ENHANCEMENT] Query-frontend: add `cortex_query_frontend_enqueue_duration_seconds` metric that records the time taken to enqueue or reject a query request when not using the query-scheduler. #5879
 * [ENHANCEMENT] Expose `/sync/mutex/wait/total:seconds` Go runtime metric as `go_sync_mutex_wait_total_seconds_total` from all components. #5879
 * [ENHANCEMENT] Query-scheduler: improve latency with many concurrent queriers. #5880
+* [ENHANCEMENT] Go: updated to 1.21.1. #5955
+* [ENHANCEMENT] Ingester: added support for sampling errors, which can be enabled by setting `-ingester.error-sample-rate`. This way each error will be logged once in the configured number of times. #5584
+* [BUGFIX] Ingester: fix spurious `not found` errors on label values API during head compaction. #5957


This looks unrelated.

DylanGuedes reviewed Jul 25, 2023

View reviewed changes

pkg/util/log/sampled.go Outdated Show resolved Hide resolved

pracucci mentioned this pull request Jul 26, 2023

ingester: don't log errors that cause OOMs, using interface #5581

Merged

colega reviewed Jul 26, 2023

View reviewed changes

bboreham force-pushed the sampled-error-logging branch from 405f279 to da1e50f Compare July 27, 2023 14:53

bboreham changed the title ~~WIP: Sampled logging: log only 1 in 1000 of specific errors~~ WIP: Sampled logging: log only 1 in N of specific errors Jul 27, 2023

colega reviewed Jul 31, 2023

View reviewed changes

bboreham mentioned this pull request Aug 3, 2023

Create base error type for ingester per-instance errors and remove logging for them #5585

Merged

3 tasks

bboreham force-pushed the sampled-error-logging branch from a5c87e1 to 78ab903 Compare August 4, 2023 21:39

bboreham marked this pull request as ready for review August 6, 2023 19:52

bboreham requested review from a team as code owners August 6, 2023 19:52

bboreham force-pushed the sampled-error-logging branch from 78ab903 to 50be552 Compare August 9, 2023 11:54

bboreham changed the title ~~WIP: Sampled logging: log only 1 in N of specific errors~~ Sampled logging: log only 1 in N of specific errors Aug 14, 2023

pracucci reviewed Aug 14, 2023

View reviewed changes

pkg/ingester/ingester.go Outdated Show resolved Hide resolved

duricanikolic self-assigned this Sep 5, 2023

duricanikolic requested review from grafanabot, wilfriedroset, vaxvms, bubu11e and a team as code owners September 5, 2023 11:37

duricanikolic force-pushed the sampled-error-logging branch 2 times, most recently from 57f37ff to e27a3b1 Compare September 5, 2023 11:45

duricanikolic removed request for a team, vaxvms, bubu11e and wilfriedroset September 5, 2023 11:47

duricanikolic and others added 17 commits September 12, 2023 17:53

Move ingester error creation in ingester/error.go

b5270a7

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings

990cf61

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings 2

33dde7e

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Sampled errors: enable logging only 1 in N

235d9f6

Error sampling rate is configurable. Sampled errors are decorated like "(sampled 1/N)".

Sample all ingester append errors

26c4787

These are likely to be high-volume.

ingester tests: don't assert error Equals

7b76827

Check it contains the string instead.

Update docs for -ingester.error-sample-rate

746b5ae

Improve sampler per error handling

24d9f04

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings

684f1c7

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings 3

d5671a7

Adding Unit test

5edd017

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Additional test

00daae9

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing lint issues

ef0d33c

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Improving the unit test

e3b9702

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings

69c8bd9

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Get rid of validationError and introduce safeToWrap interface

418b32f

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings

b3b462a

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

colega approved these changes Sep 12, 2023

View reviewed changes

Fixing review findings

fd75545

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

duricanikolic force-pushed the sampled-error-logging branch from 0feb589 to fd75545 Compare September 12, 2023 15:54

colega enabled auto-merge (squash) September 12, 2023 15:55

colega merged commit fa33346 into main Sep 12, 2023
28 checks passed

colega deleted the sampled-error-logging branch September 12, 2023 16:14

This was referenced Sep 12, 2023

All errors should be safe to be wrapped and retained #6008

Closed

Move error samplers from Limiter to Ingester #6014

Merged

Ensure that ingester returns safe errors #6019

Merged

pracucci reviewed Sep 18, 2023

View reviewed changes

duricanikolic mentioned this pull request Sep 18, 2023

Don't treat 2xx status codes as error #6057

Merged

3 tasks

seizethedave mentioned this pull request Mar 21, 2024

Limit repetitive error logging in ingesters #5894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampled logging: log only 1 in N of specific errors #5584

Sampled logging: log only 1 in N of specific errors #5584

bboreham commented Jul 25, 2023 •

edited by duricanikolic

Loading

colega Jul 26, 2023

bboreham Jul 27, 2023

colega Jul 27, 2023

bboreham Jul 27, 2023

colega Jul 31, 2023

bboreham Aug 4, 2023

colega Aug 7, 2023

bboreham Aug 9, 2023

bboreham Aug 9, 2023

colega Sep 4, 2023

duricanikolic Sep 5, 2023

colega commented Aug 7, 2023

bboreham commented Aug 9, 2023 •

edited

Loading

bboreham commented Aug 9, 2023

github-actions bot commented Sep 5, 2023

colega left a comment

pracucci Sep 18, 2023

pracucci Sep 18, 2023

Sampled logging: log only 1 in N of specific errors #5584

Sampled logging: log only 1 in N of specific errors #5584

Conversation

bboreham commented Jul 25, 2023 • edited by duricanikolic Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colega commented Aug 7, 2023

bboreham commented Aug 9, 2023 • edited Loading

bboreham commented Aug 9, 2023

github-actions bot commented Sep 5, 2023

colega left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboreham commented Jul 25, 2023 •

edited by duricanikolic

Loading

bboreham commented Aug 9, 2023 •

edited

Loading