ingest consumer: more granular error handling, committer sanity check #6951

dimitarvdimitrov · 2023-12-18T11:01:53Z

Follow-up of #6929

check that the offset we're committing is certainly from the partition we're committing to
process a fetch even when it contains some errors; this allows to process fetches with partial data

* check that the offset we're committing is certainly from the partition we're committing to * process a fetch even when it contains some errors; this allows to process fetches with partial data Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

pracucci · 2023-12-18T11:42:25Z

pkg/storage/ingest/reader.go

 		r.recordFetchesMetrics(fetches)
+		r.logFetchErrs(fetches)


I'm still not super convinced about this approach. I think mixing error and non-error fetches in these functions could be error prone. What if we construct a slice of successful fetches (which is the slice returned by PollFetches() if there are no errors) and then we call recordFetchesMetrics(), consumeFetches() and enqueueCommit() passing only the successful fetches?

the fetches abstraction is difficult to work with in pure go in the first place and we use its iterators everywhere (iterating partitions or records), but maybe filtering out the error fetches makes reasoning later on easier; I did that in the latest commit, PTAL

pracucci · 2023-12-18T11:43:12Z

pkg/storage/ingest/reader.go

@@ -341,6 +340,10 @@ func newReaderMetrics(partitionID int32, reg prometheus.Registerer) readerMetric
 			Help:    "The number of records received by the consumer in a single fetch operation.",
 			Buckets: prometheus.ExponentialBuckets(1, 2, 15),
 		}),
+		fetchesErrors: factory.NewCounter(prometheus.CounterOpts{
+			Name: "cortex_ingest_storage_reader_fetch_errors_total",


Do we already have a metric with the total number of fetches, to compute a % of failing ones?

maybe we can infer something from the franz-go metrics, but i prefer to avoid that (is a batch the same as a fetch?). I added a counter for the number of fetches too.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

pracucci

Thanks for addressing my feedback. Approved. I just have a doubt about the tracking of receiveDelay metric that could be tracked for errors too.

Also, I'm wondering if we could enhance existing tests to assert on new logic too (e.g. new metrics).

pracucci · 2023-12-18T13:14:42Z

pkg/storage/ingest/reader.go

-			level.Error(r.logger).Log("msg", "encountered error while fetching", "err", err)
-			continue
-		}
-
 		r.recordFetchesMetrics(fetches)


Could some of the metrics falsed if we track them for error-ed fetches too?

I'm thinking to receiveDelay in particular.

whenever the fetch has an error it never has records; we can perhaps have misleading values for the number of records per fetch and the number of fetches we do from kafka. Since some error fetches can be synthetic (i.e. not actually returned from kafka), i moved metrics recording after filtering out error fetches

pkg/storage/ingest/reader.go

dimitarvdimitrov · 2023-12-18T14:21:03Z

Also, I'm wondering if we could enhance existing tests to assert on new logic too (e.g. new metrics).

it's difficult with testutil.GatherAndCompare. The metrics aren't reliable, since records can come in a single fetch or multiple. Also the number of errors isn't consistent because the retries are infinite, so there's a race condition in the test

dimitarvdimitrov requested a review from a team as a code owner December 18, 2023 11:01

This was referenced Dec 18, 2023

ingester: add experimental support for consuming records from kafka #6929

Merged

ingest consumer: handle Push errors #6940

Merged

pracucci reviewed Dec 18, 2023

View reviewed changes

dimitarvdimitrov added 3 commits December 18, 2023 12:50

Record total number of fetches

bc3151e

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Shortcut on 0 errors

2692bfd

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Work only on non-err fetches

4afd8c3

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

pracucci approved these changes Dec 18, 2023

View reviewed changes

dimitarvdimitrov merged commit 9de67c5 into main Dec 18, 2023
28 checks passed

dimitarvdimitrov deleted the dimitar/ingest/address-post-merge-fb-on-6929 branch December 18, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest consumer: more granular error handling, committer sanity check #6951

ingest consumer: more granular error handling, committer sanity check #6951

dimitarvdimitrov commented Dec 18, 2023

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

pracucci left a comment

pracucci Dec 18, 2023

pracucci Dec 18, 2023

dimitarvdimitrov Dec 18, 2023

dimitarvdimitrov commented Dec 18, 2023

ingest consumer: more granular error handling, committer sanity check #6951

ingest consumer: more granular error handling, committer sanity check #6951

Conversation

dimitarvdimitrov commented Dec 18, 2023

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

pracucci Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Dec 18, 2023

Choose a reason for hiding this comment

dimitarvdimitrov commented Dec 18, 2023