Distributor: improve error handling in otlp and push handler #8339

duricanikolic · 2024-06-11T10:37:29Z

What this PR does

Errors produced by distributors are of type distributor.Error. Depending on a use case, they are mapped to the corresponding gRPC and/or HTTP errors:

distributor.Push() converts them to errors with a gRPC status (by using gogo's status package) and, therefore, gRPC status codes.
distributor.handler() converts them to HTTP errors with HTTP status codes.
distributor.otlpHandler() converts them to an HTTP response, encoding a gRPC status, gRPC status code and HTTP status code.

In case of distributor.Push(), there is a method that does mapping between distributor.Error implementations and gRPC status codes (here). Similarly, there is another method that does mapping between distributor.Error implementations and OTLP-related gRPC codes (here). Although in most of the cases these 2 methods map the same distributor.Error errors to the same gRPC code, there are some differences. The following table shows the differences and a proposed conflict solution, in order to do a conversion to a gRPC status code in one place only:

error cause	`Push()`	`otlpHandler()`	proposed solution	motivation
`BAD_DATA`	`FailedPrecondition`	`InvalidArgument`	`InvalidArgument`	`BAD_DATA` represents the errors caused by processing bad input data, which fits well here.
`REPLICAS_DID_NOT_MATCH`	`AlreadyExists`	`OK`	`AlreadyExists`	`AlreadyExists` corresponds better to `mimirpb.REPLICAS_DID_NOT_MATCH`
`TOO_MANY_CLUSTERS`	`FailedPrecondition`	`InvalidArgument`	`FailedPrecondition`	`FailedPrecondition` is used when a system is not in a state required for the operation's execution which applies to this situation. The error is caused by a system state, not by an input data.
`TSDB_UNAVAILABLE`	`Unavailable`	`Internal`	`Internal`	`TSDB_UNAVAILABLE` are errors coming from ingesters when they cannot get data from TSDB. From the distributor perspective it is an internal error, because the distributor should not know the ingester logic.

Last but not least, I am would like to replace the following code from distributor.otlpHandler():

var (
	httpCode int
	grpcCode codes.Code
	errorMsg string
)
if resp, ok := httpgrpc.HTTPResponseFromError(err); ok {
	s, _ := grpcutil.ErrorToStatus(err)
	httpCode = int(resp.Code)
	grpcCode = s.Code() // this will be the same as httpCode.
	errorMsg = string(resp.Body)
} else {
	grpcCode, httpCode = toOtlpGRPCHTTPStatus(err)
	errorMsg = err.Error()
}

with

var (
	httpCode int
	grpcCode codes.Code
	errorMsg string
)
if st, ok := grpcutil.ErrorToStatus(err); ok {
	httpCode = int(st.Code())
	grpcCode = s.Code()
	errorMsg = st.Message()
} else {
	grpcCode, httpCode = toOtlpGRPCHTTPStatus(err)
	errorMsg = err.Error()
}

This way we will avoid a double conversion from an error to a gRPC status that is currently done first in httpgrpc.HTTPResponseFromError() and then in grpcutil.ErrorToStatus().

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
[na] Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
[na] about-versioning.md updated with experimental features.

CHANGELOG.md

aknuds1

Please see my comments, especially the one about otlpHandler.

pkg/distributor/errors.go

pkg/distributor/errors_test.go

aknuds1 · 2024-06-11T16:20:38Z

pkg/distributor/otel.go

-				grpcCode = s.Code() // this will be the same as httpCode.
-				errorMsg = string(resp.Body)
+			if st, ok := grpcutil.ErrorToStatus(err); ok {
+				httpCode = int(st.Code())


Confirmed with @ying-jeanne, I think a problem here (carried over from @ying-jeanne's previous PR) is that httpCode doesn't get translated with regards to OTel's retriable HTTP status codes.

Should we possibly instead just map retryable Prometheus HTTP status codes to OTel ones? We could avoid duplication between toHTTPStatus and toOtlpGRPCHTTPStatus?

@ying-jeanne was the goal of #8324 to map all 5xx HTTP status codes that are non-retryable according to the OTLP specification to retryable ones?
So far, the actual delta of #8324 is that 500 is mapped to 503. But all other 5xx errors different from 502, 503 and 504 are still non-retryable (for example 501). What do we want to do with 501? Do we want it to be retryable or non-retryable?

After a discussion in Slack we agreed that the mapping from Prometheus retryable HTTP status codes to OTLP retryable HTTP status codes should be:

Prometheus HTTP status code OTLP HTTP status code

2xx 2xx

4xx 4xx

500 503

501 503

502 502

503 503

504 504

>504 503

@ying-jeanne was the goal of #8324 to map all 5xx HTTP status codes that are non-retryable according to the OTLP specification to retryable ones? So far, the actual delta of #8324 is that 500 is mapped to 503. But all other 5xx errors different from 502, 503 and 504 are still non-retryable (for example 501). What do we want to do with 501? Do we want it to be retryable or non-retryable?

The original function contains only 500 and 501, and since 501 is not implemented and align with loki's fix https://github.com/grafana/loki/pull/13173/files, we just map 500 to 503. But now the fix is on all path, we should at least map 504+ to 503, and 529 to 429

The table I posted above shows the current mimir mapping. We map 501 (coming from ingesters) to 503 (OTLP) because Prometheus retries 501 errors too.

…ng between otlp and push handler Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

aknuds1

Generally looks great to me! I saw some misleading (non-nit) error test case names though (which I've left suggestions for).

pkg/distributor/errors_test.go

pkg/distributor/otel_test.go

pkg/distributor/push.go

pkg/distributor/push_test.go

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

aknuds1

LGTM, thanks!

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

…r-handling

duricanikolic self-assigned this Jun 11, 2024

duricanikolic requested a review from a team as a code owner June 11, 2024 10:37

duricanikolic force-pushed the yuri/distributor-error-handling branch from 09c80b6 to 4af491e Compare June 11, 2024 11:25

aknuds1 self-requested a review June 11, 2024 12:01

pstibrany reviewed Jun 11, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

aknuds1 reviewed Jun 11, 2024

View reviewed changes

aknuds1 mentioned this pull request Jun 12, 2024

mapping no retryable 5xx errors to retryable error in otlp handler #8310

Closed

aknuds1 self-requested a review June 12, 2024 11:17

duricanikolic added 5 commits June 12, 2024 13:35

Distributor: remove redundant code and incongruencies in error handli…

cd90e5d

…ng between otlp and push handler Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing failing tests

79551bd

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing review findings

1d97dde

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Removing CHANGELONG entries

f8e7825

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Fixing a failing test

db54ce9

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

duricanikolic force-pushed the yuri/distributor-error-handling branch from de681cd to db54ce9 Compare June 12, 2024 11:35

Fixing failing tests

aca8ade

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

aknuds1 requested changes Jun 12, 2024

View reviewed changes

Fixing review findings

8f36fc9

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

aknuds1 approved these changes Jun 12, 2024

View reviewed changes

duricanikolic added 2 commits June 12, 2024 14:57

Updating CHANGELOG

130863e

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

Merge remote-tracking branch 'origin/main' into yuri/distributor-erro…

7af8a38

…r-handling

duricanikolic merged commit 9bf7fe4 into main Jun 12, 2024
29 checks passed

duricanikolic deleted the yuri/distributor-error-handling branch June 12, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributor: improve error handling in otlp and push handler #8339

Distributor: improve error handling in otlp and push handler #8339

duricanikolic commented Jun 11, 2024 •

edited

Loading

aknuds1 left a comment

aknuds1 Jun 11, 2024 •

edited

Loading

duricanikolic Jun 11, 2024

duricanikolic Jun 12, 2024

ying-jeanne Jun 12, 2024 •

edited

Loading

duricanikolic Jun 12, 2024

aknuds1 left a comment •

edited

Loading

aknuds1 left a comment

Prometheus HTTP status code	OTLP HTTP status code
`2xx`	`2xx`
`4xx`	`4xx`
`500`	`503`
`501`	`503`
`502`	`502`
`503`	`503`
`504`	`504`
>`504`	`503`

Distributor: improve error handling in otlp and push handler #8339

Distributor: improve error handling in otlp and push handler #8339

Conversation

duricanikolic commented Jun 11, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

aknuds1 left a comment

Choose a reason for hiding this comment

aknuds1 Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

duricanikolic Jun 11, 2024

Choose a reason for hiding this comment

duricanikolic Jun 12, 2024

Choose a reason for hiding this comment

ying-jeanne Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

duricanikolic Jun 12, 2024

Choose a reason for hiding this comment

aknuds1 left a comment • edited Loading

Choose a reason for hiding this comment

aknuds1 left a comment

Choose a reason for hiding this comment

duricanikolic commented Jun 11, 2024 •

edited

Loading

aknuds1 Jun 11, 2024 •

edited

Loading

ying-jeanne Jun 12, 2024 •

edited

Loading

aknuds1 left a comment •

edited

Loading