Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor: improve error handling in otlp and push handler #8339

Merged
merged 9 commits into from
Jun 12, 2024

Conversation

duricanikolic
Copy link
Contributor

@duricanikolic duricanikolic commented Jun 11, 2024

What this PR does

Errors produced by distributors are of type distributor.Error. Depending on a use case, they are mapped to the corresponding gRPC and/or HTTP errors:

  • distributor.Push() converts them to errors with a gRPC status (by using gogo's status package) and, therefore, gRPC status codes.
  • distributor.handler() converts them to HTTP errors with HTTP status codes.
  • distributor.otlpHandler() converts them to an HTTP response, encoding a gRPC status, gRPC status code and HTTP status code.

In case of distributor.Push(), there is a method that does mapping between distributor.Error implementations and gRPC status codes (here). Similarly, there is another method that does mapping between distributor.Error implementations and OTLP-related gRPC codes (here). Although in most of the cases these 2 methods map the same distributor.Error errors to the same gRPC code, there are some differences. The following table shows the differences and a proposed conflict solution, in order to do a conversion to a gRPC status code in one place only:

error cause Push() otlpHandler() proposed solution motivation
BAD_DATA FailedPrecondition InvalidArgument InvalidArgument BAD_DATA represents the
errors caused by processing
bad input data, which fits well here.
REPLICAS_DID_NOT_MATCH AlreadyExists OK AlreadyExists AlreadyExists corresponds better
to mimirpb.REPLICAS_DID_NOT_MATCH
TOO_MANY_CLUSTERS FailedPrecondition InvalidArgument FailedPrecondition FailedPrecondition is used when a
system is not in a state required
for the operation's execution which
applies to this situation. The error
is caused by a system state, not by
an input data.
TSDB_UNAVAILABLE Unavailable Internal Internal TSDB_UNAVAILABLE are errors coming
from ingesters when they cannot get
data from TSDB. From the distributor
perspective it is an internal error,
because the distributor should not
know the ingester logic.

Last but not least, I am would like to replace the following code from distributor.otlpHandler():

var (
	httpCode int
	grpcCode codes.Code
	errorMsg string
)
if resp, ok := httpgrpc.HTTPResponseFromError(err); ok {
	s, _ := grpcutil.ErrorToStatus(err)
	httpCode = int(resp.Code)
	grpcCode = s.Code() // this will be the same as httpCode.
	errorMsg = string(resp.Body)
} else {
	grpcCode, httpCode = toOtlpGRPCHTTPStatus(err)
	errorMsg = err.Error()
}

with

var (
	httpCode int
	grpcCode codes.Code
	errorMsg string
)
if st, ok := grpcutil.ErrorToStatus(err); ok {
	httpCode = int(st.Code())
	grpcCode = s.Code()
	errorMsg = st.Message()
} else {
	grpcCode, httpCode = toOtlpGRPCHTTPStatus(err)
	errorMsg = err.Error()
}

This way we will avoid a double conversion from an error to a gRPC status that is currently done first in httpgrpc.HTTPResponseFromError() and then in grpcutil.ErrorToStatus().

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • [na] Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [na] about-versioning.md updated with experimental features.

@duricanikolic duricanikolic self-assigned this Jun 11, 2024
@duricanikolic duricanikolic requested a review from a team as a code owner June 11, 2024 10:37
@duricanikolic duricanikolic force-pushed the yuri/distributor-error-handling branch from 09c80b6 to 4af491e Compare June 11, 2024 11:25
@aknuds1 aknuds1 self-requested a review June 11, 2024 12:01
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments, especially the one about otlpHandler.

pkg/distributor/errors.go Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
grpcCode = s.Code() // this will be the same as httpCode.
errorMsg = string(resp.Body)
if st, ok := grpcutil.ErrorToStatus(err); ok {
httpCode = int(st.Code())
Copy link
Contributor

@aknuds1 aknuds1 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed with @ying-jeanne, I think a problem here (carried over from @ying-jeanne's previous PR) is that httpCode doesn't get translated with regards to OTel's retriable HTTP status codes.

Should we possibly instead just map retryable Prometheus HTTP status codes to OTel ones? We could avoid duplication between toHTTPStatus and toOtlpGRPCHTTPStatus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ying-jeanne was the goal of #8324 to map all 5xx HTTP status codes that are non-retryable according to the OTLP specification to retryable ones?
So far, the actual delta of #8324 is that 500 is mapped to 503. But all other 5xx errors different from 502, 503 and 504 are still non-retryable (for example 501). What do we want to do with 501? Do we want it to be retryable or non-retryable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion in Slack we agreed that the mapping from Prometheus retryable HTTP status codes to OTLP retryable HTTP status codes should be:

Prometheus HTTP status code OTLP HTTP status code
2xx 2xx
4xx 4xx
500 503
501 503
502 502
503 503
504 504
>504 503

Copy link
Contributor

@ying-jeanne ying-jeanne Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ying-jeanne was the goal of #8324 to map all 5xx HTTP status codes that are non-retryable according to the OTLP specification to retryable ones? So far, the actual delta of #8324 is that 500 is mapped to 503. But all other 5xx errors different from 502, 503 and 504 are still non-retryable (for example 501). What do we want to do with 501? Do we want it to be retryable or non-retryable?

The original function contains only 500 and 501, and since 501 is not implemented and align with loki's fix https://github.com/grafana/loki/pull/13173/files, we just map 500 to 503. But now the fix is on all path, we should at least map 504+ to 503, and 529 to 429

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table I posted above shows the current mimir mapping. We map 501 (coming from ingesters) to 503 (OTLP) because Prometheus retries 501 errors too.

…ng between otlp and push handler

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@duricanikolic duricanikolic force-pushed the yuri/distributor-error-handling branch from de681cd to db54ce9 Compare June 12, 2024 11:35
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great to me! I saw some misleading (non-nit) error test case names though (which I've left suggestions for).

pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/errors_test.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
pkg/distributor/push.go Outdated Show resolved Hide resolved
pkg/distributor/push_test.go Show resolved Hide resolved
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@duricanikolic duricanikolic merged commit 9bf7fe4 into main Jun 12, 2024
29 checks passed
@duricanikolic duricanikolic deleted the yuri/distributor-error-handling branch June 12, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants