-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributor: improve error handling in otlp and push handler #8339
Conversation
09c80b6
to
4af491e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments, especially the one about otlpHandler
.
pkg/distributor/otel.go
Outdated
grpcCode = s.Code() // this will be the same as httpCode. | ||
errorMsg = string(resp.Body) | ||
if st, ok := grpcutil.ErrorToStatus(err); ok { | ||
httpCode = int(st.Code()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed with @ying-jeanne, I think a problem here (carried over from @ying-jeanne's previous PR) is that httpCode
doesn't get translated with regards to OTel's retriable HTTP status codes.
Should we possibly instead just map retryable Prometheus HTTP status codes to OTel ones? We could avoid duplication between toHTTPStatus
and toOtlpGRPCHTTPStatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ying-jeanne was the goal of #8324 to map all 5xx
HTTP status codes that are non-retryable according to the OTLP specification to retryable ones?
So far, the actual delta of #8324 is that 500
is mapped to 503
. But all other 5xx
errors different from 502
, 503
and 504
are still non-retryable (for example 501
). What do we want to do with 501
? Do we want it to be retryable or non-retryable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a discussion in Slack we agreed that the mapping from Prometheus retryable HTTP status codes to OTLP retryable HTTP status codes should be:
Prometheus HTTP status code | OTLP HTTP status code |
---|---|
2xx |
2xx |
4xx |
4xx |
500 |
503 |
501 |
503 |
502 |
502 |
503 |
503 |
504 |
504 |
>504 |
503 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ying-jeanne was the goal of #8324 to map all
5xx
HTTP status codes that are non-retryable according to the OTLP specification to retryable ones? So far, the actual delta of #8324 is that500
is mapped to503
. But all other5xx
errors different from502
,503
and504
are still non-retryable (for example501
). What do we want to do with501
? Do we want it to be retryable or non-retryable?
The original function contains only 500 and 501, and since 501 is not implemented
and align with loki's fix https://github.com/grafana/loki/pull/13173/files, we just map 500 to 503. But now the fix is on all path, we should at least map 504+ to 503, and 529 to 429
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table I posted above shows the current mimir mapping. We map 501 (coming from ingesters) to 503 (OTLP) because Prometheus retries 501 errors too.
…ng between otlp and push handler Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
de681cd
to
db54ce9
Compare
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great to me! I saw some misleading (non-nit) error test case names though (which I've left suggestions for).
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
What this PR does
Errors produced by distributors are of type
distributor.Error
. Depending on a use case, they are mapped to the corresponding gRPC and/or HTTP errors:distributor.Push()
converts them to errors with a gRPC status (by using gogo'sstatus
package) and, therefore, gRPC status codes.distributor.handler()
converts them to HTTP errors with HTTP status codes.distributor.otlpHandler()
converts them to an HTTP response, encoding a gRPC status, gRPC status code and HTTP status code.In case of
distributor.Push()
, there is a method that does mapping betweendistributor.Error
implementations and gRPC status codes (here). Similarly, there is another method that does mapping betweendistributor.Error
implementations and OTLP-related gRPC codes (here). Although in most of the cases these 2 methods map the samedistributor.Error
errors to the same gRPC code, there are some differences. The following table shows the differences and a proposed conflict solution, in order to do a conversion to a gRPC status code in one place only:Push()
otlpHandler()
BAD_DATA
FailedPrecondition
InvalidArgument
InvalidArgument
BAD_DATA
represents theerrors caused by processing
bad input data, which fits well here.
REPLICAS_DID_NOT_MATCH
AlreadyExists
OK
AlreadyExists
AlreadyExists
corresponds betterto
mimirpb.REPLICAS_DID_NOT_MATCH
TOO_MANY_CLUSTERS
FailedPrecondition
InvalidArgument
FailedPrecondition
FailedPrecondition
is used when asystem is not in a state required
for the operation's execution which
applies to this situation. The error
is caused by a system state, not by
an input data.
TSDB_UNAVAILABLE
Unavailable
Internal
Internal
TSDB_UNAVAILABLE
are errors comingfrom ingesters when they cannot get
data from TSDB. From the distributor
perspective it is an internal error,
because the distributor should not
know the ingester logic.
Last but not least, I am would like to replace the following code from
distributor.otlpHandler()
:with
This way we will avoid a double conversion from an error to a gRPC status that is currently done first in
httpgrpc.HTTPResponseFromError()
and then ingrpcutil.ErrorToStatus()
.Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.