Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All errors should be safe to be wrapped and retained #6008

Closed
colega opened this issue Sep 12, 2023 · 3 comments
Closed

All errors should be safe to be wrapped and retained #6008

colega opened this issue Sep 12, 2023 · 3 comments
Assignees

Comments

@colega
Copy link
Contributor

colega commented Sep 12, 2023

Is your feature request related to a problem? Please describe.

Some of our errors retain references to unsafe-ly built things like labels:

func makeMetricLimitError(labels labels.Labels, err error) error {
return validationError{
err: err,
code: http.StatusBadRequest,
labels: labels,
}
}

@pracucci said:

some errors carry the series labels but these series labels are only safe to read during the execution of push() because they're unmarshalled from the write request into a pool. So before returning, we effectively "make a copy" to ensure the error will be safe after the push() has returned.

As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)

And I think that day has come and it's time to do it. Errors are something meant to escape the execution of a function, and there's no way to ensure they won't escape the caller. It's just a matter of time for this to cause corrupted error messages or even something worse.

Describe the solution you'd like

Any error should be safe to be stored for any time, even if that means a more expensive error creation/formatting.

Describe alternatives you've considered

Just ignore the problem.

@colega colega changed the title All errors should be safe to wrap All errors should be safe to be wrapped and retained Sep 12, 2023
@duricanikolic
Copy link
Contributor

We have made the first steps towards creating safe errors: this PR introduces the following interface, belonging to the ingester package:

type safeToWrap interface {
  safeToWrap()
}

as well as a type implementing that interface:

type safeToWrapError string

func (s safeToWrapError) safeToWrap() {}

func (s safeToWrapError) Error() string {
	return string(s)
}

Most of the errors returned by ingester are created by calling safeToWrapError(msg).

@duricanikolic duricanikolic self-assigned this Oct 26, 2023
@duricanikolic
Copy link
Contributor

duricanikolic commented Nov 24, 2023

This comment describes the current status of this issue. So far error handling has been improved in the following components: ingester and distributor.

Main Achievements

  • Consistency.
  • Ability to send additional details about the error via a gRPC communication.
  • Use gRPC status codes as status_code label in the request duration metrics and in Mimir's dashboards.

Report gRPC codes as labels in cortex and ingester_client request duration metrics

Before these improvements, possible values of status_code labels of cortex_request_duration_seconds and cortex_ingester_client_request_duration_seconds metrics were:

  • success or 2xx, representing successful statuses,
  • 4xx, representing all HTTP status codes starting with 4,
  • 5xx, representing all HTTP status codes starting with 5,
  • error, representing all other cases.

It wasn't therefore possible to use any gRPC code as a status_code value, which is not the case anymore.

  • Setting the -server.report-grpc-codes-in-instrumentation-label CLI flag to true enables reporting gRPC codes as values of the status_code label in cortex_request_duration_seconds metrics. It defaults to false, meaning that successful and erroneous gRPC status codes are represented with success and error respectively.
  • Setting the -ingester.client.report-grpc-codes-in-instrumentation-label-enabled CLI flag to true enables reporting gRPC codes as values of the status_code label in cortex_ingester_client_request_duration_seconds metrics. It defaults to false, meaning that successful and erroneous gRPC status codes are represented with 2xx and error respectively.

Additional error details

Additional details about the gRPC errors returned by ingester and distributor are modelled by the following protobuf messages belonging to the mimirpb package:

message ErrorDetails {
  ErrorCause Cause = 1;
}

enum ErrorCause {
  UNKNOWN_CAUSE = 0;
  REPLICAS_DID_NOT_MATCH = 1;
  TOO_MANY_CLUSTERS = 2;
  BAD_DATA = 3;
  INGESTION_RATE_LIMITED = 4;
  REQUEST_RATE_LIMITED = 5;
  INSTANCE_LIMIT = 6;
  SERVICE_UNAVAILABLE = 7;
  TSDB_UNAVAILABLE = 8;
  TOO_BUSY = 9;
}

Ingester

Ingester is divided in 2 parts: Ingester Client (IC) and Ingester Business Logic (IS). Ingester-related errors occurring on the IS side implement the internal ingesterError interface, defined like this:

type ingesterError interface {
  errorCause() mimirpb.ErrorCause
}

IS and IC communicate via gRPC, and the errors exchanged between them should also be errors containing gRPC status codes. Previously, it wasn't always the case. For example, gRPC error corresponding to err-mimir-max-series-per-user (of type ingesterError) used to be created by dskit's httpgrpc package with code http.StatusBadRequest (400) and then propagated to IC. With the improved error handling, this error is created by using gogo's status package, with gRPC code codes.FailedPrecondition and with error cause mimirpb.BAD_DATA. The table given at the end of this section shows the differences between the errors transmitted from IS to IC before and after the improvement.

In order to switch to the new way of creating gRPC errors within IS, it is necessary to set the newly introduced experimental CLI flag -ingester.return-only-grpc-errors to true. The default value of this flag is false meaning that the default semantics of ingesters correponds to the one before the error handling improvement. The reason for this is backwards compatibility.

error -ingester.grpc-errors-enabled=false -ingester.grpc-errors-enabled=false
sample errors
exemplar errors
tsdb ingest exemplar errors
err-mimir-max-series-per-user
err-mimir-max-metadata-per-user
err-mimir-max-series-per-metric
err-mimir-max-metadata-per-metric
- created by: httpgrpc.Errorf()
- code: http.StatusBadRequest
- created by: status.New()
- code: codes.FailedPrecondition
- cause: mimirpb.BAD_DATA
tsdb unavailable error - created by: httpgrpc.Errorf()
- code: http.StatusServiceUnavailable
- created by: status.New()
- code: codes.Internal
- cause: mimirpb.TSDB_UNAVAILABLE
ingester unavailable error - created by: status.New()
- code: codes.Unavailable
- remark: opens circuit breaker
- created by: status.New()
- code: codes.Unavailable
- cause: mimirpb.SERVICE_UNAVAILABLE
- remark: opens circuit breaker
instance limit reached error - created by: status.New()
- code: codes.Unavailable
- remark: non-loggable, opens circuit breaker
- created by: status.New()
- code: codes.Unavailable
- cause: mimirpb.INSTANCE_LIMIT
- remark: non-loggable, opens circuit breaker
other not ingester-related errors (not implementing ingesterError interface) - propagated by Ingester.Push()
- implicitly assigned codes.Unknown by gRPC
- created by: status.New()
- code: codes.Internal
- cause: mimirpb.INVALID

Distributor

Distributor is divided in 2 parts: Distributor Client (DC) and Distributor Business Logic (DS). Distributor-related errors occurring on the DS side implement the internal distributorError interface, defined like this:

type distributorError interface {
  errorCause() mimirpb.ErrorCause
}

DS and DC communicate via gRPC, and the errors exchanged between them should also be errors containing gRPC status codes. Previously, it wasn't always the case. For example, gRPC error corresponding to err-mimir-tenant-max-ingestion-rate (of type distributorError) used to be created by dskit's httpgrpc package with code http.StatusServiceOverloaded (429) and then propagated to DC. With the improved error handling, this error is created by using gogo's status package, with gRPC code codes.ResourceExhausted and with error cause mimirpb.INGESTION_RATE_LIMITED. The table given at the end of this section shows the differences between the errors transmitted from DS to DC before and after the improvement.

On the other hand, there is distributor's push handler, which translates DS's errors into HTTP errors. The semantics of the push handler has not been changed by this improvement.

No distributor-related configuration is needed in order to enable the new distributor error handling. It is enabled by default, meaning that all the errors between DS and DC are gRPC errors with gRPC error codes and an appropriate error details. It is worth noting that if -ingester.return-only-grpc-errors is false, the gRPC errors with HTTP status code returned by the ingester, will be propagated from DS to DC with the same gRPC code.

error type gRPC code ErrorCause
validation errors codes.FailedPrecondition mimirpb.VALIDATION
ingestion rate limited error codes.ResourceExhausted mimirpb.INGESTION_RATE_LIMITED
request rate limited error code.Unavailable or codes.ResourceExhausted mimirpb.REQUEST_RATE_LIMITED
replicas did not match error codes.AlreadyExists mimirpb.REPLICAS_DID_NOT_MATCH
too many clusters error codes.FailedPrecondition mimirpb.TOO_MANY_CLUSTERS
ingester errors with a gRPC code the gRPC code mimirpb.ErrorCause extracted from the error
ingester errors with an HTTP status code the HTTP status code none
other errors codes.Internal none

@duricanikolic
Copy link
Contributor

duricanikolic commented Dec 13, 2023

Migration plan

Current status

  • -server.report-grpc-codes-in-instrumentation-label-enabled is being slowly rolled out since the weekly release r266.
  • -ingester.client.report-grpc-codes-in-instrumentation-label-enabled is enabled in the weekly release r272.
  • -ingester.return-only-grpc-errors is being slowly rolled out since the weekly release r267.
    This means that at the beginning of February 2024 all our cells will be migrated to the improved error handling.

Proposal

If no problem is faced during the rolling out of the newly introduced CLI flags to the prod cells, the migration plan would be the following:

Mimir 2.12.0

  • The default value of -server.report-grpc-codes-in-instrumentation-label-enabled will be set to true, and the CLI flag will be deprecated.
  • The default value of -ingester.client.report-grpc-codes-in-instrumentation-label-enabled will be set to true, and the CLI flag will be deprecated.
  • The default value of -ingester.return-only-grpc-errors will be set to true, and the CLI flag will be deprecated.

Mimir 2.14.0

  • CLI flag -server.report-grpc-codes-in-instrumentation-label-enabled will be removed.
  • CLI flag -ingester.client.report-grpc-codes-in-instrumentation-label-enabled will be removed.
  • CLI flat -ingester.return-only-grpc-errors will be removed.
  • The legacy code will be removed.

Compatibility issues

CLI flag -ingester.return-only-grpc-errors

The following table shows the incompatibility issues in the systems running different versions of distributor and ingester. The abbreviations used to denote different versions are:

  • 2.14+ is a version 2.14 or higher
  • 2.11-2.13T is a version between 2.11 and 2.13
  • 2.10- is a version 2.10 or lower
distributor ingester -ingester.return-only-grpc-errors remark
2.14+ 2.14+ na OK
2.14+ 2.11-2.13 true OK
2.14+ 2.11-2.13 false All ingester errors will be returned with status code 500 (/api/v1/push) or with status code Internal (/distributor.Distributor/Push)
2.14+ 2.10- na All ingester errors will be returned with status code 500 (/api/v1/push) or with status code Internal (/distributor.Distributor/Push)
2.11-2.13 2.14+ na OK
2.11-2.13 2.11-2.13 true OK
2.11-2.13 2.11-2.13 false OK
2.11-2.13 2.10- na OK
2.10- 2.14+ na All ingester errors will be returned with status code 500 (/api/v1/push and /distributor.Distributor/Push)
2.10- 2.11-2.13 true All ingester errors will be returned with status code 500 (/api/v1/push and /distributor.Distributor/Push)
2.10- 2.11-2.13 false OK
2.10- 2.10- na OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants