Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester write path: refactor error creation #6324

Merged
merged 9 commits into from
Oct 12, 2023

Conversation

duricanikolic
Copy link
Contributor

@duricanikolic duricanikolic commented Oct 10, 2023

What this PR does

Similarly to what has been done on distributor site (PR), this PR introduces the following error types, representing the problems on the ingester’s write path that need some special handling:

  • unavailableError,
  • tsdbUnavailableError,
  • sampleError,
  • exemplarError,
  • tsdbIngestExemplarError,
  • instanceLimitReachedError,
  • perUserSeriesLimitReachedError,
  • perUserMetadataLimitReachedError,
  • perMetricSeriesLimitReachedError,
  • perMetricMetadataLimitReachedError.
    They all belong to the ingester package.

Moreover, this PR refactors the ingester's write path in such a way that all mappings between errors created by the ingester and the HTTP status or gRPC status codes have been moved to the topmost possible level, i.e., to ingester.Push().

The semantics of the ingester's write path has not been changed, i.e., the newly introduced errors are mapped to the same HTTP or gRPC status codes their corresponding problems used to be assigned before this PR.

This refactoring is just a preliminary change needed for a bigger refactoring and improving of intester's and distributor's write path error handling.

Which issue(s) this PR fixes or relates to

Related to #6008

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@duricanikolic duricanikolic self-assigned this Oct 10, 2023
@duricanikolic duricanikolic force-pushed the yuri/handling-ingester-errors branch 2 times, most recently from 4a84994 to c6c126f Compare October 10, 2023 20:20
@duricanikolic duricanikolic marked this pull request as ready for review October 10, 2023 20:22
@duricanikolic duricanikolic requested a review from a team as a code owner October 10, 2023 20:22
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/mimir/grpc_push_check.go Show resolved Hide resolved
pkg/ingester/errors.go Outdated Show resolved Hide resolved
@duricanikolic duricanikolic force-pushed the yuri/handling-ingester-errors branch 2 times, most recently from 7e93c86 to 5d5e118 Compare October 11, 2023 15:48
pkg/ingester/errors.go Outdated Show resolved Hide resolved
Comment on lines 102 to 123
message: fmt.Sprintf(
"%s. The affected sample has timestamp %s and is from series %s",
prefixMsg,
timestamp.Time().UTC().Format(time.RFC3339Nano),
mimirpb.FromLabelAdaptersToLabels(labels).String(),
),
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this formatting in Error() call, not in "new" call. It's wasteful if message is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I have re-organized the code accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite complete. There's still unnecessary string formatting done when calling globalerror.ID.Message(...) too early. This should also be delayed to Error calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps keeping globalerror.ID in errors could also help with deciding which sampling wrapper to use? 🤔 Then we can move sampling out of calls to updateFirstPartial, and let that function only work ith ingesterError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't touch error sampling now. This is a separate problem, and it would additionally complicate this PR and this task, which is basically to remove httpgrpc.Errorf() calls from ingester and distributor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite complete. There's still unnecessary string formatting done when calling globalerror.ID.Message(...) too early. This should also be delayed to Error calls.

I was asked by @colega to do it this way here. If your opinion about formatting is strong, I will put the globalID and message back to the error as fields, and format them during Error().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest to put it back. While I see Oleg's point, creating additional strings in "new" call is premature, and can be avoided by putting formatting into Error call. If Error call is actually never called, then we just saved formatting completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this immediately.

pkg/ingester/errors.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
code := http.StatusBadRequest
errWithUser := wrapOrAnnotateWithUser(firstPartialErr, userID)
return newErrorWithHTTPStatus(errWithUser, code)
return wrapOrAnnotateWithUser(firstPartialErr, userID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we would always return HTTP Status code 400 for ANY error here, even if it wasn't "safe to wrap". After this change, any error that's "not safe to wrap" will be translated to 500. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pstibrany,
Not exactly. All the errors returned here were safe, because they are the soft errors found by [pushSamplesToAppender()](https://github.com/grafana/mimir/blob/main/pkg/ingester/ingester.go#L874). These errors are always safe, and after the changes from this PR they are all defined in errors.go. This PR keeps the same semantics as the current master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the errors returned here were safe,

These errors are always safe

I don't think this is true. If it is, let's use type system to make sure. But I think there are possible errors that can be returned by pushSamplesToAppender that are not wrapped into our "safe" wrappers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, updateFirstPartial (here) is responsible for assigning a value to variable firstPartialErr, that is then retrurned with the 400 HTTP status code (here).

The updateFirstPartial is passed as a parameter to pushSamplesToAppender(), and is called different times inside its implementation (e.g., here, here, etc...)

Related to this PR, all the calls to updateFirstPartial in pushSamplesToAppend() return what used to be called a safeToWrap error, and what now became an ingesterError, i.e., an error that wrapOrAnnotateWithUserId() would actually wrap with userID. Moreover, errorType() of all these ingesterErrors is ingesterErrorType.badData, i.e., the type of errors that is mapped to 400.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, updateFirstPartial (here) is responsible for assigning a value to variable firstPartialErr, that is then retrurned with the 400 HTTP status code (here).

The updateFirstPartial is passed as a parameter to pushSamplesToAppender(), and is called different times inside its implementation (e.g., here, here, etc...)

I see, and I think you're right. It would be great if we could codify that, for example by defining updateFirstPartial so that it only accepts ingesterErrors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand which problem with SampledError wrapping you are referring to. Anyway, I don't think it is a good idea to pass a *log.Sampler to updateFirstPartial because there are some errors that don't need to be sampled, for example this.

Copy link
Contributor Author

@duricanikolic duricanikolic Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, and I think you're right. It would be great if we could codify that, for example by defining updateFirstPartial so that it only accepts ingesterErrors.

I can try to do it, but what do we actually want to achieve? If the error returned by updateFirstPartial()'s parameter is something we consider a silent error, we go on, but if it is another type of error we want to fail immediately?
WDYT if we leave it like it is now (it doesn't change the previous behavior) and to improve this in a separate PR that will be dedicated only to that problem? Otherwise this PR will become very big.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what do we actually want to achieve? If the error returned by updateFirstPartial()'s parameter is something we consider a silent error, we go on, but if it is another type of error we want to fail immediately?

What we want to achieve is to make sure that we don't "leak" any other types of errors via firstPartialErr (that can't be safely wrapped). As stated previously, even if that was the case before this PR, we would still end up returning 400, but after this PR that's not the case anymore -- instead call to wrapOrAnnotateWithUser with unwrappable error will "destroy" underlying error type and our mapping will not convert it to anything.

While we can verify that only "ingester errors" are returned firstPartialErr today by doing a code review, it would be better to actually use the type system to avoid any future "leaks" of non-ingester errors here.

I agree about doing this in separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand which problem with SampledError wrapping you are referring to.

I've tried to change updateFirstPartial to only accept ingesterError, but that wasn't working for errors wrapped by SampledError.

Anyway, I don't think it is a good idea to pass a *log.Sampler to updateFirstPartial because there are some errors that don't need to be sampled, for example this.

I don't think that's a problem -- we would pass nil in such case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a problem -- we would pass nil in such case.

I have updated the updateFirstPartial signature accordingly. I will do additional changes in a separate PR.

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, nice work! I've left some non-blocking comments.


// exemplarError is an ingesterError indicating a problem with an exemplar.
type exemplarError struct {
originalErr error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks somewhat strange that exemplarError has this optional originalErr. That almost looks like a separate type to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added another type, tsdbIngestExemplarError to handle that situation.

if firstPartialErr == nil {
firstPartialErr = errFn()
if sampler != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested in combination of making errFn return only ingesterError, and changing firstPartialErr to that type as well. It may make more sense then, but we can keep it for now.

Comment on lines +1993 to +1994
require.True(t, ok)
require.Equal(t, http.StatusServiceUnavailable, int(stat.Code()))
require.Equal(t, tooBusyErrorMsg, stat.Message())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These checks repeat at different places. We could extract them into a checkGrpcStatusError method.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@duricanikolic
Copy link
Contributor Author

I have implemented requested findings and merged the PR.

@duricanikolic duricanikolic merged commit 847212d into main Oct 12, 2023
28 checks passed
@duricanikolic duricanikolic deleted the yuri/handling-ingester-errors branch October 12, 2023 17:04
Comment on lines +1229 to +1231
if err == nil {
return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this if necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants