-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingester write path: refactor error creation #6324
Conversation
4a84994
to
c6c126f
Compare
c6c126f
to
176604b
Compare
7e93c86
to
5d5e118
Compare
pkg/ingester/errors.go
Outdated
message: fmt.Sprintf( | ||
"%s. The affected sample has timestamp %s and is from series %s", | ||
prefixMsg, | ||
timestamp.Time().UTC().Format(time.RFC3339Nano), | ||
mimirpb.FromLabelAdaptersToLabels(labels).String(), | ||
), | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this formatting in Error()
call, not in "new" call. It's wasteful if message is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have re-organized the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite complete. There's still unnecessary string formatting done when calling globalerror.ID.Message(...)
too early. This should also be delayed to Error
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps keeping globalerror.ID
in errors could also help with deciding which sampling wrapper to use? 🤔 Then we can move sampling out of calls to updateFirstPartial
, and let that function only work ith ingesterError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't touch error sampling now. This is a separate problem, and it would additionally complicate this PR and this task, which is basically to remove httpgrpc.Errorf()
calls from ingester and distributor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite complete. There's still unnecessary string formatting done when calling globalerror.ID.Message(...) too early. This should also be delayed to Error calls.
I was asked by @colega to do it this way here. If your opinion about formatting is strong, I will put the globalID and message back to the error as fields, and format them during Error()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would suggest to put it back. While I see Oleg's point, creating additional strings in "new" call is premature, and can be avoided by putting formatting into Error
call. If Error
call is actually never called, then we just saved formatting completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this immediately.
code := http.StatusBadRequest | ||
errWithUser := wrapOrAnnotateWithUser(firstPartialErr, userID) | ||
return newErrorWithHTTPStatus(errWithUser, code) | ||
return wrapOrAnnotateWithUser(firstPartialErr, userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we would always return HTTP Status code 400 for ANY error here, even if it wasn't "safe to wrap". After this change, any error that's "not safe to wrap" will be translated to 500. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pstibrany,
Not exactly. All the errors returned here were safe, because they are the soft errors found by [pushSamplesToAppender()](https://github.com/grafana/mimir/blob/main/pkg/ingester/ingester.go#L874)
. These errors are always safe, and after the changes from this PR they are all defined in errors.go
. This PR keeps the same semantics as the current master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the errors returned here were safe,
These errors are always safe
I don't think this is true. If it is, let's use type system to make sure. But I think there are possible errors that can be returned by pushSamplesToAppender
that are not wrapped into our "safe" wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, updateFirstPartial
(here) is responsible for assigning a value to variable firstPartialErr
, that is then retrurned with the 400 HTTP status code (here).
The updateFirstPartial
is passed as a parameter to pushSamplesToAppender()
, and is called different times inside its implementation (e.g., here, here, etc...)
Related to this PR, all the calls to updateFirstPartial
in pushSamplesToAppend()
return what used to be called a safeToWrap
error, and what now became an ingesterError
, i.e., an error that wrapOrAnnotateWithUserId()
would actually wrap with userID. Moreover, errorType()
of all these ingesterError
s is ingesterErrorType.badData
, i.e., the type of errors that is mapped to 400.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation,
updateFirstPartial
(here) is responsible for assigning a value to variablefirstPartialErr
, that is then retrurned with the 400 HTTP status code (here).The
updateFirstPartial
is passed as a parameter topushSamplesToAppender()
, and is called different times inside its implementation (e.g., here, here, etc...)
I see, and I think you're right. It would be great if we could codify that, for example by defining updateFirstPartial
so that it only accepts ingesterError
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand which problem with SampledError
wrapping you are referring to. Anyway, I don't think it is a good idea to pass a *log.Sampler
to updateFirstPartial
because there are some errors that don't need to be sampled, for example this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, and I think you're right. It would be great if we could codify that, for example by defining
updateFirstPartial
so that it only acceptsingesterError
s.
I can try to do it, but what do we actually want to achieve? If the error returned by updateFirstPartial()
's parameter is something we consider a silent error, we go on, but if it is another type of error we want to fail immediately?
WDYT if we leave it like it is now (it doesn't change the previous behavior) and to improve this in a separate PR that will be dedicated only to that problem? Otherwise this PR will become very big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what do we actually want to achieve? If the error returned by updateFirstPartial()'s parameter is something we consider a silent error, we go on, but if it is another type of error we want to fail immediately?
What we want to achieve is to make sure that we don't "leak" any other types of errors via firstPartialErr
(that can't be safely wrapped). As stated previously, even if that was the case before this PR, we would still end up returning 400, but after this PR that's not the case anymore -- instead call to wrapOrAnnotateWithUser
with unwrappable error will "destroy" underlying error type and our mapping will not convert it to anything.
While we can verify that only "ingester errors" are returned firstPartialErr
today by doing a code review, it would be better to actually use the type system to avoid any future "leaks" of non-ingester errors here.
I agree about doing this in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand which problem with SampledError wrapping you are referring to.
I've tried to change updateFirstPartial
to only accept ingesterError
, but that wasn't working for errors wrapped by SampledError
.
Anyway, I don't think it is a good idea to pass a *log.Sampler to updateFirstPartial because there are some errors that don't need to be sampled, for example this.
I don't think that's a problem -- we would pass nil
in such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a problem -- we would pass nil in such case.
I have updated the updateFirstPartial
signature accordingly. I will do additional changes in a separate PR.
0707e49
to
a10c540
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, nice work! I've left some non-blocking comments.
pkg/ingester/errors.go
Outdated
|
||
// exemplarError is an ingesterError indicating a problem with an exemplar. | ||
type exemplarError struct { | ||
originalErr error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks somewhat strange that exemplarError
has this optional originalErr
. That almost looks like a separate type to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added another type, tsdbIngestExemplarError
to handle that situation.
if firstPartialErr == nil { | ||
firstPartialErr = errFn() | ||
if sampler != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was suggested in combination of making errFn
return only ingesterError
, and changing firstPartialErr
to that type as well. It may make more sense then, but we can keep it for now.
require.True(t, ok) | ||
require.Equal(t, http.StatusServiceUnavailable, int(stat.Code())) | ||
require.Equal(t, tooBusyErrorMsg, stat.Message()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: These checks repeat at different places. We could extract them into a checkGrpcStatusError
method.
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
bba50dc
to
43680a3
Compare
I have implemented requested findings and merged the PR. |
if err == nil { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this if
necessary?
What this PR does
Similarly to what has been done on distributor site (PR), this PR introduces the following error types, representing the problems on the ingester’s write path that need some special handling:
unavailableError
,tsdbUnavailableError
,sampleError
,exemplarError
,tsdbIngestExemplarError
,instanceLimitReachedError
,perUserSeriesLimitReachedError
,perUserMetadataLimitReachedError
,perMetricSeriesLimitReachedError
,perMetricMetadataLimitReachedError
.They all belong to the
ingester
package.Moreover, this PR refactors the ingester's write path in such a way that all mappings between errors created by the ingester and the HTTP status or gRPC status codes have been moved to the topmost possible level, i.e., to
ingester.Push()
.The semantics of the ingester's write path has not been changed, i.e., the newly introduced errors are mapped to the same HTTP or gRPC status codes their corresponding problems used to be assigned before this PR.
This refactoring is just a preliminary change needed for a bigger refactoring and improving of intester's and distributor's write path error handling.
Which issue(s) this PR fixes or relates to
Related to #6008
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]