Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reason label to the numTotalFailedNotifications metric #3094

Merged
merged 10 commits into from
Feb 3, 2023

Conversation

qinxx108
Copy link
Contributor

@qinxx108 qinxx108 commented Oct 7, 2022

This is to follow up the discussion on issue: #2927

Instead of creating a new metric, we will add the status code label to the numTotalFailedNotifications metric. The default status code is going to be "5xx" and if the integration receivers defined the NewErrorWithStatusCode, we will translate the returned status code and put that in the label override

@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch 2 times, most recently from 106cf64 to b99dad3 Compare October 12, 2022 02:39
notify/notify.go Outdated
}
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name(), statusCodeCategory).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this increment still be inside the if statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch from b99dad3 to eb74eb5 Compare October 12, 2022 19:34
@qinxx108
Copy link
Contributor Author

qinxx108 commented Oct 13, 2022

@gotjosh @roidelapluie
Hi Josh and Julien. This is to follow up with #2927
To address the comments to put the status category into the label

Could you please take a look to see if this look good to you?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas to improve this code, but generally looks amazing. Thanks for coming to Prometheus ContribFest and this contribution!

notify/notify.go Outdated
@@ -262,7 +265,7 @@ func NewMetrics(r prometheus.Registerer) *Metrics {
Namespace: "alertmanager",
Name: "notifications_failed_total",
Help: "The total number of failed notifications.",
}, []string{"integration"}),
}, []string{"integration", "statusCode"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be concise. Perhaps code makes more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

notify/util.go Outdated
return "5xx", nil
}

return "", fmt.Errorf("unexpected status code %v", statusCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to fail some application logic if status code is unknown? I would just return unknown string here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Notify function which call this i did not failed the logic, instead use the default if the status code is unexpected. I agreed empty string is less explained itself. changed to unknown

notify/notify.go Outdated
@@ -293,7 +296,7 @@ func NewMetrics(r prometheus.Registerer) *Metrics {
"telegram",
} {
m.numNotifications.WithLabelValues(integration)
m.numTotalFailedNotifications.WithLabelValues(integration)
m.numTotalFailedNotifications.WithLabelValues(integration, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make a lot of sense... I think we want to populate here counters with each possible 3 status codes: 4xx, 5xx and unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch 2 times, most recently from edf6a31 to 6ce9839 Compare November 3, 2022 17:20
notify/util.go Outdated
// getFailureStatusCodeCategory return the status code category for failure request
// the status starts with 4 will return 4xx and starts with 5 will return 5xx
// other than 4xx and 5xx input status will return an error.
func getFailureStatusCodeCategory(statusCode int) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on how we use this, I feel we should not return error here, but just return default value, WDYT? Would make more clean and simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wanting to return an error so that the user of the function has knowledge about if the status code has been successfully translated. Otherwise they will have to rely on comparing if the result = failureUnknownCategoryCode if failureUnknownCategoryCode is not the ideal result for them in error case. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well there are no users of this function yet, and as YAGNI goes, we should assume there will be none. I would focus on our usage:

result, interErr := getFailureStatusCodeCategory(e.StatusCode)
			if interErr == nil {
				statusCodeCategory = result
			}

We don't want to know IF the status was translated correctly, but rather we want defaultStatusCodeCategory in this case. So I would just return defaultStatusCodeCategory

Also, now when we talk about it, util is not a great place for it - it's pretty generic - we want it in util if we have more than one usage of it. Given the function is very small and used only in one place, why not:

  1. Moving this function next to notify.go file?
    OR
  2. Copy the content of this function to caller place? It's might be too shallow for a function, but up to you (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Bartek, the function should return 4xx if [400,550), 5xx if [500,599) or fallback toother.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one nit (:

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nits, thanks!

notify/util.go Outdated
// UserAgentHeader is the default User-Agent for notification requests
var UserAgentHeader = fmt.Sprintf("Alertmanager/%s", version.Version)

// PossibleFailureStatusCategory is a list of possible failure status code category
var PossibleFailureStatusCategory = []string{failure4xxCategoryCode, failure5xxCategoryCode, failureUnknownCategoryCode}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's exported? Can we make it private and move it next to notify.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to notify.go

notify/util.go Outdated
// getFailureStatusCodeCategory return the status code category for failure request
// the status starts with 4 will return 4xx and starts with 5 will return 5xx
// other than 4xx and 5xx input status will return an error.
func getFailureStatusCodeCategory(statusCode int) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well there are no users of this function yet, and as YAGNI goes, we should assume there will be none. I would focus on our usage:

result, interErr := getFailureStatusCodeCategory(e.StatusCode)
			if interErr == nil {
				statusCodeCategory = result
			}

We don't want to know IF the status was translated correctly, but rather we want defaultStatusCodeCategory in this case. So I would just return defaultStatusCodeCategory

Also, now when we talk about it, util is not a great place for it - it's pretty generic - we want it in util if we have more than one usage of it. Given the function is very small and used only in one place, why not:

  1. Moving this function next to notify.go file?
    OR
  2. Copy the content of this function to caller place? It's might be too shallow for a function, but up to you (:

notify/notify.go Outdated
if err != nil {
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
if e, ok := errors.Cause(err).(*ErrorWithStatusCode); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need to call Cause() here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the error returned from

 r.exec(ctx, l, alerts...)

The error already wrapped once with errors.Wrapf

notify/util.go Outdated
// getFailureStatusCodeCategory return the status code category for failure request
// the status starts with 4 will return 4xx and starts with 5 will return 5xx
// other than 4xx and 5xx input status will return an error.
func getFailureStatusCodeCategory(statusCode int) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Bartek, the function should return 4xx if [400,550), 5xx if [500,599) or fallback toother.

notify/notify.go Outdated
@@ -662,8 +668,16 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.WithLabelValues(r.integration.Name()).Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)

statusCodeCategory := defaultStatusCodeCategory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integrations that don't implement ErrorWithStatusCode, this could be inaccurate (e.g. the receiving side might have returned a 400 status code). It's also not correct if Alertmanager couldn't even send the notification (template error for instance). And the email notifier is another case that doesn't fit here.

I wonder if we shouldn't abstract away from the HTTP situation and consider a more generic label like reason with a handful number of possible values:

  • server error (e.g. 5xx)
  • client error (e.g. 4xx)
  • authentication error
  • template
  • other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

  • DestinationServerError
  • DestinationClientError
  • Fatal (all code error)
  • Template
  • Other

authentication error will belong to DestinationClientError

And in case the integration doesn't implement the ErrorWithStatusCode, the category will belong to Other

WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier would this work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can start with:

  • ClientError (typically templating errors, 4xx responses)
  • ServerError (typically 5xx responses from the server)
  • Other (e.g. receivers for which we haven't implemented ClientError/ServerError logic yet)

And we can always add more in the future where we see fit (e.g. maybe we'll want to distinguish authn failures, serialization errors, templating errors, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch 2 times, most recently from 8b09c90 to 73c8b96 Compare November 16, 2022 19:07
@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch 6 times, most recently from 0aac23f to 8940bb2 Compare January 12, 2023 17:37
@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch from 8940bb2 to 205b802 Compare January 20, 2023 05:54
Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

All my comments are nits - but please let's address them, and we're ready to merge. Assuming that at this point, both @simonpasquier and @bwplotka's comments are resolved and they have nothing else to add.

notify/util.go Outdated Show resolved Hide resolved
notify/util.go Outdated Show resolved Hide resolved
notify/util.go Outdated Show resolved Hide resolved
notify/util.go Outdated Show resolved Hide resolved
@@ -662,8 +665,13 @@ func NewRetryStage(i Integration, groupName string, metrics *Metrics) *RetryStag
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.WithLabelValues(r.integration.Name()).Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)

failureReason := DefaultReason.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #3231 so that we can do the same for all other notifiers.

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
… when unknown status code

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Yijie Qin <qinyijie@amazon.com>
@qinxx108 qinxx108 force-pushed the 4xx-error-for-notification-failure branch from 25b9227 to 3f5d3a2 Compare February 2, 2023 18:41
@gotjosh
Copy link
Member

gotjosh commented Feb 3, 2023

Build is broken right now but tests pass, I'll assume everything is proper order here until we get that fixed that merge this.

@gotjosh gotjosh enabled auto-merge (squash) February 3, 2023 12:08
@gotjosh gotjosh disabled auto-merge February 3, 2023 12:09
@gotjosh gotjosh merged commit 7923bc5 into prometheus:main Feb 3, 2023
@gotjosh gotjosh changed the title add status code label to the numTotalFailedNotifications metric add reason label to the numTotalFailedNotifications metric Feb 3, 2023
@gotjosh
Copy link
Member

gotjosh commented Feb 3, 2023

Thanks very much for your contribution!

hoperays pushed a commit to hoperays/alertmanager that referenced this pull request Apr 23, 2023
…etheus#3094)

* add reason label to the numTotalFailedNotifications metric

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
radek-ryckowski pushed a commit to goldmansachs/alertmanager that referenced this pull request Nov 6, 2023
…etheus#3094)

* add reason label to the numTotalFailedNotifications metric

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants