Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating new metric for filtering 4xx errors #2927

Conversation

krishnateja325
Copy link

  1. Creating a new metric for filtering 4xx errors - numTotal4xxFailedNotifications
  2. Modifying the return of Notify function to return a boolean value is4xx. We will use this value to update the above metric in notify.go.
func (n *Notifier) Notify(ctx context.Context, alert ...*types.Alert) (bool, bool, error) { }
retry, is4xx, err := r.integration.Notify(ctx, sent...)
  1. For this change we have to update the Notify function in all the integrations. For now masked this change by returning false for this variable from all other integrations. only updated this variable when coming from SNS. Will update this boolean value for other integrations after discussion with the community.
  2. The is4xx variable is used to increment the metric inside exec function of the retry Stage. This variable helps in identifying if the error thrown is 4xx or not.
if is4xx {
       r.metrics.numTotal4xxFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
}
  1. Wrote a test TestRetryStageWithErrorCode in notify_test.go to test if this metric is getting incremented or not.

Files Impacted:

notify/email/email.go
notify/email/email_test.go
notify/notify.go
notify/notify_test.go
notify/opsgenie/opsgenie.go
notify/pagerduty/pagerduty.go
notify/pagerduty/pagerduty_test.go
notify/pushover/pushover.go
notify/slack/slack.go
notify/sns/sns.go
notify/telegram/telegram.go
notify/test/test.go
notify/victorops/victorops.go
notify/victorops/victorops_test.go
notify/webhook/webhook.go
notify/wechat/wechat.go

@krishnateja325 krishnateja325 force-pushed the metric-creation-4xx-error-filtering branch from ab8cd28 to b581947 Compare May 23, 2022 08:20
notify/sns/sns.go Outdated Show resolved Hide resolved
@krishnateja325 krishnateja325 force-pushed the metric-creation-4xx-error-filtering branch from b581947 to bb7b19e Compare May 23, 2022 16:59
@roidelapluie
Copy link
Member

Ping @gotjosh @simonpasquier what do you think about this?

@krishnateja325 krishnateja325 force-pushed the metric-creation-4xx-error-filtering branch from bb7b19e to 344e2cc Compare May 23, 2022 17:20
Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
@krishnateja325 krishnateja325 force-pushed the metric-creation-4xx-error-filtering branch from 344e2cc to 319bc80 Compare May 23, 2022 17:36
@krishnateja325
Copy link
Author

@roidelapluie @gotjosh @simonpasquier any comments on this?

@gotjosh
Copy link
Member

gotjosh commented Jun 13, 2022

Seems like I got tagged on this during some of my PTO and apparently I missed the tag - apologies about that.

I'm not sure this is the direction that I want to go. We have a alertmanager_notification_latency_seconds histogram that tracks the latency for notifications could we not add the status codes to that histogram as labels instead?

Keeping in mind that this would increase the number of series produced by the Alertmanager but perhaps there are some areas where we can compromise.

Off the top of my head:

  • Don't expose a series for integrations that don't have any configuration - right now we expose a series for all integrations regardless of whether they're used or not.
  • If we think that including status_code would be a cardinality issue, an alternative could be to use 2xx, 3xx, 4xx and 5xx as the status code to reduce the number of series.

One last thing I'd like to get some clarity on is: What exactly is the problem that you're trying to solve here? In my experience notifications_failed_total has been enough to be able to tell if integration is correctly or not.

@qinxx108
Copy link
Contributor

qinxx108 commented Sep 8, 2022

Hi Josh @gotjosh. Thanks a lot for your comments. The main problem that we are trying to solve is to identify the notification failure from customer input failure (for eg: permission deny from integrations that is expected to happen) with the programatic error (the programatic error + integrations 5xx error). To address concern of the cardinality issue, we could do

  1. Publish a new metric for SNS integration 4xx
  2. Publish a notifications_failed_total with 4xx and 5xx labels

Please let us know what's your preference

@gotjosh
Copy link
Member

gotjosh commented Feb 24, 2023

Discussed this during the Alertmanager Working Group - closing due to #3094 being merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants