Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time-based muting to routing tree #2393

Merged
merged 47 commits into from
Mar 1, 2021

Conversation

benridley
Copy link
Contributor

@benridley benridley commented Oct 13, 2020

Addresses #876

This adds the ability to define time intervals as per this design document.

Time intervals are defined in their own configuration section like so:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times: 
        - start_time: "09:00"
          end_time: "17:00"

Then they can be referenced in the routing tree like so:

  # The root route on which each incoming alert enters.
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 1s
  group_interval: 30s

  # A default receiver
  receiver: team-X-mail
  routes:
    - match:
        alertname: NodeIsDown
      mute_times:
        - business_hours
      receiver: team-X-mail

Time zones are not supported at the moment because it would break the ability to run Alertmanager on Windows in a reliable way. Once this issue is addressed we can add support fairly easily.

@benridley benridley force-pushed the dev_time_interval branch 3 times, most recently from 7a16adb to 1323d17 Compare October 13, 2020 01:31
@benridley benridley marked this pull request as draft October 13, 2020 02:37
@benridley benridley marked this pull request as ready for review October 13, 2020 02:41
@benridley
Copy link
Contributor Author

Just added some additional tests for the config

@benridley
Copy link
Contributor Author

@simonpasquier Can you please let me know if there's anything else I can do for this one? Thank you!

@fredrik-j-lindberg
Copy link

This would help our team greatly. Hope you get time to look at this sooner rather than later! @simonpasquier

@juliusv
Copy link
Member

juliusv commented Nov 6, 2020

@simonpasquier There seems to be good interest in this one... in case you are held up by other stuff, maybe someone else wants to take a look? @roidelapluie @beorn7

I am unfortunately totally swamped at the moment.

notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
notify/notify.go Outdated Show resolved Hide resolved
}

// ContainsTime returns true if the TimeInterval contains the given time, otherwise returns false
func (tp TimeInterval) ContainsTime(t time.Time) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we check somewhere for empty time intervals? WIth an empty time interval, this would always return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this makes sense due to the semantics of the time interval (where not specifying a restriction means all possible times are included), I think an empty time interval is probably a symptom of an error. I'll add a check for this, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought more about this, I think we should leave this as is. This makes sense with the semantics of a time interval and modifying it would complicate the function by possibly returning only a single error case.

if err != nil {
return err
}
r.setBegin(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set them individualy somewhere? Can we do with one setter function ? Do we need those extra setters, can't we set EndDate directly?

Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
…ther error message wording

Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
…r feedback

Signed-off-by: Ben Ridley <benridley29@gmail.com>
@benridley
Copy link
Contributor Author

Thanks for validating that @beorn7. I've rebased and the tests are looking much happier.

Do we want to add some additional words around group_interval in the docs to indicate this behavior? The current documentation is:

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)

which doesn't give much indication of the issue we've discovered. Adding some words here might stop people from being surprised by delays in notification delivery.

@beorn7
Copy link
Member

beorn7 commented Mar 1, 2021

@benridley I totally agree that the current documentation is incomplete. However, let's first get clarity if the current behavior is actually the intended behavior or a bug (in which case we would document it very differently). I'll file an issue about this topic.

In the meantime, we can merge this PR.

@beorn7 beorn7 merged commit e66c803 into prometheus:master Mar 1, 2021
of the form <start_day>:<end_day> and are inclusive on both ends. For example:
`[‘monday:wednesday','saturday', 'sunday']`

`days_of_month_ramge`: A list of numerical days in the month. Days begin at 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

days_of_month_range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here #2497

@lopesit0
Copy link

Can someone explain to me how to integrate this functionality?

@benridley
Copy link
Contributor Author

benridley commented Apr 19, 2021

Hey @lopesit0, no problem!
This has been merged into master now, but isn't part of an official release yet, so if you download Alertmanager using traditional methods (e.g. downloading them from https://prometheus.io/download/), this functionality won't be there. If you want to try it out, you can build the master branch yourself which is very easy thanks to the Golang toolchain. Simply download Go and then clone this repository from GitHub. Once you've done that, on the command line simply run

go build -o alertmanager cmd/alertmanager/main.go

which will give you an alertmanager binary with this included.

You can find the raw documentation also in the master branch which explains how to configure your time intervals.

Just be warned this is new and potentially unstable. If you do decide to run it, please feel free to report back here or open issues against the repository so we can fix then ASAP!

EDIT: Alternatively, I've realized there's a Docker image built from master. So the easiest way to try this out might be using the official docker image built from master.
docker pull prom/alertmanager:master

@lopesit0
Copy link

Hey @lopesit0, no problem!
This has been merged into master now, but isn't part of an official release yet, so if you download Alertmanager using traditional methods (e.g. downloading them from https://prometheus.io/download/), this functionality won't be there. If you want to try it out, you can build the master branch yourself which is very easy thanks to the Golang toolchain. Simply download Go and then clone this repository from GitHub. Once you've done that, on the command line simply run

go build -o alertmanager cmd/alertmanager/main.go

which will give you an alertmanager binary with this included.

You can find the raw documentation also in the master branch which explains how to configure your time intervals.

Just be warned this is new and potentially unstable. If you do decide to run it, please feel free to report back here or open issues against the repository so we can fix then ASAP!

EDIT: Alternatively, I've realized there's a Docker image built from master. So the easiest way to try this out might be using the official docker image built from master.
docker pull prom/alertmanager:master

Thanks for your answer :)

@ijungmann
Copy link

Is there any way to make use of this in the helm chart at the moment? It seems like the operator does not handle this yet.

@beorn7
Copy link
Member

beorn7 commented Apr 28, 2021

We need a release of Alertmanager. I think @roidelapluie was working on it, but there were flaky tests blocking it. Perhaps those are resolved now?

@roidelapluie
Copy link
Member

See #2557

@bimmerkiev
Copy link

bimmerkiev commented May 14, 2021

Hi all.
I'm trying to use that feature with release version 0.22.0-rc.1, but I have no luck with it.
My alertmanager.yml:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "09:00"
          end_time: "11:00"

templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: pagerduty
  pagerduty_configs:
  - client_url: http://x.x.x.x/
    description: '{{ if .CommonAnnotations.summary }}{{ .CommonAnnotations.summary
      }}{{ end }}'
    routing_key: xxx
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ end }}'

route:
  group_by:
  - alertname
  - cluster
  - service
  - env
  group_interval: 5m
  group_wait: 30s
  receiver: pagerduty
  repeat_interval: 4h
  routes:
  - group_wait: 10s
    match:
      severity: low
      mute_time_intervals:
        - business_hours
      receiver: pagerduty

But with that config alertmanager don't want to start.
Part of log:

level=error ts=2021-05-14T13:38:06.667Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 44: cannot unmarshal !!seq into string"

Can anyone help me?
Thanks

@bimmerkiev
Copy link

the error about line:
" - business_hours"

@brunocriado
Copy link

Hi 👋
Before sending an issue I wanted to confirm if I'm doing it right. I have a test configuration where I want to mute the alerts for for 1 minute on Mondays for a specific receiver as you can see in below.

...
  - group_by:
    - alertname
    - severity
    - env
    match:
      type: aws-alarm
      mute_times: test
    receiver: aws-alarm
    continue: true
...
mute_time_intervals:
- name: test
  time_intervals:
  - weekdays: ['monday']
    times:
    - start_time: '09:00'
      end_time: '09:01'

I'm testing with amtool and when I send the alert I didn't receive it. If I remove that time interval it go off as expected. BTW today is not Monday and I'm testing for more than 1 minute.

@brunocriado
Copy link

brunocriado commented May 20, 2021

Just found the issue. I should use mute_time_intervals and not inside match

  - group_by:
    - alertname
    - severity
    - env
    match:
      type: aws-alarm
    mute_time_intervals:
      - test
    receiver: aws-alarm

@rschirin
Copy link

rschirin commented Jun 4, 2021

sorry team, is there any chance to see this feature within the docker image?

@beorn7
Copy link
Member

beorn7 commented Jun 7, 2021

It has been released in v0.22. Any Docker image with that or a later release should support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.