Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatch: don't garbage-collect alerts from store #2040

Merged

Conversation

simonpasquier
Copy link
Member

Ref prometheus-operator/kube-prometheus#205

The aggregation group is already responsible for removing the resolved alerts. Running the garbage collection in parallel introduces a race and eventually resolved notifications may be dropped.

I couldn't find a way to add a test since the first GC happened after 15min. I have a script to reproduce the issue and I've verified that the issue is gone after this change. The script goes like:

  • send firing alert repeatedly for about 14min.
  • send resolved alert at 14min45s (just before the GC which itself happens before the next group interval).
route:
  group_by: [job]
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 10h

The aggregation group is already responsible for removing the resolved
alerts. Running the garbage collection in parallel introduces a race and
eventually resolved notifications may be dropped.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice find! 🎉

@brancz brancz merged commit 76f339f into prometheus:master Sep 23, 2019
@simonpasquier simonpasquier deleted the dont-gc-alerts-in-aggr-groups branch September 24, 2019 07:48
mpeters pushed a commit to cisco-kusanagi/alertmanager-ha that referenced this pull request Oct 4, 2019
Updated to compile with these changes: prometheus#2040
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 13, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 23, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request May 26, 2024
This commit removes the GC and callback function from store.go
to address a number of data races that have occurred in the past
(prometheus#2040 and prometheus#3648). The store is no longer responsible for removing
resolved alerts after some elapsed period of time, and is instead
deferred to the consumer of the store (as done in prometheus#2040 and prometheus#3648).

Signed-off-by: George Robinson <george.robinson@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants