Read alertmanager state from storage if peer settling fails. #4021

stevesg · 2021-03-26T15:50:59Z

What this PR does:
Reads the alertmanager state (silences, notification log) from storage if obtaining state from a running peer fails.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pracucci

Good job, LGTM! (module a couple of nits)

pkg/alertmanager/state_replication.go

pracucci · 2021-03-31T10:44:35Z

pkg/alertmanager/state_replication.go

+		}
+	}
+
+	level.Info(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)


Assuming we circuit break in case of "object not found" error, then I would raise this to warning:

Suggested change

level.Info(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)

level.Warn(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)

By circuit break, you mean don't retry any calls to object storage when no object?

By circuit break I was meaning a "guard" (if condition then return). There's no retry mechanism here yet (to be discussed if we want it, given retries may also be offered by the bucket client).

pkg/alertmanager/state_replication_test.go

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pkg/alertmanager/state_replication.go

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pstibrany

LGTM, thanks.

In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes grafana#2245 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Code review changes Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pull-request-size bot added the size/L label Mar 26, 2021

stevesg force-pushed the am-read-state branch from 01cd546 to 291e8f7 Compare March 30, 2021 08:03

pull-request-size bot added size/M and removed size/L labels Mar 30, 2021

Read alertmanager state from storage if peer settling fails.

fa0f9e5

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

stevesg force-pushed the am-read-state branch from 291e8f7 to fa0f9e5 Compare March 30, 2021 08:26

stevesg marked this pull request as ready for review March 30, 2021 08:29

stevesg mentioned this pull request Mar 30, 2021

Implement periodic writing of alertmanager state to storage. #4031

Merged

3 tasks

pracucci approved these changes Mar 31, 2021

View reviewed changes

Review comments.

d1cb5cc

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pracucci reviewed Apr 6, 2021

View reviewed changes

pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved

Review comments.

71f4c7e

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>

pstibrany approved these changes Apr 6, 2021

View reviewed changes

pracucci merged commit 9037020 into cortexproject:master Apr 7, 2021

56quarters mentioned this pull request Jun 30, 2022

Alertmanager state is not restored from remote storage if replicationFactor == 1 grafana/mimir#2245

Closed

56quarters mentioned this pull request Jun 30, 2022

Restore alertmanager state from storage as fallback grafana/mimir#2293

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read alertmanager state from storage if peer settling fails. #4021

Read alertmanager state from storage if peer settling fails. #4021

stevesg commented Mar 26, 2021 •

edited

Loading

pracucci left a comment

pracucci Mar 31, 2021

ranton256 Apr 2, 2021

pracucci Apr 6, 2021

pstibrany left a comment

	level.Info(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)
	level.Warn(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)

Read alertmanager state from storage if peer settling fails. #4021

Read alertmanager state from storage if peer settling fails. #4021

Conversation

stevesg commented Mar 26, 2021 • edited Loading

pracucci left a comment

Choose a reason for hiding this comment

pracucci Mar 31, 2021

Choose a reason for hiding this comment

ranton256 Apr 2, 2021

Choose a reason for hiding this comment

pracucci Apr 6, 2021

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

stevesg commented Mar 26, 2021 •

edited

Loading