Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read alertmanager state from storage if peer settling fails. #4021

Merged
merged 3 commits into from
Apr 7, 2021

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented Mar 26, 2021

What this PR does:
Reads the alertmanager state (silences, notification log) from storage if obtaining state from a running peer fails.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, LGTM! (module a couple of nits)

pkg/alertmanager/state_replication.go Show resolved Hide resolved
}
}

level.Info(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we circuit break in case of "object not found" error, then I would raise this to warning:

Suggested change
level.Info(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)
level.Warn(s.logger).Log("msg", "failed to read state from storage; continuing anyway", "err", err)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By circuit break, you mean don't retry any calls to object storage when no object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By circuit break I was meaning a "guard" (if condition then return). There's no retry mechanism here yet (to be discussed if we want it, given retries may also be offered by the bucket client).

pkg/alertmanager/state_replication_test.go Outdated Show resolved Hide resolved
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@pracucci pracucci merged commit 9037020 into cortexproject:master Apr 7, 2021
56quarters added a commit to grafana/mimir that referenced this pull request Jun 30, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/mimir that referenced this pull request Jul 5, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
pracucci pushed a commit to grafana/mimir that referenced this pull request Jul 6, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes grafana#2245

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants