[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

grzesuav · 2020-05-05T12:14:35Z

What did you do?

We have deployed several different Alertmanager's - in different namespaces

What did you expect to see?

Each alertmanager installation forms a separate cluster

What did you see instead? Under which circumstances?

After some time, we noticed that all those installation form one big cluster.
We think that :

address of peer passed via --cluster.peer=DNS address is resolved to IP address and later on, this ip address is used
alertmanager instance is restarted/evicted to different node/etc and ip address can be assigned to other pod, which accidently happens to be ip address of alertmanager instance from other cluster
they join in a one big cluster

Environment

Kubernetes cluster

System information:

will add if needed
Alertmanager version:

insert output of alertmanager --version here

Different versions - 020, 0.17, 0.18

Prometheus version:

not related
Alertmanager configuration file:

will add if needed

Prometheus configuration file:

will add if needed

Logs:

will add if needed

The text was updated successfully, but these errors were encountered:

simonpasquier · 2020-05-06T09:49:35Z

Can you elaborate about your setup? Which names do you use for --cluster.peer?
Alertmanager will try to reconnect to a previously known IP address for 6 hours by default. After this, it will forget about it.

grzesuav · 2020-05-07T08:52:54Z

hi @simonpasquier ,
so my alertmanager pod arg line looks like

  - args:
    - --config.file=/etc/alertmanager/config/alertmanager.yaml
    - --cluster.listen-address=[$(POD_IP)]:9094
    - --storage.path=/alertmanager
    - --data.retention=120h
    - --web.listen-address=:9093
    - --web.external-url=https://alertmanager-address
    - --web.route-prefix=/
    - --cluster.peer=alertmanager-main-0.alertmanager-operated.namespace.svc:9094
    - --cluster.peer=alertmanager-main-1.alertmanager-operated.namespace.svc:9094
    - --cluster.peer=alertmanager-main-2.alertmanager-operated.namespace.svc:9094

and yes, fresh after hard reset of all instances in the same time it starts with three peers. But previously it has cluster formed along with all other alertmanager installations in our cluster (over 50 instances I guess ?)

So, from what you are saying, if during those 6 hours, ip gets re-assigned to completely different pod, AM will try to form cluster with this new pod ?

simonpasquier · 2020-05-13T12:55:27Z

So, from what you are saying, if during those 6 hours, ip gets re-assigned to completely different pod, AM will try to form cluster with this new pod ?

yes

simonpasquier · 2020-05-13T12:56:29Z

BTW you can set the --cluster.reconnect-timeout flag to a lower value than the default 6 hours.

grzesuav · 2020-05-19T20:14:24Z

Actually I try to resolve it with NetworkPolicies, however it is still workaround. I would still consider this as something which should be handled or at least mention in documentation (we are using prometheus operator so relay a bit on defaults there) as in our case silencing alert on one alertmanager caused somebody else missed notification.

bjakubski · 2020-06-15T16:20:25Z

I've found this issue when trying to figure out why alertmanagers formed a mesh over cluster boundaries. I was investigating because of alerts about alertmanager being in inconsistent state fired, and nothing more serious happened.
Same story - prometheus-operator, k8s.
Granted, the issue was caused mostly by a misconfiguration on our side (different cluster has same ip range and is routable).

I do find it surprising that alertmanager (often used in dynamic envs like k8s) will not try to resolve the names on every connection attempt

hwoarang · 2020-08-03T14:11:25Z

BTW you can set the --cluster.reconnect-timeout flag to a lower value than the default 6 hours.

That's a reasonable suggestion but prometheus operator does not let you pass additional parameters to the alertmanager instance. But all in all, this is something that needs to be addressed there of course.

In a high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and go through another DNS resolution process. As such, it's best to use a lower value which will allow the alertmanager cluster to recover in case of an update/rollout/etc process in the kubernetes cluster. Related: prometheus/alertmanager#2250

In a high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and goes through another DNS resolution process. As such, it's best to use a lower value which will allow the alertmanager cluster to recover in case of an update/rollout/etc process in the kubernetes cluster. Related: prometheus/alertmanager#2250

Alertmanager in cluster mode resolves the DNS name of each peer and caches its IP address which uses on regular intervals to 'refresh' the connection. In high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and remove that peer from the member list. During this period of time, the cluster is reported to be in a degraded state due to the missing member. As such, it's best to use a lower value which will allow the alertmanager to remove the pod from the list of peers soon after it disappears. Related: prometheus/alertmanager#2250

b10s · 2021-09-06T07:29:33Z

I've got the same issue and able to reproduce.

I think it is nature of gossip protocol which should be suppressed a bit by alertmanager since it has knowledge of peers from config file and can verify the table of available peers.

UPD
to reproduce

start your kind cluster:

$ kind create cluster
...

deploy here two clusters of alertmanager:

$ helm install my-release foo/bar
$ helm install my-bad-release foo/bar

find your kind's k8s cluster contaienr and enter it:

docker exec -it 942e41a1c6e6 bash

inside container change CNI settings and restart kubelet:

# sed -i 's/"subnet": "10.244.0.0\/24"/"subnet": "10.244.0.0\/28"/g' /etc/cni/net.d/10-kindnet.conflist
# systemctl restart kubelet

create few more Pods with nginx to make sure there is no more available IPs
delete one alertmanager's Pods from one cluster and one from another using the same command so there will be chance they will reuse IP of each other
enjoy merged alertmanager cluster

 Args:
      --storage.path=/alertmanager
      --config.file=/config_out/alertmanager.yml
      --cluster.advertise-address=$(POD_IP):9094
      --cluster.listen-address=0.0.0.0:9094
      --cluster.peer=my-release-alertmanager-0.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-1.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-2.my-release-alertmanager-headless:9094

You can see here is only three peers.

Before making them to switch IPs there IP assignment was:

After restart few Pods few times I can make them to reuse IPs:

Since other Pods were not restarted, they still keep old IPs in their gossip available peers table. Therefore two cluster will merge into one:

grzesuav · 2021-10-19T22:47:41Z

I am not familiar of gossip protocol, but I can image that using some identifier for the cluster (i.e. statefulset name) and using it to verify if other pod should join my network should be also a viable solution

grzesuav · 2021-10-19T22:48:39Z

of course, at Alertmanager level it would be an cli-argument, which people not using prom-op would need to set, otherwise some default would be used

b10s · 2021-10-22T14:16:18Z

@grzesuav ,

Seems there is coming TLS support for gossip in am (if not yet released). Which is one way to avoid the issue:
https://github.com/prometheus/alertmanager/blob/main/docs/https.md#gossip-traffic

Also some notes:
https://github.com/prometheus/alertmanager/tree/main/examples/ha/tls

thanks to @simonpasquier with sharing this docs over IRC : )

b10s · 2021-11-05T14:18:39Z

There is also one possible solution is to add cluster id:
https://groups.google.com/g/prometheus-developers/c/wJ60O2Mk3js/m/qixf31fRBQAJ

This is an alternate mechanism for isolating Alertmanager clusters without having to set up the right components of TLS. It should solve issues such as <prometheus#2250>, although enabling this feature will lead to loss of non-persisted state. (For example, if you rely on alertmanager cluster peering to maintain silences instead of using persistent volume storage in Kubernetes.) The Gossip label serves as the "cluster ID" idea mentioned in <prometheus#2250 (comment)>. You can enable with the command-line flag, `--cluster.gossip-label`; any non-empty string will form an effective namespace for gossip communication. If you use Prometheus Operator, you can set the `ALERTMANAGER_CLUSTER_GOSSIP_LABEL` environment variable (as Prometheus Operator does not have a way of adding additional command-line flags). You would need to modify your Alertmanager object something like: ``` kind: Alertmanager ... spec: ... containers: - name: alertmanager env: - name: ALERTMANAGER_CLUSTER_GOSSIP_LABEL value: infrastructure-eu-west-2 ... ``` This is low-security mechanism, suitable for use with Alertmanager configuration where anyone can add or remove a silence. It protects against surprising cluster expansion due to IP:port re-use.

This is an alternate mechanism for isolating Alertmanager clusters without having to set up the right components of TLS. It should solve issues such as <prometheus#2250>, although enabling this feature will lead to loss of non-persisted state. (For example, if you rely on alertmanager cluster peering to maintain silences instead of using persistent volume storage in Kubernetes.) The Gossip label serves as the "cluster ID" idea mentioned in <prometheus#2250 (comment)>. You can enable with the command-line flag, `--cluster.gossip-label`; any non-empty string will form an effective namespace for gossip communication. If you use Prometheus Operator, you can set the `ALERTMANAGER_CLUSTER_GOSSIP_LABEL` environment variable (as Prometheus Operator does not have a way of adding additional command-line flags). You would need to modify your Alertmanager object something like: ``` kind: Alertmanager ... spec: ... containers: - name: alertmanager env: - name: ALERTMANAGER_CLUSTER_GOSSIP_LABEL value: infrastructure-eu-west-2 ... ``` This is low-security mechanism, suitable for use with Alertmanager configuration where anyone can add or remove a silence. It protects against surprising cluster expansion due to IP:port re-use. Signed-off-by: Graham Reed <greed@7deadly.org>

simonpasquier · 2023-09-26T09:37:35Z

this should be fixed by #3354 which allows to define a label identifying the cluster and preventing external instances to join the cluster if they don't share the same label.

grzesuav changed the title ~~[Kubernetes] Peers ip caching causes all cluster to degrade over time~~ [Kubernetes] Peers ip caching causes all clusters to degrade over time May 5, 2020

simonpasquier added the component/high availability label May 14, 2020

atmosx mentioned this issue Aug 19, 2020

[stable/prometheus-operator] Add Alertmanager cluster.reconnect-timeout option support helm/charts#23575

Closed

hwoarang mentioned this issue Aug 24, 2020

Alertmanager merges peers through IP instead DNS #2295

Open

hwoarang mentioned this issue Aug 24, 2020

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout prometheus-operator/prometheus-operator#3436

Merged

greed42 mentioned this issue Feb 15, 2023

Support setting the Gossip cluster "label" #3254

Closed

simonpasquier closed this as completed Sep 26, 2023

simonpasquier mentioned this issue Sep 26, 2023

fix: add --cluster.label to alertmanager prometheus-operator/prometheus-operator#5945

Merged

5 tasks

mfinelli mentioned this issue Sep 27, 2024

[kube-prometheus-stack] Add support for alertmanager cluster.label prometheus-community/helm-charts#4877

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

grzesuav commented May 5, 2020

simonpasquier commented May 6, 2020

grzesuav commented May 7, 2020

simonpasquier commented May 13, 2020

simonpasquier commented May 13, 2020

grzesuav commented May 19, 2020

bjakubski commented Jun 15, 2020

hwoarang commented Aug 3, 2020

b10s commented Sep 6, 2021 •

edited

Loading

grzesuav commented Oct 19, 2021

grzesuav commented Oct 19, 2021

b10s commented Oct 22, 2021

b10s commented Nov 5, 2021

simonpasquier commented Sep 26, 2023

[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

[Kubernetes] Peers ip caching causes all clusters to degrade over time #2250

Comments

grzesuav commented May 5, 2020

simonpasquier commented May 6, 2020

grzesuav commented May 7, 2020

simonpasquier commented May 13, 2020

simonpasquier commented May 13, 2020

grzesuav commented May 19, 2020

bjakubski commented Jun 15, 2020

hwoarang commented Aug 3, 2020

b10s commented Sep 6, 2021 • edited Loading

grzesuav commented Oct 19, 2021

grzesuav commented Oct 19, 2021

b10s commented Oct 22, 2021

b10s commented Nov 5, 2021

simonpasquier commented Sep 26, 2023

b10s commented Sep 6, 2021 •

edited

Loading