race condition when flock to chef pause file fails on a node #34

tpatzig · 2017-04-19T16:11:27Z

We had an outage in production
Symptoms:

nova api no longer available. haproxy.cfg was changed by local chef-client run on the control-node running haproxy. All nova services where removed from haproxy.cfg. nova.conf was changed as well: all listen IPs/ports were reset to default.

To fix it quick we restored the local chef backup of the config files.

First analysis:

Nova proposal was applied to include new computes. One older compute had an issue and the proposal failed with:

"crowbar-failed": "Failed to apply the proposal: On d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap, 'flock /var/chef/cache/pause-file.lock.meta bash -es' (pid 2146) failed, exitcode 255\nSTDERR:\nssh: connect to host d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap port 22: No route to host

In the same time (or right after) the local/periodic chef-client run on the control-nodes happened (because the flock didn't ran over all nodes) and change the nova.conf and the haproxy.cfg.
The local chef-client run failed:

ERROR: RuntimeError: crowbar-pacemaker_sync_mark[wait-nova_database] (nova::database line 28) had an error: RuntimeError: Cluster founder didn't set nova_database to 1!

The node mentioned in the proposal error was no longer in the kvm compute list of the nova proposal, but it still had the nova role assigned on its crowbar role. We manually removed it and re-applied nova proposal. That run was successful. No haproxy.cfg or nova.conf changes. The following local chef-client runs were also sucessful.

So my guess is, that the missing flock on the chef pause file, resulted in this race condition, where the local chef-client got wrong/default data (from the founder?).

The text was updated successfully, but these errors were encountered:

matelakat · 2017-04-24T08:03:17Z

This is blocked due to @vuntz missing a profile - so he cannot log in to the system and investigate

vuntz · 2017-04-24T11:17:07Z

For the record, the "no route to host" error occurred at 2017-04-19T13:42:25.501410 (in case others need to look at production.log).

vuntz · 2017-04-24T12:20:34Z

So, I'm seeing some nodes where the nova entries for haproxy got dropped during the periodic chef-client run at 2017-04-19T14:45:17+00:00 and 2017-04-19T15:13:53+00:00. What's interesting is that it's the first periodic chef-client run after the apply failure.

matelakat · 2017-04-24T12:44:40Z

https://github.com/sap-oc/crowbar-openstack/blob/sci1-2017-03-14-00/chef/cookbooks/nova/recipes/controller_ha.rb#L16

vuntz · 2017-04-24T12:59:25Z

So here's what happened:

the periodic chef run didn't do HA for nova due to the attribute for HA in nova being absent / set to false
this happened because apply_role_pre_chef_client wasn't called
this wasn't called because of the lock issue which happens in apply_role before apply_role_pre_chef_client is called
so why was the attribute wrong? Because before apply_role, we copy the proposal to the chef role nova-config-default and the data in there is not fully "correct" until apply_role_pre_chef_client is called. Therefore we ended up with the nova attributes coming from the the nova-config-default role being incomplete (basically lacking everything from apply_role_pre_chef_client, including the HA bits).

Not sure yet how to best fix it.

Until now, we saved the role for the applied proposal early in apply_role. However, apply_role could fail after that but still before apply_role_pre_chef_client, resulting in the role containing incomplete data (since apply_role_pre_chef_client is used to process the role and add some info, like "use HA"). In case of such a failure (and that can happen when we fail to get a lock for a node), then a periodic chef-client run would use a role with incomplete information, possibly leading to changing the config on the node wrongly. Closes sap-oc/crowbar-openstack#34

vuntz · 2017-04-24T13:22:16Z

I think crowbar/crowbar-core#1225 is the right fix.

matelakat · 2017-04-25T07:27:08Z

Testing the change locally.

matelakat · 2017-04-25T08:42:14Z

Steps to reproduce:

Have an HA cloud

root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true

Deny crowbar from connecting to node1:

root@crowbar:~ # iptables -A OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT

Apply nova barclamp through the UI - see that it fails
Log in to node1, do a chef-client run
See that ha is reported to be false

root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  false

In case the node got fenced, you can recover it with:

rm  /var/spool/corosync/block_automatic_start
systemctl start crowbar_join.service

tpatzig · 2017-04-25T08:45:36Z

Great! That's exactly how it can be reproduced. Thanks @matelakat !

matelakat · 2017-04-25T09:29:58Z

Applying the fix:

curl -L https://patch-diff.githubusercontent.com/raw/crowbar/crowbar-core/pull/1225.patch | patch -p1 -d /opt/dell
systemctl restart crowbar.service
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
# Apply nova proposal, see that it fails
root@crowbar:~ # iptables -D OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # ssh node1 chef-client
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true

Until now, we saved the role for the applied proposal early in apply_role. However, apply_role could fail after that but still before apply_role_pre_chef_client, resulting in the role containing incomplete data (since apply_role_pre_chef_client is used to process the role and add some info, like "use HA"). In case of such a failure (and that can happen when we fail to get a lock for a node), then a periodic chef-client run would use a role with incomplete information, possibly leading to changing the config on the node wrongly. This partially addresses https://bugzilla.suse.com/show_bug.cgi?id=857375 Closes sap-oc/crowbar-openstack#34

Until now, we saved the role for the applied proposal early in apply_role. However, apply_role could fail after that but still before apply_role_pre_chef_client, resulting in the role containing incomplete data (since apply_role_pre_chef_client is used to process the role and add some info, like "use HA"). In case of such a failure (and that can happen when we fail to get a lock for a node), then a periodic chef-client run would use a role with incomplete information, possibly leading to changing the config on the node wrongly. This partially addresses https://bugzilla.suse.com/show_bug.cgi?id=857375 Closes sap-oc/crowbar-openstack#34 (cherry picked from commit 23dbcca)

vuntz · 2017-05-10T08:38:21Z

Backport at sap-oc/crowbar-core#29

Until now, we saved the role for the applied proposal early in apply_role. However, apply_role could fail after that but still before apply_role_pre_chef_client, resulting in the role containing incomplete data (since apply_role_pre_chef_client is used to process the role and add some info, like "use HA"). In case of such a failure (and that can happen when we fail to get a lock for a node), then a periodic chef-client run would use a role with incomplete information, possibly leading to changing the config on the node wrongly. This partially addresses https://bugzilla.suse.com/show_bug.cgi?id=857375 Closes sap-oc/crowbar-openstack#34 (cherry picked from commit 23dbcca)

mkoderer added Critical prod issue labels Apr 20, 2017

matelakat assigned aspiers and vuntz Apr 20, 2017

vuntz mentioned this issue Apr 24, 2017

crowbar: Do not save applied proposal as role too early in apply_role crowbar/crowbar-core#1225

Merged

matelakat self-assigned this Apr 25, 2017

vuntz mentioned this issue May 10, 2017

crowbar: Do not save applied proposal as role too early in apply_role sap-oc/crowbar-core#29

Merged

tpatzig closed this as completed in sap-oc/crowbar-core#29 Jun 1, 2017

tpatzig mentioned this issue Jun 8, 2017

neutron recipes on compute nodes should not look at attributes from neutron-server nodes #11

Open

5 tasks

vuntz mentioned this issue Oct 9, 2017

[4.0] crowbar: Do not save applied proposal as role too early in apply_role crowbar/crowbar-core#1357

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition when flock to chef pause file fails on a node #34

race condition when flock to chef pause file fails on a node #34

tpatzig commented Apr 19, 2017

matelakat commented Apr 24, 2017

vuntz commented Apr 24, 2017

vuntz commented Apr 24, 2017

matelakat commented Apr 24, 2017

vuntz commented Apr 24, 2017

vuntz commented Apr 24, 2017

matelakat commented Apr 25, 2017

matelakat commented Apr 25, 2017 •

edited

Loading

tpatzig commented Apr 25, 2017

matelakat commented Apr 25, 2017

vuntz commented May 10, 2017

race condition when flock to chef pause file fails on a node #34

race condition when flock to chef pause file fails on a node #34

Comments

tpatzig commented Apr 19, 2017

matelakat commented Apr 24, 2017

vuntz commented Apr 24, 2017

vuntz commented Apr 24, 2017

matelakat commented Apr 24, 2017

vuntz commented Apr 24, 2017

vuntz commented Apr 24, 2017

matelakat commented Apr 25, 2017

matelakat commented Apr 25, 2017 • edited Loading

tpatzig commented Apr 25, 2017

matelakat commented Apr 25, 2017

vuntz commented May 10, 2017

matelakat commented Apr 25, 2017 •

edited

Loading