Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition when flock to chef pause file fails on a node #34

Closed
tpatzig opened this issue Apr 19, 2017 · 11 comments
Closed

race condition when flock to chef pause file fails on a node #34

tpatzig opened this issue Apr 19, 2017 · 11 comments

Comments

@tpatzig
Copy link

tpatzig commented Apr 19, 2017

We had an outage in production
Symptoms:

nova api no longer available. haproxy.cfg was changed by local chef-client run on the control-node running haproxy. All nova services where removed from haproxy.cfg. nova.conf was changed as well: all listen IPs/ports were reset to default.

To fix it quick we restored the local chef backup of the config files.

First analysis:

Nova proposal was applied to include new computes. One older compute had an issue and the proposal failed with:

"crowbar-failed": "Failed to apply the proposal: On d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap, 'flock /var/chef/cache/pause-file.lock.meta bash -es' (pid 2146) failed, exitcode 255\nSTDERR:\nssh: connect to host d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap port 22: No route to host

In the same time (or right after) the local/periodic chef-client run on the control-nodes happened (because the flock didn't ran over all nodes) and change the nova.conf and the haproxy.cfg.
The local chef-client run failed:

ERROR: RuntimeError: crowbar-pacemaker_sync_mark[wait-nova_database] (nova::database line 28) had an error: RuntimeError: Cluster founder didn't set nova_database to 1!

The node mentioned in the proposal error was no longer in the kvm compute list of the nova proposal, but it still had the nova role assigned on its crowbar role. We manually removed it and re-applied nova proposal. That run was successful. No haproxy.cfg or nova.conf changes. The following local chef-client runs were also sucessful.

So my guess is, that the missing flock on the chef pause file, resulted in this race condition, where the local chef-client got wrong/default data (from the founder?).

@matelakat
Copy link

This is blocked due to @vuntz missing a profile - so he cannot log in to the system and investigate

@vuntz
Copy link

vuntz commented Apr 24, 2017

For the record, the "no route to host" error occurred at 2017-04-19T13:42:25.501410 (in case others need to look at production.log).

@vuntz
Copy link

vuntz commented Apr 24, 2017

So, I'm seeing some nodes where the nova entries for haproxy got dropped during the periodic chef-client run at 2017-04-19T14:45:17+00:00 and 2017-04-19T15:13:53+00:00. What's interesting is that it's the first periodic chef-client run after the apply failure.

@vuntz
Copy link

vuntz commented Apr 24, 2017

So here's what happened:

  • the periodic chef run didn't do HA for nova due to the attribute for HA in nova being absent / set to false
  • this happened because apply_role_pre_chef_client wasn't called
  • this wasn't called because of the lock issue which happens in apply_role before apply_role_pre_chef_client is called
  • so why was the attribute wrong? Because before apply_role, we copy the proposal to the chef role nova-config-default and the data in there is not fully "correct" until apply_role_pre_chef_client is called. Therefore we ended up with the nova attributes coming from the the nova-config-default role being incomplete (basically lacking everything from apply_role_pre_chef_client, including the HA bits).

Not sure yet how to best fix it.

vuntz added a commit to vuntz/crowbar-core that referenced this issue Apr 24, 2017
Until now, we saved the role for the applied proposal early in
apply_role. However, apply_role could fail after that but still before
apply_role_pre_chef_client, resulting in the role containing incomplete
data (since apply_role_pre_chef_client is used to process the role and
add some info, like "use HA").

In case of such a failure (and that can happen when we fail to get a
lock for a node), then a periodic chef-client run would use a role with
incomplete information, possibly leading to changing the config on the
node wrongly.

Closes sap-oc/crowbar-openstack#34
@vuntz
Copy link

vuntz commented Apr 24, 2017

I think crowbar/crowbar-core#1225 is the right fix.

@matelakat matelakat self-assigned this Apr 25, 2017
@matelakat
Copy link

Testing the change locally.

@matelakat
Copy link

matelakat commented Apr 25, 2017

Steps to reproduce:

  1. Have an HA cloud
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
  1. Deny crowbar from connecting to node1:
root@crowbar:~ # iptables -A OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
  1. Apply nova barclamp through the UI - see that it fails
  2. Log in to node1, do a chef-client run
  3. See that ha is reported to be false
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  false

In case the node got fenced, you can recover it with:

rm  /var/spool/corosync/block_automatic_start
systemctl start crowbar_join.service

@tpatzig
Copy link
Author

tpatzig commented Apr 25, 2017

Great! That's exactly how it can be reproduced. Thanks @matelakat !

@matelakat
Copy link

Applying the fix:

curl -L https://patch-diff.githubusercontent.com/raw/crowbar/crowbar-core/pull/1225.patch | patch -p1 -d /opt/dell
systemctl restart crowbar.service
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
# Apply nova proposal, see that it fails
root@crowbar:~ # iptables -D OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # ssh node1 chef-client
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true

vuntz added a commit to vuntz/crowbar-core that referenced this issue May 10, 2017
Until now, we saved the role for the applied proposal early in
apply_role. However, apply_role could fail after that but still before
apply_role_pre_chef_client, resulting in the role containing incomplete
data (since apply_role_pre_chef_client is used to process the role and
add some info, like "use HA").

In case of such a failure (and that can happen when we fail to get a
lock for a node), then a periodic chef-client run would use a role with
incomplete information, possibly leading to changing the config on the
node wrongly.

This partially addresses
https://bugzilla.suse.com/show_bug.cgi?id=857375

Closes sap-oc/crowbar-openstack#34
vuntz added a commit to vuntz/crowbar-core that referenced this issue May 10, 2017
Until now, we saved the role for the applied proposal early in
apply_role. However, apply_role could fail after that but still before
apply_role_pre_chef_client, resulting in the role containing incomplete
data (since apply_role_pre_chef_client is used to process the role and
add some info, like "use HA").

In case of such a failure (and that can happen when we fail to get a
lock for a node), then a periodic chef-client run would use a role with
incomplete information, possibly leading to changing the config on the
node wrongly.

This partially addresses
https://bugzilla.suse.com/show_bug.cgi?id=857375

Closes sap-oc/crowbar-openstack#34

(cherry picked from commit 23dbcca)
@vuntz
Copy link

vuntz commented May 10, 2017

Backport at sap-oc/crowbar-core#29

vuntz added a commit to vuntz/crowbar-core that referenced this issue Oct 9, 2017
Until now, we saved the role for the applied proposal early in
apply_role. However, apply_role could fail after that but still before
apply_role_pre_chef_client, resulting in the role containing incomplete
data (since apply_role_pre_chef_client is used to process the role and
add some info, like "use HA").

In case of such a failure (and that can happen when we fail to get a
lock for a node), then a periodic chef-client run would use a role with
incomplete information, possibly leading to changing the config on the
node wrongly.

This partially addresses
https://bugzilla.suse.com/show_bug.cgi?id=857375

Closes sap-oc/crowbar-openstack#34

(cherry picked from commit 23dbcca)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants