Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change when we send job notifications to avoid a race condition #6290

Conversation

ryanpetrello
Copy link
Contributor

@ryanpetrello ryanpetrello commented Mar 13, 2020

cc @jainnikhil30

success/failure notifications for playbooks include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for playbooks we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Mar 13, 2020

@chrismeyersfsu @AlanCoding I reproduced and verified this by:

  1. Setting up an inventory with 300+ ansible_connection: local total hosts.
  2. Configuring a web hook notification on success and failure for a JT, and running a simple hello world playbook with that inventory.
  3. Running an HTTP service to capture the JSON payload for the notification.
  4. Taking the "hosts" value in the payload and determining the number of keys (hosts) in the dictionary.

In my first attempt at this, the payloads never had all the hosts - only ~100 of the 301 make it through (because they haven't all been inserted yet when the EOF event is persisted):

~ cat notification-body.json | jq '.hosts | length'
103

Here's a run with a single host after my PR:

POST / HTTP/1.1
Host: localhost:9000
User-Agent: Tower 9.3.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 558
Content-Type: application/json

{"id": 45, "name": "Demo Job Template", "url": "https://towerhost/#/jobs/playbook/45", "created_by": "ryan", "started": "2020-03-13T21:45:24.951700+00:00", "finished": "2020-03-13T21:45:28.797077+00:00", "status": "successful", "traceback": "", "inventory": "Demo Inventory", "project": "Demo Project", "playbook": "hello_world.yml", "credential": "Demo Credential", "limit": "localhost", "extra_vars": "{}", "hosts": {"localhost": {"failed": false, "changed": 0, "dark": 0, "failures": 0, "ok": 2, "processed": 1, "skipped": 0, "rescued": 0, "ignored": 0}}}

Here's one that I set up with many hosts:

~ cat notification-body.json | jq '.hosts | length'
301
~ awx-manage dbshell
In [1]: Inventory.objects.first().hosts.count()
Out[1]: 301

# event handling code in main.models.events
pass
elif hasattr(uj, 'send_notification_templates'):
handle_success_and_failure_notifications.apply_async([uj.id])
Copy link
Contributor Author

@ryanpetrello ryanpetrello Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these payloads are pretty big, and they really belong in a background task so we're not monopolizing a callback worker waiting a second or two on the job to go into the finished state.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@@ -356,6 +356,14 @@ def _update_from_event_data(self):
job_id=self.job_id, uuid__in=failed
).update(failed=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not cram everything else from line 340 up to here into that handle_success_and_failure_notifications task, and remove that lag too? The issue really sounded like it scaled with the number of hosts, so it's extremely likely that the main cause of the race condition on the job host summaries, as opposed to the notifications.

Copy link
Contributor Author

@ryanpetrello ryanpetrello Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I want to move the saving of job host summary data into a background task. Otherwise there's potentially a scenario where you have a playbook_on_stats event in the database, but the API for actual summary data is empty/incomplete for some period of time. That seems weird to me, and I worry it could introduce regressions or side effects somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't in a transaction now, is it? If it's not in a transaction, I don't see how it makes a difference in a separate task vs. same task regarding the data correspondence you're talking about.

If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.

I just couldn't draw the line between the issue and this, because my assumption was the the creation of job host summaries would be the driving O(N) cost for the number of hosts.

Copy link
Contributor Author

@ryanpetrello ryanpetrello Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't in a transaction now, is it?

Yea, I think you're actually right here - this is probably actually autocommit mode since it's in the callback receiver.

I do agree with this, though:

If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.

I get a little nervous changing this code too much because it's so critical :).

Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).

But it scales with the number of hosts, and not the number of tasks. Doing a simple thing with a large number of hosts is a common use case.

Delaying the playbook_on_stats also delays the events_processing_finished status of the job, and getting the EOF before the stats event feels like something that could confuse the UI.

But I agree with your overall sentiment - because while some of this is new, the host summaries stuff isn't new, and I believe that's the bottleneck. If the new stuff was a bottleneck that could introduce regressions, but I don't see that being the case.

@jainnikhil30
Copy link
Contributor

Thanks Ryan for the PR. Fix looks good. I tested it locally. Here are the results.

Create a inventory with 1000 hosts. Run the job and check the notification sent.

1.) Without the above patch

[niks@localhost-localdomain tmp]$ curl -k -u admin https://18.138.58.238/api/v2/notifications/41737/ | grep -o "1.1.1." | wc -l
Enter host password for user 'admin':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 68493  100 68493    0     0  86372      0 --:--:-- --:--:-- --:--:-- 86372
228

2.) With the above patch

[niks@localhost-localdomain tmp]$ curl -k -u admin https://localhost:8043/api/v2/notifications/2/ | grep -o "1.1.1." | wc -l
Enter host password for user 'admin':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289k  100  289k    0     0  1383k      0 --:--:-- --:--:-- --:--:-- 1376k
1000

With the above patch I get all the 1000 nodes in the notification.

success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
@ryanpetrello ryanpetrello force-pushed the notification-host-summary-race branch from 353f2c1 to d40a5de Compare March 19, 2020 14:02
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, I missed the critical change that job notifications changed from the EOF hook to on_commit for playbook_on_stats processing, which necessarily assures the host summaries are available for it. Polling the job status already exists to handle the events / task finish synchronization. My prior suggestions can be discarded.

Also the integration test made it very clear what this addresses, and that it is effective

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants