change when we send job notifications to avoid a race condition #6290

ryanpetrello · 2020-03-13T21:39:12Z

success/failure notifications for playbooks include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for playbooks we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event

ryanpetrello · 2020-03-13T21:41:27Z

@chrismeyersfsu @AlanCoding I reproduced and verified this by:

Setting up an inventory with 300+ ansible_connection: local total hosts.
Configuring a web hook notification on success and failure for a JT, and running a simple hello world playbook with that inventory.
Running an HTTP service to capture the JSON payload for the notification.
Taking the "hosts" value in the payload and determining the number of keys (hosts) in the dictionary.

In my first attempt at this, the payloads never had all the hosts - only ~100 of the 301 make it through (because they haven't all been inserted yet when the EOF event is persisted):

~ cat notification-body.json | jq '.hosts | length'
103

Here's a run with a single host after my PR:

POST / HTTP/1.1
Host: localhost:9000
User-Agent: Tower 9.3.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 558
Content-Type: application/json

{"id": 45, "name": "Demo Job Template", "url": "https://towerhost/#/jobs/playbook/45", "created_by": "ryan", "started": "2020-03-13T21:45:24.951700+00:00", "finished": "2020-03-13T21:45:28.797077+00:00", "status": "successful", "traceback": "", "inventory": "Demo Inventory", "project": "Demo Project", "playbook": "hello_world.yml", "credential": "Demo Credential", "limit": "localhost", "extra_vars": "{}", "hosts": {"localhost": {"failed": false, "changed": 0, "dark": 0, "failures": 0, "ok": 2, "processed": 1, "skipped": 0, "rescued": 0, "ignored": 0}}}

Here's one that I set up with many hosts:

~ cat notification-body.json | jq '.hosts | length'
301
~ awx-manage dbshell
In [1]: Inventory.objects.first().hosts.count()
Out[1]: 301

ryanpetrello · 2020-03-13T21:42:49Z

awx/main/dispatch/worker/callback.py

+                            # event handling code in main.models.events
+                            pass
+                        elif hasattr(uj, 'send_notification_templates'):
+                            handle_success_and_failure_notifications.apply_async([uj.id])


Some of these payloads are pretty big, and they really belong in a background task so we're not monopolizing a callback worker waiting a second or two on the job to go into the finished state.

softwarefactory-project-zuul · 2020-03-13T22:09:35Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 11s
awx-api : SUCCESS in 10m 22s
awx-ui : SUCCESS in 7m 10s
awx-ui-next : SUCCESS in 10m 39s
awx-swagger : SUCCESS in 11m 49s
awx-detect-schema-change : SUCCESS in 12m 30s (non-voting)
awx-ansible-modules : SUCCESS in 6m 05s

AlanCoding · 2020-03-14T00:11:13Z

awx/main/models/events.py

@@ -356,6 +356,14 @@ def _update_from_event_data(self):
                    job_id=self.job_id, uuid__in=failed
                ).update(failed=True)


Why not cram everything else from line 340 up to here into that handle_success_and_failure_notifications task, and remove that lag too? The issue really sounded like it scaled with the number of hosts, so it's extremely likely that the main cause of the race condition on the job host summaries, as opposed to the notifications.

I'm not sure I want to move the saving of job host summary data into a background task. Otherwise there's potentially a scenario where you have a playbook_on_stats event in the database, but the API for actual summary data is empty/incomplete for some period of time. That seems weird to me, and I worry it could introduce regressions or side effects somewhere.

This isn't in a transaction now, is it? If it's not in a transaction, I don't see how it makes a difference in a separate task vs. same task regarding the data correspondence you're talking about.

If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.

I just couldn't draw the line between the issue and this, because my assumption was the the creation of job host summaries would be the driving O(N) cost for the number of hosts.

This isn't in a transaction now, is it?

Yea, I think you're actually right here - this is probably actually autocommit mode since it's in the callback receiver.

I do agree with this, though:

If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.

I get a little nervous changing this code too much because it's so critical :).

Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).

Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).

But it scales with the number of hosts, and not the number of tasks. Doing a simple thing with a large number of hosts is a common use case.

Delaying the playbook_on_stats also delays the events_processing_finished status of the job, and getting the EOF before the stats event feels like something that could confuse the UI.

But I agree with your overall sentiment - because while some of this is new, the host summaries stuff isn't new, and I believe that's the bottleneck. If the new stuff was a bottleneck that could introduce regressions, but I don't see that being the case.

jainnikhil30 · 2020-03-15T06:16:02Z

Thanks Ryan for the PR. Fix looks good. I tested it locally. Here are the results.

Create a inventory with 1000 hosts. Run the job and check the notification sent.

1.) Without the above patch

[niks@localhost-localdomain tmp]$ curl -k -u admin https://18.138.58.238/api/v2/notifications/41737/ | grep -o "1.1.1." | wc -l
Enter host password for user 'admin':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 68493  100 68493    0     0  86372      0 --:--:-- --:--:-- --:--:-- 86372
228

2.) With the above patch

[niks@localhost-localdomain tmp]$ curl -k -u admin https://localhost:8043/api/v2/notifications/2/ | grep -o "1.1.1." | wc -l
Enter host password for user 'admin':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289k  100  289k    0     0  1383k      0 --:--:-- --:--:-- --:--:-- 1376k
1000

With the above patch I get all the 1000 nodes in the notification.

success/failure notifications for *playbooks* include summary data about the hosts in based on the contents of the playbook_on_stats event the current implementation suffers from a number of race conditions that sometimes can cause that data to be missing or incomplete; this change makes it so that for *playbooks* we build (and send) the notification in response to the playbook_on_stats event, not the EOF event

softwarefactory-project-zuul · 2020-03-19T14:42:38Z

Build succeeded.

awx-api-lint : SUCCESS in 13m 41s
awx-api : SUCCESS in 20m 13s
awx-ui : SUCCESS in 14m 49s
awx-ui-next : SUCCESS in 18m 32s
awx-swagger : SUCCESS in 19m 38s
awx-detect-schema-change : SUCCESS in 18m 18s (non-voting)
awx-ansible-modules : SUCCESS in 13m 58s

AlanCoding

Earlier, I missed the critical change that job notifications changed from the EOF hook to on_commit for playbook_on_stats processing, which necessarily assures the host summaries are available for it. Polling the job status already exists to handle the events / task finish synchronization. My prior suggestions can be discarded.

Also the integration test made it very clear what this addresses, and that it is effective

softwarefactory-project-zuul · 2020-03-19T15:36:23Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 6m 00s
awx-api : SUCCESS in 12m 36s
awx-ui : SUCCESS in 9m 39s
awx-ui-next : SUCCESS in 13m 06s
awx-swagger : SUCCESS in 11m 57s
awx-detect-schema-change : SUCCESS in 11m 15s (non-voting)
awx-ansible-modules : SUCCESS in 6m 18s
awx-push-new-schema : SUCCESS in 12m 30s (non-voting)

ryanpetrello requested review from chrismeyersfsu, AlanCoding and matburt March 13, 2020 21:41

ryanpetrello commented Mar 13, 2020

View reviewed changes

ryanpetrello force-pushed the notification-host-summary-race branch from 259142e to 353f2c1 Compare March 13, 2020 21:45

ryanpetrello mentioned this pull request Mar 13, 2020

Notifications - missing hosts with webhook notification #4541

Closed

AlanCoding reviewed Mar 14, 2020

View reviewed changes

jainnikhil30 mentioned this pull request Mar 19, 2020

add a inventory source which can create 500 hosts ansible/test-playbooks#125

Merged

ryanpetrello force-pushed the notification-host-summary-race branch from 353f2c1 to d40a5de Compare March 19, 2020 14:02

AlanCoding approved these changes Mar 19, 2020

View reviewed changes

ryanpetrello added the mergeit label Mar 19, 2020

softwarefactory-project-zuul bot merged commit 74ebb0a into ansible:devel Mar 19, 2020

AlanCoding mentioned this pull request May 11, 2020

Host summary generation for large inventories is too slow #6991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change when we send job notifications to avoid a race condition #6290

change when we send job notifications to avoid a race condition #6290

ryanpetrello commented Mar 13, 2020 •

edited

Loading

ryanpetrello commented Mar 13, 2020 •

edited

Loading

ryanpetrello Mar 13, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Mar 13, 2020

AlanCoding Mar 14, 2020

ryanpetrello Mar 16, 2020 •

edited

Loading

AlanCoding Mar 16, 2020

ryanpetrello Mar 16, 2020 •

edited

Loading

AlanCoding Mar 16, 2020

jainnikhil30 commented Mar 15, 2020

softwarefactory-project-zuul bot commented Mar 19, 2020

AlanCoding left a comment

softwarefactory-project-zuul bot commented Mar 19, 2020

		@@ -356,6 +356,14 @@ def _update_from_event_data(self):
		job_id=self.job_id, uuid__in=failed
		).update(failed=True)

change when we send job notifications to avoid a race condition #6290

change when we send job notifications to avoid a race condition #6290

Conversation

ryanpetrello commented Mar 13, 2020 • edited Loading

ryanpetrello commented Mar 13, 2020 • edited Loading

ryanpetrello Mar 13, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Mar 13, 2020

AlanCoding Mar 14, 2020

Choose a reason for hiding this comment

ryanpetrello Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

AlanCoding Mar 16, 2020

Choose a reason for hiding this comment

ryanpetrello Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

AlanCoding Mar 16, 2020

Choose a reason for hiding this comment

jainnikhil30 commented Mar 15, 2020

softwarefactory-project-zuul bot commented Mar 19, 2020

AlanCoding left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Mar 19, 2020

ryanpetrello commented Mar 13, 2020 •

edited

Loading

ryanpetrello commented Mar 13, 2020 •

edited

Loading

ryanpetrello Mar 13, 2020 •

edited

Loading

ryanpetrello Mar 16, 2020 •

edited

Loading

ryanpetrello Mar 16, 2020 •

edited

Loading