-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change when we send job notifications to avoid a race condition #6290
change when we send job notifications to avoid a race condition #6290
Conversation
@chrismeyersfsu @AlanCoding I reproduced and verified this by:
In my first attempt at this, the payloads never had all the hosts - only ~100 of the 301 make it through (because they haven't all been inserted yet when the EOF event is persisted): ~ cat notification-body.json | jq '.hosts | length'
103 Here's a run with a single host after my PR:
Here's one that I set up with many hosts: ~ cat notification-body.json | jq '.hosts | length'
301
~ awx-manage dbshell
In [1]: Inventory.objects.first().hosts.count()
Out[1]: 301 |
# event handling code in main.models.events | ||
pass | ||
elif hasattr(uj, 'send_notification_templates'): | ||
handle_success_and_failure_notifications.apply_async([uj.id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these payloads are pretty big, and they really belong in a background task so we're not monopolizing a callback worker waiting a second or two on the job to go into the finished
state.
259142e
to
353f2c1
Compare
Build succeeded.
|
@@ -356,6 +356,14 @@ def _update_from_event_data(self): | |||
job_id=self.job_id, uuid__in=failed | |||
).update(failed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not cram everything else from line 340 up to here into that handle_success_and_failure_notifications
task, and remove that lag too? The issue really sounded like it scaled with the number of hosts, so it's extremely likely that the main cause of the race condition on the job host summaries, as opposed to the notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I want to move the saving of job host summary data into a background task. Otherwise there's potentially a scenario where you have a playbook_on_stats
event in the database, but the API for actual summary data is empty/incomplete for some period of time. That seems weird to me, and I worry it could introduce regressions or side effects somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't in a transaction now, is it? If it's not in a transaction, I don't see how it makes a difference in a separate task vs. same task regarding the data correspondence you're talking about.
If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.
I just couldn't draw the line between the issue and this, because my assumption was the the creation of job host summaries would be the driving O(N)
cost for the number of hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't in a transaction now, is it?
Yea, I think you're actually right here - this is probably actually autocommit mode since it's in the callback receiver.
I do agree with this, though:
If it works, it works. If the integration test shows this fixes the issue, then you should move ahead with it.
I get a little nervous changing this code too much because it's so critical :).
Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, even if there is a higher cost here that consumes a callback worker for a period of a few seconds, it is just once per playbook run (at the end).
But it scales with the number of hosts, and not the number of tasks. Doing a simple thing with a large number of hosts is a common use case.
Delaying the playbook_on_stats also delays the events_processing_finished
status of the job, and getting the EOF before the stats event feels like something that could confuse the UI.
But I agree with your overall sentiment - because while some of this is new, the host summaries stuff isn't new, and I believe that's the bottleneck. If the new stuff was a bottleneck that could introduce regressions, but I don't see that being the case.
Thanks Ryan for the PR. Fix looks good. I tested it locally. Here are the results. Create a inventory with 1000 hosts. Run the job and check the notification sent. 1.) Without the above patch
2.) With the above patch
With the above patch I get all the 1000 nodes in the notification. |
success/failure notifications for *playbooks* include summary data about the hosts in based on the contents of the playbook_on_stats event the current implementation suffers from a number of race conditions that sometimes can cause that data to be missing or incomplete; this change makes it so that for *playbooks* we build (and send) the notification in response to the playbook_on_stats event, not the EOF event
353f2c1
to
d40a5de
Compare
Build succeeded.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, I missed the critical change that job notifications changed from the EOF hook to on_commit for playbook_on_stats processing, which necessarily assures the host summaries are available for it. Polling the job status already exists to handle the events / task finish synchronization. My prior suggestions can be discarded.
Also the integration test made it very clear what this addresses, and that it is effective
Build succeeded (gate pipeline).
|
cc @jainnikhil30
success/failure notifications for playbooks include summary data about
the hosts in based on the contents of the playbook_on_stats event
the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for playbooks we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event