Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] Discrepancy between the rule state and task manager when performing bulkEnable or bulkDisable rules #192207

Open
js-jankisalvi opened this issue Sep 5, 2024 · 4 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@js-jankisalvi
Copy link
Contributor

Related to: #181050

When enabling rules, the Alerting framework skips rules if their saved object has alert.attributes.enabled: true.
This behavior creates issues described here and in the attached SDH.

There might be a situation where a rule is marked as enabled, but no corresponding task exists in the Task Manager.
In the UI, these rules will appear enabled but will never run. Users expect that all rules affected by the bulk enable action will get a corresponding task created in the Task Manager and be scheduled for execution.
Therefore, it would be best to also check if the rules to be enabled have tasks in the Task Manager, instead of relying solely on the rule's current enabled state.

@js-jankisalvi js-jankisalvi added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 5, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Sep 5, 2024

From a comment in a previous attempted PR: #189041 ...

In order to make the double update (rule and task) more resilient, we determined that this should be the order of updates:

  • for enable, we should update the task and then the rule
  • for disable, we should update the rule and then the task
analysis

Trying to capture what happens when a request to enable/disable a task is run, but one of the updates to the task doc or the rule doc fails - in this case, it's the second one (task updated then rule => task doc updated, then rule doc updated).

Two tables for each of enable / disable, showing the difference in the ordering of the updates (rule then task, or task then rule). The first column is the input state of the task/doc, and the second column is the final state after the failed update.

The mismatched input states would be the result of a bad enable/disable update, or some other bad thing that happened.

Final state of "task: !enabled, rule: enabled" should be avoided if possible. The rule will look enabled in the UX, but actually will not be running. These final states have an - XXX appended to them.

Final state of "task: enabled, rule: !enabled" is acceptable. In this case, the rule is the source of truth for enablement, so when the task runs, it will check if the rule is enabled. If it isn't, the rule execution code will instead disable the task, so it won't be polled for till re-enabled. Even if that update fails, we'll get it again the next time.

enable: task updated then rule update fails

input state rule update fails
task: enabled, rule: enabled task: enabled, rule: enabled
task: !enabled, rule: !enabled task: enabled, rule: !enabled
task: enabled, rule: !enabled task: enabled, rule: !enabled
task: !enabled, rule: enabled task: enabled, rule: enabled

enable: rule updated then task update fails

input state task update fails
task: enabled, rule: enabled task: enabled, rule: enabled
task: !enabled, rule: !enabled task: !enabled, rule: enabled - XXX
task: enabled, rule: !enabled task: enabled, rule: enabled
task: !enabled, rule: enabled task: !enabled, rule: enabled - XXX

disable: task updated then rule update fails

input state rule update fails
task: enabled, rule: enabled task: !enabled, rule: enabled - XXX
task: !enabled, rule: !enabled task: !enabled, rule: !enabled
task: enabled, rule: !enabled task: !enabled, rule: !enabled
task: !enabled, rule: enabled task: !enabled, rule: enabled - XXX

disable: rule updated then task update fails

input state rule update fails
task: enabled, rule: enabled task: enabled, rule: !enabled
task: !enabled, rule: !enabled task: !enabled, rule: !enabled
task: enabled, rule: !enabled task: enabled, rule: !enabled
task: !enabled, rule: enabled task: !enabled, rule: !enabled

@pmuellr
Copy link
Member

pmuellr commented Sep 5, 2024

In one of the previous related PRs, I believe I noticed we did NOT have a test where we created a rule in a disabled state, and then did a bulk update of it. We should add a FT for this case, and anything similar/related. You can repro this today, manually, by importing a rule that has been exported - it will be disabled on import (with no backing task doc). You can then enable it via the rule list, which will use the bulk enable/disable functionality.

@pmuellr
Copy link
Member

pmuellr commented Sep 9, 2024

Assuming this PR gets merged, it may change the way we fix this problem: Add bulk update function that directly updates using the esClient #191760.

Since this allows a partial update, what I'm wondering is if we just blast the enabled field in the task docs, as appropriate, without any OCC. Previously we were doing a full update (essentially). We were thinking we would need to OCC a bunch of the task doc updates, but not really sure that we do. We have conflict resolution in the task claimer, but I don't think there's really a need for it here, for the tasks anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

3 participants