[Response Ops][Alerting] Investigate refactoring `ExecutionHandler` class that schedules actions during rule execution #182181

ymao1 · 2024-04-30T17:26:14Z

The alerting execution handler makes the determination of which if any actions to trigger for a rule run. In recent months, we've added summary actions and system actions to the class so the logic for when an action is triggered and what type of action is confusing and hard to follow. We should investigate how we can refactor this code to make clearer what is going on.

elasticmachine · 2024-04-30T17:26:16Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1 · 2024-06-18T12:49:54Z

The ExecutionHandler class is responsible for scheduling actions during rule execution, after alerts have been categorized (as new, ongoing or recovered) and alerts as data documents have been indexed. It contains a main run function that is called within the alerting task runner which takes as input all of the alerts and the rule's configured actions.

Within the run function, there are two primary for-loops.

First for-loop (within the generateExecutables function)

Iterates over each action
- determines whether the action requires getting summarized alerts and performs the query if it does
- checks whether action is summary action and returns summarized alerts if yes
- iterates over each alert to determine whether to "generate an executable"
Iterates over each system action
- gets summarized alerts for system action

Second for-loop (within the run function, after generateExecutables is called)

Iterates over the "executables" (actions that need to be scheduled)
- checks that action limit hasn't been reached
- increments various metrics for tracking what's been scheduled or skipped
- debug logs
- calls helper functions based on whether action is summary action, system action or per-alert action to format the action and add to the array of actions to bulk schedule

Finally, the execution handler bulk enqueues the formatted actions.

Because we have 3 different types of actions being handled in the same class, it can be difficult to follow the logic when trying to determine, for example, what happens when a system action is scheduled. You would need to dig into the first for-loop, find the code where it's testing for system action and see what happens, then jump back to the second for-loop, find the code where it's testing for system action and see what happens.

Here is how we could refactor this

rename ExecutionHandler to ActionHandler or ActionScheduler for clarity
create classes to encapsulate the logic for the different action types; these classes can implement the same IActionScheduler interface
- SystemActionScheduler
- SummaryActionScheduler
- PerAlertActionScheduler
each class can register with the ActionScheduler
when the ActionScheduler.run() function is called, it can iterate through the registered schedulers and call a run function which will
- get summarized alerts if needed
- handle logging and metrics for the action type
- format the actions to schedule and return the action to enqueue
after calling each registered scheduler, the ActionScheduler can bulk enqueue all the returned actions

We should make sure enforce some sort of ordering between the ActionSchedulers since we're applying a limit to the number of actions we're running. For example, the per alert actions could increase in number depending on the number of alerts where the summary actions would not so we should be scheduling the summary and system actions first before enqueuing the per-alert actions. This is happening in the current for-loops so this logic should be preserved.

Another optimization that might be good to make is to consolidate all of the summary queries required for all the actions to ensure we're not performing duplicate queries. Currently, each action is considered separately from any others and a summary query is performed for each if required. If multiple actions use the same alert filter or don't use any alert filter, we could cache the query results for reuse instead of running the same query repeatedly.

ActionScheduler

constructor(ruleActions)
- new SystemActionScheduler(ruleActions)
  - filters out and saves just system actions in private variable
- new SummaryActionScheduler(ruleActions)
  - filters out and saves just summary actions in private variable
- new PerAlertActionScheduler(ruleActions)
  - filters out and saves just per-alert actions in private variable
run
- for each registered scheduler, call scheduler.getSummaryQueries which returns all the summary queries that need to run
- run all the summary queries, caching the results and skipping duplicates
- for each registered scheduler, call scheduler.run (or maybe scheduler.getActionsToSchedule for clarity?) which returns an array of actions to enqueue
- bulk enqueue all of the returned actions.

This refactor could be done in a few stages:

Stage 1

Rename ExecutionHandler to ActionScheduler.
Create the different scheduler classes.
In this first stage, we'll split the for-loop currently inside ExecutionHandler.generateExecutables into the appropriate class (SystemActionScheduler.generateExecutables) and combine the returned executables from each scheduler class.
In this stage, we'll keep the second for-loop that loops over the executables inside the ExecutionHandler/ActionScheduler class.

Stage 2

Split the second for-loop functionality into the appropriate class. We can rename the generateExecutables functions into run or something like getActionsToSchedule which returns the formatted actions to bulk queue.

Stage 3 (optional)

Investigate optimizing getting summary queries so that we're not performing duplicate queries and instead caching the query results so they can be reused. If we're getting all the queries up front, we could also look into doing an msearch to perform all the queries at once.

ymao1 · 2024-06-20T14:24:52Z

Closing the research issue as the followup issues are created.

ymao1 added Feature:Alerting technical debt Improvement of the software architecture and operational architecture Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 30, 2024

ymao1 changed the title ~~[Response Ops][Alerting] Investigate refactoring alerting execution handler~~ [Response Ops][Alerting] Investigate refactoring ExecutionHandler class that schedules actions during rule execution May 2, 2024

ymao1 self-assigned this Jun 12, 2024

This was referenced Jun 20, 2024

[Response Ops][Alerting] Refactor ExecutionHandler Pt 1 #186533

Closed

[Response Ops][Alerting] Refactor ExecutionHandler Pt 2 #186534

Closed

[Response Ops][Alerting] Optimize querying for summary alerts in ExecutionHandler #186535

Open

ymao1 closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Alerting] Investigate refactoring `ExecutionHandler` class that schedules actions during rule execution #182181

[Response Ops][Alerting] Investigate refactoring `ExecutionHandler` class that schedules actions during rule execution #182181

ymao1 commented Apr 30, 2024

elasticmachine commented Apr 30, 2024

ymao1 commented Jun 18, 2024 •

edited

Loading

ymao1 commented Jun 20, 2024

[Response Ops][Alerting] Investigate refactoring ExecutionHandler class that schedules actions during rule execution #182181

[Response Ops][Alerting] Investigate refactoring ExecutionHandler class that schedules actions during rule execution #182181

Comments

ymao1 commented Apr 30, 2024

elasticmachine commented Apr 30, 2024

ymao1 commented Jun 18, 2024 • edited Loading

ymao1 commented Jun 20, 2024

[Response Ops][Alerting] Investigate refactoring `ExecutionHandler` class that schedules actions during rule execution #182181

[Response Ops][Alerting] Investigate refactoring `ExecutionHandler` class that schedules actions during rule execution #182181

ymao1 commented Jun 18, 2024 •

edited

Loading