Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Investigate refactoring ExecutionHandler class that schedules actions during rule execution #182181

Closed
ymao1 opened this issue Apr 30, 2024 · 3 comments
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) technical debt Improvement of the software architecture and operational architecture

Comments

@ymao1
Copy link
Contributor

ymao1 commented Apr 30, 2024

The alerting execution handler makes the determination of which if any actions to trigger for a rule run. In recent months, we've added summary actions and system actions to the class so the logic for when an action is triggered and what type of action is confusing and hard to follow. We should investigate how we can refactor this code to make clearer what is going on.

@ymao1 ymao1 added Feature:Alerting technical debt Improvement of the software architecture and operational architecture Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 changed the title [Response Ops][Alerting] Investigate refactoring alerting execution handler [Response Ops][Alerting] Investigate refactoring ExecutionHandler class that schedules actions during rule execution May 2, 2024
@ymao1 ymao1 self-assigned this Jun 12, 2024
@ymao1
Copy link
Contributor Author

ymao1 commented Jun 18, 2024

The ExecutionHandler class is responsible for scheduling actions during rule execution, after alerts have been categorized (as new, ongoing or recovered) and alerts as data documents have been indexed. It contains a main run function that is called within the alerting task runner which takes as input all of the alerts and the rule's configured actions.

Within the run function, there are two primary for-loops.

First for-loop (within the generateExecutables function)

  • Iterates over each action
    • determines whether the action requires getting summarized alerts and performs the query if it does
    • checks whether action is summary action and returns summarized alerts if yes
    • iterates over each alert to determine whether to "generate an executable"
  • Iterates over each system action
    • gets summarized alerts for system action

Second for-loop (within the run function, after generateExecutables is called)

  • Iterates over the "executables" (actions that need to be scheduled)
    • checks that action limit hasn't been reached
    • increments various metrics for tracking what's been scheduled or skipped
    • debug logs
    • calls helper functions based on whether action is summary action, system action or per-alert action to format the action and add to the array of actions to bulk schedule

Finally, the execution handler bulk enqueues the formatted actions.

Because we have 3 different types of actions being handled in the same class, it can be difficult to follow the logic when trying to determine, for example, what happens when a system action is scheduled. You would need to dig into the first for-loop, find the code where it's testing for system action and see what happens, then jump back to the second for-loop, find the code where it's testing for system action and see what happens.

Here is how we could refactor this

  • rename ExecutionHandler to ActionHandler or ActionScheduler for clarity
  • create classes to encapsulate the logic for the different action types; these classes can implement the same IActionScheduler interface
    • SystemActionScheduler
    • SummaryActionScheduler
    • PerAlertActionScheduler
  • each class can register with the ActionScheduler
  • when the ActionScheduler.run() function is called, it can iterate through the registered schedulers and call a run function which will
    • get summarized alerts if needed
    • handle logging and metrics for the action type
    • format the actions to schedule and return the action to enqueue
  • after calling each registered scheduler, the ActionScheduler can bulk enqueue all the returned actions

We should make sure enforce some sort of ordering between the ActionSchedulers since we're applying a limit to the number of actions we're running. For example, the per alert actions could increase in number depending on the number of alerts where the summary actions would not so we should be scheduling the summary and system actions first before enqueuing the per-alert actions. This is happening in the current for-loops so this logic should be preserved.

Another optimization that might be good to make is to consolidate all of the summary queries required for all the actions to ensure we're not performing duplicate queries. Currently, each action is considered separately from any others and a summary query is performed for each if required. If multiple actions use the same alert filter or don't use any alert filter, we could cache the query results for reuse instead of running the same query repeatedly.

ActionScheduler

  • constructor(ruleActions)
    • new SystemActionScheduler(ruleActions)
      • filters out and saves just system actions in private variable
    • new SummaryActionScheduler(ruleActions)
      • filters out and saves just summary actions in private variable
    • new PerAlertActionScheduler(ruleActions)
      • filters out and saves just per-alert actions in private variable
  • run
    • for each registered scheduler, call scheduler.getSummaryQueries which returns all the summary queries that need to run
    • run all the summary queries, caching the results and skipping duplicates
    • for each registered scheduler, call scheduler.run (or maybe scheduler.getActionsToSchedule for clarity?) which returns an array of actions to enqueue
    • bulk enqueue all of the returned actions.

This refactor could be done in a few stages:

Stage 1

  • Rename ExecutionHandler to ActionScheduler.
  • Create the different scheduler classes.
  • In this first stage, we'll split the for-loop currently inside ExecutionHandler.generateExecutables into the appropriate class (SystemActionScheduler.generateExecutables) and combine the returned executables from each scheduler class.
  • In this stage, we'll keep the second for-loop that loops over the executables inside the ExecutionHandler/ActionScheduler class.

Stage 2

  • Split the second for-loop functionality into the appropriate class. We can rename the generateExecutables functions into run or something like getActionsToSchedule which returns the formatted actions to bulk queue.

Stage 3 (optional)

  • Investigate optimizing getting summary queries so that we're not performing duplicate queries and instead caching the query results so they can be reused. If we're getting all the queries up front, we could also look into doing an msearch to perform all the queries at once.

@ymao1
Copy link
Contributor Author

ymao1 commented Jun 20, 2024

Closing the research issue as the followup issues are created.

@ymao1 ymao1 closed this as completed Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) technical debt Improvement of the software architecture and operational architecture
Projects
No open projects
Development

No branches or pull requests

2 participants