Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods #180796

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

xcrzx
Copy link
Contributor

@xcrzx xcrzx commented Apr 15, 2024

Resolves: #171350
Resolves partially: #177634

Summary

This PR migrates the /api/detection_engine/rules/_bulk_action API endpoint to use the rulesClient.bulkEnableRules and rulesClient.bulkDisableRules methods under the hood. This change helps mitigate Task Manager flooding when users enable many detection rules at once. The alerting framework's bulk methods implement task staggering to ensure that multiple tasks are not scheduled for execution simultaneously. For more details, refer to the ticket.

@xcrzx xcrzx self-assigned this Apr 15, 2024
@xcrzx xcrzx force-pushed the bulk-enable-rules branch 6 times, most recently from 3ca10e9 to e5c6dca Compare April 17, 2024 12:28
@xcrzx xcrzx added release_note:fix Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Feature:Rule Management Security Solution Detection Rule Management Team:Detection Rule Management Security Detection Rule Management Team 8.14 candidate labels Apr 17, 2024
@xcrzx xcrzx marked this pull request as ready for review April 17, 2024 15:11
@xcrzx xcrzx requested review from a team as code owners April 17, 2024 15:11
@xcrzx xcrzx requested a review from maximpn April 17, 2024 15:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@@ -55,19 +55,15 @@ const throwMlAuthError = (mlAuthz: MlAuthz, ruleType: RuleType) =>
* @param params - {@link BulkActionsValidationArgs}
*/
export const validateBulkEnableRule = async ({ rule, mlAuthz }: BulkActionsValidationArgs) => {
if (!rule.enabled) {
Copy link
Contributor

@jpdjere jpdjere Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this change? If the rule is enabled but included in the payload to be enabled anyways, throwing a validation error for it doesn't look too helpful at first look. But would like to undersand your thought process here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now delegate to the rules client the decision of whether a rule should be enabled or not, we need to validate access to the rules regardless of their current state. This is necessary because a rule's status might change to enabled or disabled after our initial check, potentially due to a race condition.

Copy link
Contributor

@maximpn maximpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xcrzx it's a great improvement to bulk actions 👍

I've tested enabling and disabling all installed Elastic prebuilt rules and it works like charm 🙌

The major part of the PR's diff is moving logic out bulk actions route file. A special thanks for that 👍 I left some nit comments. It looks like implementation can be improved. It's interesting to check if we can use rulesClient aggregations to optimize rules data fetching. But it's mostly out of this PR's scope.

* Updating too many rules in parallel can cause the denial of service of the
* Elasticsearch cluster.
*/
const MAX_RULES_TO_UPDATE_IN_PARALLEL = 50;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I saw constant 50 a few times but there is no explanation why it works the best than 20 or 70. Is it possible to add a comment to explain reasoning behind? Even if it's just a round number everyone will know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed the same constant used several times across Kibana. I'm not sure why it was originally chosen, but it works pretty well in the rules management area, so I don't see a need to question it too much. My intuition tells me that this default is sensible 🙂


// Go through the original rules array and update rules that were not returned
// as failed from the bulkEnableRules. We cannot rely on the results from the
// bulkEnableRules because the response is not consistent and might not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what it means "the response isn't consistent"? Do you have examples of such inconsistent responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some rules you pass to the bulk method might disappear from the response. I've expanded on this in the comment and included a link to a corresponding issue with examples: #181050.

const ruleIds = validatedRules.map(({ id }) => id);

// Perform actual update using the rulesClient
let results;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: results doesn't have to a variable. In this case it's necessary only to handle both enable and disable cases. It could be a const if defined like

const results = operation === 'enable' ? await rulesClient.bulkEnableRules({ ids: ruleIds }) : await rulesClient.bulkDisableRules({ ids: ruleIds });

or a separate function to return results.

import { findRules } from '../../../logic/search/find_rules';
import { MAX_RULES_TO_PROCESS_TOTAL } from './route';

export const fetchRulesByQueryOrIds = async ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This function should be split into fetchRulesByIds and fetchRulesByQuery since implementation doesn't have anything in common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep those methods together. They encapsulate two different methods for retrieving rules based on the input and switch logic. From a business logic standpoint, it's a single operation: rule retrieval. How the rules are retrieved is just a detail of implementation.

it('returns error if disable rule throws error', async () => {
clients.rulesClient.disable.mockImplementation(async () => {
throw new Error('Test error');
it('returns error if disable rule returns an error', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it('returns error if disable rule returns an error', async () => {
it('returns an error when rulesClient.bulkDisableRules fails', async () => {

Copy link
Contributor

@jpdjere jpdjere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes and the more than welcome refactor! Tested and all looking good 👍 ✅

Comment on lines -66 to -231
): IKibanaResponse<BulkEditActionResponse> => {
const numSucceeded = updated.length + created.length + deleted.length;
const numSkipped = skipped.length;
const numFailed = errors.length;

const summary: BulkEditActionSummary = {
failed: numFailed,
succeeded: numSucceeded,
skipped: numSkipped,
total: numSucceeded + numFailed + numSkipped,
};

// if response is for dry_run, empty lists of rules returned, as rules are not actually updated and stored within ES
// thus, it's impossible to return reliably updated/duplicated/deleted rules
const results: BulkEditActionResults = isDryRun
? {
updated: [],
created: [],
deleted: [],
skipped: [],
}
: {
updated: updated.map((rule) => internalRuleToAPIResponse(rule)),
created: created.map((rule) => internalRuleToAPIResponse(rule)),
deleted: deleted.map((rule) => internalRuleToAPIResponse(rule)),
skipped,
};

if (numFailed > 0) {
return response.custom<BulkEditActionResponse>({
headers: { 'content-type': 'application/json' },
body: {
message: summary.succeeded > 0 ? 'Bulk edit partially failed' : 'Bulk edit failed',
status_code: 500,
attributes: {
errors: normalizeErrorResponse(errors),
results,
summary,
},
},
statusCode: 500,
});
}

const responseBody: BulkEditActionResponse = {
success: true,
rules_count: summary.total,
attributes: { results, summary },
};

return response.ok({ body: responseBody });
};

const fetchRulesByQueryOrIds = async ({
query,
ids,
rulesClient,
abortSignal,
}: {
query: string | undefined;
ids: string[] | undefined;
rulesClient: RulesClient;
abortSignal: AbortSignal;
}): Promise<PromisePoolOutcome<string, RuleAlertType>> => {
if (ids) {
return initPromisePool({
concurrency: MAX_RULES_TO_UPDATE_IN_PARALLEL,
items: ids,
executor: async (id: string) => {
const rule = await readRules({ id, rulesClient, ruleId: undefined });
if (rule == null) {
throw Error('Rule not found');
}
return rule;
},
abortSignal,
});
}

const { data, total } = await findRules({
rulesClient,
perPage: MAX_RULES_TO_PROCESS_TOTAL,
filter: query,
page: undefined,
sortField: undefined,
sortOrder: undefined,
fields: undefined,
});

if (total > MAX_RULES_TO_PROCESS_TOTAL) {
throw new BadRequestError(
`More than ${MAX_RULES_TO_PROCESS_TOTAL} rules matched the filter query. Try to narrow it down.`
);
}

return {
results: data.map((rule) => ({ item: rule.id, result: rule })),
errors: [],
};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very welcome refactor 👍 🚀 Thanks!

Comment on lines +121 to +123
const actionsClient = ctx.actions.getActionsClient();

const { getExporter, getClient } = (await ctx.core).savedObjects;
const { getExporter, getClient } = ctx.core.savedObjects;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

What changed here? Were we just awaiting properties that didn't need awaiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. ctx.core is not a promise, so no need for await.


// Go through the original rules array and update rules that were not returned
// as failed from the bulkEnableRules. We cannot rely on the results from the
// bulkEnableRules because the response is not consistent and might not
Copy link
Contributor

@jpdjere jpdjere Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Not consistent" how? Is this logic not correct?

total number of rules to enable/disable = results.rules.length + results.errors.length

I.e., sometimes some of the passed ruleIds are completely dropped from the response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some rules you pass to the bulk method might disappear from the response. I've expanded on this in the comment and included a link to a corresponding issue with examples: #181050.

@banderror banderror added bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. v8.14.0 v8.15.0 and removed 8.14 candidate labels Apr 18, 2024
Comment on lines +192 to +194
await pMap(
rulesFinderRules,
async (rule) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we inline this so it doesn't create such a big diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline what?

Copy link
Contributor

@JiaweiWu JiaweiWu Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await pMap(rulesFinderRules, async (rule) => {
        try {
          if (scheduleValidationError) {
...

since it's hard to tell what actually changed in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JiaweiWu, the formatting is applied by Prettier - like it or not, there's nothing I can do about it. The actual change is the addition of the concurrency: MAX_RULES_TO_UPDATE_IN_PARALLEL param to the pMap config. It was previously set up incorrectly, which led to too many simultaneous requests to Elasticsearch, resulting in 503 errors when enabling many rules at once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor

@JiaweiWu JiaweiWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @xcrzx

@xcrzx xcrzx merged commit 2169c0f into elastic:main Apr 22, 2024
37 checks passed
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Apr 22, 2024
…ds (elastic#180796)

**Resolves: elastic#171350
**Resolves partially: elastic#177634

## Summary

This PR migrates the `/api/detection_engine/rules/_bulk_action` API
endpoint to use the `rulesClient.bulkEnableRules` and
`rulesClient.bulkDisableRules` methods under the hood. This change helps
mitigate Task Manager flooding when users enable many detection rules at
once. The alerting framework's bulk methods implement task staggering to
ensure that multiple tasks are not scheduled for execution
simultaneously. For more details, refer to [the
ticket](elastic#171350).

(cherry picked from commit 2169c0f)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.14

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Apr 22, 2024
…g methods (#180796) (#181312)

# Backport

This will backport the following commits from `main` to `8.14`:
- [[SecuritySolution] Migrate bulk enable/diable rules to Alerting
methods (#180796)](#180796)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Dmitrii
Shevchenko","email":"dmitrii.shevchenko@elastic.co"},"sourceCommit":{"committedDate":"2024-04-22T13:20:46Z","message":"[SecuritySolution]
Migrate bulk enable/diable rules to Alerting methods
(#180796)\n\n**Resolves:
https://github.com/elastic/kibana/issues/171350**\r\n**Resolves
partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n##
Summary\r\n\r\nThis PR migrates the
`/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the
`rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules`
methods under the hood. This change helps\r\nmitigate Task Manager
flooding when users enable many detection rules at\r\nonce. The alerting
framework's bulk methods implement task staggering to\r\nensure that
multiple tasks are not scheduled for execution\r\nsimultaneously. For
more details, refer to
[the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d","branchLabelMapping":{"^v8.15.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","impact:high","Team:Detections
and Resp","Team: SecuritySolution","Feature:Rule
Management","Team:Detection Rule
Management","v8.14.0","v8.15.0"],"title":"[SecuritySolution] Migrate
bulk enable/diable rules to Alerting
methods","number":180796,"url":"https://github.com/elastic/kibana/pull/180796","mergeCommit":{"message":"[SecuritySolution]
Migrate bulk enable/diable rules to Alerting methods
(#180796)\n\n**Resolves:
https://github.com/elastic/kibana/issues/171350**\r\n**Resolves
partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n##
Summary\r\n\r\nThis PR migrates the
`/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the
`rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules`
methods under the hood. This change helps\r\nmitigate Task Manager
flooding when users enable many detection rules at\r\nonce. The alerting
framework's bulk methods implement task staggering to\r\nensure that
multiple tasks are not scheduled for execution\r\nsimultaneously. For
more details, refer to
[the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d"}},"sourceBranch":"main","suggestedTargetBranches":["8.14"],"targetPullRequestStates":[{"branch":"8.14","label":"v8.14.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.15.0","branchLabelMappingKey":"^v8.15.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/180796","number":180796,"mergeCommit":{"message":"[SecuritySolution]
Migrate bulk enable/diable rules to Alerting methods
(#180796)\n\n**Resolves:
https://github.com/elastic/kibana/issues/171350**\r\n**Resolves
partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n##
Summary\r\n\r\nThis PR migrates the
`/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the
`rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules`
methods under the hood. This change helps\r\nmitigate Task Manager
flooding when users enable many detection rules at\r\nonce. The alerting
framework's bulk methods implement task staggering to\r\nensure that
multiple tasks are not scheduled for execution\r\nsimultaneously. For
more details, refer to
[the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d"}}]}]
BACKPORT-->

Co-authored-by: Dmitrii Shevchenko <dmitrii.shevchenko@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Rule Management Security Solution Detection Rule Management impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. release_note:fix Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.14.0 v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Security Solution] Bulk enabling many rules can result in flooding Task Manager
8 participants