-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods #180796
Conversation
3ca10e9
to
e5c6dca
Compare
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management) |
@@ -55,19 +55,15 @@ const throwMlAuthError = (mlAuthz: MlAuthz, ruleType: RuleType) => | |||
* @param params - {@link BulkActionsValidationArgs} | |||
*/ | |||
export const validateBulkEnableRule = async ({ rule, mlAuthz }: BulkActionsValidationArgs) => { | |||
if (!rule.enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this change? If the rule is enabled but included in the payload to be enabled anyways, throwing a validation error for it doesn't look too helpful at first look. But would like to undersand your thought process here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now delegate to the rules client the decision of whether a rule should be enabled or not, we need to validate access to the rules regardless of their current state. This is necessary because a rule's status might change to enabled or disabled after our initial check, potentially due to a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xcrzx it's a great improvement to bulk actions 👍
I've tested enabling and disabling all installed Elastic prebuilt rules and it works like charm 🙌
The major part of the PR's diff is moving logic out bulk actions route file. A special thanks for that 👍 I left some nit comments. It looks like implementation can be improved. It's interesting to check if we can use rulesClient aggregations to optimize rules data fetching. But it's mostly out of this PR's scope.
* Updating too many rules in parallel can cause the denial of service of the | ||
* Elasticsearch cluster. | ||
*/ | ||
const MAX_RULES_TO_UPDATE_IN_PARALLEL = 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I saw constant 50 a few times but there is no explanation why it works the best than 20 or 70. Is it possible to add a comment to explain reasoning behind? Even if it's just a round number everyone will know that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed the same constant used several times across Kibana. I'm not sure why it was originally chosen, but it works pretty well in the rules management area, so I don't see a need to question it too much. My intuition tells me that this default is sensible 🙂
|
||
// Go through the original rules array and update rules that were not returned | ||
// as failed from the bulkEnableRules. We cannot rely on the results from the | ||
// bulkEnableRules because the response is not consistent and might not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what it means "the response isn't consistent"? Do you have examples of such inconsistent responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some rules you pass to the bulk method might disappear from the response. I've expanded on this in the comment and included a link to a corresponding issue with examples: #181050.
const ruleIds = validatedRules.map(({ id }) => id); | ||
|
||
// Perform actual update using the rulesClient | ||
let results; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: results
doesn't have to a variable. In this case it's necessary only to handle both enable
and disable
cases. It could be a const if defined like
const results = operation === 'enable' ? await rulesClient.bulkEnableRules({ ids: ruleIds }) : await rulesClient.bulkDisableRules({ ids: ruleIds });
or a separate function to return results.
import { findRules } from '../../../logic/search/find_rules'; | ||
import { MAX_RULES_TO_PROCESS_TOTAL } from './route'; | ||
|
||
export const fetchRulesByQueryOrIds = async ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This function should be split into fetchRulesByIds
and fetchRulesByQuery
since implementation doesn't have anything in common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep those methods together. They encapsulate two different methods for retrieving rules based on the input and switch logic. From a business logic standpoint, it's a single operation: rule retrieval. How the rules are retrieved is just a detail of implementation.
it('returns error if disable rule throws error', async () => { | ||
clients.rulesClient.disable.mockImplementation(async () => { | ||
throw new Error('Test error'); | ||
it('returns error if disable rule returns an error', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it('returns error if disable rule returns an error', async () => { | |
it('returns an error when rulesClient.bulkDisableRules fails', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes and the more than welcome refactor! Tested and all looking good 👍 ✅
): IKibanaResponse<BulkEditActionResponse> => { | ||
const numSucceeded = updated.length + created.length + deleted.length; | ||
const numSkipped = skipped.length; | ||
const numFailed = errors.length; | ||
|
||
const summary: BulkEditActionSummary = { | ||
failed: numFailed, | ||
succeeded: numSucceeded, | ||
skipped: numSkipped, | ||
total: numSucceeded + numFailed + numSkipped, | ||
}; | ||
|
||
// if response is for dry_run, empty lists of rules returned, as rules are not actually updated and stored within ES | ||
// thus, it's impossible to return reliably updated/duplicated/deleted rules | ||
const results: BulkEditActionResults = isDryRun | ||
? { | ||
updated: [], | ||
created: [], | ||
deleted: [], | ||
skipped: [], | ||
} | ||
: { | ||
updated: updated.map((rule) => internalRuleToAPIResponse(rule)), | ||
created: created.map((rule) => internalRuleToAPIResponse(rule)), | ||
deleted: deleted.map((rule) => internalRuleToAPIResponse(rule)), | ||
skipped, | ||
}; | ||
|
||
if (numFailed > 0) { | ||
return response.custom<BulkEditActionResponse>({ | ||
headers: { 'content-type': 'application/json' }, | ||
body: { | ||
message: summary.succeeded > 0 ? 'Bulk edit partially failed' : 'Bulk edit failed', | ||
status_code: 500, | ||
attributes: { | ||
errors: normalizeErrorResponse(errors), | ||
results, | ||
summary, | ||
}, | ||
}, | ||
statusCode: 500, | ||
}); | ||
} | ||
|
||
const responseBody: BulkEditActionResponse = { | ||
success: true, | ||
rules_count: summary.total, | ||
attributes: { results, summary }, | ||
}; | ||
|
||
return response.ok({ body: responseBody }); | ||
}; | ||
|
||
const fetchRulesByQueryOrIds = async ({ | ||
query, | ||
ids, | ||
rulesClient, | ||
abortSignal, | ||
}: { | ||
query: string | undefined; | ||
ids: string[] | undefined; | ||
rulesClient: RulesClient; | ||
abortSignal: AbortSignal; | ||
}): Promise<PromisePoolOutcome<string, RuleAlertType>> => { | ||
if (ids) { | ||
return initPromisePool({ | ||
concurrency: MAX_RULES_TO_UPDATE_IN_PARALLEL, | ||
items: ids, | ||
executor: async (id: string) => { | ||
const rule = await readRules({ id, rulesClient, ruleId: undefined }); | ||
if (rule == null) { | ||
throw Error('Rule not found'); | ||
} | ||
return rule; | ||
}, | ||
abortSignal, | ||
}); | ||
} | ||
|
||
const { data, total } = await findRules({ | ||
rulesClient, | ||
perPage: MAX_RULES_TO_PROCESS_TOTAL, | ||
filter: query, | ||
page: undefined, | ||
sortField: undefined, | ||
sortOrder: undefined, | ||
fields: undefined, | ||
}); | ||
|
||
if (total > MAX_RULES_TO_PROCESS_TOTAL) { | ||
throw new BadRequestError( | ||
`More than ${MAX_RULES_TO_PROCESS_TOTAL} rules matched the filter query. Try to narrow it down.` | ||
); | ||
} | ||
|
||
return { | ||
results: data.map((rule) => ({ item: rule.id, result: rule })), | ||
errors: [], | ||
}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very welcome refactor 👍 🚀 Thanks!
const actionsClient = ctx.actions.getActionsClient(); | ||
|
||
const { getExporter, getClient } = (await ctx.core).savedObjects; | ||
const { getExporter, getClient } = ctx.core.savedObjects; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
What changed here? Were we just awaiting properties that didn't need awaiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. ctx.core
is not a promise, so no need for await
.
|
||
// Go through the original rules array and update rules that were not returned | ||
// as failed from the bulkEnableRules. We cannot rely on the results from the | ||
// bulkEnableRules because the response is not consistent and might not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Not consistent" how? Is this logic not correct?
total number of rules to enable/disable = results.rules.length + results.errors.length
I.e., sometimes some of the passed ruleIds are completely dropped from the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some rules you pass to the bulk method might disappear from the response. I've expanded on this in the comment and included a link to a corresponding issue with examples: #181050.
e5c6dca
to
20247e4
Compare
await pMap( | ||
rulesFinderRules, | ||
async (rule) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we inline this so it doesn't create such a big diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await pMap(rulesFinderRules, async (rule) => {
try {
if (scheduleValidationError) {
...
since it's hard to tell what actually changed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @JiaweiWu, the formatting is applied by Prettier - like it or not, there's nothing I can do about it. The actual change is the addition of the concurrency: MAX_RULES_TO_UPDATE_IN_PARALLEL
param to the pMap
config. It was previously set up incorrectly, which led to too many simultaneous requests to Elasticsearch, resulting in 503 errors when enabling many rules at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
20247e4
to
3c57bd7
Compare
3c57bd7
to
acdeb8d
Compare
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @xcrzx |
…ds (elastic#180796) **Resolves: elastic#171350 **Resolves partially: elastic#177634 ## Summary This PR migrates the `/api/detection_engine/rules/_bulk_action` API endpoint to use the `rulesClient.bulkEnableRules` and `rulesClient.bulkDisableRules` methods under the hood. This change helps mitigate Task Manager flooding when users enable many detection rules at once. The alerting framework's bulk methods implement task staggering to ensure that multiple tasks are not scheduled for execution simultaneously. For more details, refer to [the ticket](elastic#171350). (cherry picked from commit 2169c0f)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…g methods (#180796) (#181312) # Backport This will backport the following commits from `main` to `8.14`: - [[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods (#180796)](#180796) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Dmitrii Shevchenko","email":"dmitrii.shevchenko@elastic.co"},"sourceCommit":{"committedDate":"2024-04-22T13:20:46Z","message":"[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods (#180796)\n\n**Resolves: https://github.com/elastic/kibana/issues/171350**\r\n**Resolves partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n## Summary\r\n\r\nThis PR migrates the `/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the `rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules` methods under the hood. This change helps\r\nmitigate Task Manager flooding when users enable many detection rules at\r\nonce. The alerting framework's bulk methods implement task staggering to\r\nensure that multiple tasks are not scheduled for execution\r\nsimultaneously. For more details, refer to [the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d","branchLabelMapping":{"^v8.15.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","impact:high","Team:Detections and Resp","Team: SecuritySolution","Feature:Rule Management","Team:Detection Rule Management","v8.14.0","v8.15.0"],"title":"[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods","number":180796,"url":"https://github.com/elastic/kibana/pull/180796","mergeCommit":{"message":"[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods (#180796)\n\n**Resolves: https://github.com/elastic/kibana/issues/171350**\r\n**Resolves partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n## Summary\r\n\r\nThis PR migrates the `/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the `rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules` methods under the hood. This change helps\r\nmitigate Task Manager flooding when users enable many detection rules at\r\nonce. The alerting framework's bulk methods implement task staggering to\r\nensure that multiple tasks are not scheduled for execution\r\nsimultaneously. For more details, refer to [the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d"}},"sourceBranch":"main","suggestedTargetBranches":["8.14"],"targetPullRequestStates":[{"branch":"8.14","label":"v8.14.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.15.0","branchLabelMappingKey":"^v8.15.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/180796","number":180796,"mergeCommit":{"message":"[SecuritySolution] Migrate bulk enable/diable rules to Alerting methods (#180796)\n\n**Resolves: https://github.com/elastic/kibana/issues/171350**\r\n**Resolves partially: https://github.com/elastic/kibana/issues/177634**\r\n\r\n## Summary\r\n\r\nThis PR migrates the `/api/detection_engine/rules/_bulk_action` API\r\nendpoint to use the `rulesClient.bulkEnableRules` and\r\n`rulesClient.bulkDisableRules` methods under the hood. This change helps\r\nmitigate Task Manager flooding when users enable many detection rules at\r\nonce. The alerting framework's bulk methods implement task staggering to\r\nensure that multiple tasks are not scheduled for execution\r\nsimultaneously. For more details, refer to [the\r\nticket](https://github.com/elastic/kibana/issues/171350).","sha":"2169c0f9fc6cd0738e94154ec2a0f75ac9c6106d"}}]}] BACKPORT--> Co-authored-by: Dmitrii Shevchenko <dmitrii.shevchenko@elastic.co>
Resolves: #171350
Resolves partially: #177634
Summary
This PR migrates the
/api/detection_engine/rules/_bulk_action
API endpoint to use therulesClient.bulkEnableRules
andrulesClient.bulkDisableRules
methods under the hood. This change helps mitigate Task Manager flooding when users enable many detection rules at once. The alerting framework's bulk methods implement task staggering to ensure that multiple tasks are not scheduled for execution simultaneously. For more details, refer to the ticket.