[Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders #124912

FrankHassanabad · 2022-02-08T00:57:05Z

Summary

Changes the usage collector telemetry within security solutions to use PIT (Point in Time) and a few other bug fixes and restructuring.

The main goal is to change the full queries for up to 10k items to be instead using 1k batched items at a time and PIT (Point in Time). See this ticket for more information and here for an example where they changed there code to use 1k batched items. I use PIT with SO object API, searches, and then composite aggregations which all support the PIT. The PIT timeouts are all set to 5 minutes and all the maximums of 10k to not increase memory more is still there. However, we should be able to increase the 10k limit at this point if we wanted to for usage collector to count beyond the 10k. The initial 10k was an elastic limitation that PIT now avoids.
This also fixes a bug where the aggregations were only returning the top 10 items instead of the full 10k. That is changed to use composite aggregations.
This restructuring the folder structure to try and do reductionism best we can. I could not do reductionism with the schema as the tooling does not allow it. But the rest is self-repeating in the way hopefully developers expect it to be. And also make it easier for developers to add new telemetry usage collector counters in the same fashion.
This exchanges the hand spun TypeScript types in favor of using the caseComments and the Sanitized Alerts and the ML job types using Partial and other TypeScript tricks.
This removes the Cyclomatic Complexity warnings coming from the linters by breaking down the functions into smaller units.
This removes the "as casts" in all but 1 area which can lead to subtle TypeScript problems.
This pushes down the logger and uses the logger to report errors and some debug information

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

x-pack/plugins/security_solution/server/usage/constants.ts

x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.ts

pjhampton

RE: #93770

I'm not sure about this PR @FrankHassanabad. Instead of doing one upfront call, you are making significantly more network calls to Elasticsearch and potentially slowing down the entire usage collector for Kibana, and possibly timing it out on our end resulting in no telemetry. As telemetry processing is time sensitive I'm not sure size: 10_000 advice really applies to this, but rather sporadic and ad-hoc requests to ES in Kibana APIs.

Also, are you sure the PIT API applies to aggregations? The documentation only discusses queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html

AFAIK, Snapshot telemetry is collected via the /api/telemetry/v2/clusters/_stats which times out after one minute. Really our usage collector needs to finish processing within one minute for us to get anything. You are keeping the connection live in some cases up to 5m.

I will continue to test it, but I think we should briefly meet to discuss these changes. We need to do some profiling on a production-like cluster. I think the folder reorg is fine.

pjhampton · 2022-02-09T11:15:38Z

@afharo is a very strong ally of the security business from a stack perspective. Alejandro, can you run your eyes over this PR, please?

FrankHassanabad · 2022-02-09T17:36:15Z

Edit:

Looking at other people's telemetry code I see the pattern using 1k with PIT's like so to fix their code according to the kibana core recommendations and guidelines:
https://github.com/elastic/kibana/blob/main/src/plugins/visualizations/server/usage_collector/get_usage_collector.ts#L39-L43

From this PR:
#99031

So I think I'm on the right track and have adjusted the default from 100 to 1k in the constants and updated the description. Let me know if there's something else for me to do with this.

The PIT recommendations are from here and this is the ticket we have to solve based off it:

Returns a ISavedObjectsPointInTimeFinder to help page through large sets of saved objects. We strongly recommend using this API for any find queries that might return more than 1000 saved objects, however this API is only intended for use in server-side "batch" processing of objects where you are collecting all objects in memory or streaming them back to the client.

So I am following what they're saying as far as I know if I'm using SO PIT or if I'm using ES PIT.

For their medium term goals they are kind of saying from kibana core on that ticket that they are going to start doing throws in development mode:

Introduce a soft-limit that throws if Kibana searches with a size > 1000 in development mode

which is why I added PIT to the queries.

Also, are you sure the PIT API applies to aggregations? The documentation only discusses queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html

Although the docs don't point it out explicitly for composite aggregations you can see the Typescript accepts it and the ES queries return the PIT back in responses and works with the search_after. I tested it out in dev tooling as well and it all works out as far as I know.

Edit: Yes, asking on slack I got back an affirmative that this is ok 👍

you are making significantly more network calls to Elasticsearch and potentially slowing down the entire usage collector for Kibana, and possibly timing it out on our end resulting in no telemetry.

We can change the defaults from 100 paging to 10k paging and it will be very close to the previous behavior. However, now you have the ability to get more than 10k records if you want to get more than 10k. Do you want to keep the query up to 10k at a time? It's just constant file change now.

Edit: See above, I changed it to 1k like the others are in the code base.

AFAIK, Snapshot telemetry is collected via the /api/telemetry/v2/clusters/_stats which times out after one minute.

As mentioned above, we can change the defaults and keep the PIT and query 10k at a time again. I wouldn't imagine just using a PIT with 10k is going to reduce your odds to where it was before. You just have 1 or 2 extra network calls to create the PIT and then destroy it.

Edit: See above, I changed it to 1k like the others in the code base. Hopefully we are good with this and timing.

You are keeping the connection live in some cases up to 5m.

The PIT does not equal the connection. The PIT has a timeout of 5m, meaning that if it is not explicitly released within 5m, then it will expire for you regardless within 5m. Unless an exception occurs, those handles will be closed by us before 5m. They will be closed when the querying is completed. We can reduce it to 1m if we only have 1 minute to query.

pjhampton · 2022-02-14T10:31:15Z

x-pack/plugins/security_solution/server/usage/detections/ml_jobs/get_metrics.ts

+    const moduleJobs = modules.flatMap((module) => module.jobs);
+    const jobs = await mlClient.jobServiceProvider(fakeRequest, savedObjectsClient).jobsSummary();
+
+    jobsUsage = jobs.filter(isSecurityJob).reduce((usage, job) => {


One interesting thing I noticed last week is that this doesn't represent custom security jobs accurately. This is because updateMlJobUsage checks for if a job has been tagged with siem or security, but it is pretty unlikely that a user is going to create a custom job and label it with our labels. We probably don't want to filter with this condition for custom ML jobs and simplify updateMlJobUsage, but you don't need to worry about this in this PR. Just making you aware.

Ok, I am not going to change any of this part at this point for this PR.

x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.mocks.ts

x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.ts

x-pack/plugins/security_solution/server/usage/queries/get_alerts.ts

x-pack/plugins/security_solution/server/usage/queries/utils/fetch_hits_with_pit.ts

pjhampton · 2022-02-14T12:04:17Z

Pete's notes

When I start Kibana from your PR I get the following error message (before any rules / ml jobs loaded in):

$ gh pr checkout 124912
$ yarn start --no-base-path
...
> [2022-02-14T11:21:47.701+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true'
> [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

Manual Actions (Detection Rules)

Load all prebuilt rules
Enable 7 including Endpoint package
Run yarn test:generate to generate some alerts.
Create a custom rule - enabled
Create another customer rule - disabled
Trigger usage collector:

https://gist.github.com/pjhampton/17b0f353a0dae0da20f8a35ff50117d3

Manual Actions (Machine Learning)

Enable ml rare event jobs
Clone those jobs to make custom jobs

$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true' \
   | jq '.usage.security_solution.detectionMetrics.ml_jobs'

https://gist.github.com/pjhampton/0c4a2cb17ae914c54b239f1f49e23bfb

It seems that we are transmitting back custom ml job detail back if the custom job was created from a preconfigured security job. I need to consult the ML team if this is what they want / expect. cc @ajosh0504 @SourinPaul. I know we need to do work around here anyway.

This log mentioned above > [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped. seems to go away when an ML job is created.

Overall I think this PR is good and works as expected except for a couple of caveats. I really like the new folder structure, I think that is the biggest win of this PR. I think given you have fixed the alert counts for detection rules in this PR we should ship it in 8.1 assuming you can get it merged this week.

FrankHassanabad · 2022-02-14T17:30:07Z

Thanks for testing it out... For this error you're seeing below:

$ gh pr checkout 124912
$ yarn start --no-base-path
...
> [2022-02-14T11:21:47.701+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true'
> [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

I think that has existed for a while but I don't know if that is expected from the ML API or not. It does appear to be coming from their API. I don't know if there is another way we should change the query but my changes aren't causing this as something "new" happening.

I will change the error to info as suggested above, but I can change it from info to debug or even omit it if you think it's too noisy.

FrankHassanabad · 2022-02-14T21:48:07Z

x-pack/plugins/security_solution/server/usage/types.ts

+/**
+ * Same as "RuleSearchResult" just a partial is applied. Useful for unit tests since these types aren't exact.
+ */
+export type PartialRuleSearchResult = Omit<


Dead type. Already removed in a follow up commit

pjhampton

Looks fine.

Please reduce the logging level to debug in this PR before merge instead of info. I mistyped in one of my previous comments.

kibana-ci · 2022-02-15T18:31:28Z

💛 Build succeeded, but was flaky

Test Failures

[job] [logs] Default CI Group #1 / Execution context Browser apps dashboard app propagates context for Vertical bar visualization

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id	before	after	diff
`securitySolution`	67	66	-1

ESLint disabled line counts

id	before	after	diff
`securitySolution`	442	444	+2

Total ESLint disabled count

id	before	after	diff
`securitySolution`	509	510	+1

History

💚 Build #24080 succeeded 5778b97
💚 Build #23720 succeeded cdfd72f
💚 Build #23249 succeeded 8b940bc
💚 Build #22929 succeeded 204d357
💛 Build #22551 was flaky 4b309c7
💚 Build #22334 succeeded f306b24

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @FrankHassanabad

kibanamachine · 2022-02-17T20:06:18Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

kibanamachine · 2022-02-18T20:16:17Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

kibanamachine · 2022-02-21T20:26:17Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

afharo · 2022-02-22T10:04:04Z

v8.2.0 does not need backporting today... adding the label backport:skip

FrankHassanabad added 4 commits February 4, 2022 09:41

wip

d016f5a

Merge branch 'main' into telemtry-use-pit

bbfbdbd

Added the Point In Time (PIT) work and cleaned up the types

230d385

Merge branch 'main' into telemtry-use-pit

e5b5dab

FrankHassanabad changed the title ~~Telemtry use pit~~ [Security Solutions] Changes telemetry to use PIT (Point in Time) and other refactors Feb 8, 2022

FrankHassanabad self-assigned this Feb 8, 2022

FrankHassanabad added v8.0.0 v8.1.0 release_note:skip Skip the PR/issue when compiling release notes Team:Security Solution Platform Security Solution Platform Team v8.2.0 and removed v8.0.0 v8.1.0 labels Feb 8, 2022

FrankHassanabad added 2 commits February 7, 2022 18:04

Merge branch 'main' into telemtry-use-pit

7e03126

Fixes schema issue by re-combining it again

f306b24

FrankHassanabad changed the title ~~[Security Solutions] Changes telemetry to use PIT (Point in Time) and other refactors~~ [Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and has large refactors Feb 8, 2022

FrankHassanabad marked this pull request as ready for review February 8, 2022 15:39

FrankHassanabad requested review from a team as code owners February 8, 2022 15:39

FrankHassanabad requested review from pjhampton, yctercero and dhurley14 February 8, 2022 15:39

Merge branch 'main' into telemtry-use-pit

4b309c7

FrankHassanabad changed the title ~~[Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and has large refactors~~ [Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders Feb 8, 2022

pjhampton reviewed Feb 9, 2022

View reviewed changes

x-pack/plugins/security_solution/server/usage/constants.ts Outdated Show resolved Hide resolved

pjhampton reviewed Feb 9, 2022

View reviewed changes

x-pack/plugins/security_solution/server/usage/detections/rules/get_metrics.ts Show resolved Hide resolved

pjhampton reviewed Feb 9, 2022

View reviewed changes

Merge branch 'main' into telemtry-use-pit

d32517e