Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders #124912

Merged
merged 19 commits into from
Feb 15, 2022

Conversation

FrankHassanabad
Copy link
Contributor

@FrankHassanabad FrankHassanabad commented Feb 8, 2022

Summary

Changes the usage collector telemetry within security solutions to use PIT (Point in Time) and a few other bug fixes and restructuring.

  • The main goal is to change the full queries for up to 10k items to be instead using 1k batched items at a time and PIT (Point in Time). See this ticket for more information and here for an example where they changed there code to use 1k batched items. I use PIT with SO object API, searches, and then composite aggregations which all support the PIT. The PIT timeouts are all set to 5 minutes and all the maximums of 10k to not increase memory more is still there. However, we should be able to increase the 10k limit at this point if we wanted to for usage collector to count beyond the 10k. The initial 10k was an elastic limitation that PIT now avoids.
  • This also fixes a bug where the aggregations were only returning the top 10 items instead of the full 10k. That is changed to use composite aggregations.
  • This restructuring the folder structure to try and do reductionism best we can. I could not do reductionism with the schema as the tooling does not allow it. But the rest is self-repeating in the way hopefully developers expect it to be. And also make it easier for developers to add new telemetry usage collector counters in the same fashion.
  • This exchanges the hand spun TypeScript types in favor of using the caseComments and the Sanitized Alerts and the ML job types using Partial and other TypeScript tricks.
  • This removes the Cyclomatic Complexity warnings coming from the linters by breaking down the functions into smaller units.
  • This removes the "as casts" in all but 1 area which can lead to subtle TypeScript problems.
  • This pushes down the logger and uses the logger to report errors and some debug information

Checklist

Delete any items that are not applicable to this PR.

@FrankHassanabad FrankHassanabad changed the title Telemtry use pit [Security Solutions] Changes telemetry to use PIT (Point in Time) and other refactors Feb 8, 2022
@FrankHassanabad FrankHassanabad self-assigned this Feb 8, 2022
@FrankHassanabad FrankHassanabad added v8.0.0 v8.1.0 release_note:skip Skip the PR/issue when compiling release notes Team:Security Solution Platform Security Solution Platform Team v8.2.0 and removed v8.0.0 v8.1.0 labels Feb 8, 2022
@FrankHassanabad FrankHassanabad changed the title [Security Solutions] Changes telemetry to use PIT (Point in Time) and other refactors [Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and has large refactors Feb 8, 2022
@FrankHassanabad FrankHassanabad marked this pull request as ready for review February 8, 2022 15:39
@FrankHassanabad FrankHassanabad requested review from a team as code owners February 8, 2022 15:39
@FrankHassanabad FrankHassanabad changed the title [Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and has large refactors [Security Solutions] Updates usage collector telemetry to use PIT (Point in Time) and restructuring of folders Feb 8, 2022
Copy link
Contributor

@pjhampton pjhampton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: #93770

I'm not sure about this PR @FrankHassanabad. Instead of doing one upfront call, you are making significantly more network calls to Elasticsearch and potentially slowing down the entire usage collector for Kibana, and possibly timing it out on our end resulting in no telemetry. As telemetry processing is time sensitive I'm not sure size: 10_000 advice really applies to this, but rather sporadic and ad-hoc requests to ES in Kibana APIs.

Also, are you sure the PIT API applies to aggregations? The documentation only discusses queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html

AFAIK, Snapshot telemetry is collected via the /api/telemetry/v2/clusters/_stats which times out after one minute. Really our usage collector needs to finish processing within one minute for us to get anything. You are keeping the connection live in some cases up to 5m.

I will continue to test it, but I think we should briefly meet to discuss these changes. We need to do some profiling on a production-like cluster. I think the folder reorg is fine.

@pjhampton
Copy link
Contributor

@afharo is a very strong ally of the security business from a stack perspective. Alejandro, can you run your eyes over this PR, please?

@FrankHassanabad
Copy link
Contributor Author

FrankHassanabad commented Feb 9, 2022

Edit:

Looking at other people's telemetry code I see the pattern using 1k with PIT's like so to fix their code according to the kibana core recommendations and guidelines:
https://github.com/elastic/kibana/blob/main/src/plugins/visualizations/server/usage_collector/get_usage_collector.ts#L39-L43

From this PR:
#99031

So I think I'm on the right track and have adjusted the default from 100 to 1k in the constants and updated the description. Let me know if there's something else for me to do with this.

The PIT recommendations are from here and this is the ticket we have to solve based off it:

Returns a ISavedObjectsPointInTimeFinder to help page through large sets of saved objects. We strongly recommend using this API for any find queries that might return more than 1000 saved objects, however this API is only intended for use in server-side "batch" processing of objects where you are collecting all objects in memory or streaming them back to the client.

So I am following what they're saying as far as I know if I'm using SO PIT or if I'm using ES PIT.

For their medium term goals they are kind of saying from kibana core on that ticket that they are going to start doing throws in development mode:

Introduce a soft-limit that throws if Kibana searches with a size > 1000 in development mode

which is why I added PIT to the queries.


Also, are you sure the PIT API applies to aggregations? The documentation only discusses queries: https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html

Although the docs don't point it out explicitly for composite aggregations you can see the Typescript accepts it and the ES queries return the PIT back in responses and works with the search_after. I tested it out in dev tooling as well and it all works out as far as I know.

Edit: Yes, asking on slack I got back an affirmative that this is ok 👍

you are making significantly more network calls to Elasticsearch and potentially slowing down the entire usage collector for Kibana, and possibly timing it out on our end resulting in no telemetry.

We can change the defaults from 100 paging to 10k paging and it will be very close to the previous behavior. However, now you have the ability to get more than 10k records if you want to get more than 10k. Do you want to keep the query up to 10k at a time? It's just constant file change now.

Edit: See above, I changed it to 1k like the others are in the code base.

AFAIK, Snapshot telemetry is collected via the /api/telemetry/v2/clusters/_stats which times out after one minute.

As mentioned above, we can change the defaults and keep the PIT and query 10k at a time again. I wouldn't imagine just using a PIT with 10k is going to reduce your odds to where it was before. You just have 1 or 2 extra network calls to create the PIT and then destroy it.

Edit: See above, I changed it to 1k like the others in the code base. Hopefully we are good with this and timing.

You are keeping the connection live in some cases up to 5m.

The PIT does not equal the connection. The PIT has a timeout of 5m, meaning that if it is not explicitly released within 5m, then it will expire for you regardless within 5m. Unless an exception occurs, those handles will be closed by us before 5m. They will be closed when the querying is completed. We can reduce it to 1m if we only have 1 minute to query.

const moduleJobs = modules.flatMap((module) => module.jobs);
const jobs = await mlClient.jobServiceProvider(fakeRequest, savedObjectsClient).jobsSummary();

jobsUsage = jobs.filter(isSecurityJob).reduce((usage, job) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting thing I noticed last week is that this doesn't represent custom security jobs accurately. This is because updateMlJobUsage checks for if a job has been tagged with siem or security, but it is pretty unlikely that a user is going to create a custom job and label it with our labels. We probably don't want to filter with this condition for custom ML jobs and simplify updateMlJobUsage, but you don't need to worry about this in this PR. Just making you aware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am not going to change any of this part at this point for this PR.

@pjhampton
Copy link
Contributor

Pete's notes

When I start Kibana from your PR I get the following error message (before any rules / ml jobs loaded in):

$ gh pr checkout 124912
$ yarn start --no-base-path
...
> [2022-02-14T11:21:47.701+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true'
> [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

Manual Actions (Detection Rules)

  1. Load all prebuilt rules
  2. Enable 7 including Endpoint package
  3. Run yarn test:generate to generate some alerts.
  4. Create a custom rule - enabled
  5. Create another customer rule - disabled
  6. Trigger usage collector:

https://gist.github.com/pjhampton/17b0f353a0dae0da20f8a35ff50117d3

Manual Actions (Machine Learning)

  1. Enable ml rare event jobs
  2. Clone those jobs to make custom jobs
$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true' \
   | jq '.usage.security_solution.detectionMetrics.ml_jobs'

https://gist.github.com/pjhampton/0c4a2cb17ae914c54b239f1f49e23bfb


  • It seems that we are transmitting back custom ml job detail back if the custom job was created from a preconfigured security job. I need to consult the ML team if this is what they want / expect. cc @ajosh0504 @SourinPaul. I know we need to do work around here anyway.

Screenshot 2022-02-14 at 12 00 09

  • This log mentioned above > [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped. seems to go away when an ML job is created.

Overall I think this PR is good and works as expected except for a couple of caveats. I really like the new folder structure, I think that is the biggest win of this PR. I think given you have fixed the alert counts for detection rules in this PR we should ship it in 8.1 assuming you can get it merged this week.

@FrankHassanabad
Copy link
Contributor Author

Thanks for testing it out... For this error you're seeing below:

$ gh pr checkout 124912
$ yarn start --no-base-path
...
> [2022-02-14T11:21:47.701+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

$ curl -XGET 'http://elastic:changeme@localhost:5601/api/stats?extended=true&legacy=true'
> [2022-02-14T11:26:25.300+00:00][ERROR][plugins.securitySolution] Encountered error in telemetry of message: No known job with id 'security', error: Error: No known job with id 'security'. Telemetry for "ml_jobs" will be skipped.

I think that has existed for a while but I don't know if that is expected from the ML API or not. It does appear to be coming from their API. I don't know if there is another way we should change the query but my changes aren't causing this as something "new" happening.

I will change the error to info as suggested above, but I can change it from info to debug or even omit it if you think it's too noisy.

/**
* Same as "RuleSearchResult" just a partial is applied. Useful for unit tests since these types aren't exact.
*/
export type PartialRuleSearchResult = Omit<
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead type. Already removed in a follow up commit

Copy link
Contributor

@pjhampton pjhampton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine.

Please reduce the logging level to debug in this PR before merge instead of info. I mistyped in one of my previous comments.

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Test Failures

  • [job] [logs] Default CI Group #1 / Execution context Browser apps dashboard app propagates context for Vertical bar visualization

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
securitySolution 67 66 -1

ESLint disabled line counts

id before after diff
securitySolution 442 444 +2

Total ESLint disabled count

id before after diff
securitySolution 509 510 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @FrankHassanabad

@FrankHassanabad FrankHassanabad added the auto-backport Deprecated - use backport:version if exact versions are needed label Feb 15, 2022
@FrankHassanabad FrankHassanabad merged commit 7c850dd into elastic:main Feb 15, 2022
@FrankHassanabad FrankHassanabad deleted the telemtry-use-pit branch February 15, 2022 19:58
@kibanamachine

This comment was marked as outdated.

@FrankHassanabad FrankHassanabad removed the auto-backport Deprecated - use backport:version if exact versions are needed label Feb 15, 2022
@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 17, 2022
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

2 similar comments
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 124912 or prevent reminders by adding the backport:skip label.

@afharo afharo added the backport:skip This commit does not require backporting label Feb 22, 2022
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Feb 22, 2022
@afharo
Copy link
Member

afharo commented Feb 22, 2022

v8.2.0 does not need backporting today... adding the label backport:skip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Security Solution Platform Security Solution Platform Team v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants