[Telemetry] track and warn event loop delays thresholds #103615

Bamieh · 2021-06-29T07:42:56Z

Summary

Part of a larger work to measure platform performance and ease debugging performance issues (#63848)

In 7.14 we started sending hourly updated event loop delays histogram (#101580). This helps us investigate average delays our customers have, percentiles, etc.

This PR warns users when the event loop delay exceeds a configurable threshold duration ops. eventLoopDelayThreshold

By default this duration is 350ms logged once every 30 seconds as long as the delay is still above that target.

Once we have a representative sample from the reported delays histogram we can adjust this default to be more meaningful and closer to real world cases.

metrics.ops already reports collected_at and all the ecs object for further debuggablity around the logs.

Implementation direction

This is another approach to implementing the event loop threshold (original draft PR: #103478)

Instead of using the ops.metrics implementation I used the the event loop delays histogram. The current ops metics implementation does not really capture event loop delays as it only captures the delay in the immediate loop when the measurement is made.

The perf.monitorEventLoopDelay() here tracks the delays over time and not only on collection. This way we really capture delays and spikes. I also added some telemetry around these spikes to report them back to our cluster along the full histogram for diagnosis. which is not possible inside core at the moment without a lot of piping. I prefer this approach, let me know what you think.

Notes

I've experimented with using event loop utilization (ELU) but realized it serves a different purpose than the original intention of this PR (draft #103477)

Related: #98673
Closes: #96192

…vent_loop_delay

Bamieh · 2021-06-29T07:53:20Z

This is another approach to implementing the event loop threshold (original draft PR: #103478)

Instead of using the ops.metrics implementation I used the the event loop delays histogram. The current ops metics implementation does not really capture event loop delays as it only captures the delay in the immediate loop when the measurement is made.

The perf.monitorEventLoopDelay() tracks the delays over time and not only during the collection. This way we really capture delays and spikes. I also added some telemetry around these spikes to report them back to our cluster along the full histogram for diagnosis. which is not possible inside core at the moment without a lot of piping.

I like this approach more. let me know what you think. cc @joshdover

…ck_event_loop_threshold

joshdover

Just one nit, but otherwise LGTM. Totally agree that this is going to be a much more helpful, stable, and accurate way of warning the admin of an issue. I do worry that this warning isn't highly actionable though, but would signals to support that they may have some scaling issues with Task Manager.

I do wonder if we should consider linking to our documentation on scaling task manager in production: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html#_deployment_considerations

src/plugins/kibana_usage_collection/server/collectors/event_loop_delays/track_threshold.ts

…ck_event_loop_threshold

kibanamachine · 2021-06-29T17:57:34Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #135022 succeeded 0c47654
💔 Build #134960 failed 821372e

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine · 2021-06-29T18:00:56Z

💚 Backport successful

Status	Branch	Result
✅	7.x

This backport PR will be merged automatically after passing CI.

…03728) Co-authored-by: Ahmad Bamieh <ahmadbamieh@gmail.com>

Bamieh added 5 commits June 28, 2021 15:04

log event loop delay

892edf8

track delay thresholds via kibana collection

e22bf12

Merge branch 'master' of github.com:elastic/kibana into metrics/log_e…

863bd9d

…vent_loop_delay

update tests

60c994a

revert metrics changes

821372e

Bamieh requested a review from joshdover June 29, 2021 10:37

Bamieh added 2 commits June 29, 2021 13:48

update collection mock

3e77e4f

Merge branch 'master' of github.com:elastic/kibana into telemetry/tra…

0c47654

…ck_event_loop_threshold

Bamieh marked this pull request as ready for review June 29, 2021 13:19

Bamieh requested a review from a team as a code owner June 29, 2021 13:19

Bamieh added auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v7.14.0 v8.0.0 labels Jun 29, 2021

joshdover approved these changes Jun 29, 2021

View reviewed changes

src/plugins/kibana_usage_collection/server/collectors/event_loop_delays/track_threshold.ts Outdated Show resolved Hide resolved

Bamieh added 2 commits June 29, 2021 18:10

Merge branch 'master' of github.com:elastic/kibana into telemetry/tra…

9af1464

…ck_event_loop_threshold

named parameters for configs + update warning message

f83f9c1

Bamieh enabled auto-merge (squash) June 29, 2021 15:49

Bamieh merged commit de19795 into elastic:master Jun 29, 2021

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jun 29, 2021

[Telemetry] track and warn event loop delays thresholds (elastic#103615)

355c769

kibanamachine mentioned this pull request Jun 29, 2021

[7.x] [Telemetry] track and warn event loop delays thresholds (#103615) #103728

Merged

Bamieh deleted the telemetry/track_event_loop_threshold branch June 29, 2021 20:44

kibanamachine added a commit that referenced this pull request Jun 29, 2021

[Telemetry] track and warn event loop delays thresholds (#103615) (#1…

f66507e

…03728) Co-authored-by: Ahmad Bamieh <ahmadbamieh@gmail.com>

Bamieh mentioned this pull request Aug 31, 2021

[Telemetry] track event loop utilization #103477

Closed

This was referenced Mar 28, 2022

Attach a long event loop delay span to an APM transaction #128646

Closed

Attach a long event loop delay "span" to an APM transaction #128647

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] track and warn event loop delays thresholds #103615

[Telemetry] track and warn event loop delays thresholds #103615

Bamieh commented Jun 29, 2021 •

edited

Loading

Bamieh commented Jun 29, 2021

joshdover left a comment

kibanamachine commented Jun 29, 2021

kibanamachine commented Jun 29, 2021

[Telemetry] track and warn event loop delays thresholds #103615

[Telemetry] track and warn event loop delays thresholds #103615

Conversation

Bamieh commented Jun 29, 2021 • edited Loading

Summary

Implementation direction

Notes

Bamieh commented Jun 29, 2021

joshdover left a comment

Choose a reason for hiding this comment

kibanamachine commented Jun 29, 2021

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Jun 29, 2021

💚 Backport successful

Bamieh commented Jun 29, 2021 •

edited

Loading