Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] track and warn event loop delays thresholds #103615

Merged
merged 9 commits into from
Jun 29, 2021

Conversation

Bamieh
Copy link
Member

@Bamieh Bamieh commented Jun 29, 2021

Summary

Part of a larger work to measure platform performance and ease debugging performance issues (#63848)

In 7.14 we started sending hourly updated event loop delays histogram (#101580). This helps us investigate average delays our customers have, percentiles, etc.

This PR warns users when the event loop delay exceeds a configurable threshold duration ops. eventLoopDelayThreshold

By default this duration is 350ms logged once every 30 seconds as long as the delay is still above that target.

Once we have a representative sample from the reported delays histogram we can adjust this default to be more meaningful and closer to real world cases.

metrics.ops already reports collected_at and all the ecs object for further debuggablity around the logs.

Implementation direction

This is another approach to implementing the event loop threshold (original draft PR: #103478)

Instead of using the ops.metrics implementation I used the the event loop delays histogram. The current ops metics implementation does not really capture event loop delays as it only captures the delay in the immediate loop when the measurement is made.

The perf.monitorEventLoopDelay() here tracks the delays over time and not only on collection. This way we really capture delays and spikes. I also added some telemetry around these spikes to report them back to our cluster along the full histogram for diagnosis. which is not possible inside core at the moment without a lot of piping. I prefer this approach, let me know what you think.

Notes

I've experimented with using event loop utilization (ELU) but realized it serves a different purpose than the original intention of this PR (draft #103477)

Related: #98673
Closes: #96192

@Bamieh
Copy link
Member Author

Bamieh commented Jun 29, 2021

This is another approach to implementing the event loop threshold (original draft PR: #103478)

Instead of using the ops.metrics implementation I used the the event loop delays histogram. The current ops metics implementation does not really capture event loop delays as it only captures the delay in the immediate loop when the measurement is made.

The perf.monitorEventLoopDelay() tracks the delays over time and not only during the collection. This way we really capture delays and spikes. I also added some telemetry around these spikes to report them back to our cluster along the full histogram for diagnosis. which is not possible inside core at the moment without a lot of piping.

I like this approach more. let me know what you think. cc @joshdover

@Bamieh Bamieh requested a review from joshdover June 29, 2021 10:37
@Bamieh Bamieh marked this pull request as ready for review June 29, 2021 13:19
@Bamieh Bamieh requested a review from a team as a code owner June 29, 2021 13:19
@Bamieh Bamieh added auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v7.14.0 v8.0.0 labels Jun 29, 2021
Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit, but otherwise LGTM. Totally agree that this is going to be a much more helpful, stable, and accurate way of warning the admin of an issue. I do worry that this warning isn't highly actionable though, but would signals to support that they may have some scaling issues with Task Manager.

I do wonder if we should consider linking to our documentation on scaling task manager in production: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html#_deployment_considerations

@Bamieh Bamieh enabled auto-merge (squash) June 29, 2021 15:49
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

@Bamieh Bamieh deleted the telemetry/track_event_loop_threshold branch June 29, 2021 20:44
kibanamachine added a commit that referenced this pull request Jun 29, 2021
…03728)

Co-authored-by: Ahmad Bamieh <ahmadbamieh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v7.14.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log a warning for event loop delay > threshold
3 participants