-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Monitor" tab for health metrics #736
Comments
@albertteoh maybe we should move this to the main repo since it spans frontend and backend work. |
or at least we should have a master ticket in the main repo |
Yup sure, I'll move this to the main repo as suggested, and will make this the master ticket as well, since this issue will drive the requirements for UI and backend work. |
Closing in favour of master ticket: jaegertracing/jaeger#2954 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Requirement - what kind of business use case are you trying to solve?
The main proposal is documented in: jaegertracing/jaeger#2736.
The motivation is to help identify interesting traces (high qps, slow or erroneous) without knowing the service or operations up-front.
Use cases include:
Proposal - what do you suggest to solve the problem or improve the existing situation?
Add a new "Monitoring" tab situated after "Compare" containing service-level request rates, error rates, latency and impact (=
latency * request rate
to avoid "false positives" from low QPS endpoints with high latencies).The data will be sourced from jaeger-query's new metrics endpoints.
As the jaeger-query metrics endpoints require opt-in to be enabled, the Monitor tab will have a sensible empty state, perhaps a link to documentation on how to enable metrics querying capabilities.
Workflow
The screen will open to a per-service level set of metrics sorted, by default, on Impact. Columns are configurable by the user with other latency percentiles available, among others. A search box will be available to filter on service names.
The user need only supply the time period to fetch metrics on (similar to Find Traces), defaulting to a 1 hour lookback.
Note the user is not required to define the step size (the period between data points), at least in this iteration, to keep the user experience as simple as possible. Instead we propose to define the step size based on a sensible heuristic based on the query period and/or the width of the chart. For example:
< 30m
search period -> 15s step< 1h
search period -> 1m step, etc.There are two possible actions from here in this tab:
Service metrics page
If drilling down into the service-level metrics, the page will show a summary of the RED metrics at the top along with the per-operation equivalent metrics as with the per-service metrics above. Also similarly, there will be a search box to filter on operations, and the user has the option to "View all traces" for a given operation.
Search tab
The search tab will be the final stage in the workflow (except of course if going back to a previous state), which is pre-populated with the service and/or operation as well as the search period.
The search period will be sticky between each of these screens to maintain consistency in search results.
Demo
Courtesy of @Danafrid.
Screen.Recording.2021-04-14.at.11.52.58.mov
Any open questions to address
The text was updated successfully, but these errors were encountered: