Create a metrics query endpoint #2736

albertteoh · 2021-01-21T08:44:43Z

Requirement - what kind of business use case are you trying to solve?

Follow-up to: #2574

To integrate (operational R.E.D) metrics into Jaeger, sourced from the OTEL collector's spanmetrics processor, which opens up potential use cases such as:

Building an operational homepage.
Enriching DAG feature with call count, latency and error data.
Sort services/operations in Jaeger UI search by relevancy such as highest latency/errors first instead of alphabetical sorting.

Though not the focus of this issue, the following mockup (courtesy of @Danafrid) aims to communicate what a potential use case of an operational homepage may look like:

Problem - what in Jaeger blocks you from solving the requirement?

The ability to query a metrics backend from Jaeger UI.

Proposal - what do you suggest to solve the problem or improve the existing situation?

More details can be found in the proposal document.

Add an /api/metrics endpoint to the existing jaeger-query service, or introduce a new service if this is a preferred approach.
This endpoint will relay the request over to a configured PromQL compatible metrics backend such as Prometheus or M3DB.

We would appreciate feedback from the community on this feature proposal.

The text was updated successfully, but these errors were encountered:

jpkrohling · 2021-01-21T09:24:13Z

I think this isn't something that belongs to the Jaeger core. We already have quite a few things to do in the tracing front and there are capable UIs out there serving as metrics dashboards. For instance, when used with Istio, Kiali can get this kind of information:

Instead of having to use a metrics backend for this, we could brainstorm on how we can use our trace information to provide this kind of information. We've talked a bit in the past about trace analytics with notebooks and/or spark, and I think this is the path we should explore.

albertteoh · 2021-01-21T11:24:49Z

@jpkrohling thanks for your thoughts. In fact, the proposal is to use our trace information to provide this kind of information. This is described in more detail in the linked proposal document. Sorry it wasn't clear in this issue description, I've updated it.

The idea is the OTEL collector will aggregate metrics from span data via a new spanmetrics processor (WIP) and write them to a prometheus exporter into a prometheus compliant backend like M3 or, of course, Prometheus. Then this data can be used to enrich the user experience of Jaeger UI. For example, sort services by relevance based on error rates, latency or call count, and also potentially enrich the DAG with latency or error annotations.

IMHO, using another tool like notebook/spark involves a context switch instead of a seamless experience within Jaeger UI itself. BTW, the intention is to give new users or even experienced users, the ability to quickly deep dive into the most relevant traces and hit the road running with Jaeger UI, and make good use of the empty real-estate of the opening search screen.

I think that in-depth/advanced analysis of trace data belongs in the realm of notebooks and spark, and believe these satisfy a different group of use cases, where ML can reveal interesting insights from trace data.

However, in my mind, this is intended to be quite simple and opinionated towards the specific use cases that improve the user experience of Jaeger UI, and should not try to replicate powerful metrics dashboard products like Grafana or Kiali (which couples users to Istio), nor play the role of a trace analysis tool.

jpkrohling · 2021-01-21T12:27:51Z

Nice proposal, sorry for not properly reading it first. I think I need some time to process it, as my initial reaction is the same: we might be going on a path that would deviate us from our main purpose (tracing). If we were to derive this information from the storage that we already use, I'd be fine with it, but getting this data from a metrics backend sounds weird, especially as the source of information was already trace data in the first place.

@kevinearls do you think this type of query would be possible with InfluxDB IOx?

kevinearls · 2021-01-21T12:47:32Z

@jpkrohling to be honest, I don't really know enough about IOx yet to say either way.

yurishkuro · 2021-01-21T14:56:15Z

I don't view this as "metrics" solution, but as a way to navigate to interesting traces through aggregate views. Most of these aggregates naturally take form of time series, hence the need for a storage backend that can store time series.

jpkrohling · 2021-01-21T15:06:28Z

hence the need for a storage backend that can store time series

This is actually why I asked above about IOx. If our own storage can answer this, we don't need to require people to convert traces to data points only to come back to Jaeger later.

afishler-zz · 2021-01-21T15:15:50Z

@jpkrohling If you take the data from the tracing storage you are not necessarily providing accurate view of a system state since sampling limits the amount of trace data on the storage.
When the metrics are aggregated at the collector level before sampling is applied you will be able to provide a broader picture on the overall trace data.
I assume that aggregating ad-hoc directly on the tracing storage might also have performance penalties that can be avoided using an external time-series source for the metrics

jpkrohling · 2021-01-21T15:43:03Z

Not sure about the other arguments, but you are right about the sampling. I think we'll talk about storage tomorrow, so, we can discuss more about this after that. But I think this is looking good.

devrimdemiroz · 2021-02-02T22:31:08Z

As an end-user, for a period of time, I am trying to have an aggregated statistics / metrics of the very same data presented under "Trace Statistics" of a trace. Also tried over Grafana. I checked the metrics provided under name prometheus, did not exist. Cheked opentelemety metrics both in javaagent metrics exporter and otel collector exposed ones... no luck. In that sense, I value the motivation @albertteoh here within. I can not argue the point of view in responses. While, if it was there as UI or prometheus , it would have been very beneficial. When metrics are discussed, the main metrics produced by tracing itself is almost unreachable. It is there. It is so valuable. But we can not view it. This trace/span created metrics are the distinction between a commercial apm product. We can see trace itself under Grafana. But the treasure inside is locked. And still, if it was available on jaeger UI, would absolutely be very practical for many newcomers in observability.

albertteoh · 2021-02-24T08:40:39Z

@jaegertracing/jaeger-maintainers and community, I wanted to checkin to see if we're okay to go ahead with this proposal or if there are any outstanding questions/concerns that still need addressing?

yurishkuro · 2021-02-24T15:35:00Z

+1

jpkrohling · 2021-02-25T11:21:12Z

go go go!

pavolloffay · 2021-03-03T14:14:35Z

+1, This is great functionality and addition to the Jaeger project. Doing it at the collector with tail-based sampling makes a lot of sense.

The data itself can live in any (query compatible) storage, I see this as a pluggable feature like a dependency diagram (calculated by Spark).

pavolloffay · 2021-10-26T13:04:03Z

@albertteoh can we close this one?

albertteoh · 2021-10-26T21:24:34Z

@pavolloffay yup, I've closed it. We're tracking progress in #2954.

github-actions bot added the needs-triage label Jan 21, 2021

jpkrohling removed the needs-triage label Feb 2, 2021

albertteoh self-assigned this Mar 12, 2021

albertteoh mentioned this issue Mar 31, 2021

Add metric def and metrics query endpoints jaegertracing/jaeger-idl#73

Closed

albertteoh mentioned this issue Apr 15, 2021

"Monitor" tab for health metrics jaegertracing/jaeger-ui#736

Closed

albertteoh mentioned this issue Apr 24, 2021

"Monitor" tab for service health metrics #2954

Closed

14 tasks

ArthurSens mentioned this issue May 12, 2021

Added Proposal observatorium/observatorium#408

Closed

albertteoh closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a metrics query endpoint #2736

Create a metrics query endpoint #2736

albertteoh commented Jan 21, 2021 •

edited

Loading

jpkrohling commented Jan 21, 2021

albertteoh commented Jan 21, 2021 •

edited

Loading

jpkrohling commented Jan 21, 2021

kevinearls commented Jan 21, 2021

yurishkuro commented Jan 21, 2021

jpkrohling commented Jan 21, 2021

afishler-zz commented Jan 21, 2021

jpkrohling commented Jan 21, 2021

devrimdemiroz commented Feb 2, 2021

albertteoh commented Feb 24, 2021

yurishkuro commented Feb 24, 2021

jpkrohling commented Feb 25, 2021

pavolloffay commented Mar 3, 2021

pavolloffay commented Oct 26, 2021

albertteoh commented Oct 26, 2021

Create a metrics query endpoint #2736

Create a metrics query endpoint #2736

Comments

albertteoh commented Jan 21, 2021 • edited Loading

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

jpkrohling commented Jan 21, 2021

albertteoh commented Jan 21, 2021 • edited Loading

jpkrohling commented Jan 21, 2021

kevinearls commented Jan 21, 2021

yurishkuro commented Jan 21, 2021

jpkrohling commented Jan 21, 2021

afishler-zz commented Jan 21, 2021

jpkrohling commented Jan 21, 2021

devrimdemiroz commented Feb 2, 2021

albertteoh commented Feb 24, 2021

yurishkuro commented Feb 24, 2021

jpkrohling commented Feb 25, 2021

pavolloffay commented Mar 3, 2021

pavolloffay commented Oct 26, 2021

albertteoh commented Oct 26, 2021

albertteoh commented Jan 21, 2021 •

edited

Loading

albertteoh commented Jan 21, 2021 •

edited

Loading