Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a metrics query endpoint #2736

Closed
albertteoh opened this issue Jan 21, 2021 · 15 comments
Closed

Create a metrics query endpoint #2736

albertteoh opened this issue Jan 21, 2021 · 15 comments
Assignees

Comments

@albertteoh
Copy link
Contributor

albertteoh commented Jan 21, 2021

Requirement - what kind of business use case are you trying to solve?

Follow-up to: #2574

To integrate (operational R.E.D) metrics into Jaeger, sourced from the OTEL collector's spanmetrics processor, which opens up potential use cases such as:

  • Building an operational homepage.
  • Enriching DAG feature with call count, latency and error data.
  • Sort services/operations in Jaeger UI search by relevancy such as highest latency/errors first instead of alphabetical sorting.

Though not the focus of this issue, the following mockup (courtesy of @Danafrid) aims to communicate what a potential use case of an operational homepage may look like:

homepage_jaeger

Problem - what in Jaeger blocks you from solving the requirement?

The ability to query a metrics backend from Jaeger UI.

Proposal - what do you suggest to solve the problem or improve the existing situation?

More details can be found in the proposal document.

Add an /api/metrics endpoint to the existing jaeger-query service, or introduce a new service if this is a preferred approach.
This endpoint will relay the request over to a configured PromQL compatible metrics backend such as Prometheus or M3DB.

We would appreciate feedback from the community on this feature proposal.

@jpkrohling
Copy link
Contributor

I think this isn't something that belongs to the Jaeger core. We already have quite a few things to do in the tracing front and there are capable UIs out there serving as metrics dashboards. For instance, when used with Istio, Kiali can get this kind of information:

kiali-example

Instead of having to use a metrics backend for this, we could brainstorm on how we can use our trace information to provide this kind of information. We've talked a bit in the past about trace analytics with notebooks and/or spark, and I think this is the path we should explore.

@albertteoh
Copy link
Contributor Author

albertteoh commented Jan 21, 2021

@jpkrohling thanks for your thoughts. In fact, the proposal is to use our trace information to provide this kind of information. This is described in more detail in the linked proposal document. Sorry it wasn't clear in this issue description, I've updated it.

The idea is the OTEL collector will aggregate metrics from span data via a new spanmetrics processor (WIP) and write them to a prometheus exporter into a prometheus compliant backend like M3 or, of course, Prometheus. Then this data can be used to enrich the user experience of Jaeger UI. For example, sort services by relevance based on error rates, latency or call count, and also potentially enrich the DAG with latency or error annotations.

IMHO, using another tool like notebook/spark involves a context switch instead of a seamless experience within Jaeger UI itself. BTW, the intention is to give new users or even experienced users, the ability to quickly deep dive into the most relevant traces and hit the road running with Jaeger UI, and make good use of the empty real-estate of the opening search screen.

I think that in-depth/advanced analysis of trace data belongs in the realm of notebooks and spark, and believe these satisfy a different group of use cases, where ML can reveal interesting insights from trace data.

However, in my mind, this is intended to be quite simple and opinionated towards the specific use cases that improve the user experience of Jaeger UI, and should not try to replicate powerful metrics dashboard products like Grafana or Kiali (which couples users to Istio), nor play the role of a trace analysis tool.

@jpkrohling
Copy link
Contributor

Nice proposal, sorry for not properly reading it first. I think I need some time to process it, as my initial reaction is the same: we might be going on a path that would deviate us from our main purpose (tracing). If we were to derive this information from the storage that we already use, I'd be fine with it, but getting this data from a metrics backend sounds weird, especially as the source of information was already trace data in the first place.

@kevinearls do you think this type of query would be possible with InfluxDB IOx?

@kevinearls
Copy link
Contributor

@jpkrohling to be honest, I don't really know enough about IOx yet to say either way.

@yurishkuro
Copy link
Member

I don't view this as "metrics" solution, but as a way to navigate to interesting traces through aggregate views. Most of these aggregates naturally take form of time series, hence the need for a storage backend that can store time series.

@jpkrohling
Copy link
Contributor

hence the need for a storage backend that can store time series

This is actually why I asked above about IOx. If our own storage can answer this, we don't need to require people to convert traces to data points only to come back to Jaeger later.

@afishler-zz
Copy link

@jpkrohling If you take the data from the tracing storage you are not necessarily providing accurate view of a system state since sampling limits the amount of trace data on the storage.
When the metrics are aggregated at the collector level before sampling is applied you will be able to provide a broader picture on the overall trace data.
I assume that aggregating ad-hoc directly on the tracing storage might also have performance penalties that can be avoided using an external time-series source for the metrics

@jpkrohling
Copy link
Contributor

Not sure about the other arguments, but you are right about the sampling. I think we'll talk about storage tomorrow, so, we can discuss more about this after that. But I think this is looking good.

@devrimdemiroz
Copy link

As an end-user, for a period of time, I am trying to have an aggregated statistics / metrics of the very same data presented under "Trace Statistics" of a trace. Also tried over Grafana. I checked the metrics provided under name prometheus, did not exist. Cheked opentelemety metrics both in javaagent metrics exporter and otel collector exposed ones... no luck. In that sense, I value the motivation @albertteoh here within. I can not argue the point of view in responses. While, if it was there as UI or prometheus , it would have been very beneficial. When metrics are discussed, the main metrics produced by tracing itself is almost unreachable. It is there. It is so valuable. But we can not view it. This trace/span created metrics are the distinction between a commercial apm product. We can see trace itself under Grafana. But the treasure inside is locked. And still, if it was available on jaeger UI, would absolutely be very practical for many newcomers in observability.

@albertteoh
Copy link
Contributor Author

@jaegertracing/jaeger-maintainers and community, I wanted to checkin to see if we're okay to go ahead with this proposal or if there are any outstanding questions/concerns that still need addressing?

@yurishkuro
Copy link
Member

+1

@jpkrohling
Copy link
Contributor

go go go!

@pavolloffay
Copy link
Member

+1, This is great functionality and addition to the Jaeger project. Doing it at the collector with tail-based sampling makes a lot of sense.

The data itself can live in any (query compatible) storage, I see this as a pluggable feature like a dependency diagram (calculated by Spark).

@pavolloffay
Copy link
Member

@albertteoh can we close this one?

@albertteoh
Copy link
Contributor Author

@pavolloffay yup, I've closed it. We're tracking progress in #2954.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants