Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API RFCs to eliminate Raw statistics #4

Merged
merged 15 commits into from
Aug 13, 2019

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jun 28, 2019

No description provided.

@bogdandrutu
Copy link
Member

Sorry, I am a bit lost. Does this represent 4 RFCs? If yes should we split them? Otherwise should we make them one RFC?

@jmacd
Copy link
Contributor Author

jmacd commented Jul 1, 2019

Answered your question here:
open-telemetry/opentelemetry-specification#169

@jmacd
Copy link
Contributor Author

jmacd commented Jul 1, 2019

I think we should review these as a block, and if any need to be removed from this PR, they will be.

Copy link
Member

@iredelmeier iredelmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lot of confusion about this, maybe worth discussing today in the SIG meeting. I will give a 10 minutes presentation about metrics.


## Explanation

In the current proposal, Metrics are used for pre-aggregated metric types, whereas Raw statistics are used for uncommon and vendor-specific aggregations. The optimization and the usability advantages gained with pre-defined labels should be extended to Raw statistics because they are equally important and equally applicable. This is a new requirement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw statistics" which probably is not a good name (I made it up, and probably confused more than helped) is not for "uncommon and vendor-specific aggregations". We use them for example in gRPC and HTTP metrics where we don't want the framework developers (gRPC devs) to define the aggregation and labels for the end user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu HTTP and gRPC metrics seem to me like they should be regular metrics/not "raw statistics". Using them in this way seems inline with, for instance, Prometheus practices - see, e.g., the metric and label naming guide, the label examples in their querying examples, and [suggestions around label usage](https://prometheus.io/docs/practices/instrumentation/#things-to-watch-out-for.

Could you clarify why you think that HTTP and gRPC are better served by raw statistics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See one of the example I gave to you in the other comment. I think this is what I mentioned in the SIG call that we do have different understanding of the "raw statistics" (delayed metrics). We do not want the gRPC/HTTP devs to decide how the metrics are aggregated, we want them to only record all these measurements and let the application owner decide how these will be aggregated.

One important thing is that the result of the "raw statistics" (delayed metrics) after the "Views" are applied are regular metrics, the main difference being who decides what are the label keys and aggregation function to be used.

@@ -0,0 +1,37 @@
# Pre-defined label support for all metric operations

Let all Metric objects (Cumulative, Gauge, ...) and Raw statistics support pre-defined label values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "raw statistics" we don't know in advance all the label keys that will be used to break down these metrics. In OpenCensus what we did was to allow users to call record with what we called extra labels. We cannot do the same optimization that we do for the pre-defined metrics (aggregation, labels are pre-defined) because we don't know at compilation time the set of labels that we need to use. Also for "raw statistics" the final set of label values will be a subset of Tags (a.k.a DistributedContext) and the extraLabels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from my understanding, the idea isn't that label values are known at compile time, but that label keys are known. Does that clarify things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is opposite to the idea of the "raw statistics" ("delayed metrics"). The whole point to support this is to allow deferred/delayed label keys and aggregation. See for example https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md we do record all these measurements and recommend some views, but application owners (who configured the Views) can define any label keys they want. One example is when an A/B test is performed in ServiceA (ServiceA calls ServiceB), you would want to add the test_label as a label key for all rpc stats (maybe just some of them) to monitor the effects. But you cannot change the gRPC instrumented code so then we need to allow this mechanism.

Copy link
Member

@iredelmeier iredelmeier Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like an issue with the gRPC instrumentation that can be solved with increased configurability, e.g., allowing end users to override the default label names upfront.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about "not allowing" to override the default label names, is about adding extra labels that are propagated from different services with the context (tags/distributedContext). This is the whole idea. Also I mentioned the aggregation function (for example do I want to build a histogram with these buckets or that buckets, or just build a Sum).

If you try to come up with that "flexible configuration" you will see that you get very close to what we propose in this framework "raw statistics" + views.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1000 to being able to pre-specify defaults for some tags, rather than having all the tags need to be re-checked at runtime. To be all technical... this feels like the metrics version of performing partial application.


## Internal details

This RFC is accompanied by RFC 0002-metric-measure which proposes to create a new Metric type to replace Raw statistics. The metric type, named "Measure", would replace the existing concept and type named "Measure" in the metrics API. The new MeasureMetric object would support a `Record` method to record measurements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and also I talked with @SergeyKanzhelev about this. One important optimization for these metrics is "batch" recording. Because we don't know the labels and aggregation in advance what we do is we put every record entry into a producer-consumer queue and the consumer thread "applies" all defined views (views define the aggregation and labels for these metrics so that we can produce a Metric data). Alternative consideration is to do the work on the critical path and extract the necessary labels and apply the aggregation there, based on my experience with this system (used extensively in Google) this has more overhead than the previous implementation.

An option that I had in mind is to have Measure and MeasureCollection where MeasureCollection is a set of Measures that allow batch record for all the Measures in this collection.


## Open questions

This RFC is co-dependent on several others; it's an open question how to address this concern if the other RFCs are not accepted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared the same concern in my previous question on this PR. I think it makes more sense to have one or two independent RFC than a chain. I don't feel that strong right now during the review but probably when we merge this we can also merge the RFCs into one or two independent once.


Define a new Metric type named "Measure" to cover existing "Raw" statistics uses.

## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the motivation see open-telemetry/opentelemetry-specification#145 which proposes to remove the Measurement class and move the Record method on the Measure.


## Explanation

This proposal suggests we think about which aggregations apply to a metric independently from its type. A MeasureMetric could be used to aggregate a Histogram, or a Summary, or _both_ of these aggregations simultaneously. This proposal makes metric type independent of aggregation type, whereas there is a precedent for combining these types into one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I confused you a bit because we did not define the Histograms/Summaries see open-telemetry/opentelemetry-specification#146.

The idea is the following, we offer support for users to record metrics where the developer that instruments the code can choose between exporting a pre-defined aggregation/label metrics (type A) or a "raw statistics" (type B) (no aggregation, labels pre-defined). The way how this decision is made tries to follow this rules:

  1. If the metric is pre-aggregated (e.g. CPU usage is a counter in /proc) and no labels (here I mean that there are no labels used to break down this metric in /proc, but constant labels can be added to this metric if necessary. Then a type A metric should be used because we cannot apply any other aggregation (if you want to calculate rate which is possible for any counter (monotonic increasing metric) that can be done in all modern backends Prometheus, Stackdriver, etc.).
  2. If the metric is recorded with every request (rpc latency, bytes received, bytes send) and the developer of the instrumentation is not the application owner (third-party library like OpenTelemetry itself) then type B should be used. This way the application owner can decide how to aggregate these metrics.
  3. If the metric is a simple metric that does not make sense to aggregate in a different way (e.g. queue length - this is a gauge and probably does not require any labels to be used to breakdown) then a type A metric should be used.

@bogdandrutu
Copy link
Member

We will have a meeting tomorrow (July 9) at 1pm PST. Please use https://lightstep.zoom.us/j/516112682

@@ -0,0 +1,37 @@
# Pre-defined label support for all metric operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I remember during the meeting about this RFC is how for example gRPC will set the grpc_method. In OpenCensus we decided to set the method as a Tag and the measure will read the value from the tags not from the extra labels provided by gRPC. The main reason is that we want other metrics recorded for this request to be able to use the grpc_method tag as a label as well.

For the status we used the extra labels (the measure labels proposed in this RFC) because this is known at the end of the request and cannot be used in other metrics.

So the main label that a system like gRPC will add which is the request (method) name is read from the Tags (a.k.a. DistributedContext).

@@ -0,0 +1,73 @@
# Let Metrics support configrable, recommended aggregations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to allow users to configure the aggregation for the "raw statistics" as well as the set of labels. In the OpenCensus case the full set of labels supported are Tags (available in the context when the record method is called) + extraLables (specifically recorded with every request), and users should define a subset from there.

@bogdandrutu
Copy link
Member

During the meeting there was a mention about CounterVec in Prometheus. Based on my understanding the CounterVec represents what in OpenTelemetry we call Counter and Prometheus Counter represents what we call a TimeSeries.

So we do support the same things for Counters and Gauges.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 20, 2019

I've put together a PR to demonstrate how I would satisfy these RFCs in Go.

jmacd/opentelemetry-go#2

I'd prefer not to drag out the individual threads above. We have a meeting to discuss on Monday.

I think RFC 0001 has good support so far.

RFC 0002 and 0004 created some confusion. I believe the requirement is to allow creation of metrics that do not have a semantic interpretation, to allow SDKs to do what's best and not necessarily to follow the programmer's recommendation. I agree with this goal, and see the main change here as a compromise. Let the programmer recommend a good default at the place where metrics are defined, but call it "advisory". This way the SDK can still implement the behavior it wants, potentially disregarding the programmer's input.

iredelmeier referenced this pull request in iredelmeier/oteps Jul 26, 2019
@jmacd
Copy link
Contributor Author

jmacd commented Jul 31, 2019

I will merge this in the "proposed" state, as discussed in the spec meeting today.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 31, 2019

@iredelmeier Actually, I can't merge this. I wasn't sure where to write the current status, so I put it at the top. Will you have a look?

@bhs
Copy link
Contributor

bhs commented Aug 4, 2019

@jmacd what's the current status on this PR? What are we blocked on, if anything? Thanks in advance...

@bogdandrutu
Copy link
Member

Some highlevel comments before we can merge this:

  1. Use the path suggested in the readme text/XYZ.
  2. Consider to merge this RFCs into one.
  3. Consider to add details about the MeasureBatch.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 12, 2019

@iredelmeier I've assigned numbers 3, 4, 5, and 6 to these RFCs, which I think can now be submitted in the Status: proposed state as discussed at this morning's otel-metrics meeting.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being probably a bit late into this. I feel that I am not 100% confident on the last RFC, can we split this PR and merge the first 3 which we have no objections and discuss a bit more the 4th one?


This change, while it eliminates the need for a Raw statistics concept, potentially introduces new required concepts. Whereas Raw statistics have no directly-declared aggregations, introducing MeasureMetric raises the question of which aggregations apply. We will propose how a programmer can declare recommended aggregations (and good defaults) in RFC 0006-configurable-aggregation.

## Prior art and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think based on the discussions we do agree that Measure is different than Summary/Histogram. Measure can produce a Summary or Histogram but can also produce a Counter.

There are two reasons to maintain a low-level API that we know of:

1. For _generality_. An application that forwards metrics from another source may need to handle metrics in generic code. For these applications, having type-specific Metric handles could actually require more code to be written, whereas the low-level `stats.Record` API is more amenable to generic use.
1. For _atomicity_. An application that wishes to record multiple statistics in a single operation can feel confident computing formulas based on multiple metrics, not worry about inconsistent views of the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided that we will support a MeasureBatch.

1. Unless the programmer declares otherwise, suggesting a default aggregation
1. For Gauges: LAST_VALUE is interesting, SUM is likely not interesting
1. For Cumulatives: SUM is interesting, LAST_VALUE is likely not interesting
1. For Measures: all aggregations apply, default is MIN, MAX, SUM, COUNT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need min/max. Also based on the OpenMetrics and also Prometheus/Stackdriver (it has a max/min but does not use it) they do not have a min/max.

@bogdandrutu
Copy link
Member

Please also rebase

@jmacd
Copy link
Contributor Author

jmacd commented Aug 12, 2019

OK. I will merge the first three into one RFC and leave the last one separate. I will post this later today.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 13, 2019

Please take another look @bogdandrutu. I've rebased and merged the first three RFCs into one. The remaining RFCs are number 3 and 4.

@bogdandrutu
Copy link
Member

Sorry probably my English was not the best. I meant to merge this PR with the first 3 RFCs (or currently I think we should merge the PR number 3) and open a new PR with RFC 4. I have some questions and concerns about letting users to define the aggregation during the instrumentation.

Does that make sense?

@jmacd
Copy link
Contributor Author

jmacd commented Aug 13, 2019

I thought our goal was to merge these PRs so that individual RFCs can be assigned unique numbers, and then we can discuss them via new PRs against these drafts, which will have Status: proposed. There are two RFCs and we can discuss the now-combined-for-approval RFC 3 independently from the still-needs-discussion RFC 4 about aggregation support.

@bogdandrutu bogdandrutu merged commit 25ab28a into open-telemetry:master Aug 13, 2019
@jmacd jmacd deleted the jmacd/metrics_part1 branch September 13, 2019 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants