Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 0003, 0007 updates (Metrics measure type, metrics handle specification), RFC 0004 (configurable aggregation) deleted #29

Merged
merged 26 commits into from
Sep 12, 2019
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 114 additions & 18 deletions text/0003-measure-metric-type.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,50 +2,146 @@

**Status:** `proposed`

## Forward
# Foreward
jmacd marked this conversation as resolved.
Show resolved Hide resolved

This propsal was originally split into three semi-related parts. Based on the feedback, they are now combined here into a single proposal. The original proposals were:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/propsal/proposal


000x-metric-pre-defined-labels
000x-metric-measure
000x-eliminate-stats-record

## Overview
### Updated 8/27/2019

Introduce a `Measure` type of metric object that supports a `Record` API. Like existing `Gauge` and `Cumulative` metrics, the new `Measure` metric supports pre-defined labels. A new measurement batch API is introduced for recording multiple metric observations simultaneously.
A working group convened on 8/21/2019 to discuss and debate the two metrics RFCs (0003 and 0004) and several surrounding concerns. This document has been revised with related updates that were agreed upon during this working session. See the [meeting notes](https://docs.google.com/document/d/1d0afxe3J6bQT-I6UbRXeIYNcTIyBQv4axfjKF4yvAPA/edit#).

## Motivation
# Overview

In the current `Metric.GetOrCreateTimeSeries` API for Gauges and Cumulatives, the caller obtains a `TimeSeries` handle for repeatedly recording metrics with certain pre-defined label values set. This is an important optimization, especially for exporting aggregated metrics.
Introduce a `Measure` kind of metric object that supports a `Record` API method. Like existing `Gauge` and `Cumulative` metrics, the new `Measure` metric supports pre-defined labels. A new measurement batch API is introduced for recording multiple metric observations simultaneously.

The use of pre-defined labels improves usability too, for working with metrics in code. Application programs with long-lived objects and associated Metrics can compute predefined label values once (e.g., in a constructor), rather than once per call site.
## Terminology

The current raw statistics API does not support pre-defined labels. This RFC replaces the raw statistics API by a new, general-purpose type of metric, `MeasureMetric`, generally intended for recording individual measurements the way raw statistics did, with added support for pre-defined labels.
This RFC changes how "Measure" is used in the OpenTelemetry metrics specification. Before, "Measure" was the name of a series of raw measurements. After, "Measure" is the kind of a metric object used for recording a series raw measurements.

The former raw statistics API supported all-or-none recording for interdependent measurements. This RFC introduces a `MeasurementBatch` to support recording batches of metric observations.
Since this document will be read in the future after the proposal has been written, uses of the word "current" lead to confusion. For this document, the term "preceding" refers to the state that was current prior to these changes.

## Explanation
# Motivation

In the current proposal, Metrics are used for pre-aggregated metric types, whereas Raw statistics are used for uncommon and vendor-specific aggregations. The optimization and the usability advantages gained with pre-defined labels should be extended to Raw statistics because they are equally important and equally applicable. This is a new requirement.
In the preceding `Metric.GetOrCreateTimeSeries` API for Gauges and Cumulatives, the caller obtains a `TimeSeries` handle for repeatedly recording metrics with certain pre-defined label values set. This enables an important optimization for exporting pre-aggregated metrics, since the implementation is able to compute the aggregate summary "entry" using a pointer or fast table lookup. The efficiency gain requires that the aggregation keys be a subset of the pre-defined labels.

For example, where the application wants to compute a histogram of some value (e.g., latency), there's good reason to pre-aggregate such information. In this example, it allows an implementation to effienctly export the histogram of latencies "grouped" into individual results by label value(s).
Application programs with long-lived objects and associated Metrics can take advantage of pre-defined labels by computing label values once per object (e.g., in a constructor), rather than once per call site. In this way, the use of pre-defined labels improves the usability of the API as well as makes an important optimization possible to the implementation.

The new `MeasureMetric` API satisfies the requirements of a single-argument call to record raw statistics, but the raw statistics API had secondary purpose, that of supporting recording multiple observed values simultaneously. This proposal introduces a `MeasurementBatch` API to record multiple metric observations in a single call.
The preceding raw statistics API did not specify support for pre-defined labels. This RFC replaces the raw statistics API by a new, general-purpose kind of metric, `MeasureMetric`, generally intended for recording individual measurements like the preceding raw statistics API, with explicit support for pre-defined labels.

## Internal details
The preceding raw statistics API supported all-or-none recording for interdependent measurements. This RFC introduces a `RecordBatch` API to support recording batches of measurements in a single API call, where a `Measurement` is now defined as a tuple of `MeasureMetric`, `Value` (integer or floating point), and `Labels`.

The type known as `MeasureMetric` is a direct replacement for the raw statistics `Measure` type. The `MeasureMetric.Record` method records a single observation of the metric. The `MeasureMetric.GetOrCreateTimeSeries` supports pre-defined labels, just the same as `Gauge` and `Cumulative` metrics.
# Explanation

## Trade-offs and mitigations
The common use for `MeasureMetric`, like the preceding raw statistics API, is for reporting information about rates and distributions over structured, numerical event data. Measure metrics are the most general-purpose of metrics. Informally, the individual metric event has a logical format expressed as one primary key=value (the metric name and a numerical value) and any number of secondary key=values (the labels, resources, and context).

This Measure Metric API is conceptually close to the Prometheus [Histogram, Summary, and Untyped metric types](https://prometheus.io/docs/concepts/metric_types/), but there is no way in OpenTelemetry to distinguish these cases at the declaration site, in code. This topic is covered in 0004-metric-configurable-aggregation.
metric_name=_number_
pre_defined1=_any_value_
pre_defined2=_any_value_
...
resource1=_any_value_
jmacd marked this conversation as resolved.
Show resolved Hide resolved
resource2=_any_value_
...
context_tag1=_any_value_
context_tag2=_any_value_
...

Here, "pre_defined" keys are those captured in the metrics handle, "resource" keys are those configured when the SDK was initialized, and "context_tag" keys are those propagated via context.

Events of this form can logically capture a single update to a named metric, whether a cumulative, gauge, or measure kind of metric. This logical structure defines a _low-level encoding_ of any metric event, across the three kinds of metric. An SDK could simply encode a stream of these events and the consumer, provided access to the metric definition, should be able to interpret these events according to the semantics prescribed for each kind of metric.

## Metrics API concepts

The `Meter` interface represents the metrics portion of the OpenTelemetry API.

There are three kinds of metric instrument, `CumulativeMetric`, `GaugeMetric`, and `MeasureMetric`.

Metric instruments are constructed by the API, they are not constructed by any specific SDK.

| Name | A string. |
jmacd marked this conversation as resolved.
Show resolved Hide resolved
| Kind | One of Cumulative, Gauge, or Measure. |
| Keys | List of always-defined keys in handles for this metric. |
| Unit | The unit of measurement being recorded. |
| Description | Information about this metric. |

See the specification for more information on these fields, including formatting and uniqueness requirements. To define a new metric, use one of the language-specific API methods (e.g., with names like `NewCumulativeMetric`, `NewGaugeMetric`, or `NewMeasureMetric`).

Metric instrument Handles are SDK-provided objects that combine a metric instrument with a set of pre-defined labels. Handles are obtained by calling a language-specific API method (e.g., `GetHandle`) on the metric instrument with its label values. Handles may be used to `Set()`, `Add()`, or `Record()` metrics according to their kind. The `Set()`, `Add()`, and `Record()`
jmacd marked this conversation as resolved.
Show resolved Hide resolved

## Selecting Metric Kind

By separation of API and implementation in OpenTelemetry, we know that an implementation is free to do _anything_ in response to a metric API call. By the low-level interpretation defined above, all metric events have the same structural representation, only their logical interpretation varies according to the metric definition. Therefore, we select metric kinds based on two primary concerns:

1. What should be the default implementation behavior? Unless configured otherwise, how should the implementation treat this metric variable?
1. How will the program source code read? Each metric uses a different verb, which helps convey meaning and describe default behavior. Cumulatives have an `Add()` method. Gauges have a `Set()` method. Measures have a `Record()` method.

To guide the user in selecting the right kind of metric for an application, we'll consider the following questions about the primary intent of reporting given data. We use "of primary interest" here to mean information that is almost certainly useful in understanding system behavior. Consider these questions:

- Does the measurement represent a quantity of something? Is it also non-negative?
- Is the sum a matter of primary interest?
- Is the event count a matter of primary interest?
- Is the distribution (p50, p99, etc.) a matter of primary interest?

The specification will be updated with the following guidance.

### Cumulative metric

Likely to be the most common kind of metric, cumulative metric events express the computation of a sum. Choose this kind of metric when the value is a quantity, the sum is of primary interest, and the event count and distribution are not of primary interest. To raise (or lower) a cumulative metric, call the `Add()` method.

If the quantity in question is always non-negative, it implies that the sum is strictly ascending. When this is the case, the cumulative metric also serves to define a rate. For this reason, cumulative metrics have an option to be declared as non-negative. The API will reject negative updates to non-negative cumulative metrics, instead submitting an SDK error event, which helps ensure meaningful rate calculations.

For cumulative metrics, the default OpenTelemetry implementation exports the sum of event values taken over an interval of time.

### Gauge metric

Gauge metrics express a pre-calculated value that is either `Set()` by explicit instrumentation or observed through a callback. Generally, this kind of metric should be used when the metric cannot be expressed as a sum or a rate because the measurement interval is arbitrary. Use this kind of metric when the measurement is not a quantity, and the sum and event count are not of interest.

Only the gauge kind of metric supports observing the metric via a gauge `Observer` callback (as an option). Semantically, there is an important difference between explicitly setting a gauge and observing it through a callback. In case of setting the gauge explicitly, the call happens inside of an implicit or explicit context. The implementation is free to associate the explicit `Set()` event with a context, for example. When observing gauge metrics via a callback, there is no context associated with the event.

As a special case, to support existing metrics infrastructure, a gauge metric may be declared as a precomputed sum, in which case it is defined as strictly ascending. The API will reject descending updates to strictly-ascending gauges, instead submitting an SDK error event.

For gauge metrics, the default OpenTelemetry implementation exports the last value that was explicitly `Set()`, or if using a callback, the current value from the Observer.

### Measure metric

Measure metrics express a distribution of values. This kind of metric should be used when the count or rate of events is meaningful and either:

1. The sum is of interest in addition to the count (rate)
1. Quantile information is of interest.

The key property of a measure metric event is that computing quantiles and/or summarizing a distribution (e.g., via a histogram) may be expensive. Not only will implementations have various capabilities and algorithms for this task, users may wish to control the quality and cost of aggregating measure metrics.

Like cumulative metrics, non-negative measures are an important case because they support rate calculations. As an option, measure metrics may be declared as non-negative. The API will reject negative metric events for non-negative measures, instead submitting an SDK error event.

Because measure metrics have such wide application, implementations are likely to provide configurable behavior. OpenTelemetry may provide such a facility in its standard SDK, but in case no configuration is provided by the application, a low-cost policy is specified as the default behavior, whic is to export the sum, the count (rate), the minimum value, and the maximum value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the min/max are relevant. can you please point me to how this can be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are P0 and P100 of the distribution, and they're the only two quantiles that can be computed in O(1) space. I realize there are problems in aggregating these, but I've used this sort of summary to get an idea of the worst-case latency in a high-throughput section of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way how cloudwatch makes use of this is by computing what we called "interval stats". They require you to have a period and always send the metric that represents the statistic for this period of time (kind of already rated) so that's how they can do some aggregation :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can see this as one of the aggregation "Statistic over time", then min/max are useful.


### Disable selected metrics by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this. Do we offer an "enable" option when a metric is created? If so, please mention that clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this section to say that enabled is the default and Disabled is an option.


All OpenTelemetry metrics may be disabled by default, as an option. Use this option to indicate that the default implementation should be to do nothing for events about this metric.

### RecordBatch API

Applications sometimes want to record multiple metrics in a single API call, either becase the values are inter-related or because it lowers overhead. We agree that recording batch measurements will be restricted to measure metrics, although this support could be extended to all kinds of metric in the future.

Logically, a measurement is defined as:

- Measure metric: which metric is being updated
- Value: a floating point or integer
- Pre-defined label values: associated via metrics API handle

The batch measurement API uses a language-specific method name (e.g., `RecordBatch`). The entire batch of measurements takes place within some (implicit or explicit) context.

## Prior art and alternatives

Prometheus supports the notion of vector metrics, which are those which support pre-defined labels. The vector-metric API supports a variety of methods like `WithLabelValues` to associate labels with a metric handle, similar to `GetOrCreateTimeSeries` in OpenTelemetry. As in this proposal, Prometheus supports a vector API for all metric types.
Prometheus supports the notion of vector metrics, which are those that support pre-defined labels for a specific set of Keys. The vector-metric API supports a variety of methods like `WithLabelValues` to associate labels with a metric handle, similar to `GetHandle` in OpenTelemetry. As in this proposal, Prometheus supports a vector API for all metric types.

Statsd libraries generally report metric events individually. To implement statsd reporting from the OpenTelemetry, a `Meter` SDK would be installed that converts metric events into statsd updates.

## Open questions
jmacd marked this conversation as resolved.
Show resolved Hide resolved

Argument ordering has been proposed as the way to pass pre-defined label values in `GetOrCreateTimeseries`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse, than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values.
Argument ordering has been proposed as the way to pass pre-defined label values in `GetHandle`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values.

The same discussion can be had for the `MeasurementBatch` type described here. It can be declared with an ordered list of metrics, then the `Record` API takes only an ordered list of numbers. Alternatively, and less prone to misuse, the `MeasurementBatch.Record` API could be declared with a list of metric:number pairs.