Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API RFCs to eliminate Raw statistics #4

Merged
merged 15 commits into from
Aug 13, 2019
51 changes: 51 additions & 0 deletions text/0003-measure-metric-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Consolidate pre-aggregated and raw metrics APIs

**Status:** `proposed`

## Forward

This propsal was originally split into three semi-related parts. Based on the feedback, they are now combined here into a single proposal. The original proposals were:

000x-metric-pre-defined-labels
000x-metric-measure
000x-eliminate-stats-record

## Overview

Introduce a `Measure` type of metric object that supports a `Record` API. Like existing `Gauge` and `Cumulative` metrics, the new `Measure` metric supports pre-defined labels. A new measurement batch API is introduced for recording multiple metric observations simultaneously.

## Motivation

In the current `Metric.GetOrCreateTimeSeries` API for Gauges and Cumulatives, the caller obtains a `TimeSeries` handle for repeatedly recording metrics with certain pre-defined label values set. This is an important optimization, especially for exporting aggregated metrics.

The use of pre-defined labels improves usability too, for working with metrics in code. Application programs with long-lived objects and associated Metrics can compute predefined label values once (e.g., in a constructor), rather than once per call site.

The current raw statistics API does not support pre-defined labels. This RFC replaces the raw statistics API by a new, general-purpose type of metric, `MeasureMetric`, generally intended for recording individual measurements the way raw statistics did, with added support for pre-defined labels.

The former raw statistics API supported all-or-none recording for interdependent measurements. This RFC introduces a `MeasurementBatch` to support recording batches of metric observations.

## Explanation

In the current proposal, Metrics are used for pre-aggregated metric types, whereas Raw statistics are used for uncommon and vendor-specific aggregations. The optimization and the usability advantages gained with pre-defined labels should be extended to Raw statistics because they are equally important and equally applicable. This is a new requirement.

For example, where the application wants to compute a histogram of some value (e.g., latency), there's good reason to pre-aggregate such information. In this example, it allows an implementation to effienctly export the histogram of latencies "grouped" into individual results by label value(s).

The new `MeasureMetric` API satisfies the requirements of a single-argument call to record raw statistics, but the raw statistics API had secondary purpose, that of supporting recording multiple observed values simultaneously. This proposal introduces a `MeasurementBatch` API to record multiple metric observations in a single call.

## Internal details

The type known as `MeasureMetric` is a direct replacement for the raw statistics `Measure` type. The `MeasureMetric.Record` method records a single observation of the metric. The `MeasureMetric.GetOrCreateTimeSeries` supports pre-defined labels, just the same as `Gauge` and `Cumulative` metrics.

## Trade-offs and mitigations

This Measure Metric API is conceptually close to the Prometheus [Histogram, Summary, and Untyped metric types](https://prometheus.io/docs/concepts/metric_types/), but there is no way in OpenTelemetry to distinguish these cases at the declaration site, in code. This topic is covered in 0004-metric-configurable-aggregation.

## Prior art and alternatives

Prometheus supports the notion of vector metrics, which are those which support pre-defined labels. The vector-metric API supports a variety of methods like `WithLabelValues` to associate labels with a metric handle, similar to `GetOrCreateTimeSeries` in OpenTelemetry. As in this proposal, Prometheus supports a vector API for all metric types.

## Open questions

Argument ordering has been proposed as the way to pass pre-defined label values in `GetOrCreateTimeseries`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse, than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values.

The same discussion can be had for the `MeasurementBatch` type described here. It can be declared with an ordered list of metrics, then the `Record` API takes only an ordered list of numbers. Alternatively, and less prone to misuse, the `MeasurementBatch.Record` API could be declared with a list of metric:number pairs.
75 changes: 75 additions & 0 deletions text/0004-metric-configurable-aggregation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Let Metrics support configrable, recommended aggregations

**Status:** `proposed`

Let the user configure recommended Metric aggregations (SUM, COUNT, MIN, MAX, LAST_VALUE, HISTOGRAM, SUMMARY).

## Motivation

In the current API proposal, Metric types like Gauge and Cumulative are mapped into specific aggregations: Gauge:LAST_VALUE and Cumulative:SUM. Depending on RFC 0003-measure-metric-type, which creates a new MeasureMetric type, this proposal introduces the ability to configure alternative, potentially multiple aggregations for Metrics. This allows the MeasureMetric type to support HISTOGRAM and SUMMARY aggregations, as an alternative to raw statistics.

## Explanation

This proposal completes the elimination of Raw statistics by recognizing that aggregations should be independent of metric type. This recognizes that _sometimes_ we have a cumulative but want to compute a histogram of increment values, and _sometimes_ we have a measure that has multiple interesting aggregations.

Following this change, we should think of the _Metric type_ as:

1. Indicating something about what kind of numbers are being recorded (i.e., the input domain, e.g., restricted to values >= 0?)
1. For Gauges: Something pre-computed where rate or count is not relevant
1. For Cumulatives: Something where rate or count is relevant
1. For Measures: Something where individual values are relevant
1. Indicating something about the default interpretation, based on the action verb (Set, Inc, Record, etc.)
1. For Gauges: the action is Set()
1. For Cumulatives: the action is Inc()
1. For Measures: the action is Record()
1. Unless the programmer declares otherwise, suggesting a default aggregation
1. For Gauges: LAST_VALUE is interesting, SUM is likely not interesting
1. For Cumulatives: SUM is interesting, LAST_VALUE is likely not interesting
1. For Measures: all aggregations apply, default is MIN, MAX, SUM, COUNT.

## Internal details

Metric constructors should take an optional list of aggregations, to override the default behavior. When constructed with an explicit list of aggregations, the implementation may use this as a hint about which aggregations should be exported by default. However, the implementation is not bound by these recommendations in any way and is free to control which aggregations that are applied.

The standard defined aggregations are broken into two groups, those which are "decomposable" (i.e., inexpensive) and those which are not.

The decomposable aggregations are simple to define:

1. SUM: The sum of observed values.
1. COUNT: The number of observations.
1. MIN: The smallest value.
1. MAX: The largest value.
1. LAST_VALUE: The latest value.

The non-decomposable aggregations do not have standard definitions, they are purely advisory. The intention behind these are:

1. HISTOGRAM: The intended output is a distribution summary, specifically summarizing counts into non-overlapping ranges.
1. SUMMARY: This is a more generic way to request information about a distribution, perhaps represented in some vendor-specific way / not a histogram.

## Example

To declare a MeasureMetric,

```
myMetric := metric.NewMeasureMetric(
"ex.com/mymetric",
metric.WithAggregations(metric.SUM, metric.COUNT),
metric.WithLabelKeys(aKey, bKey))
)
```

Here, we have declared a Measure-type metric with recommended SUM and COUNT aggregations (allowing to compute the average) with `aKey` and `bKey` as recommended aggregation dimensions. While the SDK has full control over which aggregations are actually performed, the programmer has specified a good default behavior for the implementation to use.

## Trade-offs and mitigations

This avoids requiring programmers to use the `view` API, which is an SDK API, not a user-facing instrumentation API. Letting the application programmer recommend aggregations directly gives the implementation more information about the raw statistics. Letting programmers declare their intent has few downsides, since there is a well-defined default behavior.

## Prior art and alternatives

Existing systems generally declare separate Metric types according to the desired aggregation. Raw statistics were invented to overcome this, and the present proposal brings back the ability to specify an Aggregation at the point where a Metric is defined.

## Open questions

There are questions about the value of the MIN and MAX aggregations. While they are simple to compute, they are difficult to use in practice.

There are questions about the interpretation of HISTOGRAM and SUMMARY. The point of Raw statistics was that we shouldn't specify these aggregations because they are expensive and many implementations are possible. This is still true. What is the value in specifying HISTOGRAM as opposed to SUMMARY? How is SUMMARY different from MIN/MAX/COUNT/SUM, does it imply implementation-defined quantiles?