Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API with RFC 0003 #87

Merged
merged 35 commits into from
Sep 12, 2019
Merged

Metrics API with RFC 0003 #87

merged 35 commits into from
Sep 12, 2019

Conversation

lzchen
Copy link
Contributor

@lzchen lzchen commented Aug 14, 2019

This is a continuation of [#68], with the initial Metrics API as well as Metrics RFC 0003 outlined here: open-telemetry/oteps#4

The RFC included three changes:

  1. Get rid of Measure and Measurement class and have Measure be another Metric (like gauge and counter)
  2. Have the ability to pass in pre-defined label values (that match the label keys when creating the metric) when creating the time series for a metric. This is an important optimization, as programs with long-lived objects can compute pre-defined label values once, rather than once per call site.
  3. The former raw stats API supported all-or-none recording of interdependent measurements, RFC introduces MeasureBatch class to support recording of multiple observed values simultaneously.

TODO:

  1. Decisions on supporting Resource and Components (should they simply be labels?)
  2. DistributedContext (might be a label)
  3. Decisions for MeasureBatch function signature (how to record)

An example recording of raw statistics:

METER = Meter()
LABEL_KEYS = [LabelKey("environment",
                       "the environment the application is running in")]
MEASURE = METER.create_float_measure("idle_cpu_percentage",
                                     "cpu idle over time",
                                     "percentage",
                                     LastValueAggregation)
LABEL_VALUE_TESTING = [LabelValue("Testing")]
LABEL_VALUE_STAGING = [LabelValue("Staging")]

# Metrics sent to some exporter
MEASURE_METRIC_TESTING = MEASURE.get_or_create_time_series(LABEL_VALUE_TESTING)
MEASURE_METRIC_STAGING = MEASURE.get_or_create_time_series(LABEL_VALUE_STAGING)

# record individual measures
idle = psutil.cpu_times_percent().idle
MEASURE_METRIC_STAGING.record(idle)

# record multiple observed values
batch = MeasureBatch()
batch.record([(MEASURE_METRIC_TESTING, idle), (MEASURE_METRIC_STAGING, idle)])

An example of pre-aggregated metrics:

METER = Meter()
LABEL_KEYS = [LabelKey("environment", 
                       "the environment the application is running in")]
COUNTER = METER.create_int_counter("sum numbers", 
                                      "sum numbers over time",
                                      "number",
                                      LABEL_KEYS)
LABEL_VALUE_TESTING = [LabelValue("Testing")]
LABEL_VALUE_STAGING = [LabelValue("Staging")]

# Metrics sent to some exporter
COUNTER_METRIC_TESTING = COUNTER.get_or_create_time_series(LABEL_VALUE_TESTING)
COUNTER_METRIC_STAGING = COUNTER.get_or_create_time_series(LABEL_VALUE_STAGING)

for i in range(100):
    COUNTER_METRIC_STAGING.add(i)

Copy link

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good. There may be more questions when we start to attach an SDK and think about how to efficiently perform pre-aggregation when it is requested.

updater_function: The callback function to execute.
"""

def remove_time_series(self,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the remove method should take a Timeseries object, instead of a LabelSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might revisit this, if we decide to remove TimeSeries entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat Metric as a registry for these value containers (TimeSeries right now, but 👍 to changing this name or axing the class completely) then I don't think we should expect to get the same instance for the same label values at different times.

Another argument for keeping this signature: We expect users to call this to stop reporting a metric for a given set of label values. If they've got the time series they can use its label values without making another call to get_or_create_timeseries here.

opentelemetry-api/src/opentelemetry/metrics/__init__.py Outdated Show resolved Hide resolved
"""

@abstractmethod
def get_default_time_series(self) -> 'object':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, would you agree that this is equivalent to self.get_or_create_time_series([])?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Open Questions section of RFC 0003, this is stated as "should we eliminate GetDefaultHandle()".

Copy link
Member

@toumorokoshi toumorokoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have full context on the metrics API enough to be a good approver or otherwise.

But my thoughts on the UX side of things are:

  1. consolidating create methods into a single method might be worth it.
  2. allowing strings for shortcuts to things like LabelKey, LabelValue saves a ton of typing on the consumer side.
  3. I think time series is a misnomer, since it looks like just a way to get an aggregator with pre-populated tag values. I feel like even the spec should change to clarify that usage.

for the exported metric are deferred.
"""

def create_float_counter(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on passing in the float or int as types to the create method?

Might also be a good use case for enums:

def create(self, metric_type=METRIC_TYPE.counter, data_type=float, ...)

"""

@abstractmethod
def get_or_create_time_series(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably for the RFC or OTel spec, but is there a better description than timeseries? I think most metric systems don't expose a nuance between a timeseries and an individual logging of a measure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best explanation for the name "timeseries" is that something got lost in the translation from OC code to the OT spec.

  1. TimeSeries used to be a list of a Metric's timestamped measurements.
  2. Gauges and counters aren't meant to record multiple measurements per export interval, just report the current value (for each set of labels) at export time. To be consistent with Metrics, they could export each value as a single-point TimeSeries.
  3. Fast-forward to the OT spec: we've kept the name, but changed the underlying logic. TimeSeries is now a container for single values, and Gauges are now Metrics.

I may be missing something here, but I don't see a reason that these should still be called "time series".


# Metrics sent to some exporter
COUNTER_METRIC_TESTING = COUNTER.get_or_create_time_series(LABEL_VALUE_TESTING)
COUNTER_METRIC_STAGING = COUNTER.get_or_create_time_series(LABEL_VALUE_STAGING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API still feel too complicated to me, but I don't think that's a reflection of this PR.

I guess maybe a question to the spec, but why are we not attempting to expose a simpler consumer API? for example, referencing prometheus/client_python:

from prometheus_client import Counter
c = Counter('my_requests_total', 'HTTP Failures', ['method', 'endpoint'])
c.labels('get', '/').inc()
c.labels('post', '/submit').inc()

The behavior behind the hood, using OTel terminology:

  1. the instantation of a global Meter() object, creating the gauge.
  2. the creation of time series with the labels provided
  3. the aggregation of the values, pivoted by the labels

Using that to clarify my thinking, I think the things that feel out of place to me are:

  1. time series. What does that really mean, to the end user? It seems like the time series object that is returned back is a counter with specific labels built in. I feel like a method with a name like counter.with_predefined_labels(LABEL_VALUE_STAGING) might make more sense.

I think if the terminology were improved a little bit, that might help things. in addition, maybe providing some shorthand, such as LabelValues being a LabelValue object or a string. If strings were accepted, you'd probably have an interface almost identical to prometheus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a great feature to pilot in this repo (in another PR) and push up to specs or RFCs for discussion.

"""Returns a `CounterTimeSeries` with a cumulative float value."""


class CounterInt(Metric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: how will these APIs be used in practice? I kind of imagine that we wouldn't hook in different implementations of a counter or gauge: the behavior feels pretty clear and might be more beneficial to remain consistent.

Again an example that makes opentelemetry-sdk basically a must to use opentelemetry-api

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see this problem with almost every class in the metrics package. Tracing has separate API and SDK packages largely because vendors want to be able to change the implementation. Metrics has separate API and SDK packages to be consistent with tracing, but if (e.g.) vendors are going to change metrics, they're much more likely to add new aggregations or metric types than to change these built-in classes.

This touches on a deeper issue regarding the API/SDK separation. The API already includes some implementation to support context propagation without an SDK, so how do we decide where to draw the line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great question to discuss in the specification. I agree with all the points you've raised here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current metrics RFC discusses the semantic connection between metrics APIs and trace context. The current specification describes three kinds of metric "instrument", cumulative, gauge, and measure, which are pure API objects. Handles (known as TimeSeries currently) are SDK-specific implementations of the Add(), Set(), and Record() APIs. Nothing is said about aggregations at the API level apart from the the standard interpretation given to each, which is to say that by default cumulative metric instruments aggregate a sum, gauge metric instruments aggregate a last-value, and measure instruments aggregate a rate of some distribution.
open-telemetry/oteps#29

By separating the API from the implementation, it should be possible for to define an exporter for direct integration with existing metrics client libraries. This will require some sort of new metrics event exporter to be specified.

METER = Meter()
LABEL_KEYS = [LabelKey("environment",
"the environment the application is running in")]
MEASURE = METER.create_float_measure("idle_cpu_percentage",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an api to allow direct recording, if no label values are passed?

Would simplify the creation quite a bit for a trivial counter

idle_cpu = METER.create_float_measure("idle_cpu_percentage", "cpu idle over time", 'percentage")
idle_cpu.record(psutil.cpu_times_percent().idle)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to the Open Questions section of RFC 0003 this question. I am in favor of the idea.

@toumorokoshi
Copy link
Member

Good stuff by the way! I think it's only minor issues and maybe requiring less terminology to start, the API is getting simplerl

@reyang
Copy link
Member

reyang commented Aug 17, 2019

  1. allowing strings for shortcuts to things like LabelKey, LabelValue saves a ton of typing on the consumer side.

+1 same here.

@jmacd
Copy link

jmacd commented Aug 19, 2019

@toumorokoshi I agree that "Timeseries" is not a great term. In the Go repository, it currently stands at "Handle". In OpenCensus, it was "Entry".

I also don't like the method "GetOrCreate" for this thing, whatever we call it, because "GetOrCreate" implies some behavior not the semantics implied. In a streaming implementation, there will be nothing to create, after which "get" by itself will not be very descriptive.

I'd name the method "With", following the terminology in Prometheus.

Then the code will read

  handle = metric.With(labels...)
  handle.Set(value, labels...)

@lzchen
Copy link
Contributor Author

lzchen commented Aug 22, 2019

LGTM?
TODOs:

  1. Keeping/removing TimeSeries
  2. How to handle Aggregations
  3. Which metric types are we supporting
  4. Terminology

that the metric is associated with.

Returns:
A new `GaugeFloat`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor personal opinion here, it might be better to use create_float_guage with FloatGauge rather than mixing "float gauge" and "gauge float". Up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think I will leave this for the next PR, seeing as we are getting rid of TimeSeries and these naming changes should be done all at once for consistency.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current metrics RFC PR open-telemetry/oteps#29, I've renamed TimeSeries to Handle following a consensus reached in the 8/21 working session, but it's just a name change relative to the code here.

import typing


class CounterTimeSeries:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I have the same question regarding the TimeSeries name :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In open-telemetry/oteps#29 this becomes Handle. Please review.

"""
def __init__(self,
value: str) -> None:
self.value = value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's any good reasons to have mandatory containers for label values? Currently it seems to be engineered like that just for the sake of completeness.

As far as I understand majority of industry solutions, there's no support for anything but strings for labels, and it's unlikely to change in the years to come.

Copy link
Contributor Author

@lzchen lzchen Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good point and a common ask. I think it will make sense to simply use strings for the label values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think each language should decide how to handle this question. If there is some kind of built-in support for key-value in the language, they should prefer that. In the Go repository, we have a common core.KeyValue type that can be used as both an event label and a metric label. In logging (events), it's more common to have non-string values.

I'm not sure the API needs to specify anything about how a vendor should behave when, say, a numeric value is passed as the value for a label (and the compiler didn't prevent it). I would say the default behavior should coerce values to a string silently, not something to worry about.

Copy link

@GreyCat GreyCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In overall picture of the world, at this point I believe the original distinction between "gauges" vs "counters" seems to have lost its meaning. "Counter" is now basically the same as "gauge" with some extra artificial limitations. Can we move forward and just deal with "aggregates" vs "raw measurements", agreeing on some term for both (previously suggested "metric" and "measurement" seems to fit quite well)?

@lzchen
Copy link
Contributor Author

lzchen commented Sep 3, 2019

@GreyCat
Thanks for the review. Gauge metrics express a pre-calculated value that is either "set" or observed through a callback. These metrics usually cannot be represented using a sum or rate because the measurement interval is arbitrary (such as getting cpu_usage). As well, when setting a gauge explicitly, it occurs within an implicit or explicit context, as opposed to the callback, where there is no context.

The counter metric will be changed to a "cumulative" metric, in which the value has an option of either going up or down. This will be done in a separate PR.

Cumulative expresses a computation of a sum. The meaning, purpose and behavior of these types of metrics are inherently different than those of a gauge. For those reasons, I believe there is a strong argument for keeping the distinction between the two.

@GreyCat
Copy link

GreyCat commented Sep 3, 2019

@lzchen
Sorry, I don't quite follow.

Gauge metrics express a pre-calculated value that is either "set" or observed through a callback.

Formally, the very same is allowed for "counters" now too. I don't quite follow how observing through a callback is related to these scenarios in whole: it is a very generic mechanism, applicable to all possible emissions at all.

These metrics usually cannot be represented using a sum or rate because the measurement interval is arbitrary (such as getting cpu_usage).

It actually can be (and will be). If in a certain interval we've got 3 readings of CPU usage of 40%, then 50%, then 60%, we'll happily have a sum of 40+50+60 = 150, count of 3, and thus derive that average is 150 / 3 = 50%.

It's not any different from other use cases proposed for counters. There are quite a few systems that don't make this distinction, and, what's even worse from my point of view, some systems (coming from statsd semantics, namely) use that "gauge" vs "counter" distinction for totally different purpose, having a notion of "flush interval" and either resetting (for counter) or not resetting (for gauge) on the expiration of flush interval.

The counter metric will be changed to a "cumulative" metric, in which the value has an option of either going up or down. This will be done in a separate PR.

Makes sense, thanks!

Cumulative expresses a computation of a sum.

Not really. Cumulative just expresses that there is a state, and we're aiming to have multiple ways to keep that state: that's summing, keeping min/max/last, keeping tabs on a histograms / distributions, etc.

@lzchen
Copy link
Contributor Author

lzchen commented Sep 3, 2019

@GreyCat

It actually can be (and will be). If in a certain interval we've got 3 readings of CPU usage of 40%, then 50%, then 60%, we'll happily have a sum of 40+50+60 = 150, count of 3, and thus derive that average is 150 / 3 = 50%.

I was referring to capturing the initial readings of CPU Usage. For example if the user wanted to see the CPU usage simply at certain points in time (instead of aggregating), in which the value will be set through a callback.

Not really. Cumulative just expresses that there is a state, and we're aiming to have multiple ways to keep that state: that's summing, keeping min/max/last, keeping tabs on a histograms / distributions, etc.

In this context, cumulative represents a specific type of preaggregation, which we represent with as only a sum. It's purpose is to represent metrics when the value is a quantity, the sum is of primary interest, the event count and distribution are not of primary interest. For the cases you have listed (histograms, distributions, etc.), the measure metric should be used.

But going back to your question of counter vs. gauge, the introduction of the cumulative will have a clearer picture of the distinction between that metric and gauge, which will be done in a separate PR.

@lzchen lzchen requested a review from a-feld as a code owner September 3, 2019 18:22
@jmacd
Copy link

jmacd commented Sep 3, 2019

@GreyCat I hope the current RFC 0003 and its new sibling (unnumbered, probably 0007) on metric handles explain the intent of the specification, which is to separate the API from the implementation with a clear semantic explanation. The essence of this is that there are three kinds of object with distinct verbs Add(), Set(), and Record(). Handles of these are the way an SDK can control the behavior behind these events.

See open-telemetry/oteps#29

Note that PR29 removes RFC 0004-metric-configurable-aggregation.md, which addressed the topics you raised. We decided this could be left out of the API, although it's something I personally like very much.

Currently the RFC (0003) says what the default aggregation will be for each of the three kinds of metric, and does not specify any way to influence this behavior at the site where metrics objects are used. Carrying over from OpenCensus, there is a desire to specify a "view" API for application code to declare various levels of detail in metrics reporting. As you suggest, sometimes we're interested in recording both the count and the sum for a cumulative metric, as opposed to only the count. The SDK is certainly capable of changing this behavior. It may be possible for the application to configure this behavior (via "views", maybe). Should the application be capable of suggesting or recommending the aggregation itself? That's the topic of 0004 that is currently off the table. An SDK can do this as it pleases. Please comment on PR29, thanks!

@toumorokoshi
Copy link
Member

Thanks for the focus here! Loving the direction this is going.

@lzchen lzchen merged commit fb11568 into open-telemetry:master Sep 12, 2019
@lzchen lzchen mentioned this pull request Sep 19, 2019
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this pull request Nov 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants