Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: support native histograms along classic histograms in panels #1121

Closed
wants to merge 16 commits into from

Conversation

krajorama
Copy link
Contributor

@krajorama krajorama commented Jan 9, 2024

This branch will have force pushes.

Related to grafana/mimir#7154

Native histograms are
https://grafana.com/docs/mimir/latest/send/native-histograms/

@krajorama krajorama marked this pull request as draft January 9, 2024 12:38
This branch will have force pushes.

Native histograms are
https://grafana.com/docs/mimir/latest/send/native-histograms/

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama and others added 15 commits January 10, 2024 08:25
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Must put metric query template into parenthesis otherwise we just multiply
one term of the query.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Easier to add on to existing selector this way

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
nativeClassicHistogramQuantile(percentile, metric, selector, sum_by=[], rate_interval='$__rate_interval')::
local classicSumBy = if std.length(sum_by) > 0 then ' by (%(lbls)s) ' % { lbls: std.join(',', ['le'] + sum_by) } else ' by (le) ';
local nativeSumBy = if std.length(sum_by) > 0 then ' by (%(lbls)s) ' % { lbls: std.join(',', sum_by) } else ' ';
'histogram_quantile(%(percentile)s, sum%(nativeSumBy)s(rate(%(metric)s{%(selector)s}[%(rateInterval)s]))) or histogram_quantile(%(percentile)s, sum%(classicSumBy)s(rate(%(metric)s_bucket{%(selector)s}[%(rateInterval)s])))' % {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewers: e.g.:

(histogram_quantile(0.99, sum (rate(cortex_request_duration_seconds{}[$__rate_interval]))) or
histogram_quantile(0.99, sum by (le) (rate(cortex_request_duration_seconds_bucket{}[$__rate_interval]))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be improved.

Imagine you have a fairly long rate_interval (the problem exists for all intervals, but the longer, the more serious it gets), e.g. multiple days or so. While migrating to native histograms, you ingest classic and native histograms in parallel. Very soon, the first leg of the query above will yield a result, but it will be based just on a few samples of native histograms, or the last few minutes of the multi-day range.

So what you want is to use the leg of the query that has a complete coverage of the rate_interval (and in doubt prefer the native histogram). I'll try to come up with a way to do that in the next comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could work:

(
      histogram_quantile(
        0.99,
        sum(rate(cortex_request_duration_seconds[1w])) and group(cortex_request_duration_seconds offset 1w)
      )
    or
      histogram_quantile(
        0.99,
          sum by (le) (rate(cortex_request_duration_seconds_bucket[1w]))
        and
          group by (le) (cortex_request_duration_seconds_bucket)
      )
  or
    histogram_quantile(0.99, sum(rate(cortex_request_duration_seconds[1w])))
)

This assumes the migration is already going from classic to native histograms. First, the query checks if there has been a native histogram 1w ago. If so, it goes for native histograms. Second, it checks if there is a classic histogram now, and uses them. As the third and final option, it uses native histograms (which is relevant for a new metric that never had classic histograms).

Of course, the stupid PromQL engine will calculate all three legs, even if only one will ever be used.

// The classicNativeHistogramSumRate function is used to calculate the histogram sum of rate from native histograms or classic histograms.
// Metric name should be provided without _sum suffix.
nativeClassicHistogramSumRate(metric, selector, rate_interval='$__rate_interval')::
'histogram_sum(rate(%(metric)s{%(selector)s}[%(rateInterval)s])) or rate(%(metric)s_sum{%(selector)s}[%(rateInterval)s])' % {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewers e.g., in context of a sum:

sum(
 histogram_sum(rate(cortex_request_duration_seconds{}[$__rate_interval])) or
 rate(cortex_request_duration_seconds_sum{}[$__rate_interval])
)

@krajorama
Copy link
Contributor Author

Superseded by #1150

@krajorama krajorama closed this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants