Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promql: more Kahan summation (avg) and less incremental mean calculation (avg, avg_over_time) #14413

Merged
merged 6 commits into from
Aug 6, 2024

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Jul 4, 2024

See various commits for details.

Studying #14074, I noticed that the tests added for the sum aggregation wouldn't work for the avg aggregation. Also, the tests already there for sum_over_time (with a series 1 1e100 1 -1e100) wouldn't work for avg_over_time.

Simply applying Kahan summation wouldn't help. (In fact, it was already applied to avg_over_time, but of course, without adding the test mentioned above.) The reason is that we always use incremental calculation of the mean value. That's a good approach if the sum of the averaged values overflow float64. But in general, it's just more costly and has way more potential of accumulating numerical errors.

This PR changes the behavior that incremental calculation of the mean is only used if the sum would overflow. It also uses Kahan summation in all cases, which should take care of errors happening when adding small numbers to large ones (which is a case where incremental calculation of the mean might help, too, but I believe Kahan summation is cheaper and better and also consistent with how we calculate sums).

@beorn7 beorn7 requested a review from roidelapluie as a code owner July 4, 2024 13:32
@beorn7 beorn7 marked this pull request as draft July 4, 2024 13:34
It's always used as such. Let's avoid the countless conversions.

Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7 beorn7 force-pushed the beorn7/promql branch 3 times, most recently from 18eeefb to 6c2a37d Compare July 4, 2024 16:56
Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7 beorn7 changed the title promql: use Kahan summation for avg() promql: more Kahan summation (avg) and less incremental mean calculation (avg, avg_over_time) Jul 4, 2024
@beorn7
Copy link
Member Author

beorn7 commented Jul 4, 2024

BTW, the speedup for avg calculation isn't significant, but it should be noted that this PR adds Kahan summation to avg, which has lead to a measurable slowdown for sum in #14074. Or in other words, the slowdown caused by the Kahan summation is approximately compensated by the speedup gained from removing incremental mean calculation.

name                                          old time/op    new time/op    delta
RangeQuery/expr=avg(a_hundred),steps=1-16        224µs ± 2%     216µs ± 1%  -3.39%  (p=0.000 n=10+9)
RangeQuery/expr=avg(a_hundred),steps=100-16      683µs ± 1%     686µs ± 4%    ~     (p=0.780 n=9+10)
RangeQuery/expr=avg(a_hundred),steps=1000-16    4.29ms ± 7%    4.34ms ± 2%    ~     (p=0.393 n=10+10)

@beorn7 beorn7 marked this pull request as ready for review July 4, 2024 17:19
@beorn7 beorn7 requested review from bboreham and removed request for roidelapluie July 4, 2024 17:19
@beorn7
Copy link
Member Author

beorn7 commented Jul 4, 2024

@bboreham I nominate you as the reviewer because you inspired this through #14074 .

bboreham
bboreham previously approved these changes Jul 5, 2024
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, but I have a number of thoughts.

@@ -2807,21 +2809,18 @@ func (ev *evaluator) aggregation(e *parser.AggregateExpr, q float64, inputMatrix
*group = groupedAggregation{
seen: true,
floatValue: f,
floatMean: f,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why initialize this in cases where it isn't used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all cases where we need it, we have to initialize it that way, and in the cases where we don't need it, we are doing nothing wrong by initializing it.

Admittedly, we need it in only two cases. However, I assumed the computational effort is negligible and opted for one fewer line of code and the peace of mind that we cannot forget to initialize it by doing it here once and forever.

I don't feel strongly about this decision. If my explanation doesn't convince you, let me know.

promql/engine.go Show resolved Hide resolved
promql/engine.go Show resolved Hide resolved
Comment on lines 2870 to 2873
// point in copying the histogram in that case.
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could put a break here and out-dent the following.

Comment on lines +2881 to +2889
group.floatMean = group.floatValue / (group.groupCount - 1)
group.floatKahanC /= group.groupCount - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of situation makes this useful? We've overflowed a float64, but dividing by groupCount brings us back within range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have overflowed, but instead of actually doing it, we revert to incremental mean calculation, whose whole purpose is to avoid overflowing by "dividing early" (with some tricks). This is what we have done so far before this PR, even in (the vast majority of) cases that don't overflow.

I have tried to make the comment a bit more explanatory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean what sort of data will bring us to this case.
Perhaps 1,000 series each of whose value is 2e305 ?
And I am hinting that it may not be a very common case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would assume it's very rare in practice.
And that's one of the motivations behind this change.
Prior to this change, we always did incremental mean calculation, which is expensive and less accurate.
With this change, we do "normal" mean calculation, backed by Kahan summation, and only switch to incremental mean calculation in the rare cases that we need it.

promql/functions.go Outdated Show resolved Hide resolved
for _, f := range s.Floats {
count++
if !incrementalMean {
newSum, newC := kahanSumInc(f.F, sum, c)
if count == 1 || !math.IsInf(newSum, 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about count=1 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would get division-by-zero problems below if the very first value to sum up is already ±Inf. I have added a comment to explain that. In case of the aggregators, the flow is different because we essentially deal with the first step during the initialization, which happens for all the different aggregations in the same way.

The basic idea here is that the previous code was always doing
incremental calculation of the mean value, which is more costly and
can be less precise. It protects against overflows, but in most cases,
an overflow doesn't happen anyway.

The other idea applied here is to expand on #14074, where Kahan
summation was applied to sum().

With this commit, the average is calculated in a conventional way
(adding everything up and divide in the end) as long as the sum isn't
overflowing float64. This is combined with Kahan summation so that the
avg aggregation, in most cases, is really equivalent to the sum
aggregation with a following division (which is the user's expectation
as avg is supposed to be syntactic sugar for sum with a following
divison).

If the sum hits ±Inf, the calculation reverts to incremental
calculation of the mean value. Kahan summation is also applied here,
although it cannot fully compensate for the numerical errors
introduced by the incremental mean calculation. (The tests added in
this commit would fail if incremental mean calculation was always
used.)

Signed-off-by: beorn7 <beorn@grafana.com>
The calculation of the mean value in avg_over_time is performed in an
incremental fashion. This introduces additional numerical errors that
even Kahan summation cannot compensate, but at least we can use the
Kahan-corrected mean value when we use the intermediate mean value in
the calculation.

Signed-off-by: beorn7 <beorn@grafana.com>
Same idea as for the avg aggregator before: Most of the time, there is
no overflow, so we don't have to revert to the more expensive and less
precise incremental calculation of the mean value.

Signed-off-by: beorn7 <beorn@grafana.com>
Copy link
Member Author

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

I have addressed all comments, but you might want to double check if it happened to your satisfaction, so I'll wait for your response before merging.

promql/engine.go Show resolved Hide resolved
@@ -2807,21 +2809,18 @@ func (ev *evaluator) aggregation(e *parser.AggregateExpr, q float64, inputMatrix
*group = groupedAggregation{
seen: true,
floatValue: f,
floatMean: f,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all cases where we need it, we have to initialize it that way, and in the cases where we don't need it, we are doing nothing wrong by initializing it.

Admittedly, we need it in only two cases. However, I assumed the computational effort is negligible and opted for one fewer line of code and the peace of mind that we cannot forget to initialize it by doing it here once and forever.

I don't feel strongly about this decision. If my explanation doesn't convince you, let me know.

promql/engine.go Show resolved Hide resolved
Comment on lines +2881 to +2889
group.floatMean = group.floatValue / (group.groupCount - 1)
group.floatKahanC /= group.groupCount - 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have overflowed, but instead of actually doing it, we revert to incremental mean calculation, whose whole purpose is to avoid overflowing by "dividing early" (with some tricks). This is what we have done so far before this PR, even in (the vast majority of) cases that don't overflow.

I have tried to make the comment a bit more explanatory.

promql/functions.go Outdated Show resolved Hide resolved
for _, f := range s.Floats {
count++
if !incrementalMean {
newSum, newC := kahanSumInc(f.F, sum, c)
if count == 1 || !math.IsInf(newSum, 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would get division-by-zero problems below if the very first value to sum up is already ±Inf. I have added a comment to explain that. In case of the aggregators, the flow is different because we essentially deal with the first step during the initialization, which happens for all the different aggregations in the same way.

@beorn7 beorn7 merged commit ee5bba0 into main Aug 6, 2024
42 checks passed
@beorn7 beorn7 deleted the beorn7/promql branch August 6, 2024 17:56
charleskorn added a commit to grafana/mimir that referenced this pull request Aug 12, 2024
… `sum_over_time` and `avg_over_time` (#8934)

* Ensure non-name labels are retained in `count_over_time`, `last_over_time` and `present_over_time`

* Enable test cases for `min_over_time` and `max_over_time`

* Add `min_over_time` and `max_over_time`

* Add support for `sum_over_time`

* Add support for `avg_over_time`

* Expand concurrent queries test to include new functions that support native histograms

* Add changelog entry

* Fix typo in test

* Don't create unnecessary `FloatHistogram` instances in `avg_over_time`

* Add benchmarks

* Add test cases for combinations of infinity and NaN

* Explain why we need to copy histograms in `avg_over_time`

* Use simpler and more accurate average calculation in `avg_over_time` from prometheus/prometheus#14413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants