-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promql: more Kahan summation (avg) and less incremental mean calculation (avg, avg_over_time) #14413
Conversation
It's always used as such. Let's avoid the countless conversions. Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
18eeefb
to
6c2a37d
Compare
Signed-off-by: beorn7 <beorn@grafana.com>
BTW, the speedup for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, but I have a number of thoughts.
@@ -2807,21 +2809,18 @@ func (ev *evaluator) aggregation(e *parser.AggregateExpr, q float64, inputMatrix | |||
*group = groupedAggregation{ | |||
seen: true, | |||
floatValue: f, | |||
floatMean: f, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why initialize this in cases where it isn't used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all cases where we need it, we have to initialize it that way, and in the cases where we don't need it, we are doing nothing wrong by initializing it.
Admittedly, we need it in only two cases. However, I assumed the computational effort is negligible and opted for one fewer line of code and the peace of mind that we cannot forget to initialize it by doing it here once and forever.
I don't feel strongly about this decision. If my explanation doesn't convince you, let me know.
// point in copying the histogram in that case. | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could put a break
here and out-dent the following.
group.floatMean = group.floatValue / (group.groupCount - 1) | ||
group.floatKahanC /= group.groupCount - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of situation makes this useful? We've overflowed a float64, but dividing by groupCount brings us back within range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have overflowed, but instead of actually doing it, we revert to incremental mean calculation, whose whole purpose is to avoid overflowing by "dividing early" (with some tricks). This is what we have done so far before this PR, even in (the vast majority of) cases that don't overflow.
I have tried to make the comment a bit more explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I mean what sort of data will bring us to this case.
Perhaps 1,000 series each of whose value is 2e305 ?
And I am hinting that it may not be a very common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would assume it's very rare in practice.
And that's one of the motivations behind this change.
Prior to this change, we always did incremental mean calculation, which is expensive and less accurate.
With this change, we do "normal" mean calculation, backed by Kahan summation, and only switch to incremental mean calculation in the rare cases that we need it.
for _, f := range s.Floats { | ||
count++ | ||
if !incrementalMean { | ||
newSum, newC := kahanSumInc(f.F, sum, c) | ||
if count == 1 || !math.IsInf(newSum, 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about count=1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would get division-by-zero problems below if the very first value to sum up is already ±Inf. I have added a comment to explain that. In case of the aggregators, the flow is different because we essentially deal with the first step during the initialization, which happens for all the different aggregations in the same way.
The basic idea here is that the previous code was always doing incremental calculation of the mean value, which is more costly and can be less precise. It protects against overflows, but in most cases, an overflow doesn't happen anyway. The other idea applied here is to expand on #14074, where Kahan summation was applied to sum(). With this commit, the average is calculated in a conventional way (adding everything up and divide in the end) as long as the sum isn't overflowing float64. This is combined with Kahan summation so that the avg aggregation, in most cases, is really equivalent to the sum aggregation with a following division (which is the user's expectation as avg is supposed to be syntactic sugar for sum with a following divison). If the sum hits ±Inf, the calculation reverts to incremental calculation of the mean value. Kahan summation is also applied here, although it cannot fully compensate for the numerical errors introduced by the incremental mean calculation. (The tests added in this commit would fail if incremental mean calculation was always used.) Signed-off-by: beorn7 <beorn@grafana.com>
The calculation of the mean value in avg_over_time is performed in an incremental fashion. This introduces additional numerical errors that even Kahan summation cannot compensate, but at least we can use the Kahan-corrected mean value when we use the intermediate mean value in the calculation. Signed-off-by: beorn7 <beorn@grafana.com>
Same idea as for the avg aggregator before: Most of the time, there is no overflow, so we don't have to revert to the more expensive and less precise incremental calculation of the mean value. Signed-off-by: beorn7 <beorn@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
I have addressed all comments, but you might want to double check if it happened to your satisfaction, so I'll wait for your response before merging.
@@ -2807,21 +2809,18 @@ func (ev *evaluator) aggregation(e *parser.AggregateExpr, q float64, inputMatrix | |||
*group = groupedAggregation{ | |||
seen: true, | |||
floatValue: f, | |||
floatMean: f, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all cases where we need it, we have to initialize it that way, and in the cases where we don't need it, we are doing nothing wrong by initializing it.
Admittedly, we need it in only two cases. However, I assumed the computational effort is negligible and opted for one fewer line of code and the peace of mind that we cannot forget to initialize it by doing it here once and forever.
I don't feel strongly about this decision. If my explanation doesn't convince you, let me know.
group.floatMean = group.floatValue / (group.groupCount - 1) | ||
group.floatKahanC /= group.groupCount - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have overflowed, but instead of actually doing it, we revert to incremental mean calculation, whose whole purpose is to avoid overflowing by "dividing early" (with some tricks). This is what we have done so far before this PR, even in (the vast majority of) cases that don't overflow.
I have tried to make the comment a bit more explanatory.
for _, f := range s.Floats { | ||
count++ | ||
if !incrementalMean { | ||
newSum, newC := kahanSumInc(f.F, sum, c) | ||
if count == 1 || !math.IsInf(newSum, 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would get division-by-zero problems below if the very first value to sum up is already ±Inf. I have added a comment to explain that. In case of the aggregators, the flow is different because we essentially deal with the first step during the initialization, which happens for all the different aggregations in the same way.
… `sum_over_time` and `avg_over_time` (#8934) * Ensure non-name labels are retained in `count_over_time`, `last_over_time` and `present_over_time` * Enable test cases for `min_over_time` and `max_over_time` * Add `min_over_time` and `max_over_time` * Add support for `sum_over_time` * Add support for `avg_over_time` * Expand concurrent queries test to include new functions that support native histograms * Add changelog entry * Fix typo in test * Don't create unnecessary `FloatHistogram` instances in `avg_over_time` * Add benchmarks * Add test cases for combinations of infinity and NaN * Explain why we need to copy histograms in `avg_over_time` * Use simpler and more accurate average calculation in `avg_over_time` from prometheus/prometheus#14413
See various commits for details.
Studying #14074, I noticed that the tests added for the
sum
aggregation wouldn't work for theavg
aggregation. Also, the tests already there forsum_over_time
(with a series1 1e100 1 -1e100
) wouldn't work foravg_over_time
.Simply applying Kahan summation wouldn't help. (In fact, it was already applied to
avg_over_time
, but of course, without adding the test mentioned above.) The reason is that we always use incremental calculation of the mean value. That's a good approach if the sum of the averaged values overflow float64. But in general, it's just more costly and has way more potential of accumulating numerical errors.This PR changes the behavior that incremental calculation of the mean is only used if the sum would overflow. It also uses Kahan summation in all cases, which should take care of errors happening when adding small numbers to large ones (which is a case where incremental calculation of the mean might help, too, but I believe Kahan summation is cheaper and better and also consistent with how we calculate sums).