-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groupby aggregations M2
and MERGE_M2
#8605
Conversation
What is the difference between "MERGE_VARIANCE" and "MERGE_STDS"? for Spark we really just want something that will take the M2, MEAN, and COUNT_VALID and produce a merged M2, MEAN, and COUNT_VALID. We may do multiple intermediate merges along the way and if the schema changes as a part of the MERGE step we cannot do that any more. |
Both
If a merging is a final merge, we can directly use the computed final values of Yes, this is a bit wasting of computation to always do the second step when its results may not be used. Unless we know exactly when to do it then we can specify whether we will do it during calling to the merging aggs. In Spark, its is doing a 3-steps approach: computing + merging + finalizing. In libcudf so far, we only implement computing + merging aggregations. Thus, we have to combine the merging and finalizing steps within the merging aggs. Adding more aggs like I'm thinking of refactoring the entire groupby aggs to make them having 3 APIs that support distributed computing environment similar to Spark: |
That should be fine. I didn't see any implementation or doxygen comments or implementation to really understand what was proposed. I personally would be fine with just a |
I'm trying to understand why we need these new "merge" aggregations at all. Looking at the Dask algorithm it uses the relation that the variance of a set X is equal to the sum of the squares minus the square of the sums divided by the sample size, e.g., So a distributed groupby variance can be done as a groupby sum and groupby sum of squares with a division at the end without any new aggregations needing to be added. |
Apparently that algorithm has numerical stability issues: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Na%C3%AFve_algorithm But that page also lists several alternative algorithms with better stability that should be explored. I would prefer to find a solution that doesn't require adding several new, specialized "merge" aggregation types. |
Thanks Jake. As Bobby said we may just need Yes, by just having Update: I have removed |
M2
, MERGE_VARIANCES
, and MERGE_STDS
M2
, MERGE_M2
M2
, MERGE_M2
M2
and MERGE_M2
@ttnghia could you describe the algorithm being used here to compute a distributed variance? An example would be especially helpful. |
Sure. I have added more details about the algorithm in this PR's description, and plan to add several usage examples in the unit tests, which will be pushed up later. |
@jrhemstad From the algorithm I described in the top post, one more thing I'm still concerned about is the merging step for Maybe I will just need to rewrite my |
Why would you need to cast back to
How is Spark's algorithm different from what you listed here? https://www.wikiwand.com/en/Algorithms_for_calculating_variance#/Parallel_algorithm |
Note that 1) is no longer applicable, as I have changed the implementation of |
@gpucibot merge |
This PR adds the following new groupby aggregations:
M2
: sum of squares of differences from group mean, which is essentially an unscaled version of varianceMERGE_M2
for merging distributedly computedM2
These are auxiliary aggregations used for generating intermediate results for distributed computing of variance and standard deviation, with numerical stability.
Below is an overview of the algorithm for distributed computing of
VARIANCE
aggregation:COUNT_VALID
,MEAN
andM2
aggregations for each batch.COUNT_VALID
,MEAN
andM2
, then assemble a structs column with these values in separate children columns.MERGE_M2
aggregation to produce new merged values for those partial results, following the algorithm described here: https://www.wikiwand.com/en/Algorithms_for_calculating_variance#/Parallel_algorithm. As such, the output ofMERGE_M2
is also a structs column similar to its input.COUNT_VALID
,MEAN
, andM2
) may be merged in several merging steps.M2
values. Then, a finalizing step will compute variance/std for each key group using cudf binops (variance = M2 / group_size
, and ``variance_pop = M2 / (group_size - 1)`).Reference: https://www.wikiwand.com/en/Algorithms_for_calculating_variance#/Parallel_algorithm
There are also some other clean-up and re-order code in several related files.