Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby aggregations M2 and MERGE_M2 #8605

Merged
merged 22 commits into from
Jul 8, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
4c272df
Add aggregation definitions
ttnghia Jun 23, 2021
d446076
Adding new aggregations: M2, MERGE_VARIANCES, MERGE_STDS
ttnghia Jun 24, 2021
ebc718a
Remove `MERGE_VARIANCES` and `MERGE_STDS`, add `MERGE_M2`
ttnghia Jun 25, 2021
93b38c4
Finish implementation, no unit tests implemented yet
ttnghia Jun 30, 2021
8c74cb4
Finish unit tests for M2 aggregation
ttnghia Jul 1, 2021
44a3b16
Rewrite doxygen
ttnghia Jul 2, 2021
1d54ef5
Fix `MERGE_M2`implementation
ttnghia Jul 2, 2021
96dc792
Finish unit tests for `MERGE_M2`
ttnghia Jul 2, 2021
e131be6
Merge branch 'branch-21.08' into merge_std_variance
ttnghia Jul 2, 2021
99ab9a1
Fix copyright header
ttnghia Jul 2, 2021
0ac08fe
Rename functor
ttnghia Jul 2, 2021
b751d15
Merge branch 'branch-21.08' into merge_std_variance
ttnghia Jul 6, 2021
7863b6f
Rewrite the merge functor, adding `partial_result` struct to store in…
ttnghia Jul 6, 2021
0ab2c18
Rewrite doxygen
ttnghia Jul 6, 2021
90d984a
Add a unit test when the input column column is a structs column with…
ttnghia Jul 6, 2021
9967ec2
Add unit tests for the cases when the input values column has negativ…
ttnghia Jul 6, 2021
f3d0a3c
Fix comments
ttnghia Jul 6, 2021
bbb961f
Rewrite unit tests, separating multiple steps merging and one step me…
ttnghia Jul 6, 2021
26e17f6
Change `ResultType` to `result_type` to enforce name consistency
ttnghia Jul 6, 2021
1872846
Rewrite doxygen
ttnghia Jul 7, 2021
53023af
Merge branch 'branch-21.08' into merge_std_variance
ttnghia Jul 7, 2021
a1d00b1
Fix formatting
ttnghia Jul 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ add_library(cudf
src/groupby/sort/aggregate.cpp
src/groupby/sort/group_collect.cu
src/groupby/sort/group_merge_lists.cu
src/groupby/sort/group_merge_variances.cu
src/groupby/sort/group_count.cu
src/groupby/sort/group_max.cu
src/groupby/sort/group_min.cu
Expand Down
134 changes: 93 additions & 41 deletions cpp/include/cudf/aggregation.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,33 +57,36 @@ class aggregation {
* @brief Possible aggregation operations
*/
enum Kind {
SUM, ///< sum reduction
PRODUCT, ///< product reduction
MIN, ///< min reduction
MAX, ///< max reduction
COUNT_VALID, ///< count number of valid elements
COUNT_ALL, ///< count number of elements
ANY, ///< any reduction
ALL, ///< all reduction
SUM_OF_SQUARES, ///< sum of squares reduction
MEAN, ///< arithmetic mean reduction
VARIANCE, ///< groupwise variance
STD, ///< groupwise standard deviation
MEDIAN, ///< median reduction
QUANTILE, ///< compute specified quantile(s)
ARGMAX, ///< Index of max element
ARGMIN, ///< Index of min element
NUNIQUE, ///< count number of unique elements
NTH_ELEMENT, ///< get the nth element
ROW_NUMBER, ///< get row-number of current index (relative to rolling window)
COLLECT_LIST, ///< collect values into a list
COLLECT_SET, ///< collect values into a list without duplicate entries
MERGE_LISTS, ///< merge multiple lists values into one list
MERGE_SETS, ///< merge multiple lists values into one list then drop duplicate entries
LEAD, ///< window function, accesses row at specified offset following current row
LAG, ///< window function, accesses row at specified offset preceding current row
PTX, ///< PTX UDF based reduction
CUDA ///< CUDA UDF based reduction
SUM, ///< sum reduction
PRODUCT, ///< product reduction
MIN, ///< min reduction
MAX, ///< max reduction
COUNT_VALID, ///< count number of valid elements
COUNT_ALL, ///< count number of elements
ANY, ///< any reduction
ALL, ///< all reduction
SUM_OF_SQUARES, ///< sum of squares reduction
MEAN, ///< arithmetic mean reduction
M2, ///< groupwise sum of squares of differences from the current mean
VARIANCE, ///< groupwise variance
STD, ///< groupwise standard deviation
MEDIAN, ///< median reduction
QUANTILE, ///< compute specified quantile(s)
ARGMAX, ///< Index of max element
ARGMIN, ///< Index of min element
NUNIQUE, ///< count number of unique elements
NTH_ELEMENT, ///< get the nth element
ROW_NUMBER, ///< get row-number of current index (relative to rolling window)
COLLECT_LIST, ///< collect values into a list
COLLECT_SET, ///< collect values into a list without duplicate entries
LEAD, ///< window function, accesses row at specified offset following current row
LAG, ///< window function, accesses row at specified offset preceding current row
PTX, ///< PTX UDF based reduction
CUDA, ///< CUDA UDF based reduction
MERGE_LISTS, ///< merge multiple lists values into one list
MERGE_SETS, ///< merge multiple lists values into one list then drop duplicate entries
MERGE_VARIANCES, ///< merge partial variance values
MERGE_STDS ///< merge partial standard deviation values
};

aggregation() = delete;
Expand Down Expand Up @@ -159,6 +162,16 @@ std::unique_ptr<Base> make_sum_of_squares_aggregation();
template <typename Base = aggregation>
std::unique_ptr<Base> make_mean_aggregation();

/**
* @brief Factory to create a M2 aggregation
*
* A M2 aggregation is groupwise sum of squares of differences from the current mean. From this,
* a `VARIANCE` aggregation can be computed as `M2 / (N - ddof)`, where `N` is the population size
* and `ddof` is the delta degrees of freedom.
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_m2_aggregation();

/**
* @brief Factory to create a VARIANCE aggregation
*
Expand Down Expand Up @@ -271,6 +284,28 @@ std::unique_ptr<Base> make_collect_set_aggregation(null_policy null_handling = n
null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::UNEQUAL);

/// Factory to create a LAG aggregation
template <typename Base = aggregation>
std::unique_ptr<Base> make_lag_aggregation(size_type offset);

/// Factory to create a LEAD aggregation
template <typename Base = aggregation>
std::unique_ptr<Base> make_lead_aggregation(size_type offset);

/**
* @brief Factory to create an aggregation base on UDF for PTX or CUDA
*
* @param[in] type: either udf_type::PTX or udf_type::CUDA
* @param[in] user_defined_aggregator A string containing the aggregator code
* @param[in] output_type expected output type
*
* @return aggregation unique pointer housing user_defined_aggregator string.
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_udf_aggregation(udf_type type,
std::string const& user_defined_aggregator,
data_type output_type);

/**
* @brief Factory to create a MERGE_LISTS aggregation.
*
Expand Down Expand Up @@ -308,27 +343,44 @@ template <typename Base = aggregation>
std::unique_ptr<Base> make_merge_sets_aggregation(null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::UNEQUAL);

/// Factory to create a LAG aggregation
template <typename Base = aggregation>
std::unique_ptr<Base> make_lag_aggregation(size_type offset);

/// Factory to create a LEAD aggregation
/**
* @brief Factory to create a MERGE_VARIANCES aggregation
*
* This aggregation is designed specificly to perform distributed computing of `VARIANCE`
* aggregation. The partial results input to this aggregation is generated by two groupby
* aggregations: `VARIANCE` and `COUNT_VALID`.
*
* In order to use this aggregation, the `aggregation_request` array input to `groupby::aggregate`
* must contain at least two requests:
* - A request for `COLLECT_LIST` aggregation to collect the partial results of `COUNT_VALID`
* - This `MERGE_VARIANCES` request, which must be given AFTER the request above so that it can
* access the cached results generated by that request
*
* For a merging operation that is not a final merge (i.e., its outputs will be used as input to
* perform another `MERGE_VARIANCES` aggregation), a `SUM` aggregation must also be added to the
* same request for `COLLECT_LIST` above to produce the merged values for `COUNT_VALID`.
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
*
* Since the partial results output from `VARIANCE` and `COUNT_VALID` do not contain nulls, the
* input values columns to those two requests must be non-nullable.
*
* @param ddof Delta degrees of freedom. The divisor used in calculation of `variance` is
* `N - ddof`, where `N` is the population size.
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_lead_aggregation(size_type offset);
std::unique_ptr<Base> make_merge_variances_aggregation(size_type ddof = 1);

/**
* @brief Factory to create an aggregation base on UDF for PTX or CUDA
* @brief Factory to create a MERGE_STDS aggregation
*
* @param[in] type: either udf_type::PTX or udf_type::CUDA
* @param[in] user_defined_aggregator A string containing the aggregator code
* @param[in] output_type expected output type
* This aggregation is designed specificly to perform distributed computing of `STD`
* aggregation. The partial results input to this aggregation and its usage are the same as of
* `MERGE_VARIANCES` aggregation.
*
* @return aggregation unique pointer housing user_defined_aggregator string.
* @param ddof Delta degrees of freedom. The divisor used in calculation of `variance` is
* `N - ddof`, where `N` is the population size.
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_udf_aggregation(udf_type type,
std::string const& user_defined_aggregator,
data_type output_type);
std::unique_ptr<Base> make_merge_stds_aggregation(size_type ddof = 1);

/** @} */ // end of group
} // namespace cudf
Loading