Skip to content

Commit

Permalink
Support for groupby/scan rank and dense_rank aggregations (#8652)
Browse files Browse the repository at this point in the history
resolves #7208 and resolves #8440 (Rank in rolling window is a functional equivalent of scan)
replaces #8138 and #8506

Adds functionality for aggregation operators rank and dense_rank. Rank and dense rank supported by scan and groupby scan (segmented scan). This PR also includes java support for the added aggregations.

Authors:
  - https://github.com/rwlee

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #8652
  • Loading branch information
rwlee authored Jul 22, 2021
1 parent 825f132 commit b803c4e
Show file tree
Hide file tree
Showing 19 changed files with 1,571 additions and 209 deletions.
2 changes: 2 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -242,8 +242,10 @@ add_library(cudf
src/groupby/sort/group_sum.cu
src/groupby/sort/scan.cpp
src/groupby/sort/group_count_scan.cu
src/groupby/sort/group_dense_rank_scan.cu
src/groupby/sort/group_max_scan.cu
src/groupby/sort/group_min_scan.cu
src/groupby/sort/group_rank_scan.cu
src/groupby/sort/group_sum_scan.cu
src/groupby/sort/group_replace_nulls.cu
src/groupby/sort/sort_helper.cu
Expand Down
108 changes: 107 additions & 1 deletion cpp/include/cudf/aggregation.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ class aggregation {
NUNIQUE, ///< count number of unique elements
NTH_ELEMENT, ///< get the nth element
ROW_NUMBER, ///< get row-number of current index (relative to rolling window)
RANK, ///< get rank of current index
DENSE_RANK, ///< get dense rank of current index
COLLECT_LIST, ///< collect values into a list
COLLECT_SET, ///< collect values into a list without duplicate entries
LEAD, ///< window function, accesses row at specified offset following current row
Expand Down Expand Up @@ -253,6 +255,110 @@ std::unique_ptr<Base> make_nth_element_aggregation(
template <typename Base = aggregation>
std::unique_ptr<Base> make_row_number_aggregation();

/**
* @brief Factory to create a RANK aggregation
*
* `RANK` returns a non-nullable column of size_type "ranks": the number of rows preceding or
* equal to the current row plus one. As a result, ranks are not unique and gaps will appear in
* the ranking sequence.
*
* This aggregation only works with "scan" algorithms. The input column into the group or
* ungrouped scan is an orderby column that orders the rows that the aggregate function ranks.
* If rows are ordered by more than one column, the orderby input column should be a struct
* column containing the ordering columns.
*
* Note:
* 1. This method requires that the rows are presorted by the group keys and order_by columns.
* 2. `RANK` aggregations will return a fully valid column regardless of null_handling policy
* specified in the scan.
* 3. `RANK` aggregations are not compatible with exclusive scans.
*
* @code{.pseudo}
* Example: Consider an motor-racing statistics dataset, containing the following columns:
* 1. driver_name: (STRING) Name of the car driver
* 2. num_overtakes: (INT32) Number of times the driver overtook another car in a lap
* 3. lap_number: (INT32) The number of the lap
*
* For the following presorted data:
*
* [ // driver_name, num_overtakes, lap_number
* { "bottas", 2, 3 },
* { "bottas", 2, 7 },
* { "bottas", 2, 7 },
* { "bottas", 1, 1 },
* { "bottas", 1, 2 },
* { "hamilton", 4, 1 },
* { "hamilton", 4, 1 },
* { "hamilton", 3, 4 },
* { "hamilton", 2, 4 }
* ]
*
* A grouped rank aggregation scan with:
* groupby column : driver_name
* input orderby column: struct_column{num_overtakes, lap_number}
* result: column<size_type>{1, 2, 2, 4, 5, 1, 1, 3, 4}
*
* A grouped rank aggregation scan with:
* groupby column : driver_name
* input orderby column: num_overtakes
* result: column<size_type>{1, 1, 1, 4, 4, 1, 1, 3, 4}
* @endcode
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_rank_aggregation();

/**
* @brief Factory to create a DENSE_RANK aggregation
*
* `DENSE_RANK` returns a non-nullable column of size_type "dense ranks": the preceding unique
* value's rank plus one. As a result, ranks are not unique but there are no gaps in the ranking
* sequence (unlike RANK aggregations).
*
* This aggregation only works with "scan" algorithms. The input column into the group or
* ungrouped scan is an orderby column that orders the rows that the aggregate function ranks.
* If rows are ordered by more than one column, the orderby input column should be a struct
* column containing the ordering columns.
*
* Note:
* 1. This method requires that the rows are presorted by the group keys and order_by columns.
* 2. `DENSE_RANK` aggregations will return a fully valid column regardless of null_handling
* policy specified in the scan.
* 3. `DENSE_RANK` aggregations are not compatible with exclusive scans.
*
* @code{.pseudo}
* Example: Consider an motor-racing statistics dataset, containing the following columns:
* 1. driver_name: (STRING) Name of the car driver
* 2. num_overtakes: (INT32) Number of times the driver overtook another car in a lap
* 3. lap_number: (INT32) The number of the lap
*
* For the following presorted data:
*
* [ // driver_name, num_overtakes, lap_number
* { "bottas", 2, 3 },
* { "bottas", 2, 7 },
* { "bottas", 2, 7 },
* { "bottas", 1, 1 },
* { "bottas", 1, 2 },
* { "hamilton", 4, 1 },
* { "hamilton", 4, 1 },
* { "hamilton", 3, 4 },
* { "hamilton", 2, 4 }
* ]
*
* A grouped dense rank aggregation scan with:
* groupby column : driver_name
* input orderby column: struct_column{num_overtakes, lap_number}
* result: column<size_type>{1, 2, 2, 3, 4, 1, 1, 2, 3}
*
* A grouped dense rank aggregation scan with:
* groupby column : driver_name
* input orderby column: num_overtakes
* result: column<size_type>{1, 1, 1, 2, 2, 1, 1, 2, 3}
* @endcode
*/
template <typename Base = aggregation>
std::unique_ptr<Base> make_dense_rank_aggregation();

/**
* @brief Factory to create a COLLECT_LIST aggregation
*
Expand All @@ -268,7 +374,7 @@ std::unique_ptr<Base> make_collect_list_aggregation(
null_policy null_handling = null_policy::INCLUDE);

/**
* @brief Factory to create a COLLECT_SET aggregation.
* @brief Factory to create a COLLECT_SET aggregation
*
* `COLLECT_SET` returns a lists column of all included elements in the group/series. Within each
* list, the duplicated entries are dropped out such that each entry appears only once.
Expand Down
Loading

1 comment on commit b803c4e

@chil-trek
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this feature (grouped dense rank) is still not available in Python. Any plans in making it available?

Please sign in to comment.