Support for groupby/scan rank and dense_rank aggregations (#8652)

resolves #7208 and resolves #8440 (Rank in rolling window is a functional equivalent of scan) replaces #8138 and #8506 Adds functionality for aggregation operators rank and dense_rank. Rank and dense rank supported by scan and groupby scan (segmented scan). This PR also includes java support for the added aggregations. Authors: - https://github.com/rwlee Approvers: - Conor Hoekstra (https://github.com/codereport) - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Jake Hemstad (https://github.com/jrhemstad) URL: #8652
rapidsai · Jul 22, 2021 · b803c4e · b803c4e · chil-trek · Feb 3, 2022
1 parent 825f132
commit b803c4e
Show file tree

Hide file tree

Showing 19 changed files with 1,571 additions and 209 deletions.
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
@@ -242,8 +242,10 @@ add_library(cudf
     src/groupby/sort/group_sum.cu
     src/groupby/sort/scan.cpp
     src/groupby/sort/group_count_scan.cu
+    src/groupby/sort/group_dense_rank_scan.cu
     src/groupby/sort/group_max_scan.cu
     src/groupby/sort/group_min_scan.cu
+    src/groupby/sort/group_rank_scan.cu
     src/groupby/sort/group_sum_scan.cu
     src/groupby/sort/group_replace_nulls.cu
     src/groupby/sort/sort_helper.cu

diff --git a/cpp/include/cudf/aggregation.hpp b/cpp/include/cudf/aggregation.hpp
@@ -77,6 +77,8 @@ class aggregation {
     NUNIQUE,         ///< count number of unique elements
     NTH_ELEMENT,     ///< get the nth element
     ROW_NUMBER,      ///< get row-number of current index (relative to rolling window)
+    RANK,            ///< get rank       of current index
+    DENSE_RANK,      ///< get dense rank of current index
     COLLECT_LIST,    ///< collect values into a list
     COLLECT_SET,     ///< collect values into a list without duplicate entries
     LEAD,            ///< window function, accesses row at specified offset following current row
@@ -253,6 +255,110 @@ std::unique_ptr<Base> make_nth_element_aggregation(
 template <typename Base = aggregation>
 std::unique_ptr<Base> make_row_number_aggregation();
 
+/**
+ * @brief Factory to create a RANK aggregation
+ *
+ * `RANK` returns a non-nullable column of size_type "ranks": the number of rows preceding or
+ * equal to the current row plus one. As a result, ranks are not unique and gaps will appear in
+ * the ranking sequence.
+ *
+ * This aggregation only works with "scan" algorithms. The input column into the group or
+ * ungrouped scan is an orderby column that orders the rows that the aggregate function ranks.
+ * If rows are ordered by more than one column, the orderby input column should be a struct
+ * column containing the ordering columns.
+ *
+ * Note:
+ *  1. This method requires that the rows are presorted by the group keys and order_by columns.
+ *  2. `RANK` aggregations will return a fully valid column regardless of null_handling policy
+ *     specified in the scan.
+ *  3. `RANK` aggregations are not compatible with exclusive scans.
+ *
+ * @code{.pseudo}
+ * Example: Consider an motor-racing statistics dataset, containing the following columns:
+ *   1. driver_name:   (STRING) Name of the car driver
+ *   2. num_overtakes: (INT32)  Number of times the driver overtook another car in a lap
+ *   3. lap_number:    (INT32)  The number of the lap
+ *
+ * For the following presorted data:
+ *
+ *  [ // driver_name,  num_overtakes,  lap_number
+ *    {   "bottas",        2,            3        },
+ *    {   "bottas",        2,            7        },
+ *    {   "bottas",        2,            7        },
+ *    {   "bottas",        1,            1        },
+ *    {   "bottas",        1,            2        },
+ *    {   "hamilton",      4,            1        },
+ *    {   "hamilton",      4,            1        },
+ *    {   "hamilton",      3,            4        },
+ *    {   "hamilton",      2,            4        }
+ *  ]
+ *
+ * A grouped rank aggregation scan with:
+ *   groupby column      : driver_name
+ *   input orderby column: struct_column{num_overtakes, lap_number}
+ *  result: column<size_type>{1, 2, 2, 4, 5, 1, 1, 3, 4}
+ *
+ * A grouped rank aggregation scan with:
+ *   groupby column      : driver_name
+ *   input orderby column: num_overtakes
+ *  result: column<size_type>{1, 1, 1, 4, 4, 1, 1, 3, 4}
+ * @endcode
+ */
+template <typename Base = aggregation>
+std::unique_ptr<Base> make_rank_aggregation();
+
+/**
+ * @brief Factory to create a DENSE_RANK aggregation
+ *
+ * `DENSE_RANK` returns a non-nullable column of size_type "dense ranks": the preceding unique
+ * value's rank plus one. As a result, ranks are not unique but there are no gaps in the ranking
+ * sequence (unlike RANK aggregations).
+ *
+ * This aggregation only works with "scan" algorithms. The input column into the group or
+ * ungrouped scan is an orderby column that orders the rows that the aggregate function ranks.
+ * If rows are ordered by more than one column, the orderby input column should be a struct
+ * column containing the ordering columns.
+ *
+ * Note:
+ *  1. This method requires that the rows are presorted by the group keys and order_by columns.
+ *  2. `DENSE_RANK` aggregations will return a fully valid column regardless of null_handling
+ *     policy specified in the scan.
+ *  3. `DENSE_RANK` aggregations are not compatible with exclusive scans.
+ *
+ * @code{.pseudo}
+ * Example: Consider an motor-racing statistics dataset, containing the following columns:
+ *   1. driver_name:   (STRING) Name of the car driver
+ *   2. num_overtakes: (INT32)  Number of times the driver overtook another car in a lap
+ *   3. lap_number:    (INT32)  The number of the lap
+ *
+ * For the following presorted data:
+ *
+ *  [ // driver_name,  num_overtakes,  lap_number
+ *    {   "bottas",        2,            3        },
+ *    {   "bottas",        2,            7        },
+ *    {   "bottas",        2,            7        },
+ *    {   "bottas",        1,            1        },
+ *    {   "bottas",        1,            2        },
+ *    {   "hamilton",      4,            1        },
+ *    {   "hamilton",      4,            1        },
+ *    {   "hamilton",      3,            4        },
+ *    {   "hamilton",      2,            4        }
+ *  ]
+ *
+ * A grouped dense rank aggregation scan with:
+ *   groupby column      : driver_name
+ *   input orderby column: struct_column{num_overtakes, lap_number}
+ *  result: column<size_type>{1, 2, 2, 3, 4, 1, 1, 2, 3}
+ *
+ * A grouped dense rank aggregation scan with:
+ *   groupby column      : driver_name
+ *   input orderby column: num_overtakes
+ *  result: column<size_type>{1, 1, 1, 2, 2, 1, 1, 2, 3}
+ * @endcode
+ */
+template <typename Base = aggregation>
+std::unique_ptr<Base> make_dense_rank_aggregation();
+
 /**
  * @brief Factory to create a COLLECT_LIST aggregation
  *
@@ -268,7 +374,7 @@ std::unique_ptr<Base> make_collect_list_aggregation(
   null_policy null_handling = null_policy::INCLUDE);
 
 /**
- * @brief Factory to create a COLLECT_SET aggregation.
+ * @brief Factory to create a COLLECT_SET aggregation
  *
  * `COLLECT_SET` returns a lists column of all included elements in the group/series. Within each
  * list, the duplicated entries are dropped out such that each entry appears only once.