Add `cudf::stable_sort_by_key` #10387

PointKernel · 2022-03-02T22:05:37Z

This PR adds a new stable_sort_by_key API into libcudf. The new API is helpful to simplify Cython/JNI bindings of drop_duplicates (#10370).

cpp/src/sort/sort.cu

cpp/tests/sort/stable_sort_tests.cpp

codecov · 2022-03-03T00:17:45Z

Codecov Report

Merging #10387 (7486621) into branch-22.04 (a7d88cd) will increase coverage by 0.07%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.04   #10387      +/-   ##
================================================
+ Coverage         10.42%   10.50%   +0.07%     
================================================
  Files               119      126       +7     
  Lines             20603    21218     +615     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18990     +535

Impacted Files	Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py	`99.39% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/errors.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/ops.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/datasets.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/scalar.py	`0.00% <0.00%> (ø)`
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d8ea19...7486621. Read the comment docs.

PointKernel · 2022-03-07T22:05:42Z

@gpucibot merge

davidwendt · 2022-03-07T22:12:15Z

@PointKernel you should be getting two C++ approvals for libcudf PRs before merging.

PointKernel · 2022-03-07T22:32:54Z

@PointKernel you should be getting two C++ approvals for libcudf PRs before merging.

Thanks for the reminder. I realized this right after merging.

Closes #9413 Depending on #10387. There are several changes involved in this PR: - Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input. - Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms. - Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct` - Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique` - Updates corresponding tests and benchmarks. - Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. Performance hints for `cudf::unique` and `cudf::distinct`: - If the input is pre-sorted, use `cudf::unique` - If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired: - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct` - If `keep` control is required, stable sort the input then `cudf::unique` Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/brandon-b-miller - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10370

PointKernel added 5 commits March 1, 2022 15:35

Add stable_sort_by_key

48a8ac9

Update tests

a0a47e3

Minor correction: use proper memory resource

0d3bbc5

Move stable sort tests to a new file

7911897

Add mixed null order test

6f0089d

PointKernel added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue non-breaking Non-breaking change labels Mar 2, 2022

PointKernel self-assigned this Mar 2, 2022

PointKernel requested a review from a team as a code owner March 2, 2022 22:05

PointKernel requested review from devavret and hyperbolic2346 March 2, 2022 22:05

PointKernel mentioned this pull request Mar 2, 2022

Refactor stream compaction APIs #10370

Merged

cmake-format

bf57a49

davidwendt reviewed Mar 2, 2022

View reviewed changes

cpp/src/sort/sort.cu Outdated Show resolved Hide resolved

Fix memory resource bugs

d6c2aa9

davidwendt requested changes Mar 2, 2022

View reviewed changes

cpp/tests/sort/stable_sort_tests.cpp Outdated Show resolved Hide resolved

davidwendt reviewed Mar 2, 2022

View reviewed changes

cpp/tests/sort/stable_sort_tests.cpp Outdated Show resolved Hide resolved

PointKernel added 2 commits March 7, 2022 12:10

Update bool type tests

75aef03

Add fixed point tests for stable sort

7486621

PointKernel requested a review from davidwendt March 7, 2022 17:26

davidwendt approved these changes Mar 7, 2022

View reviewed changes

rapids-bot bot merged commit 4f8c60a into rapidsai:branch-22.04 Mar 7, 2022

PointKernel deleted the stable-sort-by-key branch May 26, 2022 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `cudf::stable_sort_by_key` #10387

Add `cudf::stable_sort_by_key` #10387

PointKernel commented Mar 2, 2022 •

edited

Loading

codecov bot commented Mar 3, 2022 •

edited

Loading

PointKernel commented Mar 7, 2022

davidwendt commented Mar 7, 2022

PointKernel commented Mar 7, 2022

Add cudf::stable_sort_by_key #10387

Add cudf::stable_sort_by_key #10387

Conversation

PointKernel commented Mar 2, 2022 • edited Loading

codecov bot commented Mar 3, 2022 • edited Loading

Codecov Report

PointKernel commented Mar 7, 2022

davidwendt commented Mar 7, 2022

PointKernel commented Mar 7, 2022

Add `cudf::stable_sort_by_key` #10387

Add `cudf::stable_sort_by_key` #10387

PointKernel commented Mar 2, 2022 •

edited

Loading

codecov bot commented Mar 3, 2022 •

edited

Loading