[FEA] Support `duplicate_keep_option` in places that use unordered map #11050

ttnghia · 2022-06-03T21:58:46Z

In several places, we need to support duplicate_keep_option to drop duplicates except for the first or last element in a duplicate sequence. Such an option requires to stable-sort the input column so we can identify the first or last duplicate element.

Hash table/unordered_map for drop duplicates can avoid sorting and enhance performance from O(nlogn) to O(n) but the output elements are unordered thus it cannot support duplicate_keep_option. I believe that such limitation can be overcome somehow. I'll think about it and will prototype a solution if found any idea.

The text was updated successfully, but these errors were encountered:

ttnghia · 2022-06-03T23:08:46Z

@PointKernel I have some ideas for this. So I have a question: Can we access a value associated with a key in a kernel? Is that using the __device__ static_map::device_view::find() function? I'm not sure if calling it in a provided kernel will be fast or not, and should I use the overload of find that has CG g parameter or not?

PointKernel · 2022-06-03T23:36:06Z

Is that using the device static_map::device_view::find() function?

Yes

I'm not sure if calling it in a provided kernel will be fast or not

Whether it's a cuco kernel or a custom kernel doesn't matter. Those APIs are designed to be invoked on the device side.

should I use the overload of find that has CG g parameter or not?

There is no short answer for this. Using CG or not and the optimal size of CG vary on a use-case base. But in general, if we stick with 50% occupancy (the default occupancy for all use cases of static_map in cudf), The non-CG version is slightly more efficient since hash collisions are rare with such low occupancy.

This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052

github-actions · 2022-07-04T00:12:19Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-10-02T00:18:51Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

GregoryKimball · 2022-10-04T22:55:20Z

Closed by #11052

ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jun 3, 2022

ttnghia self-assigned this Jun 3, 2022

This was referenced Jun 4, 2022

Support duplicate_keep_option in cudf::distinct #11052

Merged

[FEA] Implement lists::drop_list_duplicates using cudf::distinct #11053

Closed

github-actions bot added the inactive-30d label Jul 4, 2022

github-actions bot added the inactive-90d label Oct 2, 2022

GregoryKimball added this to the Refactor using cuco containers milestone Oct 4, 2022

GregoryKimball closed this as completed Oct 4, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support `duplicate_keep_option` in places that use unordered map #11050

[FEA] Support `duplicate_keep_option` in places that use unordered map #11050

ttnghia commented Jun 3, 2022

ttnghia commented Jun 3, 2022 •

edited

Loading

PointKernel commented Jun 3, 2022

github-actions bot commented Jul 4, 2022

github-actions bot commented Oct 2, 2022

GregoryKimball commented Oct 4, 2022

[FEA] Support duplicate_keep_option in places that use unordered map #11050

[FEA] Support duplicate_keep_option in places that use unordered map #11050

Comments

ttnghia commented Jun 3, 2022

ttnghia commented Jun 3, 2022 • edited Loading

PointKernel commented Jun 3, 2022

github-actions bot commented Jul 4, 2022

github-actions bot commented Oct 2, 2022

GregoryKimball commented Oct 4, 2022

[FEA] Support `duplicate_keep_option` in places that use unordered map #11050

[FEA] Support `duplicate_keep_option` in places that use unordered map #11050

ttnghia commented Jun 3, 2022 •

edited

Loading