-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support duplicate_keep_option
in places that use unordered map
#11050
Comments
@PointKernel I have some ideas for this. So I have a question: Can we access a value associated with a key in a kernel? Is that using the |
Yes
Whether it's a cuco kernel or a custom kernel doesn't matter. Those APIs are designed to be invoked on the device side.
There is no short answer for this. Using CG or not and the optimal size of CG vary on a use-case base. But in general, if we stick with 50% occupancy (the default occupancy for all use cases of |
This adds `duplicate_keep_option` to `cudf::distinct`, allowing to specify a `keep` option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieve `O(n)` performance. A `KEEP_ANY` option is also added to `duplicate_keep_option`, which was an attempt in #9417 but didn't get in eventually. Partially addresses #11050 and #11053. ---- Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47 Follow up work: * #11053 * #11089 * #11092 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11052
This issue has been labeled |
This issue has been labeled |
Closed by #11052 |
In several places, we need to support
duplicate_keep_option
to drop duplicates except for the first or last element in a duplicate sequence. Such an option requires to stable-sort the input column so we can identify the first or last duplicate element.Hash table/unordered_map for drop duplicates can avoid sorting and enhance performance from
O(nlogn)
toO(n)
but the output elements are unordered thus it cannot supportduplicate_keep_option
. I believe that such limitation can be overcome somehow. I'll think about it and will prototype a solution if found any idea.The text was updated successfully, but these errors were encountered: