-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support duplicate_keep_option
in cudf::distinct
#11052
Support duplicate_keep_option
in cudf::distinct
#11052
Conversation
Below is benchmarks on a struct column, 8 children (4 int + 4 string), generated randomly then repeated 4 times (that means the column has less than 25% unique rows). Comparing
|
I believe that this brings undeniable benefits. We can apply the same approach to |
So this work is intended to fill the "hole" when keys are not sorted and keep control is required. Currently, we need to stable sort the key first and then invoke
Did I miss something? |
The first option should not be handled internally by |
Based on the offline discussion, @ttnghia is splitting the existing
@davidwendt Any suggestions to further reduce compile time are highly appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake LGTM (did not look at the C++)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One non-blocking nitpick otherwise looks good to me. 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some significant changes from the last time I reviewed. I'd like to see one more round of review on this to resolve some questions on naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small edits removing mentions of sparsity since we renamed the function (and it seems that it could be performant even if the data is mostly unique). Otherwise LGTM!
# Conflicts: # cpp/src/stream_compaction/distinct.cu
@gpucibot merge |
This adds `nan_equality` parameter to `cudf::distinct`, allowing to specify the desired behavior when dealing with floating-point data: `NaN` should be compared equally to other `NaN` or not. Depends on #11052 (built on top of it). Closes #11092. This is a blocker for set-like operations (#11043) and also the last blocker for #11053. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11118
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
This adds
duplicate_keep_option
tocudf::distinct
, allowing to specify akeep
option for selecting which of the duplicate elements to keep. It paves the way for many drop duplicate applications to achieveO(n)
performance.A
KEEP_ANY
option is also added toduplicate_keep_option
, which was an attempt in #9417 but didn't get in eventually.Partially addresses #11050 and #11053.
Main implementation: https://github.com/rapidsai/cudf/pull/11052/files#diff-4c2d4268b3c50000ae845ba15a890bb743709c30e5cab4847af7ad633c5a2823R47
Follow up work:
lists::drop_list_duplicates
usingcudf::distinct
#11053cudf::distinct
in Python and Java when it has support forduplicate_keep_option
#11089nan_equality
incudf::distinct
#11092