Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lists::distinct and cudf::detail::stable_distinct #11149

Merged
merged 29 commits into from
Jul 11, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 24, 2022

This adds new APIs:

  • lists::distinct as a stream compaction component of cudf::lists::, allowing to extract distinct elements from lists in a lists column. The new API does a similar job as lists::drop_list_duplicate but can operate on arbitrary data types while lists::drop_list_duplicate can only work on basic data types and flat structs.
  • cudf::detail::stable_distinct, which is implemented in the main stream compaction module. This API is introduced as just a detail:: API first (which means we can expose it to the public if needed), producing the equivalent output as cudf::distinct but with row order preserved. It is used as a building block to implement lists::distinct.

This PR is a dependency to implement set-like operations (#11043).

Note: This new lists::distinct API will completely replace lists::drop_list_duplicate (which in turn will be deprecated). This will be the follow-up work.

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf blocker libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jun 24, 2022
@ttnghia ttnghia self-assigned this Jun 24, 2022
@github-actions github-actions bot added the CMake CMake build issue label Jun 24, 2022
@ttnghia ttnghia mentioned this pull request Jun 24, 2022
@codecov

This comment was marked as off-topic.

@ttnghia ttnghia changed the title Implement lists::distinct Implement cudf::stable_distinct and lists::distinct Jun 27, 2022
@ttnghia ttnghia changed the title Implement cudf::stable_distinct and lists::distinct Implement cudf::detail::stable_distinct and lists::distinct Jun 27, 2022
@ttnghia ttnghia changed the title Implement cudf::detail::stable_distinct and lists::distinct Implement lists::distinct and cudf::detail::stable_distinct Jun 27, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 28, 2022
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nitpicks. Looks good in general.

cpp/src/stream_compaction/stable_distinct.cu Outdated Show resolved Hide resolved
cpp/include/cudf/lists/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/src/stream_compaction/stable_distinct.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from PointKernel June 30, 2022 16:06
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't pay attention to the default stream until this round of review. Otherwise, looks good to me.

cpp/src/lists/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 11, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 073cbd8 into rapidsai:branch-22.08 Jul 11, 2022
@ttnghia ttnghia deleted the add_lists_distinct branch July 14, 2022 02:48
rapids-bot bot pushed a commit that referenced this pull request Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work.

Closes #11114, #11093, #11053, #11034, and closes #9257.

Depends on:
 * #11228
 * #11149
 * #11234
 * #11233

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #11236
rapids-bot bot pushed a commit that referenced this pull request Jul 25, 2022
This PR follows up on #11149 to add the lists filtering (stream compaction) APIs to a doxygen group. The previous doxygen group `lists_drop_duplicates` is empty after #11326.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11336
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
This PR adds the following APIs for set operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

### Name Convention
Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.

---

Depends on:
 * #10945
 * #11017
 * NVIDIA/cuCollections#175
 * #11052
 * #11118
 * #11100
 * #11149

Closes #10409.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants