Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support set operations #11043

Merged
merged 298 commits into from
Jul 26, 2022
Merged

Support set operations #11043

merged 298 commits into from
Jul 26, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jun 3, 2022

This PR adds the following APIs for set operations:

  • lists::have_overlap
  • lists::intersect_distinct
  • lists::union_distinct
  • lists::difference_distinct

Name Convention

Except for the first API (lists::have_overlap) that returns a boolean column, the suffix _distinct of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements.


Depends on:

Closes #10409.

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. 0 - Blocked Cannot progress due to external reasons Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jun 3, 2022
@ttnghia ttnghia self-assigned this Jun 3, 2022
@github-actions github-actions bot added CMake CMake build issue conda labels Jun 3, 2022
@codecov

This comment was marked as off-topic.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it has the potential to be a very large PR. I would strongly advocate for breaking this up into smaller, easier to review ones. Likely be doing one of the set_ algorithms at a time.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 3, 2022

This looks like it has the potential to be a very large PR. I would strongly advocate for breaking this up into smaller, easier to review ones. Likely be doing one of the set_ algorithms at a time.

I'm not sure. If you read my algorithm descriptions (from here) then the implementation for each set-op should be very short. I will try to have smaller dependent PRs if I figured out what to extract from here.

Currently, I see some overlap in implementation for this PR and the cudf::contains with nested type supports (#10656). In both places, will are doing similar steps: calling create/insert/contains with map. Maybe I will extract the common code and put it somewhere in stream_compaction module.

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia I have some comments attached. I think this is some of your best work I've reviewed. Really great job! The way the set algorithms all fall back to contains is very nice -- I love when we can write concise higher-order algorithms like these.

cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/src/lists/set_operations.cu Show resolved Hide resolved
cpp/src/lists/set_operations.cu Outdated Show resolved Hide resolved
cpp/src/lists/set_operations.cu Show resolved Hide resolved
cpp/include/cudf/lists/set_operations.hpp Show resolved Hide resolved
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions, otherwise LGTM!

cpp/include/cudf/lists/set_operations.hpp Outdated Show resolved Hide resolved
cpp/src/lists/set_operations.cu Outdated Show resolved Hide resolved
cpp/src/lists/set_operations.cu Outdated Show resolved Hide resolved
@bdice bdice requested a review from PointKernel July 25, 2022 22:10
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I've learned a lot by going through your PRs in the chain and here we are finally! Thanks @ttnghia for your persistence and effort put into this work.

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 26, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b5a2efb into rapidsai:branch-22.08 Jul 26, 2022
@ttnghia ttnghia deleted the set_operations branch July 26, 2022 19:46
rapids-bot bot pushed a commit that referenced this pull request Jul 28, 2022
This PR add Java binding for the set-like operations:
 * `lists::have_overlap`
 * `lists::intersect_distinct`
 * `lists::union_distinct`
 * `lists::difference_distinct`

Depends on:
 * #11043
 * #11220

New Java APIs start here: https://github.com/rapidsai/cudf/pull/11143/files#diff-50ba2711690aca8e4f28d7b491373a4dd76443127c8b452a77b6c1fe2388d9e3R3545

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #11143
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support set like operators intersect, union, and difference on lists
7 participants