Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor collect_set to use cudf::distinct and cudf::lists::distinct #11228

Merged
merged 53 commits into from
Jul 15, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 8, 2022

The current groupby/reducttion collect_set aggregations use lists::drop_list_duplicates to generate set(s) of distinct elements. This PR changes that to use cudf::distinct and cudf::lists::distinct instead, which have some advantages including:

  • Fully supporting nested types, and:
  • Achieving better performance (O(n) instead of O(nlogn)) by internally using hash table instead of segmented sort.

This also enables nested types support for collect_set in spark-rapids (issue NVIDIA/spark-rapids#5508).

The changes in Java code here are only to fix unit tests. Previously, they were implemented with the assumption that the collect_set results are sorted, now they fail when the results are no longer sorted.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Jul 12, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 0 - Blocked Cannot progress due to external reasons labels Jul 12, 2022
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java approval

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks. Looks good, otherwise.

cpp/src/groupby/sort/aggregate.cpp Outdated Show resolved Hide resolved
cpp/src/reductions/collect_ops.cu Outdated Show resolved Hide resolved
cpp/src/reductions/collect_ops.cu Show resolved Hide resolved
cpp/tests/groupby/collect_set_tests.cpp Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from mythrocks July 13, 2022 18:24
@ttnghia ttnghia requested a review from davidwendt July 13, 2022 19:48
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mythrocks
Copy link
Contributor

mythrocks commented Jul 14, 2022

Ah, it looks like #11250 wrecked this. (I didn't realize it's been merged.)
#11250 was a simple change. One only need switch drop_list_duplicates to distinct in rolling/detail/rolling.cuh.

Edit: @ttnghia beat me to it. Looks good now. 👍

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 15, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b654597 into rapidsai:branch-22.08 Jul 15, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 22, 2022
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work.

Closes #11114, #11093, #11053, #11034, and closes #9257.

Depends on:
 * #11228
 * #11149
 * #11234
 * #11233

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #11236
@ttnghia ttnghia deleted the refactor_collect_set branch July 28, 2022 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants