Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby collect_set #7420

Merged
merged 24 commits into from
Mar 23, 2021
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Feb 22, 2021

This partially addresses #2973.

This PR implements groupby collect_set aggregation. The idea of this PR is to simply apply drop_list_duplicates (#7528) to the result generated by groupby collect_list, obtaining collect lists without duplicate entries.

Examples:

keys = {1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3};
vals = {10, 11, 10, 10, 20, 21, 21, 20, 30, 33, 32, 31};

keys_output = {1, 2, 3};
vals_output = {{10, 11}, {20, 21}, {30, 31, 32, 33}};

In this PR, a simple, incomplete Python binding for collect_set has been added, and no Java binding is implemented yet. Complete bindings for those Python/Java sides need to be implemented later in some other separate PRs.

@ttnghia ttnghia requested a review from a team as a code owner February 22, 2021 16:33
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 22, 2021
@ttnghia ttnghia added 2 - In Progress Currently a work in progress 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request labels Feb 22, 2021
@jrhemstad
Copy link
Contributor

Is this only for rolling? If so, can you clarify in the title? Also, the PR description needs to be more complete as it is used in generating the CHANGELOG.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 22, 2021

Is this only for rolling? If so, can you clarify in the title? Also, the PR description needs to be more complete as it is used in generating the CHANGELOG.

Hi Jake. I just started to work on this thus things are still unclear to put in. I'll update more details for the description along the way.

@mythrocks
Copy link
Contributor

Might I suggest a change to the title and description of this issue? Let's just tackle renaming COLLECT to COLLECT_LIST.
(This is assuming that @shwina has no objection.)

We can address COLLECT_SET in groupby and rolling_window as separate tasks.

@mythrocks
Copy link
Contributor

Also, if this is a work in progress, it would be best to switch this to a draft PR rather than use [WIP] in the title.

@shwina
Copy link
Contributor

shwina commented Feb 24, 2021

Changing COLLECT to COLLECT_LIST sounds absolutely fine to me, although for now, we probably still want the name of the aggregation in Cython/Python to be `"collect"``.

@ttnghia ttnghia changed the title [WIP] Implement collect_set aggregation Implement collect_set aggregation [skip ci] Mar 3, 2021
@ttnghia ttnghia removed the 2 - In Progress Currently a work in progress label Mar 3, 2021
@ttnghia ttnghia marked this pull request as draft March 3, 2021 14:28
@karthikeyann karthikeyann added the 2 - In Progress Currently a work in progress label Mar 11, 2021
@ttnghia ttnghia changed the title Implement collect_set aggregation [skip ci] Implement collect_set [skip ci] Mar 12, 2021
@ttnghia ttnghia changed the title Implement collect_set [skip ci] Implement collect_set on top of collect_list and drop_list_duplicates [skip ci] Mar 12, 2021
@ttnghia ttnghia requested a review from a team as a code owner March 18, 2021 13:36
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

Rerun tests.

1 similar comment
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

Rerun tests.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake / conda lgtm, will let @shwina review the Python side 😄

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java approval

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

Rerun tests.

2 similar comments
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 19, 2021

Rerun tests.

# Conflicts:
#	cpp/src/groupby/sort/aggregate.cpp
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 23, 2021

Rerun tests.

2 similar comments
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 23, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 23, 2021

Rerun tests.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython LGTM!

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 23, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 23, 2021

Thanks all 😃

@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #7420 (ac3e9a2) into branch-0.19 (7871e7a) will increase coverage by 0.61%.
The diff coverage is n/a.

❗ Current head ac3e9a2 differs from pull request most recent head 918eeb7. Consider uploading reports for the commit 918eeb7 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7420      +/-   ##
===============================================
+ Coverage        81.86%   82.47%   +0.61%     
===============================================
  Files              101      101              
  Lines            16884    17402     +518     
===============================================
+ Hits             13822    14353     +531     
+ Misses            3062     3049      -13     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/categorical.py 91.97% <ø> (+0.58%) ⬆️
python/cudf/cudf/core/column/column.py 87.86% <ø> (+0.10%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.63% <ø> (+0.54%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <ø> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 92.50% <ø> (+1.10%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.79% <ø> (+0.30%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.57% <ø> (+0.33%) ⬆️
python/cudf/cudf/core/column_accessor.py 95.45% <ø> (+0.14%) ⬆️
python/cudf/cudf/core/dataframe.py 90.90% <ø> (+0.44%) ⬆️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d500142...918eeb7. Read the comment docs.

@rapids-bot rapids-bot bot merged commit 30e493c into rapidsai:branch-0.19 Mar 23, 2021
@ttnghia ttnghia self-assigned this Apr 25, 2021
@ttnghia ttnghia deleted the collect_set branch May 3, 2021 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants