Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cudf::distinct in Java binding #11232

Merged
merged 2 commits into from
Jul 8, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 8, 2022

Java binding has dropDuplicates API to remove duplicate rows from a table. Previously it has been implemented by sorting the table then calling to cudf::unique. This PR changes the implementation to use cudf::distinct directly, which can significantly improve performance by avoiding sorting the input table.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team Performance Performance related issue Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 8, 2022
@ttnghia ttnghia requested a review from a team as a code owner July 8, 2022 21:20
@ttnghia ttnghia self-assigned this Jul 8, 2022
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been nice to have not broken an existing public API. But it does not appear to be used right now, and there are semantic changes to the output too, so I am not as concerned.

@codecov
Copy link

codecov bot commented Jul 8, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@9ecb672). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11232   +/-   ##
===============================================
  Coverage                ?   86.30%           
===============================================
  Files                   ?      144           
  Lines                   ?    22698           
  Branches                ?        0           
===============================================
  Hits                    ?    19589           
  Misses                  ?     3109           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ecb672...4bb723a. Read the comment docs.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 8, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 89a8e70 into rapidsai:branch-22.08 Jul 8, 2022
@ttnghia ttnghia deleted the use_distinct_in_java branch July 12, 2022 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants