Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor MD5 implementation. #9212

Merged
merged 54 commits into from
Oct 21, 2021
Merged

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Sep 10, 2021

This PR refactors the MD5 hash implementation in libcudf. I used the MD5 code as a reference while working on SHA (extending #6020, PR #9215 to follow).

List of high-level changes:

  • I moved the implementation of MD5Hash and related logic from include/cudf/detail/utilities/hash_functions.cuh to src/hash/md5_hash.cu because it is only used in that file and nowhere else. We don't need to include and build MD5 in hash_functions.cuh for all the collections/sorting/groupby tools that only use Murmur3 variants and IdentityHash. (This will be a bigger deal once we add the SHA hash functions, soon to follow this PR, because the size of hash_functions.cuh would be substantially larger without this separation.)
  • I removed an MD5Hash constructor that accepted and stored a seed whose value was unused.
  • Improved use of namespaces.
  • Use named constants instead of magic numbers.
  • Introduced a hash_circular_buffer and refactored dispatch logic.

No changes were made to the feature scope or public APIs of the MD5 feature, so existing unit tests and bindings should remain the same.

@bdice bdice requested a review from a team as a code owner September 10, 2021 03:10
@bdice bdice requested review from harrism and vyasr September 10, 2021 03:10
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 10, 2021
@bdice bdice added improvement Improvement / enhancement to an existing function tech debt non-breaking Non-breaking change labels Sep 10, 2021
@bdice bdice self-assigned this Sep 10, 2021
@bdice
Copy link
Contributor Author

bdice commented Sep 14, 2021

rerun tests

@codecov
Copy link

codecov bot commented Sep 14, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@4349232). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head fa7321c differs from pull request most recent head c7d132f. Consider uploading reports for the commit c7d132f to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.12    #9212   +/-   ##
===============================================
  Coverage                ?   10.75%           
===============================================
  Files                   ?      116           
  Lines                   ?    19480           
  Branches                ?        0           
===============================================
  Hits                    ?     2096           
  Misses                  ?    17384           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4349232...c7d132f. Read the comment docs.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. I would be curious to know if this improved overall build time as well since hash_functions.cuh is also included in the row_operators.cuh header file.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to remove some unreferenced variables.

cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup. I recommended some further improvements, but be aware that I may not have caught all of the places where those improvements could be made because there seems to be quite a bit of code duplication in this file. I don't know if you want to try and tackle it in this PR or not, but it may be possible to unify a lot of these implementations to a greater degree with some suitably defined device functions. If you think that's out of scope that's fine, just double-check that my suggestions wouldn't also apply in other places.

cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Sep 15, 2021

@davidwendt @vyasr Thanks for the great suggestions! I'll follow up on this soon. Many of these suggestions will also help improve #9215.

bdice added a commit to bdice/cudf that referenced this pull request Sep 16, 2021
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any reason to block this PR beyond other reviewers' comments. Just one question.

cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Oct 18, 2021

@vyasr @davidwendt If you want to do a final review of this PR, feel free. I'm aiming to merge by EOD tomorrow to unblock some next steps on SHA work. Thanks!

I ran benchmarks. For a column of int32 values, this code is a bit slower than the previous implementation for large numbers of elements, but I hope that's okay for the moment (18.1 ms in branch-21.12 vs. 20.2 ms in this PR for processing 2^24 elements). I am planning to investigate performance a bit further once I merge this refactored code with the SHA implementation. I suspect the performance might be improved with minor changes that I will apply to both the MD5 and SHA algorithms.

@jrhemstad
Copy link
Contributor

@vyasr @davidwendt If you want to do a final review of this PR, feel free. I'm aiming to merge by EOD tomorrow to unblock some next steps on SHA work. Thanks!

I ran benchmarks. For a column of int32 values, this code is a bit slower than the previous implementation for large numbers of elements, but I hope that's okay for the moment (18.1 ms in branch-21.12 vs. 20.2 ms in this PR for processing 2^24 elements). I am planning to investigate performance a bit further once I merge this refactored code with the SHA implementation. I suspect the performance might be improved with minor changes that I will apply to both the MD5 and SHA algorithms.

For performance, I think the lowest hanging fruit is to move the circular_buffer backing storage into shared memory.

@bdice
Copy link
Contributor Author

bdice commented Oct 18, 2021

@jrhemstad I agree shared memory should be a good perf boost -- I'll give that a shot when I refactor with the SHA code.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big improvement. Got some further suggestions before merging, but it's almost there. Any performance impact of these changes?

cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big improvement. Got some further suggestions before merging, but it's almost there. Any performance impact of these changes?

@bdice
Copy link
Contributor Author

bdice commented Oct 21, 2021

@vyasr This is ready for a final review.

Any performance impact of these changes?

See #9212 (comment). It's slightly slower but I'll refactor this to use shared memory when refactoring with SHA, which should boost performance.

@bdice bdice requested a review from vyasr October 21, 2021 18:24
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple small questions but I think this is good to go however you choose to address them.

cpp/src/hash/md5_hash.cu Outdated Show resolved Hide resolved
cpp/src/hash/md5_hash.cu Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Oct 21, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 72694d2 into rapidsai:branch-21.12 Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants