Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add accurate hash join size functions #8453

Merged
merged 21 commits into from
Jun 14, 2021

Conversation

PointKernel
Copy link
Member

Addresses #8237

This PR adds 3 join size APIs (hash_join::inner_join_size, hash_join::left_join_size and hash_join::full_join_size) into hash_join class, one for each type of join that returns the exact number of matches with the specified probe table. It completely removed the deprecated size estimation logic in the current implementation.

Also, this PR updates the existing join APIs by adding an optional output_size as an argument. If output_size.has_value(), we take that value directly for further computation. Otherwise, the target join will internally invoke its corresponding size function.

TODO: the current full_join_size uses a 2-step algorithm similar to what's used in hash_join::full_join. It duplicates certain computations with full_join also thus should be refactored during cuco integration.

@PointKernel PointKernel requested a review from a team as a code owner June 7, 2021 21:30
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 7, 2021
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. and removed libcudf Affects libcudf (C++/CUDA) code. labels Jun 7, 2021
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 8, 2021

Codecov Report

Merging #8453 (6ce1562) into branch-21.08 (709adb1) will increase coverage by 0.38%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.08    #8453      +/-   ##
================================================
+ Coverage         82.53%   82.91%   +0.38%     
================================================
  Files               110      110              
  Lines             17739    18094     +355     
================================================
+ Hits              14640    15002     +362     
+ Misses             3099     3092       -7     
Impacted Files Coverage Δ
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 94.09% <0.00%> (+0.01%) ⬆️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 709adb1...6ce1562. Read the comment docs.

@PointKernel PointKernel added the breaking Breaking change label Jun 8, 2021
@jrhemstad jrhemstad requested a review from revans2 June 9, 2021 13:47
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks great for what we would want.

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 82005fe into rapidsai:branch-21.08 Jun 14, 2021
@PointKernel PointKernel deleted the hash-join-size branch June 14, 2021 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants