Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use std::size_t when computing join output size #9626

Merged
merged 1 commit into from
Nov 8, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Nov 8, 2021

Fixes #9625. Updates hash_join::compute_join_output_size to use std::size_t instead of cudf::size_type as the intermediate type to hold the computed output size.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Nov 8, 2021
@jlowe jlowe self-assigned this Nov 8, 2021
@jlowe jlowe requested a review from a team as a code owner November 8, 2021 16:57
Comment on lines +128 to +132
std::size_t size;
if constexpr (JoinKind == join_kind::LEFT_JOIN) {
size = static_cast<size_type>(
hash_table.pair_count_outer(iter, iter + probe_table_num_rows, equality, stream.value()));
size = hash_table.pair_count_outer(iter, iter + probe_table_num_rows, equality, stream.value());
} else {
size = static_cast<size_type>(
hash_table.pair_count(iter, iter + probe_table_num_rows, equality, stream.value()));
size = hash_table.pair_count(iter, iter + probe_table_num_rows, equality, stream.value());
Copy link
Contributor

@ttnghia ttnghia Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indeed you can just do this:

if constexpr (JoinKind == join_kind::LEFT_JOIN) {
    return hash_table.pair_count_outer(iter, iter + probe_table_num_rows, equality, stream.value());
  } else {
    return hash_table.pair_count(iter, iter + probe_table_num_rows, equality, stream.value());
}

@ttnghia
Copy link
Contributor

ttnghia commented Nov 8, 2021

Rerun tests.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It's a wonder that this worked before for (moderately) explosive joins.

@codecov
Copy link

codecov bot commented Nov 8, 2021

Codecov Report

Merging #9626 (00d9ba9) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9626      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19825     +956     
================================================
+ Hits               2036     2115      +79     
- Misses            16833    17710     +877     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 67 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eda31b6...00d9ba9. Read the comment docs.

@PointKernel
Copy link
Member

Oh, it's my mistake when refactoring hash join. Thanks for fixing it!

@jlowe
Copy link
Member Author

jlowe commented Nov 8, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 281fed9 into rapidsai:branch-21.12 Nov 8, 2021
@jlowe jlowe deleted the fix-join-output-size branch November 8, 2021 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] join output row count returns negative number when row count exceeds int32_t
7 participants