[BUG] join output row count returns negative number when row count exceeds int32_t #9625

jlowe · 2021-11-08T16:51:39Z

Describe the bug
When an inner join result would produce more than 2^31 output rows, hash_join::inner_join_size returns a negative number rather than the correct result.

Steps/Code to reproduce bug
Apply the following patch and run JOIN_TEST.

diff --git a/cpp/tests/join/join_tests.cpp b/cpp/tests/join/join_tests.cpp
index d64b40c38b..e6ae709f00 100644
--- a/cpp/tests/join/join_tests.cpp
+++ b/cpp/tests/join/join_tests.cpp
@@ -1418,6 +1418,19 @@ TEST_F(JoinTest, HashJoinWithStructsAndNulls)
   }
 }
 
+TEST_F(JoinTest, HashJoinLargeOutputSize)
+{
+  // self-join a table of zeroes to generate an output row count that would overflow int32_t
+  std::size_t col_size = 65567;
+  rmm::device_buffer zeroes(col_size * sizeof(int32_t), rmm::cuda_stream_default);
+  CUDA_TRY(cudaMemsetAsync(zeroes.data(), 0, zeroes.size(), rmm::cuda_stream_default.value()));
+  cudf::column_view col_zeros(cudf::data_type{cudf::type_id::INT32}, col_size, zeroes.data());
+  cudf::table_view tview{{col_zeros}};
+  cudf::hash_join hash_join(tview, cudf::null_equality::UNEQUAL);
+  std::size_t output_size = hash_join.inner_join_size(tview);
+  EXPECT_EQ(col_size * col_size, output_size);
+}
+
 struct JoinDictionaryTest : public cudf::test::BaseFixture {
 };

Expected behavior
The output row count is correct even if the value exceeds 31 bits.

The text was updated successfully, but these errors were encountered:

Fixes #9625. Updates `hash_join::compute_join_output_size` to use std::size_t instead of cudf::size_type as the intermediate type to hold the computed output size. Authors: - Jason Lowe (https://github.com/jlowe) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) - MithunR (https://github.com/mythrocks) - Mike Wilson (https://github.com/hyperbolic2346) - https://github.com/nvdbaranec URL: #9626

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 8, 2021

jlowe removed the Needs Triage Need team to review and classify label Nov 8, 2021

jlowe mentioned this issue Nov 8, 2021

Use std::size_t when computing join output size #9626

Merged

jlowe self-assigned this Nov 8, 2021

rapids-bot bot closed this as completed in #9626 Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

jlowe commented Nov 8, 2021

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

[BUG] join output row count returns negative number when row count exceeds int32_t #9625

Comments

jlowe commented Nov 8, 2021