Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::hash_join compute size fails on struct columns #9095

Closed
jlowe opened this issue Aug 23, 2021 · 0 comments · Fixed by #9107
Closed

[BUG] cudf::hash_join compute size fails on struct columns #9095

jlowe opened this issue Aug 23, 2021 · 0 comments · Fixed by #9107
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Aug 23, 2021

Describe the bug
Using cudf::hash_join to calculate the output row count for a join involving struct columns results in the following exception being thrown.

cuDF failure at: ../include/cudf/table/row_operators.cuh:231: Mismatched number of columns.

It looks like the build table is flattened when the hash is built but the probe table isn't also flattened when computing the output size.

Steps/Code to reproduce bug
Apply the following patch and run the libcudf join tests.

diff --git a/cpp/tests/join/join_tests.cpp b/cpp/tests/join/join_tests.cpp
index e468368842..22bf7aa2ff 100644
--- a/cpp/tests/join/join_tests.cpp
+++ b/cpp/tests/join/join_tests.cpp
@@ -501,6 +501,50 @@ TEST_F(JoinTest, LeftJoinWithStructsAndNulls)
   CUDF_TEST_EXPECT_TABLES_EQUIVALENT(*sorted_gold, *sorted_result);
 }
 
+TEST_F(JoinTest, LeftHashJoinWithStructsAndNulls)
+{
+  column_wrapper<int32_t> col0_0{{3, 1, 2, 0, 2}};
+  strcol_wrapper col0_1({"s1", "s1", "", "s4", "s0"}, {1, 1, 0, 1, 1});
+  column_wrapper<int32_t> col0_2{{0, 1, 2, 4, 1}};
+  auto col0_names_col = strcol_wrapper{
+    "Samuel Vimes", "Carrot Ironfoundersson", "Detritus", "Samuel Vimes", "Angua von Überwald"};
+  auto col0_ages_col = column_wrapper<int32_t>{{48, 27, 351, 31, 25}};
+
+  auto col0_is_human_col = column_wrapper<bool>{{true, true, false, false, false}, {1, 1, 0, 1, 0}};
+
+  auto col0_3 =
+    cudf::test::structs_column_wrapper{{col0_names_col, col0_ages_col, col0_is_human_col}};
+
+  column_wrapper<int32_t> col1_0{{2, 2, 0, 4, 3}};
+  strcol_wrapper col1_1({"s1", "s0", "s1", "s2", "s1"});
+  column_wrapper<int32_t> col1_2{{1, 0, 1, 2, 1}, {1, 0, 1, 1, 1}};
+  auto col1_names_col = strcol_wrapper{
+    "Samuel Vimes", "Detritus", "Detritus", "Carrot Ironfoundersson", "Angua von Überwald"};
+  auto col1_ages_col = column_wrapper<int32_t>{{48, 35, 351, 22, 25}};
+
+  auto col1_is_human_col = column_wrapper<bool>{{true, true, false, false, true}, {1, 1, 0, 1, 1}};
+
+  auto col1_3 =
+    cudf::test::structs_column_wrapper{{col1_names_col, col1_ages_col, col1_is_human_col}};
+
+  CVector cols0, cols1;
+  cols0.push_back(col0_0.release());
+  cols0.push_back(col0_1.release());
+  cols0.push_back(col0_2.release());
+  cols0.push_back(col0_3.release());
+  cols1.push_back(col1_0.release());
+  cols1.push_back(col1_1.release());
+  cols1.push_back(col1_2.release());
+  cols1.push_back(col1_3.release());
+
+  Table t0(std::move(cols0));
+  Table t1(std::move(cols1));
+
+  cudf::hash_join hash_join(t1, cudf::null_equality::EQUAL);
+  auto output_size = hash_join.left_join_size(t0, cudf::null_equality::EQUAL);
+  EXPECT_EQ(output_size, 5);
+}
+
 TEST_F(JoinTest, LeftJoinOnNulls)
 {
   // clang-format off

Expected behavior
The test above should pass.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Aug 23, 2021
@ttnghia ttnghia self-assigned this Aug 24, 2021
@ttnghia ttnghia removed the Needs Triage Need team to review and classify label Aug 24, 2021
@ttnghia ttnghia assigned jlowe and unassigned ttnghia Aug 24, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 25, 2021
Fixes #9095.

This adds calls to `flatten_nested_columns` in the `cudf::hash_join` join output size APIs along with tests for joins on struct columns using `cudf::hash_join`.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #9107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants