Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] corruption in string column after contig split + partition #10717

Closed
abellina opened this issue Apr 22, 2022 · 3 comments · Fixed by #10724
Closed

[BUG] corruption in string column after contig split + partition #10717

abellina opened this issue Apr 22, 2022 · 3 comments · Fixed by #10724
Assignees
Labels
bug Something isn't working

Comments

@abellina
Copy link
Contributor

This issue happens after a combination of cudf calls are applied and started occurring after this PR was merged: #10673.

If we first call contiguous_split with 0 splits (an operation that should return the original data but packed in a single buffer) and later call partition on a row that has an empty string "", the result is a null string instead of the original empty string.

Before the linked PR this wasn't an issue (as we noticed this in tests that broke NVIDIA/spark-rapids#5286). If contiguous_split is removed from the sequence of calls, the issue doesn't manifest.

Here's a repro case, and thanks to @davidwendt for taking a cuDF java example I had and converting it to the C++ equivalent.

TEST_F(ContiguousSplitTableCornerCases, SplitEmptyColumn)
{
  cudf::test::strings_column_wrapper a{""};
  cudf::table_view t({a});
  auto cs = cudf::contiguous_split(t, {});
  std::cout << "csplit size = " << cs.size() << std::endl;
  cudf::test::print(cs.front().table.column(0));
  std::cout << "csplit columns = " << cs.front().table.num_columns() << "\n";
  std::cout << "csplit rows = " << cs.front().table.num_rows() << "\n";
  cudf::test::print(cs.back().table.column(0));

  auto map    = cudf::test::fixed_width_column_wrapper<int32_t>{0};
  auto result = cudf::partition(cs.front().table, map, 2);
  std::cout << "partition size = " << result.second.size() << "\n";
  std::cout << "partition table columns = " << result.first->num_columns() << "\n";
  std::cout << "partition table rows = " << result.first->num_rows() << "\n";
  cudf::test::print(result.first->view().column(0));
}

In this case, the last print: cudf::test::print(result.first->view().column(0));, prints NULL when it should be the empty string.

Thanks to @jlowe and @revans2 for pointing me in the right direction.

@abellina abellina added bug Something isn't working Needs Triage Need team to review and classify labels Apr 22, 2022
@abellina
Copy link
Contributor Author

@davidwendt graciously offered to take a look at this issue, so I've assigned it to him.

@abellina
Copy link
Contributor Author

abellina commented Apr 24, 2022

This seems like another case of: #4480, where data() for the empty string is null, so when we compare it with the null placeholder it converts the value to null when the scatter code is trying to create the resulting column.

This patch fixes this particular instance, but I am not 100% sure if this is the correct place, but essentially I believe we should detect that the element is the empty string, and return string_view{"", 0} instead of string_view{nullptr, 0}.

If this is the correct place (column_device_view::element(size_type) which is overloaded for string_view) , I am happy to put up a PR + test.

diff --git a/cpp/include/cudf/column/column_device_view.cuh b/cpp/include/cudf/column/column_device_view.cuh
index 070ca80..6508d6e 100644
--- a/cpp/include/cudf/column/column_device_view.cuh
+++ b/cpp/include/cudf/column/column_device_view.cuh
@@ -395,8 +395,14 @@ class alignas(16) column_device_view : public detail::column_device_view_base {
     size_type index       = element_index + offset();  // account for this view's _offset
     const auto* d_offsets = d_children[strings_column_view::offsets_column_index].data<int32_t>();
     const char* d_strings = d_children[strings_column_view::chars_column_index].data<char>();
-    size_type offset      = d_offsets[index];
-    return string_view{d_strings + offset, d_offsets[index + 1] - offset};
+    size_type offset      = d_offsets[index];   
+    const char* element   = d_strings + offset;
+    // special case empty string
+    if (element == nullptr and !is_null(element_index)) {
+      return string_view{"", 0};
+    }
+
+    return string_view{element, d_offsets[index + 1] - offset};
   }

@davidwendt
Copy link
Contributor

@abellina I don't agree with this solution. I want to keep this function as efficient as possible. I think there is a better way to fix this where the specific factory functions are used (which are rarely). I will work on this tomorrow.

rapids-bot bot pushed a commit that referenced this issue Apr 27, 2022
Closes #10717 

Fixes bug introduced with changes in #10673 which uses the `cudf::make_strings_column` that accepts a span of `string_view` objects with a null-placeholder. The placeholder can be unintentionally created in `create_string_vector_from_column` when given a strings column where all the rows are empty. The utility is fixed to prevent creating the placeholder for empty strings.

A gtest was added to scatter from/to an all-empty strings column to verify this behavior.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #10724
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants