Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memcheck error found in STRINGS_TEST #13578

Merged

Conversation

davidwendt
Copy link
Contributor

Description

Fixes a memcheck error found in STRINGS_TEST where an atomicOr was used on a boolean device scalar. The workaround uses a cub::WarpReduce to compute the result in the warp-per-string kernel.

Reference #13574

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Jun 14, 2023
@davidwendt davidwendt self-assigned this Jun 14, 2023
@davidwendt davidwendt requested a review from a team as a code owner June 14, 2023 19:04
@bdice
Copy link
Contributor

bdice commented Jun 14, 2023

Does this change have performance implications?

@davidwendt
Copy link
Contributor Author

Does this change have performance implications?

Nothing noticeable. I just thought the atomicOr code looked cleaner with the thrust::for_each taking care of all the kernel launch details. Maybe @PointKernel could recommend the cuda::atomic_ref equivalent? I had this working before he mentioned replacing atomicOr.

contains_warp_fn{*d_strings, d_target, results_view.data<bool>()});
auto const d_strings = column_device_view::create(input.parent(), stream);
constexpr int block_size = 256;
cudf::detail::grid_1d grid{input.size() * cudf::detail::warp_size, block_size};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if input.size() * cudf::detail::warp_size overflow? grid_1d only has int members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just found the similar code in other files (attributes.cu and find.cu). So this may be a new potential issue.

Copy link
Contributor Author

@davidwendt davidwendt Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missed this line of code perhaps?
https://github.com/rapidsai/cudf/pull/13578/files#diff-048f86c21559b14f64f86aaeaa57776d366c3a4948a5aba7c0ab1a3801be87bcR292

 if (idx >= (d_strings.size() * cudf::detail::warp_size)) { return; }

That did not format too well. It is line 292 currently.
This line is in attributes.cu as well.

Copy link
Contributor

@ttnghia ttnghia Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, that line is inside the kernel, while this line is before kernel launch. If we have overflow here, we may still launch a kernel with some (large?) input. We should avoid launching the kernel from here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overflow cannot technically occur here since this code path is only for long strings which is always much greater than 32 bytes on average. This means the number of rows * 32 will never overflow under these conditions.

if (d_target.compare(d_str.data() + i, d_target.size_bytes()) == 0) { found = true; }
}
if (found) { atomicOr(d_results + str_idx, true); }
if (idx >= (d_strings.size() * cudf::detail::warp_size)) { return; }
Copy link
Contributor

@ttnghia ttnghia Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I suspect that this mul can overflow, as all the operands are of type int. So maybe we should cast into int64_t?

Suggested change
if (idx >= (d_strings.size() * cudf::detail::warp_size)) { return; }
if (static_cast<int64_t>(idx) >= static_cast<...>(d_strings.size()) * static_cast<...>(cudf::detail::warp_size)) { return; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overflow cannot technically occur here since this code path is only for long strings which is always much greater than 32 bytes on average. This means the (number of rows * 32) will never overflow under these conditions.

@PointKernel
Copy link
Member

I was inclined to use atomic_ref for this issue and now realize the warp-reduce workaround is a better solution: atomic_ref supports exclusively 4-byte or 8-byte types since only 32-bit and 64-bit atomic CAS are supported at the hardware level (plus 128-bit on hopper). Technically, we should never do atomic operations over bools.

@davidwendt davidwendt requested a review from bdice June 20, 2023 20:11
@davidwendt davidwendt changed the title Fix memcheck error found in STRINGS_TEST Fix memcheck error found in STRINGS_TEST Jun 21, 2023
}
};
auto const result = warp_reduce(temp_storage).Reduce(found, cub::Max());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could get an early-exit benefit by checking the warp-reduced result before reading the full string. (Discussed offline with @davidwendt.) I don't have a good expectation for the synchronization cost of a single warp sync. It'll probably be slower, but I'd like to learn by how much.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 7cbef2a into rapidsai:branch-23.08 Jun 22, 2023
@davidwendt davidwendt deleted the bug-strings-find-memcheck branch June 22, 2023 12:26
rapids-bot bot pushed a commit that referenced this pull request Jul 10, 2023
Contributes to #13575

Depends on #13574, #13578

This PR cleans up custom atomic implementations in libcudf by using `cuda::atomic_ref` when possible. It removes atomic bitwise operations like `and`, `or` and `xor` since libcudac++ already provides proper replacements.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #13583
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants