-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766
Comments
For the sizes less than 2^31, is there any pattern to those failures? E.g., is it only with certain hardware or compiler flags, or is it just random from run-to-run in the same binary? |
@allisonvacanti I've only tested on Turing RTX 8000, so not sure if other devices have different limits. In this standalone repro with GCC 7.5/CUDA 10.2, it seems the limit is It does not seem to change between runs of the same binary. |
The OOM portion of the bug is solved by switch However, that reveals an invalid memory access exception deeper in the kernel. Right now I've tracked it to a suspiciously large |
@cwharris Does the branch you're looking at include NVIDIA/thrust@8f876ba ? That fixed several similar overflows in tile size calculations. |
I'm seeing this behavior on 1.12.0 |
In reality it looks like tile_base is jumping from near intmax to near size_t max once it reaches a certain threshold. I'll post some example output and a printf I've added for debugging. |
I've added the following printf immediately following
here is part of the output I'm getting (I've sorted the lines so tile_idx appears in order)...
Finally, the error I see is.
This is reproducible using the example code provided by @trxcllnt and modifying 1.12.0 source to add the |
Fixed Still need to verify high-level functionality. |
With the aforementioned changes: diff --git a/thrust/system/cuda/detail/scan_by_key.h b/thrust/system/cuda/detail/scan_by_key.h
index fe4b321c..b3974c69 100644
--- a/thrust/system/cuda/detail/scan_by_key.h
+++ b/thrust/system/cuda/detail/scan_by_key.h
@@ -513,7 +513,7 @@ namespace __scan_by_key {
scan_op(scan_op_)
{
int tile_idx = blockIdx.x;
- Size tile_base = ITEMS_PER_TILE * tile_idx;
+ Size tile_base = static_cast<Size>(ITEMS_PER_TILE) * tile_idx;
Size num_remaining = num_items - tile_base;
if (num_remaining > ITEMS_PER_TILE)
@@ -734,7 +734,7 @@ namespace __scan_by_key {
ScanOp scan_op,
AddInitToScan add_init_to_scan)
{
- int num_items = static_cast<int>(thrust::distance(keys_first, keys_last));
+ size_t num_items = static_cast<size_t>(thrust::distance(keys_first, keys_last));
size_t storage_size = 0;
cudaStream_t stream = cuda_cub::stream(policy);
bool debug_sync = THRUST_DEBUG_SYNC_FLAG;
I'm able to run the following code successfully: // test_inclusive_scan_limit.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/scan.h>
int main(void) {
uint64_t const scan_by_key_limit = (1uL << 41uL);
auto keys = thrust::make_constant_iterator<uint64_t>(0);
auto vals = thrust::make_constant_iterator<uint64_t>(1);
auto iota = thrust::make_counting_iterator<uint64_t>(0);
thrust::host_vector<uint64_t> h_sums(1);
thrust::device_vector<uint64_t> d_sums(2);
std::cout << "size: " << scan_by_key_limit << std::endl;
// make a scatter output iterator to effectively ignore/compress all but the last value
auto scatter_out = thrust::make_permutation_iterator(
d_sums.begin(),
thrust::make_transform_iterator(iota, []__device__(uint64_t idx){ if (idx == scan_by_key_limit - 1) { return 0; } else { return 1; } })
);
auto s_end = thrust::inclusive_scan_by_key(
thrust::device, keys, keys + scan_by_key_limit, vals, scatter_out);
thrust::copy(s_end - 1, s_end, h_sums.begin());
std::cout << "sums: " << h_sums[0] << std::endl;
} and get the following output
|
Thanks for tracking this down, @cwharris! The changes look reasonable to me, though I'm surprised that I think the Could you submit a PR with your changes? |
It doesn't matter which argument gets cast to #include <stdint.h>
enum {
ITEMS_PER_TILE = 9 * 256
};
__global__ void kernel(uint64_t* res) {
int tile_idx = blockIdx.x;
if constexpr (true) {
uint64_t tile_base = ITEMS_PER_TILE * static_cast<uint64_t>(tile_idx);
res[tile_idx] = tile_base;
} else {
uint64_t tile_base = ITEMS_PER_TILE * tile_idx;
res[tile_idx] = tile_base;
}
} .visible .entry _Z6kernelPm(
.param .u64 _Z6kernelPm_param_0
)
{
ld.param.u64 %rd1, [_Z6kernelPm_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r1, %ctaid.x;
- mul.lo.s32 %r2, %r1, 2304;
- cvt.s64.s32 %rd3, %r2;
+ mul.wide.s32 %rd3, %r1, 2304;
mul.wide.s32 %rd4, %r1, 8;
add.s64 %rd5, %rd2, %rd4;
st.global.u64 [%rd5], %rd3;
}
vs how gcc handles something similar: mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov edx, DWORD PTR [rbp-28]
- movsx rdx, edx
- mov rax, rdx
- sal rax, 3
- add rax, rdx
- sal rax, 8
+ mov eax, edx
+ sal eax, 3
+ add eax, edx
+ sal eax, 8
+ cdqe
mov QWORD PTR [rbp-8], rax
mov eax, DWORD PTR [rbp-28]
cdqe In the gcc case, it seems the result is the same for both sets of operations. |
D'oh, I misread the diff originally -- that totally makes sense. Thanks for digging into this. I'll take a look at your PR in a minute. |
Possibly related to the fix at NVIDIA/cub#221, we're seeing an OOM in
inclusive_scan_by_key
when input size is close to (but not exactly)INT_MAX
. The limit seems to depend on the accumulator type -- sometimes(1 << 31) - 1
works, but other times it's(1 << 31) - 2048
or(1 << 31) - 4096
.Tested in
nvidia/cuda:10.2-devel-ubuntu18.04
container with GCC 7.5.0/CUDA 10.2, but present in newer versions too.The text was updated successfully, but these errors were encountered: