Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

Open
trxcllnt opened this issue Apr 9, 2021 · 12 comments
Open

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

trxcllnt opened this issue Apr 9, 2021 · 12 comments
Assignees
Labels
thrust For all items related to Thrust.

Comments

@trxcllnt
Copy link
Member

trxcllnt commented Apr 9, 2021

Possibly related to the fix at NVIDIA/cub#221, we're seeing an OOM in inclusive_scan_by_key when input size is close to (but not exactly) INT_MAX. The limit seems to depend on the accumulator type -- sometimes (1 << 31) - 1 works, but other times it's (1 << 31) - 2048 or (1 << 31) - 4096.

Tested in nvidia/cuda:10.2-devel-ubuntu18.04 container with GCC 7.5.0/CUDA 10.2, but present in newer versions too.

# CMakeLists.txt
cmake_minimum_required(VERSION 3.20)
project(test VERSION 1.0.0 LANGUAGES C CXX CUDA)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# https://github.com/cpm-cmake/CPM.cmake/blob/310efb9b17d0befe9ccd4f5bf2e39942869777fc/cmake/get_cpm.cmake
set(CPM_DOWNLOAD_VERSION 0.32.0)
if(CPM_SOURCE_CACHE)
  # Expand relative path. This is important if the provided path contains a tilde (~)
  get_filename_component(CPM_SOURCE_CACHE ${CPM_SOURCE_CACHE} ABSOLUTE)
  set(CPM_DOWNLOAD_LOCATION "${CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
elseif(DEFINED ENV{CPM_SOURCE_CACHE})
  set(CPM_DOWNLOAD_LOCATION "$ENV{CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
else()
  set(CPM_DOWNLOAD_LOCATION "${CMAKE_BINARY_DIR}/cmake/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
endif()
if(NOT (EXISTS ${CPM_DOWNLOAD_LOCATION}))
  message(STATUS "Downloading CPM.cmake to ${CPM_DOWNLOAD_LOCATION}")
  file(DOWNLOAD
       https://github.com/cpm-cmake/CPM.cmake/releases/download/v${CPM_DOWNLOAD_VERSION}/CPM.cmake
       ${CPM_DOWNLOAD_LOCATION}
  )
endif()
include(${CPM_DOWNLOAD_LOCATION})

# Get Thrust v1.12.0
CPMAddPackage(NAME      Thrust
    VERSION             1.12.0
    GIT_REPOSITORY      https://github.com/NVIDIA/thrust.git
    GIT_TAG             1.12.0
    GIT_SHALLOW         TRUE
    UPDATE_DISCONNECTED TRUE)

thrust_create_target(test::Thrust FROM_OPTIONS)

add_executable(test_inclusive_scan_limit test_inclusive_scan_limit.cu)

target_compile_features(test_inclusive_scan_limit
    PRIVATE cxx_std_14 $<BUILD_INTERFACE:cuda_std_14>)

target_link_libraries(test_inclusive_scan_limit test::Thrust)

execute_process(COMMAND cmake -E create_symlink ${CMAKE_CURRENT_BINARY_DIR}/compile_commands.json  compile_commands.json)
// test_inclusive_scan_limit.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/scan.h>

int main(void) {
  //
  // limit is somewhere between these values, otherwise get this error:
  // terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  //   what():  std::bad_alloc: cudaErrorMemoryAllocation: out of memory
  //
  auto const scan_by_key_limit = (1uL << 31uL);  // error
  // auto const scan_by_key_limit = (1uL << 31uL) - 1uL;  // no error
  // auto const scan_by_key_limit = (1uL << 31uL) - 2048uL;  // no error
  // auto const scan_by_key_limit = (1uL << 31uL) - 4096uL;  // no error

  auto keys = thrust::make_constant_iterator(0);
  auto vals = thrust::make_counting_iterator(uint64_t{0});

  thrust::host_vector<uint64_t> h_sum(1);
  thrust::device_vector<uint64_t> d_sums(scan_by_key_limit);

  std::cout << "size: " << scan_by_key_limit << std::endl;

  // OOM here
  auto s_end = thrust::inclusive_scan_by_key(
      thrust::device, keys, keys + scan_by_key_limit, vals, d_sums.begin());

  thrust::copy(s_end - 1, s_end, h_sum.begin());

  std::cout << "sums: " << h_sum[0] << std::endl;
}
@alliepiper
Copy link
Collaborator

For the sizes less than 2^31, is there any pattern to those failures? E.g., is it only with certain hardware or compiler flags, or is it just random from run-to-run in the same binary?

@trxcllnt
Copy link
Member Author

trxcllnt commented Apr 9, 2021

@allisonvacanti I've only tested on Turing RTX 8000, so not sure if other devices have different limits.

In this standalone repro with GCC 7.5/CUDA 10.2, it seems the limit is 2^31 - 1, so compiler flags could be affecting things. I originally ran this test in rapidsai/cuspatial, and the limit was between 2^31 - (2048|4096) in both GCC 7.5/CUDA 10.2 and GCC 9.3/CUDA 11.2.

It does not seem to change between runs of the same binary.

@cwharris
Copy link
Contributor

cwharris commented May 3, 2021

The OOM portion of the bug is solved by switch int to size_t on this line:

https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L737

However, that reveals an invalid memory access exception deeper in the kernel. Right now I've tracked it to a suspiciously large tile_base value in scan_by_key.h where it may be overflowing.

@alliepiper
Copy link
Collaborator

@cwharris Does the branch you're looking at include NVIDIA/thrust@8f876ba ? That fixed several similar overflows in tile size calculations.

@cwharris
Copy link
Contributor

cwharris commented May 5, 2021

I'm seeing this behavior on 1.12.0

@cwharris
Copy link
Contributor

cwharris commented May 5, 2021

In reality it looks like tile_base is jumping from near intmax to near size_t max once it reaches a certain threshold. I'll post some example output and a printf I've added for debugging.

@cwharris
Copy link
Contributor

cwharris commented May 5, 2021

I've added the following printf immediately following
the sync_threadblock() which appears after the calls to scan_tile(...) in scan_by_key.h

        if (threadIdx.x == 0) {
          printf("AFTER SCAN TILE... tile_idx: %i, tile_base: %lu %lu\n",
          tile_idx,
          std::numeric_limits<Size>::max(),
          tile_base);
        }

here is part of the output I'm getting (I've sorted the lines so tile_idx appears in order)...

AFTER SCAN TILE... tile_idx: 922975, tile_base: 18446744073709551615 2126534400
AFTER SCAN TILE... tile_idx: 922976, tile_base: 18446744073709551615 2126536704
AFTER SCAN TILE... tile_idx: 922978, tile_base: 18446744073709551615 2126541312
AFTER SCAN TILE... tile_idx: 922979, tile_base: 18446744073709551615 2126543616
AFTER SCAN TILE... tile_idx: 922980, tile_base: 18446744073709551615 2126545920
AFTER SCAN TILE... tile_idx: 922981, tile_base: 18446744073709551615 2126548224
AFTER SCAN TILE... tile_idx: 922983, tile_base: 18446744073709551615 2126552832
AFTER SCAN TILE... tile_idx: 922984, tile_base: 18446744073709551615 2126555136
AFTER SCAN TILE... tile_idx: 922986, tile_base: 18446744073709551615 2126559744
AFTER SCAN TILE... tile_idx: 922998, tile_base: 18446744073709551615 2126587392
AFTER SCAN TILE... tile_idx: 922999, tile_base: 18446744073709551615 2126589696
AFTER SCAN TILE... tile_idx: 923001, tile_base: 18446744073709551615 2126594304
AFTER SCAN TILE... tile_idx: 923002, tile_base: 18446744073709551615 2126596608
AFTER SCAN TILE... tile_idx: 932162, tile_base: 18446744073709551615 18446744071562285568
AFTER SCAN TILE... tile_idx: 932172, tile_base: 18446744073709551615 18446744071562308608
AFTER SCAN TILE... tile_idx: 932174, tile_base: 18446744073709551615 18446744071562313216
AFTER SCAN TILE... tile_idx: 932176, tile_base: 18446744073709551615 18446744071562317824
AFTER SCAN TILE... tile_idx: 932177, tile_base: 18446744073709551615 18446744071562320128
AFTER SCAN TILE... tile_idx: 932185, tile_base: 18446744073709551615 18446744071562338560
AFTER SCAN TILE... tile_idx: 932187, tile_base: 18446744073709551615 18446744071562343168
AFTER SCAN TILE... tile_idx: 932189, tile_base: 18446744073709551615 18446744071562347776

tile_base goes from a more normal value 2126596608 to 18446744071562285568 in a single step/block, which makes me suspicious of an overflow somewhere in the math leading up to the tile_base calculation. The new unexpected large value differs from uint64_t max (18446744073709551615) by a value close to intmax (18446744073709551615 - 18446744071562285568 = 2147266560).

Finally, the error I see is.

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  scan_by_key: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

This is reproducible using the example code provided by @trxcllnt and modifying 1.12.0 source to add the printf mentioned in this comment and using size_t instead of int on this line: https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L737

@cwharris
Copy link
Contributor

cwharris commented May 5, 2021

Fixed tile_base value by static_cast<Size>(ITEMS_PER_TILE) here:
https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L516

Still need to verify high-level functionality.

@cwharris
Copy link
Contributor

cwharris commented May 5, 2021

With the aforementioned changes:

diff --git a/thrust/system/cuda/detail/scan_by_key.h b/thrust/system/cuda/detail/scan_by_key.h
index fe4b321c..b3974c69 100644
--- a/thrust/system/cuda/detail/scan_by_key.h
+++ b/thrust/system/cuda/detail/scan_by_key.h
@@ -513,7 +513,7 @@ namespace __scan_by_key {
             scan_op(scan_op_)
       {
         int  tile_idx      = blockIdx.x;
-        Size tile_base     = ITEMS_PER_TILE * tile_idx;
+        Size tile_base     = static_cast<Size>(ITEMS_PER_TILE) * tile_idx;
         Size num_remaining = num_items - tile_base;
 
         if (num_remaining > ITEMS_PER_TILE)
@@ -734,7 +734,7 @@ namespace __scan_by_key {
                              ScanOp                     scan_op,
                              AddInitToScan              add_init_to_scan)
   {
-    int          num_items    = static_cast<int>(thrust::distance(keys_first, keys_last));
+    size_t       num_items    = static_cast<size_t>(thrust::distance(keys_first, keys_last));
     size_t       storage_size = 0;
     cudaStream_t stream       = cuda_cub::stream(policy);
     bool         debug_sync   = THRUST_DEBUG_SYNC_FLAG;

I'm able to run the following code successfully:

// test_inclusive_scan_limit.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/scan.h>

int main(void) {
  uint64_t const scan_by_key_limit = (1uL << 41uL);

  auto keys = thrust::make_constant_iterator<uint64_t>(0);
  auto vals = thrust::make_constant_iterator<uint64_t>(1);
  auto iota = thrust::make_counting_iterator<uint64_t>(0);

  thrust::host_vector<uint64_t> h_sums(1);
  thrust::device_vector<uint64_t> d_sums(2);

  std::cout << "size: " << scan_by_key_limit << std::endl;

  // make a scatter output iterator to effectively ignore/compress all but the last value
  auto scatter_out = thrust::make_permutation_iterator(
    d_sums.begin(),
    thrust::make_transform_iterator(iota, []__device__(uint64_t idx){ if (idx == scan_by_key_limit - 1) { return 0; } else { return 1; } })
  );

  auto s_end = thrust::inclusive_scan_by_key(
      thrust::device, keys, keys + scan_by_key_limit, vals, scatter_out);

  thrust::copy(s_end - 1, s_end, h_sums.begin());

  std::cout << "sums: " << h_sums[0] << std::endl;
}

and get the following output

canning dependencies of target test_inclusive_scan_limit
[ 50%] Building CUDA object CMakeFiles/test_inclusive_scan_limit.dir/test_inclusive_scan_limit.cu.o
[100%] Linking CUDA executable test_inclusive_scan_limit
[100%] Built target test_inclusive_scan_limit
Invoking transform::agent<<<1, 256, 0, 1>>>(), 2 items total, 2 items per thread, 4 SM occupancy, 0 vshmem size, 520 ptx_version 
size: 2199023255552
Invoking scan_by_key::init_agent<<<7456541, 128, 0, 1>>>(), 954437177 items total, 1 items per thread, 8 SM occupancy, 0 vshmem size, 520 ptx_version 
Invoking scan_by_key::scan_agent<<<954437177, 256, 18448, 1>>>(), 2199023255552 items total, 9 items per thread, 3 SM occupancy, 0 vshmem size, 520 ptx_version 
Invoking transform::agent<<<1, 256, 0, 1>>>(), 1 items total, 2 items per thread, 4 SM occupancy, 0 vshmem size, 520 ptx_version 
sums: 2199023255552

@alliepiper
Copy link
Collaborator

Thanks for tracking this down, @cwharris!

The changes look reasonable to me, though I'm surprised that static_cast on the enum value is necessary. Do you have any insight into why that changed the behavior?

I think the size_t in the second part of the patch should just be Size, since distance(keys_first, keys_last) won't exceed num_items, which is guaranteed to fit into Size.

Could you submit a PR with your changes?

@cwharris
Copy link
Contributor

cwharris commented May 6, 2021

It doesn't matter which argument gets cast to uint64_t, just so long as they're both promoted before multiplication.

#include <stdint.h>

enum {
    ITEMS_PER_TILE = 9 * 256
};

__global__ void kernel(uint64_t* res) {
        int  tile_idx      = blockIdx.x;
        if constexpr (true) {
            uint64_t tile_base = ITEMS_PER_TILE * static_cast<uint64_t>(tile_idx);
            res[tile_idx] = tile_base;
        } else {
            uint64_t tile_base = ITEMS_PER_TILE * tile_idx;
            res[tile_idx] = tile_base;
        }
}
.visible .entry _Z6kernelPm(
        .param .u64 _Z6kernelPm_param_0
)
{

         ld.param.u64    %rd1, [_Z6kernelPm_param_0];
         cvta.to.global.u64      %rd2, %rd1;
         mov.u32         %r1, %ctaid.x;
-        mul.lo.s32      %r2, %r1, 2304;
-        cvt.s64.s32     %rd3, %r2;
+        mul.wide.s32    %rd3, %r1, 2304;
         mul.wide.s32    %rd4, %r1, 8;
         add.s64         %rd5, %rd2, %rd4;
         st.global.u64   [%rd5], %rd3;

}

vs how gcc handles something similar:

         mov     rbp, rsp
         mov     QWORD PTR [rbp-24], rdi
         mov     DWORD PTR [rbp-28], esi
         mov     edx, DWORD PTR [rbp-28]
-        movsx   rdx, edx
-        mov     rax, rdx
-        sal     rax, 3
-        add     rax, rdx
-        sal     rax, 8
+        mov     eax, edx
+        sal     eax, 3
+        add     eax, edx
+        sal     eax, 8
+        cdqe
         mov     QWORD PTR [rbp-8], rax
         mov     eax, DWORD PTR [rbp-28]
         cdqe

In the gcc case, it seems the result is the same for both sets of operations.

@alliepiper
Copy link
Collaborator

It doesn't matter which argument gets cast to uint64_t, just so long as they're both promoted before multiplication.

D'oh, I misread the diff originally -- that totally makes sense. Thanks for digging into this.

I'll take a look at your PR in a minute.

@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

6 participants