[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

trxcllnt · 2021-04-09T16:36:17Z

Possibly related to the fix at NVIDIA/cub#221, we're seeing an OOM in inclusive_scan_by_key when input size is close to (but not exactly) INT_MAX. The limit seems to depend on the accumulator type -- sometimes (1 << 31) - 1 works, but other times it's (1 << 31) - 2048 or (1 << 31) - 4096.

Tested in nvidia/cuda:10.2-devel-ubuntu18.04 container with GCC 7.5.0/CUDA 10.2, but present in newer versions too.

# CMakeLists.txt
cmake_minimum_required(VERSION 3.20)
project(test VERSION 1.0.0 LANGUAGES C CXX CUDA)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# https://github.com/cpm-cmake/CPM.cmake/blob/310efb9b17d0befe9ccd4f5bf2e39942869777fc/cmake/get_cpm.cmake
set(CPM_DOWNLOAD_VERSION 0.32.0)
if(CPM_SOURCE_CACHE)
  # Expand relative path. This is important if the provided path contains a tilde (~)
  get_filename_component(CPM_SOURCE_CACHE ${CPM_SOURCE_CACHE} ABSOLUTE)
  set(CPM_DOWNLOAD_LOCATION "${CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
elseif(DEFINED ENV{CPM_SOURCE_CACHE})
  set(CPM_DOWNLOAD_LOCATION "$ENV{CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
else()
  set(CPM_DOWNLOAD_LOCATION "${CMAKE_BINARY_DIR}/cmake/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
endif()
if(NOT (EXISTS ${CPM_DOWNLOAD_LOCATION}))
  message(STATUS "Downloading CPM.cmake to ${CPM_DOWNLOAD_LOCATION}")
  file(DOWNLOAD
       https://github.com/cpm-cmake/CPM.cmake/releases/download/v${CPM_DOWNLOAD_VERSION}/CPM.cmake
       ${CPM_DOWNLOAD_LOCATION}
  )
endif()
include(${CPM_DOWNLOAD_LOCATION})

# Get Thrust v1.12.0
CPMAddPackage(NAME      Thrust
    VERSION             1.12.0
    GIT_REPOSITORY      https://github.com/NVIDIA/thrust.git
    GIT_TAG             1.12.0
    GIT_SHALLOW         TRUE
    UPDATE_DISCONNECTED TRUE)

thrust_create_target(test::Thrust FROM_OPTIONS)

add_executable(test_inclusive_scan_limit test_inclusive_scan_limit.cu)

target_compile_features(test_inclusive_scan_limit
    PRIVATE cxx_std_14 $<BUILD_INTERFACE:cuda_std_14>)

target_link_libraries(test_inclusive_scan_limit test::Thrust)

execute_process(COMMAND cmake -E create_symlink ${CMAKE_CURRENT_BINARY_DIR}/compile_commands.json  compile_commands.json)

// test_inclusive_scan_limit.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/scan.h>

int main(void) {
  //
  // limit is somewhere between these values, otherwise get this error:
  // terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  //   what():  std::bad_alloc: cudaErrorMemoryAllocation: out of memory
  //
  auto const scan_by_key_limit = (1uL << 31uL);  // error
  // auto const scan_by_key_limit = (1uL << 31uL) - 1uL;  // no error
  // auto const scan_by_key_limit = (1uL << 31uL) - 2048uL;  // no error
  // auto const scan_by_key_limit = (1uL << 31uL) - 4096uL;  // no error

  auto keys = thrust::make_constant_iterator(0);
  auto vals = thrust::make_counting_iterator(uint64_t{0});

  thrust::host_vector<uint64_t> h_sum(1);
  thrust::device_vector<uint64_t> d_sums(scan_by_key_limit);

  std::cout << "size: " << scan_by_key_limit << std::endl;

  // OOM here
  auto s_end = thrust::inclusive_scan_by_key(
      thrust::device, keys, keys + scan_by_key_limit, vals, d_sums.begin());

  thrust::copy(s_end - 1, s_end, h_sum.begin());

  std::cout << "sums: " << h_sum[0] << std::endl;
}

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-04-09T16:44:30Z

For the sizes less than 2^31, is there any pattern to those failures? E.g., is it only with certain hardware or compiler flags, or is it just random from run-to-run in the same binary?

trxcllnt · 2021-04-09T18:08:29Z

@allisonvacanti I've only tested on Turing RTX 8000, so not sure if other devices have different limits.

In this standalone repro with GCC 7.5/CUDA 10.2, it seems the limit is 2^31 - 1, so compiler flags could be affecting things. I originally ran this test in rapidsai/cuspatial, and the limit was between 2^31 - (2048|4096) in both GCC 7.5/CUDA 10.2 and GCC 9.3/CUDA 11.2.

It does not seem to change between runs of the same binary.

cwharris · 2021-05-03T20:40:13Z

The OOM portion of the bug is solved by switch int to size_t on this line:

https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L737

However, that reveals an invalid memory access exception deeper in the kernel. Right now I've tracked it to a suspiciously large tile_base value in scan_by_key.h where it may be overflowing.

alliepiper · 2021-05-04T18:01:04Z

@cwharris Does the branch you're looking at include NVIDIA/thrust@8f876ba ? That fixed several similar overflows in tile size calculations.

cwharris · 2021-05-05T00:15:03Z

I'm seeing this behavior on 1.12.0

cwharris · 2021-05-05T00:19:03Z

In reality it looks like tile_base is jumping from near intmax to near size_t max once it reaches a certain threshold. I'll post some example output and a printf I've added for debugging.

cwharris · 2021-05-05T01:03:05Z

I've added the following printf immediately following
the sync_threadblock() which appears after the calls to scan_tile(...) in scan_by_key.h

        if (threadIdx.x == 0) {
          printf("AFTER SCAN TILE... tile_idx: %i, tile_base: %lu %lu\n",
          tile_idx,
          std::numeric_limits<Size>::max(),
          tile_base);
        }

here is part of the output I'm getting (I've sorted the lines so tile_idx appears in order)...

AFTER SCAN TILE... tile_idx: 922975, tile_base: 18446744073709551615 2126534400
AFTER SCAN TILE... tile_idx: 922976, tile_base: 18446744073709551615 2126536704
AFTER SCAN TILE... tile_idx: 922978, tile_base: 18446744073709551615 2126541312
AFTER SCAN TILE... tile_idx: 922979, tile_base: 18446744073709551615 2126543616
AFTER SCAN TILE... tile_idx: 922980, tile_base: 18446744073709551615 2126545920
AFTER SCAN TILE... tile_idx: 922981, tile_base: 18446744073709551615 2126548224
AFTER SCAN TILE... tile_idx: 922983, tile_base: 18446744073709551615 2126552832
AFTER SCAN TILE... tile_idx: 922984, tile_base: 18446744073709551615 2126555136
AFTER SCAN TILE... tile_idx: 922986, tile_base: 18446744073709551615 2126559744
AFTER SCAN TILE... tile_idx: 922998, tile_base: 18446744073709551615 2126587392
AFTER SCAN TILE... tile_idx: 922999, tile_base: 18446744073709551615 2126589696
AFTER SCAN TILE... tile_idx: 923001, tile_base: 18446744073709551615 2126594304
AFTER SCAN TILE... tile_idx: 923002, tile_base: 18446744073709551615 2126596608
AFTER SCAN TILE... tile_idx: 932162, tile_base: 18446744073709551615 18446744071562285568
AFTER SCAN TILE... tile_idx: 932172, tile_base: 18446744073709551615 18446744071562308608
AFTER SCAN TILE... tile_idx: 932174, tile_base: 18446744073709551615 18446744071562313216
AFTER SCAN TILE... tile_idx: 932176, tile_base: 18446744073709551615 18446744071562317824
AFTER SCAN TILE... tile_idx: 932177, tile_base: 18446744073709551615 18446744071562320128
AFTER SCAN TILE... tile_idx: 932185, tile_base: 18446744073709551615 18446744071562338560
AFTER SCAN TILE... tile_idx: 932187, tile_base: 18446744073709551615 18446744071562343168
AFTER SCAN TILE... tile_idx: 932189, tile_base: 18446744073709551615 18446744071562347776

tile_base goes from a more normal value 2126596608 to 18446744071562285568 in a single step/block, which makes me suspicious of an overflow somewhere in the math leading up to the tile_base calculation. The new unexpected large value differs from uint64_t max (18446744073709551615) by a value close to intmax (18446744073709551615 - 18446744071562285568 = 2147266560).

Finally, the error I see is.

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  scan_by_key: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted (core dumped)

This is reproducible using the example code provided by @trxcllnt and modifying 1.12.0 source to add the printf mentioned in this comment and using size_t instead of int on this line: https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L737

cwharris · 2021-05-05T01:14:45Z

Fixed tile_base value by static_cast<Size>(ITEMS_PER_TILE) here:
https://github.com/NVIDIA/thrust/blob/fa54f2c6f1217237953f27ddf67f901b6b34fbdd/thrust/system/cuda/detail/scan_by_key.h#L516

Still need to verify high-level functionality.

cwharris · 2021-05-05T02:06:01Z

With the aforementioned changes:

diff --git a/thrust/system/cuda/detail/scan_by_key.h b/thrust/system/cuda/detail/scan_by_key.h
index fe4b321c..b3974c69 100644
--- a/thrust/system/cuda/detail/scan_by_key.h
+++ b/thrust/system/cuda/detail/scan_by_key.h
@@ -513,7 +513,7 @@ namespace __scan_by_key {
             scan_op(scan_op_)
       {
         int  tile_idx      = blockIdx.x;
-        Size tile_base     = ITEMS_PER_TILE * tile_idx;
+        Size tile_base     = static_cast<Size>(ITEMS_PER_TILE) * tile_idx;
         Size num_remaining = num_items - tile_base;
 
         if (num_remaining > ITEMS_PER_TILE)
@@ -734,7 +734,7 @@ namespace __scan_by_key {
                              ScanOp                     scan_op,
                              AddInitToScan              add_init_to_scan)
   {
-    int          num_items    = static_cast<int>(thrust::distance(keys_first, keys_last));
+    size_t       num_items    = static_cast<size_t>(thrust::distance(keys_first, keys_last));
     size_t       storage_size = 0;
     cudaStream_t stream       = cuda_cub::stream(policy);
     bool         debug_sync   = THRUST_DEBUG_SYNC_FLAG;

I'm able to run the following code successfully:

// test_inclusive_scan_limit.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/scan.h>

int main(void) {
  uint64_t const scan_by_key_limit = (1uL << 41uL);

  auto keys = thrust::make_constant_iterator<uint64_t>(0);
  auto vals = thrust::make_constant_iterator<uint64_t>(1);
  auto iota = thrust::make_counting_iterator<uint64_t>(0);

  thrust::host_vector<uint64_t> h_sums(1);
  thrust::device_vector<uint64_t> d_sums(2);

  std::cout << "size: " << scan_by_key_limit << std::endl;

  // make a scatter output iterator to effectively ignore/compress all but the last value
  auto scatter_out = thrust::make_permutation_iterator(
    d_sums.begin(),
    thrust::make_transform_iterator(iota, []__device__(uint64_t idx){ if (idx == scan_by_key_limit - 1) { return 0; } else { return 1; } })
  );

  auto s_end = thrust::inclusive_scan_by_key(
      thrust::device, keys, keys + scan_by_key_limit, vals, scatter_out);

  thrust::copy(s_end - 1, s_end, h_sums.begin());

  std::cout << "sums: " << h_sums[0] << std::endl;
}

and get the following output

canning dependencies of target test_inclusive_scan_limit
[ 50%] Building CUDA object CMakeFiles/test_inclusive_scan_limit.dir/test_inclusive_scan_limit.cu.o
[100%] Linking CUDA executable test_inclusive_scan_limit
[100%] Built target test_inclusive_scan_limit
Invoking transform::agent<<<1, 256, 0, 1>>>(), 2 items total, 2 items per thread, 4 SM occupancy, 0 vshmem size, 520 ptx_version 
size: 2199023255552
Invoking scan_by_key::init_agent<<<7456541, 128, 0, 1>>>(), 954437177 items total, 1 items per thread, 8 SM occupancy, 0 vshmem size, 520 ptx_version 
Invoking scan_by_key::scan_agent<<<954437177, 256, 18448, 1>>>(), 2199023255552 items total, 9 items per thread, 3 SM occupancy, 0 vshmem size, 520 ptx_version 
Invoking transform::agent<<<1, 256, 0, 1>>>(), 1 items total, 2 items per thread, 4 SM occupancy, 0 vshmem size, 520 ptx_version 
sums: 2199023255552

alliepiper · 2021-05-05T16:17:10Z

Thanks for tracking this down, @cwharris!

The changes look reasonable to me, though I'm surprised that static_cast on the enum value is necessary. Do you have any insight into why that changed the behavior?

I think the size_t in the second part of the patch should just be Size, since distance(keys_first, keys_last) won't exceed num_items, which is guaranteed to fit into Size.

Could you submit a PR with your changes?

cwharris · 2021-05-06T15:24:06Z

It doesn't matter which argument gets cast to uint64_t, just so long as they're both promoted before multiplication.

#include <stdint.h>

enum {
    ITEMS_PER_TILE = 9 * 256
};

__global__ void kernel(uint64_t* res) {
        int  tile_idx      = blockIdx.x;
        if constexpr (true) {
            uint64_t tile_base = ITEMS_PER_TILE * static_cast<uint64_t>(tile_idx);
            res[tile_idx] = tile_base;
        } else {
            uint64_t tile_base = ITEMS_PER_TILE * tile_idx;
            res[tile_idx] = tile_base;
        }
}

.visible .entry _Z6kernelPm(
        .param .u64 _Z6kernelPm_param_0
)
{

         ld.param.u64    %rd1, [_Z6kernelPm_param_0];
         cvta.to.global.u64      %rd2, %rd1;
         mov.u32         %r1, %ctaid.x;
-        mul.lo.s32      %r2, %r1, 2304;
-        cvt.s64.s32     %rd3, %r2;
+        mul.wide.s32    %rd3, %r1, 2304;
         mul.wide.s32    %rd4, %r1, 8;
         add.s64         %rd5, %rd2, %rd4;
         st.global.u64   [%rd5], %rd3;

}

vs how gcc handles something similar:

         mov     rbp, rsp
         mov     QWORD PTR [rbp-24], rdi
         mov     DWORD PTR [rbp-28], esi
         mov     edx, DWORD PTR [rbp-28]
-        movsx   rdx, edx
-        mov     rax, rdx
-        sal     rax, 3
-        add     rax, rdx
-        sal     rax, 8
+        mov     eax, edx
+        sal     eax, 3
+        add     eax, edx
+        sal     eax, 8
+        cdqe
         mov     QWORD PTR [rbp-8], rax
         mov     eax, DWORD PTR [rbp-28]
         cdqe

In the gcc case, it seems the result is the same for both sets of operations.

alliepiper · 2021-05-10T18:37:07Z

It doesn't matter which argument gets cast to uint64_t, just so long as they're both promoted before multiplication.

D'oh, I misread the diff originally -- that totally makes sense. Thanks for digging into this.

I'll take a look at your PR in a minute.

cwharris mentioned this issue May 3, 2021

[BUG] Hausdorff fails with OOM for > intmax num inputs rapidsai/cuspatial#393

Closed

cwharris mentioned this issue May 6, 2021

fix > intmax num inputs for scan_by_key NVIDIA/thrust#1424

Closed

alliepiper assigned gevtushenko Jan 27, 2022

alliepiper unassigned gevtushenko Apr 25, 2022

jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023

jarmak-nv assigned griwes Feb 23, 2023

jrhemstad mentioned this issue Jun 28, 2023

Thrust large input support #49

Open

jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

trxcllnt commented Apr 9, 2021 •

edited

Loading

alliepiper commented Apr 9, 2021

trxcllnt commented Apr 9, 2021

cwharris commented May 3, 2021 •

edited

Loading

alliepiper commented May 4, 2021

cwharris commented May 5, 2021

cwharris commented May 5, 2021

cwharris commented May 5, 2021 •

edited

Loading

cwharris commented May 5, 2021 •

edited

Loading

cwharris commented May 5, 2021 •

edited

Loading

alliepiper commented May 5, 2021

cwharris commented May 6, 2021 •

edited

Loading

alliepiper commented May 10, 2021

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

[BUG] inclusive_scan_by_key OOM on >= INT_MAX elements #766

Comments

trxcllnt commented Apr 9, 2021 • edited Loading

alliepiper commented Apr 9, 2021

trxcllnt commented Apr 9, 2021

cwharris commented May 3, 2021 • edited Loading

alliepiper commented May 4, 2021

cwharris commented May 5, 2021

cwharris commented May 5, 2021

cwharris commented May 5, 2021 • edited Loading

cwharris commented May 5, 2021 • edited Loading

cwharris commented May 5, 2021 • edited Loading

alliepiper commented May 5, 2021

cwharris commented May 6, 2021 • edited Loading

alliepiper commented May 10, 2021

trxcllnt commented Apr 9, 2021 •

edited

Loading

cwharris commented May 3, 2021 •

edited

Loading

cwharris commented May 5, 2021 •

edited

Loading

cwharris commented May 5, 2021 •

edited

Loading

cwharris commented May 5, 2021 •

edited

Loading

cwharris commented May 6, 2021 •

edited

Loading