Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nightly test failed ROW_CONVERSION constantly #1567

Closed
pxLi opened this issue Nov 15, 2023 · 10 comments · Fixed by #1577
Closed

[BUG] nightly test failed ROW_CONVERSION constantly #1567

pxLi opened this issue Nov 15, 2023 · 10 comments · Fixed by #1577
Assignees
Labels
bug Something isn't working test

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 15, 2023

Describe the bug
spark-rapids-jni_nightly-pre_release, build ID: 204

currently we only saw this in cuda12 once, jni ref: ff59e68, cudf ref: rapidsai/cudf@330d389

we started seeing this more frequent in also submodule sync up pipeline (cuda11 ENV)
spark-rapids-jni_submodule-sync-pre_release, build ID 593, 591, 592 (constantly failing now)

failed

[2023-11-15T06:06:25.878Z] [INFO]      [exec] The following tests FAILED:
[2023-11-15T06:06:25.878Z] [INFO]      [exec] 	  4 - ROW_CONVERSION (Failed)
[2023-11-15T06:06:25.878Z] [INFO]      [exec] Errors while running CTest

from target/cmake-build/Testing/Temporary/LastTest.log (full log 204.log)

4/8 Testing: ROW_CONVERSION
4/8 Test: ROW_CONVERSION
Command: "/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/cmake-build/gtests/ROW_CONVERSION"
Directory: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/cmake-build/tests
"ROW_CONVERSION" start time: Nov 15 06:06 UTC
Output:
----------------------------------------------------------
Running main() from /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/thirdparty/cudf/cpp/build/_deps/gtest-src/googletest/src/gtest_main.cc
[==========] Running 28 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 13 tests from ColumnToRowTests
[ RUN      ] ColumnToRowTests.Single
[       OK ] ColumnToRowTests.Single (587 ms)
[ RUN      ] ColumnToRowTests.SimpleString
[       OK ] ColumnToRowTests.SimpleString (14 ms)
[ RUN      ] ColumnToRowTests.DoubleString
[       OK ] ColumnToRowTests.DoubleString (0 ms)
[ RUN      ] ColumnToRowTests.BigStrings
[       OK ] ColumnToRowTests.BigStrings (3 ms)
[ RUN      ] ColumnToRowTests.ManyStrings
unknown file: Failure
C++ exception with description "inclusive_scan failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.ManyStrings (2688 ms)
[ RUN      ] ColumnToRowTests.Simple
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Simple (0 ms)
[ RUN      ] ColumnToRowTests.Tall
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Tall (0 ms)
[ RUN      ] ColumnToRowTests.Wide
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Wide (0 ms)
[ RUN      ] ColumnToRowTests.SingleByteWide
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.SingleByteWide (0 ms)
[ RUN      ] ColumnToRowTests.Non2Power
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Non2Power (0 ms)
[ RUN      ] ColumnToRowTests.Big
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Big (8 ms)
[ RUN      ] ColumnToRowTests.Bigger
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Bigger (10 ms)
[ RUN      ] ColumnToRowTests.Biggest
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] ColumnToRowTests.Biggest (14 ms)
[----------] 13 tests from ColumnToRowTests (3328 ms total)

[----------] 15 tests from RowToColumnTests
[ RUN      ] RowToColumnTests.Single
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Single (0 ms)
[ RUN      ] RowToColumnTests.Simple
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Simple (0 ms)
[ RUN      ] RowToColumnTests.Tall
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Tall (0 ms)
[ RUN      ] RowToColumnTests.Wide
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Wide (0 ms)
[ RUN      ] RowToColumnTests.SingleByteWide
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.SingleByteWide (0 ms)
[ RUN      ] RowToColumnTests.AllTypes
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.AllTypes (0 ms)
[ RUN      ] RowToColumnTests.AllTypesLarge
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.AllTypesLarge (1 ms)
[ RUN      ] RowToColumnTests.Non2Power
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Non2Power (0 ms)
[ RUN      ] RowToColumnTests.Big
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Big (6 ms)
[ RUN      ] RowToColumnTests.Bigger
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Bigger (6 ms)
[ RUN      ] RowToColumnTests.Biggest
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.Biggest (31 ms)
[ RUN      ] RowToColumnTests.SimpleString
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.SimpleString (0 ms)
[ RUN      ] RowToColumnTests.DoubleString
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.DoubleString (0 ms)
[ RUN      ] RowToColumnTests.BigStrings
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.BigStrings (0 ms)
[ RUN      ] RowToColumnTests.ManyStrings
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-204-cuda12/target/libcudf-install/include/rmm/mr/device/cuda_memory_resource.hpp:76: cudaErrorIllegalAddress an illegal memory access was encountered" thrown in the test body.
[  FAILED  ] RowToColumnTests.ManyStrings (3 ms)
[----------] 15 tests from RowToColumnTests (49 ms total)

[----------] Global test environment tear-down
[==========] 28 tests from 2 test suites ran. (3377 ms total)
[  PASSED  ] 4 tests.
[  FAILED  ] 24 tests, listed below:
[  FAILED  ] ColumnToRowTests.ManyStrings
[  FAILED  ] ColumnToRowTests.Simple
[  FAILED  ] ColumnToRowTests.Tall
[  FAILED  ] ColumnToRowTests.Wide
[  FAILED  ] ColumnToRowTests.SingleByteWide
[  FAILED  ] ColumnToRowTests.Non2Power
[  FAILED  ] ColumnToRowTests.Big
[  FAILED  ] ColumnToRowTests.Bigger
[  FAILED  ] ColumnToRowTests.Biggest
[  FAILED  ] RowToColumnTests.Single
[  FAILED  ] RowToColumnTests.Simple
[  FAILED  ] RowToColumnTests.Tall
[  FAILED  ] RowToColumnTests.Wide
[  FAILED  ] RowToColumnTests.SingleByteWide
[  FAILED  ] RowToColumnTests.AllTypes
[  FAILED  ] RowToColumnTests.AllTypesLarge
[  FAILED  ] RowToColumnTests.Non2Power
[  FAILED  ] RowToColumnTests.Big
[  FAILED  ] RowToColumnTests.Bigger
[  FAILED  ] RowToColumnTests.Biggest
[  FAILED  ] RowToColumnTests.SimpleString
[  FAILED  ] RowToColumnTests.DoubleString
[  FAILED  ] RowToColumnTests.BigStrings
[  FAILED  ] RowToColumnTests.ManyStrings

24 FAILED TESTS
<end of output>
Test time =   3.84 sec
----------------------------------------------------------
Test Failed.
"ROW_CONVERSION" end time: Nov 15 06:06 UTC
"ROW_CONVERSION" time elapsed: 00:00:03

Steps/Code to reproduce bug
run test with cuda 12 (cuda 12.2, driver: 535.104, GPU: A30)

Expected behavior
Pass the test

@pxLi pxLi added bug Something isn't working ? - Needs Triage test labels Nov 15, 2023
@pxLi pxLi changed the title [BUG] nightly failed ROW_CONVERSION in cuda 12 [BUG] nightly test failed ROW_CONVERSION in cuda 12 Nov 15, 2023
@pxLi pxLi changed the title [BUG] nightly test failed ROW_CONVERSION in cuda 12 [BUG] nightly test failed ROW_CONVERSION intermittently Nov 16, 2023
@jlowe
Copy link
Member

jlowe commented Nov 16, 2023

I tried to git bisect this, but it's failing even on commits from over two weeks ago. That makes me think the failure was triggered not by a change in spark-rapids-jni or thirdparty/cudf but some dependency that always gets downloaded from latest (e.g.: rapids-cmake, rmm, etc.)

@gerashegalov
Copy link
Collaborator

gerashegalov commented Nov 16, 2023

The issue reproduces on my laptop deterministically,

 NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2 

e.g for ColumnToRowTests.Tall

./target/cmake-build/gtests/ROW_CONVERSION  --gtest_filter=ColumnToRowTests.Tall --rerun-failed --output-on-failure --gtest_recreate_environments_when_repeating --gtest_repeat=10

Rerunning with initcheck

 RUN      ] ColumnToRowTests.Tall
========= Uninitialized __global__ memory read of size 16 bytes
=========     at 0x2760 in void cub::CUB_101702_860_NS::DeviceScanKernel<cub::CUB_101702_860_NS::DeviceScanPolicy<unsigned long>::Policy600, thrust::constant_iterator<unsigned long, thrust::use_default, thrust::use_default>, unsigned long *, cub::CUB_101702_860_NS::ScanTileState<unsigned long, (bool)1>, thrust::plus<void>, cub::CUB_101702_860_NS::NullType, int>(T2, T3, T4, int, T5, T6, T7)
=========     by thread (1,0,0) in block (1,0,0)
=========     Address 0x7f0a6ec041f0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2fa190]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart1071 [0x31f4beb]
=========                in /path/issue1567/target/cmake-build/libcudf.so
=========     Host Frame:cudaLaunchKernel_ptsz [0x3234838]
=========                in /path/issue1567/target/cmake-build/libcudf.so
=========     Host Frame:spark_rapids_jni::detail::batch_data spark_rapids_jni::detail::build_batches<thrust::constant_iterator<unsigned long, thrust::use_default, thrust::use_default> >(int, thrust::constant_iterator<unsigned long, thrust::use_default, thrust::use_default>, bool, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1146686]
=========                in /path/issue1567/target/cmake-build/libcudf.so
=========     Host Frame:spark_rapids_jni::convert_to_rows(cudf::table_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x113539d]
=========                in /path/issue1567/target/cmake-build/libcudf.so
=========     Host Frame:ColumnToRowTests_Tall_Test::TestBody() [0x1f4435]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [0x82995d]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:testing::Test::Run() [0x81a0ae]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:testing::TestInfo::Run() [0x81a24d]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:testing::TestSuite::Run() [0x81a7dd]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:testing::internal::UnitTestImpl::RunAllTests() [0x82036f]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:testing::UnitTest::Run() [0x81a320]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:main [0x1bc24b]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_start_call_main [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../csu/libc-start.c:379:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x1e414e]
=========                in /path/issue1567/./target/cmake-build/gtests/ROW_CONVERSION
=====

@gerashegalov
Copy link
Collaborator

gerashegalov commented Nov 16, 2023

With memcheck the test actually passes after intercepting

========= Program hit cudaErrorMissingConfiguration (error 52) due to "__global__ function call is not configured" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x480ea6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaGetLastError [0x868d54]
=========                in /target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:thrust::cuda_cub::throw_on_error(cudaError, char const*) [0x236d64]
=========                in /target/cmake-build/gtests/ROW_CONVERSION
=========     Host Frame:spark_rapids_jni::detail::batch_data spark_rapids_jni::detail::build_batches<thrust::constant_iterator<unsigned long, thrust::use_default, thrust::use_default> >(int, thrust::constant_iterator<unsigned long, thrust::use_default, thrust::use_default>, bool, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1144ba3]
=========                in /home/gshegalov/gits/NVIDIA/spark-rapids-jni.worktrees/gerashegalov/issue1567/target/cmake-build/libcudf.so
=========     Host Frame:spark_rapids_jni::convert_to_rows(cudf::table_view const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x113539d]

...
[       OK ] ColumnToRowTests.Tall (361 ms)
[----------] 1 test from ColumnToRowTests (361 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (361 ms total)
[  PASSED  ] 1 test.
========= ERROR SUMMARY: 1 error

@hyperbolic2346
Copy link
Collaborator

hyperbolic2346 commented Nov 17, 2023

I tried to git bisect this, but it's failing even on commits from over two weeks ago. That makes me think the failure was triggered not by a change in spark-rapids-jni or thirdparty/cudf but some dependency that always gets downloaded from latest (e.g.: rapids-cmake, rmm, etc.)

Or it was always a bug and we are just now hitting it for some reason. That doesn't seem to be the case if it is deterministic. I originally thought this was due to the rand() inside this test. If I had to pick a line to investigate it would be

thrust::exclusive_scan(rmm::exec_policy(stream),
but it doesn't look suspect to me. I wonder if the CCCL work has introduced any optimizations down in cub. @jrhemstad

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 18, 2023

Can you try changing thrust/libcxx to their older version and test again?

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 19, 2023

I tried to revert libcudacxx version from version 2.1.0 back to 1.9.1. That also requires reverting rmm since there is a recent rmm commit depending on libcudacxx 2.1.0. However, that doesn't help.

I tried to add a bunch of code to retrieve cuda error throughout the code, which also adds cudaDeviceSynchronize, and suddenly that resolves the issue. So probably there is a bug in the row conversion code that doesn't properly sync the device. I'm trying to identify where we actually need to sync device and will update.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 19, 2023

It seems that cudaDeviceSynchronize is not needed. I inserted this code in some places and that helps:

{
 cudaError_t err = cudaGetLastError();
 if (err != cudaSuccess) { throw std::runtime_error("Kernel failed on non-default stream! line: " + std::to_string(__LINE__)); }
}

So there seems to be something else wrong, likely due to missing stream synchronization?

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 19, 2023

There seems to be a bug in thrust::inclusive_scan that overflows with large range of input. The function populates a lot of values to the iterators that are out of range (either < 0 or > num_rows). I have no idea what is wrong in this case, maybe somebody else more familiar with thrust can have some insight?

Here is the diff for (rough, not correct) fixing the issue above, I'm still investigating it...

diff --git a/src/main/cpp/src/row_conversion.cu b/src/main/cpp/src/row_conversion.cu
index c1f94598d0..7da2c74c2f 100644
--- a/src/main/cpp/src/row_conversion.cu
+++ b/src/main/cpp/src/row_conversion.cu
@@ -1516,7 +1516,7 @@ struct row_size_functor {
 
   __device__ inline uint64_t operator()(int i) const
   {
-    return i >= _row_end ? 0 : _row_sizes[i + _last_row_end];
+    return i < 0 || i >= _row_end ? 0 : _row_sizes[i + _last_row_end];
   }
 
   size_type _row_end;
@@ -1556,8 +1556,14 @@ batch_data build_batches(size_type num_rows,
   batch_row_boundaries.push_back(0);
   size_type last_row_end = 0;
   device_uvector<uint64_t> cumulative_row_sizes(num_rows, stream);
-  thrust::inclusive_scan(
-    rmm::exec_policy(stream), row_sizes, row_sizes + num_rows, cumulative_row_sizes.begin());
+
+  thrust::inclusive_scan(rmm::exec_policy(stream),
+      thrust::make_counting_iterator<int64_t>(0L),
+      thrust::make_counting_iterator<int64_t>((int64_t) num_rows),
+     cumulative_row_sizes.begin(),
+     [row_sizes]__device__(auto i, auto j) -> uint64_t {
+      return row_sizes[i] + row_sizes[j];
+    });

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 19, 2023

Hit another overflow bug. Still investigating....

alloc batch bytes: 18446744072182824960
========= Program hit cudaErrorMemoryAllocation (error 2) due to "out of memory" on CUDA API call to cudaMalloc.

@pxLi pxLi changed the title [BUG] nightly test failed ROW_CONVERSION intermittently [BUG] nightly test failed ROW_CONVERSION constantly Nov 20, 2023
@ttnghia ttnghia self-assigned this Nov 20, 2023
@ttnghia ttnghia reopened this Nov 21, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented Nov 21, 2023

I've filed a follow on issue: #1579.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Dec 21, 2023
This temporarily moves the row conversion code from spark-rapids-jni into libcudf. It is necessary to have the row conversion code compiled in a static library to overcome a CCCL issue that triggers invalid memory access when calling to `thrust::in(ex)clusive_scan` (NVIDIA/spark-rapids-jni#1567).

In the future, when we have CCCL updated to fix the issue (1567), we may need to move the code back into spark-rapids-jni.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - MithunR (https://github.com/mythrocks)

URL: #14664
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 5, 2024
This is to remove the row conversion code from libcudf. It was move from spark-rapids-jni (by #14664) to temporarily workaround the issue due to conflict of kernel names that causes invalid memory access when calling to `thrust::in(ex)clusive_scan` (NVIDIA/spark-rapids-jni#1567).

Now we have fixes for the namespace visibility issue (by marking all libcudf kenels private in rapidsai/rapids-cmake#523 and NVIDIA/cuCollections#422) and need to move back the code.

Closes #14853.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #15234
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants