[BUG] Illegal memory access while sorting lists #12201

jlowe · 2022-11-18T16:47:08Z

Describe the bug
We have recently seen memory corruption and GPU illegal address crashes in the RAPIDS Accelerator for Apache Spark 22.12 nightly test runs since Nov 16. It appears to be related to sorting lists. See NVIDIA/spark-rapids#7092 (comment) for an illegal access captured by compute-sanitizer.

Steps/Code to reproduce bug
So far we can only sometimes reproduce this by running a large suite of RAPIDS Accelerator integration tests. I'm working on trying to narrow this down to something that only requires cudf to reproduce, but I wanted to file this early to raise awareness.

Expected behavior
Tests should pass

Environment overview (please complete the following information)
RAPIDS Accelerator for Apache Spark 22.12.0-SNAPSHOT on Apache Spark 3.1.2 using spark-rapids-jni-22.12.0-SNAPSHOT based on cudf 22.12.0

ttnghia · 2022-11-18T20:24:30Z

If the bug shows up randomly then it looks similar to what we have experienced the out-of-bound error in the past: temporary rmm::device_uvector variable is allocated almost always with extra padding memory thus no error showed up until the input has some "magic" size passed in. In such situations, rmm::device_uvector variable is allocated without padding and, boom.....

jlowe · 2022-11-18T20:41:44Z

I'm able to reliably reproduce it if I run a large suite of hash aggregation pyspark tests, but it's not reliable if I try to cut the number of tests down to just the ones that were failing. I also was able to save off the input data of a list sort operation that failed as a Parquet file, but loading that Parquet data and executing the list sort does not reproduce the problem, even with compute-sanitizer running and the memory pool disabled.

I'm currently doing some experiments to verify which libcudf commit introduced the issue.

ttnghia · 2022-11-18T20:50:34Z

Can you try disabling rmm memory pool when testing your saved parquet file with cudf list sort?

jlowe · 2022-11-18T20:52:02Z

Can you try disabling rmm memory pool when testing your saved parquet file with cudf list sort?

Yes, that's what I already tried without success.

executing the list sort does not reproduce the problem, even with compute-sanitizer running and the memory pool disabled.

jlowe · 2022-11-18T21:14:49Z

This appears to be triggered by 90f0a77 (#11969). Spark plugin tests pass on the commit just before this and reliably fail after pulling in just this commit.

I still cannot explain why I cannot reproduce it in isolation. There must be some state (e.g.: values of unallocated memory, state of GPU cache, etc.) that is key to getting it to reliably fail.

ttnghia · 2022-11-18T21:51:36Z

@jlowe Do you know the data type of the sorted lists? Knowing it can help identify the source of issue better.

jlowe · 2022-11-18T21:58:42Z

In the case where I can reliably get it to fail as part of a larger suite, the list is of booleans (so BOOL8 in cudf). Here's the value from the debug dump into Parquet of the data just before the list sort call:

scala> df.printSchema
root
 |-- c_list1: array (nullable = true)
 |    |-- element: boolean (containsNull = true)


scala> df.show(truncate=false)
+---------------------------------------------------------------+
|c_list1                                                        |
+---------------------------------------------------------------+
|[true, false, false, true, true]                               |
|[true, true, true, true, true]                                 |
|[true, true, true, false, false]                               |
|[false, false, true, false, false]                             |
|[true, true, true, true, true]                                 |
|[true, true, false, true, false]                               |
|[true, true, true, true, true, true, false, true, false, false]|
+---------------------------------------------------------------+

ttnghia · 2022-11-18T22:08:11Z

So it is bool type with nulls. Previously this would be sorted by cub radix sort. Now it is sorted by detail::sorted_order, not the new fast_sort algorithm (which uses cub). So probably there was bug in detail::sorted_order with bool type that didn't show up before (I'm not sure)?

jlowe · 2022-11-18T22:28:34Z

There are no nulls in the sample data. Spark thinks the type is nullable, but in practice the data contains no nulls. I believe this is a bug in the fast sort algorithm. @davidwendt asked me to hack the code to disable the fast algorithm, and test pass with that hack.

ttnghia · 2022-11-18T23:49:22Z

@davidwendt From the old code:

 cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage.data(),
                                             temp_storage_bytes,
                                             keys_in,
                                             keys_out,
                                             values_in,
                                             values_out,
                                             num_items,
                                             num_segments,
                                             begin_offsets,
                                             end_offsets,
                                             0,
                                             sizeof(KeyT) * 8,
                                             stream.value());

So there are 13 params. In the new code:

if (ascending) {
        cub::DeviceSegmentedSort::SortPairs(
          d_temp_storage.data(), temp_storage_bytes, std::forward<decltype(args)>(args)...);

fast_sort_impl(ascending,
                   input.begin<T>(),
                   output_view.begin<T>(),
                   indices.begin<size_type>(),
                   indices.begin<size_type>(),
                   input.size(),
                   segment_offsets.size() - 1,
                   segment_offsets.begin<size_type>(),
                   segment_offsets.begin<size_type>() + 1,
                   stream.value());

And the number of params is just 11. Should this be the issue?

If not then should we test switching back to cub::DeviceSegmentedRadixSort from cub::DeviceSegmentedSort?

jlowe · 2022-11-19T00:20:33Z

Looks like there's a bug in cub::DeviceSegmentedSort::SortPairs, at least for boolean values. I added printing of the indices before and after the cub calls, and here's the debug output. The output shows the incoming column's offsets and data, along with the indices before and after. Note that one of the indices towards the end becomes negative which explains the bad behavior of the gather call later.

GPU COLUMN *** list to sort false - NC: 0 DATA: null VAL: null
GPU COLUMN *** list to sort false:DATA - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f55ba000480, length=40, id=-1} VAL: null
COLUMN *** list to sort false - LIST
OFFSETS
0 [0 - 5)
1 [5 - 10)
2 [10 - 15)
3 [15 - 20)
4 [20 - 25)
5 [25 - 30)
6 [30 - 40)
COLUMN *** list to sort false:DATA - BOOL8
0 true
1 false
2 false
3 true
4 true
5 true
6 true
7 true
8 true
9 true
10 true
11 true
12 true
13 false
14 false
15 false
16 false
17 true
18 false
19 false
20 true
21 true
22 true
23 true
24 true
25 true
26 true
27 false
28 true
29 false
30 true
31 true
32 true
33 true
34 true
35 true
36 false
37 true
38 false
39 false
Indices before fast sort
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Indices after fast sort
1
2
0
3
4
5
6
7
8
9
13
14
10
11
12
15
16
18
19
17
20
21
22
23
24
27
29
25
26
28
36
38
39
2147483647
30
31
32
33
34
35
*** isDescending=false
GPU COLUMN *** list to sort false - NC: 0 DATA: null VAL: null
GPU COLUMN *** list to sort false:DATA - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f55ba000500, length=12, id=-1} VAL: null
========= Invalid __global__ read of size 1 bytes
=========     at 0x590 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(T2, T3)
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x7f563a00047f is out of bounds
=========     and is 536872064 bytes after the nearest allocation at 0x7f55ba000000 of size 1610612736 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel_ptsz [0x2d53e6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4342b]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame: [0x3e80b78]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:void thrust::cuda_cub::parallel_for<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base> >&, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long) [0x17c4902]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:void cudf::detail::gather_helper<bool const*, bool*, cudf::detail::input_indexalator>(bool const*, int, bool*, cudf::detail::input_indexalator, cudf::detail::input_indexalator, bool, rmm::cuda_stream_view) [0x17c4ce2]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<cudf::detail::input_indexalator>(cudf::table_view const&, cudf::detail::input_indexalator, cudf::detail::input_indexalator, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17d321e]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::gather(cudf::table_view const&, cudf::column_view const&, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x179abf1]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::(anonymous namespace)::segmented_sort_by_key_common(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, cudf::detail::(anonymous namespace)::sort_method, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a3ad04]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a3b7f8]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1fbbf3a]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1fbcc3a]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x149aef8]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame: [0x769dae17]
=========                in

I also tried this debug build without Spark using the data I captured as a Parquet file (same data as in the log) and that same output index slot that becomes negative becomes zero instead, and duplicates another entry in the gather map that is zero (i.e.: it's a bad but not invalid gather map).

So we should be able to construct a test using the input values shown in the log above and show that the cub call is generating a resulting index/gather map that contains duplicated or invalid keys when the input did not have any duplicates. @davidwendt volunteered to write the C++ repro and see if that is able to show the issue.

davidwendt · 2022-11-19T00:42:00Z

Error appears to occur with bool but not with int8_t with CUB's segmented sort

  cudf::test::fixed_width_column_wrapper<bool> col1{
    {true,  false, false, true, true,  true,  true, true, true,  true, true,  true, true, false,
     false, false, false, true, false, false, true, true, true,  true, true,  true, true, false,
     true,  false, true,  true, true,  true,  true, true, false, true, false, false}};
  cudf::test::fixed_width_column_wrapper<int> segments{{0, 5, 10, 15, 20, 25, 30, 40}};
  auto result = cudf::segmented_sorted_order(cudf::table_view({col1}), segments);
  cudf::test::print(result->view());

Output

1,2,0,3,4,5,6,7,8,9,13,14,10,11,12,15,16,18,19,17,20,21,22,23,24,27,29,25,26,28,36,38,39,32724,30,31,32,33,34,35

Garbage value 32724 at position 33
I will open an issue with CUB

ttnghia · 2022-11-19T04:58:55Z

I'm so glad that we can identify and catch the bug so quickly 👍. Thanks :)

davidwendt · 2022-11-21T18:42:32Z

Created CUB issue here: NVIDIA/cub#594
Still working on possible solutions.

davidwendt · 2022-11-21T20:16:26Z

Verified that adding a cudf::bit_cast to the test code above will workaround the error

  auto intcol = cudf::bit_cast(cudf::column_view{col1}, cudf::data_type{cudf::type_id::INT8});
  auto result = cudf::segmented_sorted_order(cudf::table_view({intcol}), segments);

The CUB issue is specifically due to an error in handling the bool type for the keys.
The cudf::bit_cast is a zero-copy function.

GregoryKimball · 2022-11-22T21:53:50Z

Closed by #12217

jlowe added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 18, 2022

davidwendt self-assigned this Nov 18, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 21, 2022

davidwendt mentioned this issue Nov 21, 2022

Workaround for CUB segmented-sort bug with boolean keys #12217

Merged

3 tasks

GregoryKimball closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Illegal memory access while sorting lists #12201

[BUG] Illegal memory access while sorting lists #12201

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022 •

edited

Loading

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022 •

edited

Loading

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022 •

edited

Loading

jlowe commented Nov 19, 2022

davidwendt commented Nov 19, 2022

ttnghia commented Nov 19, 2022

davidwendt commented Nov 21, 2022

davidwendt commented Nov 21, 2022

GregoryKimball commented Nov 22, 2022

[BUG] Illegal memory access while sorting lists #12201

[BUG] Illegal memory access while sorting lists #12201

Comments

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022 • edited Loading

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022 • edited Loading

jlowe commented Nov 18, 2022

ttnghia commented Nov 18, 2022 • edited Loading

jlowe commented Nov 19, 2022

davidwendt commented Nov 19, 2022

ttnghia commented Nov 19, 2022

davidwendt commented Nov 21, 2022

davidwendt commented Nov 21, 2022

GregoryKimball commented Nov 22, 2022

jlowe commented Nov 18, 2022 •

edited

Loading

ttnghia commented Nov 18, 2022 •

edited

Loading

ttnghia commented Nov 18, 2022 •

edited

Loading