Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Illegal memory access while sorting lists #12201

Closed
jlowe opened this issue Nov 18, 2022 · 16 comments
Closed

[BUG] Illegal memory access while sorting lists #12201

jlowe opened this issue Nov 18, 2022 · 16 comments
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Nov 18, 2022

Describe the bug
We have recently seen memory corruption and GPU illegal address crashes in the RAPIDS Accelerator for Apache Spark 22.12 nightly test runs since Nov 16. It appears to be related to sorting lists. See NVIDIA/spark-rapids#7092 (comment) for an illegal access captured by compute-sanitizer.

Steps/Code to reproduce bug
So far we can only sometimes reproduce this by running a large suite of RAPIDS Accelerator integration tests. I'm working on trying to narrow this down to something that only requires cudf to reproduce, but I wanted to file this early to raise awareness.

Expected behavior
Tests should pass

Environment overview (please complete the following information)
RAPIDS Accelerator for Apache Spark 22.12.0-SNAPSHOT on Apache Spark 3.1.2 using spark-rapids-jni-22.12.0-SNAPSHOT based on cudf 22.12.0

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 18, 2022
@davidwendt davidwendt self-assigned this Nov 18, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

If the bug shows up randomly then it looks similar to what we have experienced the out-of-bound error in the past: temporary rmm::device_uvector variable is allocated almost always with extra padding memory thus no error showed up until the input has some "magic" size passed in. In such situations, rmm::device_uvector variable is allocated without padding and, boom.....

@jlowe
Copy link
Member Author

jlowe commented Nov 18, 2022

I'm able to reliably reproduce it if I run a large suite of hash aggregation pyspark tests, but it's not reliable if I try to cut the number of tests down to just the ones that were failing. I also was able to save off the input data of a list sort operation that failed as a Parquet file, but loading that Parquet data and executing the list sort does not reproduce the problem, even with compute-sanitizer running and the memory pool disabled.

I'm currently doing some experiments to verify which libcudf commit introduced the issue.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

Can you try disabling rmm memory pool when testing your saved parquet file with cudf list sort?

@jlowe
Copy link
Member Author

jlowe commented Nov 18, 2022

Can you try disabling rmm memory pool when testing your saved parquet file with cudf list sort?

Yes, that's what I already tried without success.

executing the list sort does not reproduce the problem, even with compute-sanitizer running and the memory pool disabled.

@jlowe
Copy link
Member Author

jlowe commented Nov 18, 2022

This appears to be triggered by 90f0a77 (#11969). Spark plugin tests pass on the commit just before this and reliably fail after pulling in just this commit.

I still cannot explain why I cannot reproduce it in isolation. There must be some state (e.g.: values of unallocated memory, state of GPU cache, etc.) that is key to getting it to reliably fail.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

@jlowe Do you know the data type of the sorted lists? Knowing it can help identify the source of issue better.

@jlowe
Copy link
Member Author

jlowe commented Nov 18, 2022

In the case where I can reliably get it to fail as part of a larger suite, the list is of booleans (so BOOL8 in cudf). Here's the value from the debug dump into Parquet of the data just before the list sort call:

scala> df.printSchema
root
 |-- c_list1: array (nullable = true)
 |    |-- element: boolean (containsNull = true)


scala> df.show(truncate=false)
+---------------------------------------------------------------+
|c_list1                                                        |
+---------------------------------------------------------------+
|[true, false, false, true, true]                               |
|[true, true, true, true, true]                                 |
|[true, true, true, false, false]                               |
|[false, false, true, false, false]                             |
|[true, true, true, true, true]                                 |
|[true, true, false, true, false]                               |
|[true, true, true, true, true, true, false, true, false, false]|
+---------------------------------------------------------------+

@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

So it is bool type with nulls. Previously this would be sorted by cub radix sort. Now it is sorted by detail::sorted_order, not the new fast_sort algorithm (which uses cub). So probably there was bug in detail::sorted_order with bool type that didn't show up before (I'm not sure)?

@jlowe
Copy link
Member Author

jlowe commented Nov 18, 2022

There are no nulls in the sample data. Spark thinks the type is nullable, but in practice the data contains no nulls. I believe this is a bug in the fast sort algorithm. @davidwendt asked me to hack the code to disable the fast algorithm, and test pass with that hack.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

@davidwendt From the old code:

 cub::DeviceSegmentedRadixSort::SortPairs(d_temp_storage.data(),
                                             temp_storage_bytes,
                                             keys_in,
                                             keys_out,
                                             values_in,
                                             values_out,
                                             num_items,
                                             num_segments,
                                             begin_offsets,
                                             end_offsets,
                                             0,
                                             sizeof(KeyT) * 8,
                                             stream.value());

So there are 13 params. In the new code:

if (ascending) {
        cub::DeviceSegmentedSort::SortPairs(
          d_temp_storage.data(), temp_storage_bytes, std::forward<decltype(args)>(args)...);

fast_sort_impl(ascending,
                   input.begin<T>(),
                   output_view.begin<T>(),
                   indices.begin<size_type>(),
                   indices.begin<size_type>(),
                   input.size(),
                   segment_offsets.size() - 1,
                   segment_offsets.begin<size_type>(),
                   segment_offsets.begin<size_type>() + 1,
                   stream.value());

And the number of params is just 11. Should this be the issue?

If not then should we test switching back to cub::DeviceSegmentedRadixSort from cub::DeviceSegmentedSort?

@jlowe
Copy link
Member Author

jlowe commented Nov 19, 2022

Looks like there's a bug in cub::DeviceSegmentedSort::SortPairs, at least for boolean values. I added printing of the indices before and after the cub calls, and here's the debug output. The output shows the incoming column's offsets and data, along with the indices before and after. Note that one of the indices towards the end becomes negative which explains the bad behavior of the gather call later.

GPU COLUMN *** list to sort false - NC: 0 DATA: null VAL: null
GPU COLUMN *** list to sort false:DATA - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f55ba000480, length=40, id=-1} VAL: null
COLUMN *** list to sort false - LIST
OFFSETS
0 [0 - 5)
1 [5 - 10)
2 [10 - 15)
3 [15 - 20)
4 [20 - 25)
5 [25 - 30)
6 [30 - 40)
COLUMN *** list to sort false:DATA - BOOL8
0 true
1 false
2 false
3 true
4 true
5 true
6 true
7 true
8 true
9 true
10 true
11 true
12 true
13 false
14 false
15 false
16 false
17 true
18 false
19 false
20 true
21 true
22 true
23 true
24 true
25 true
26 true
27 false
28 true
29 false
30 true
31 true
32 true
33 true
34 true
35 true
36 false
37 true
38 false
39 false
Indices before fast sort
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Indices after fast sort
1
2
0
3
4
5
6
7
8
9
13
14
10
11
12
15
16
18
19
17
20
21
22
23
24
27
29
25
26
28
36
38
39
2147483647
30
31
32
33
34
35
*** isDescending=false
GPU COLUMN *** list to sort false - NC: 0 DATA: null VAL: null
GPU COLUMN *** list to sort false:DATA - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f55ba000500, length=12, id=-1} VAL: null
========= Invalid __global__ read of size 1 bytes
=========     at 0x590 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<const bool *, cudf::detail::input_indexalator>, bool *, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(T2, T3)
=========     by thread (33,0,0) in block (0,0,0)
=========     Address 0x7f563a00047f is out of bounds
=========     and is 536872064 bytes after the nearest allocation at 0x7f55ba000000 of size 1610612736 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuLaunchKernel_ptsz [0x2d53e6]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x3e4342b]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame: [0x3e80b78]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:void thrust::cuda_cub::parallel_for<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_nosync_base> >&, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<bool const*, cudf::detail::input_indexalator>, bool*, thrust::cuda_cub::__transform::no_stencil_tag, thrust::cuda_cub::identity, thrust::cuda_cub::__transform::always_true_predicate>, long) [0x17c4902]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:void cudf::detail::gather_helper<bool const*, bool*, cudf::detail::input_indexalator>(bool const*, int, bool*, cudf::detail::input_indexalator, cudf::detail::input_indexalator, bool, rmm::cuda_stream_view) [0x17c4ce2]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:std::unique_ptr<cudf::table, std::default_delete<cudf::table> > cudf::detail::gather<cudf::detail::input_indexalator>(cudf::table_view const&, cudf::detail::input_indexalator, cudf::detail::input_indexalator, cudf::out_of_bounds_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x17d321e]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::gather(cudf::table_view const&, cudf::column_view const&, cudf::out_of_bounds_policy, cudf::detail::negative_index_policy, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x179abf1]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::(anonymous namespace)::segmented_sort_by_key_common(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, cudf::detail::(anonymous namespace)::sort_method, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a3ad04]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2a3b7f8]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::lists::detail::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1fbbf3a]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:cudf::lists::sort_lists(cudf::lists_column_view const&, cudf::order, cudf::null_order, rmm::mr::device_memory_resource*) [0x1fbcc3a]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame:Java_ai_rapids_cudf_ColumnView_listSortRows [0x149aef8]
=========                in /tmp/cudf8802830850562852060.so
=========     Host Frame: [0x769dae17]
=========                in 

I also tried this debug build without Spark using the data I captured as a Parquet file (same data as in the log) and that same output index slot that becomes negative becomes zero instead, and duplicates another entry in the gather map that is zero (i.e.: it's a bad but not invalid gather map).

So we should be able to construct a test using the input values shown in the log above and show that the cub call is generating a resulting index/gather map that contains duplicated or invalid keys when the input did not have any duplicates. @davidwendt volunteered to write the C++ repro and see if that is able to show the issue.

@davidwendt
Copy link
Contributor

Error appears to occur with bool but not with int8_t with CUB's segmented sort

  cudf::test::fixed_width_column_wrapper<bool> col1{
    {true,  false, false, true, true,  true,  true, true, true,  true, true,  true, true, false,
     false, false, false, true, false, false, true, true, true,  true, true,  true, true, false,
     true,  false, true,  true, true,  true,  true, true, false, true, false, false}};
  cudf::test::fixed_width_column_wrapper<int> segments{{0, 5, 10, 15, 20, 25, 30, 40}};
  auto result = cudf::segmented_sorted_order(cudf::table_view({col1}), segments);
  cudf::test::print(result->view());

Output

1,2,0,3,4,5,6,7,8,9,13,14,10,11,12,15,16,18,19,17,20,21,22,23,24,27,29,25,26,28,36,38,39,32724,30,31,32,33,34,35

Garbage value 32724 at position 33
I will open an issue with CUB

@ttnghia
Copy link
Contributor

ttnghia commented Nov 19, 2022

I'm so glad that we can identify and catch the bug so quickly 👍. Thanks :)

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 21, 2022
@davidwendt
Copy link
Contributor

Created CUB issue here: NVIDIA/cub#594
Still working on possible solutions.

@davidwendt
Copy link
Contributor

Verified that adding a cudf::bit_cast to the test code above will workaround the error

  auto intcol = cudf::bit_cast(cudf::column_view{col1}, cudf::data_type{cudf::type_id::INT8});
  auto result = cudf::segmented_sorted_order(cudf::table_view({intcol}), segments);

The CUB issue is specifically due to an error in handling the bool type for the keys.
The cudf::bit_cast is a zero-copy function.

@GregoryKimball
Copy link
Contributor

Closed by #12217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants