Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Invalid result from DeviceSegmentedSort::SortPairs/SortKeys when keys are bool type #594

Closed
davidwendt opened this issue Nov 21, 2022 · 3 comments · Fixed by #595
Closed
Labels
P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended.
Milestone

Comments

@davidwendt
Copy link
Contributor

If the keys parameter is type bool then DeviceSegmentedSort::SortPairs and DeviceSegmentedSort::SortKeys return invalid results. The following example reproduces the error consistently. The error occurs with SortKeys as well but is easier to see with SortPairs

#include <thrust/device_vector.h>
#include <cub/cub.cuh>
#include <cuda_runtime.h>
#include <vector>
#include <iostream>

template <typename T>
std::vector<int> segmented_sort(bool const *h_data, size_t input_size,
                                int const *h_segs, size_t segs_size)
{
  T *d_data;
  cudaMalloc(&d_data, input_size * sizeof(bool));
  cudaMemcpy(d_data, h_data, input_size * sizeof(bool), cudaMemcpyHostToDevice);

  thrust::device_vector<int> d_segs(segs_size);
  cudaMemcpy(d_segs.data().get(), h_segs, segs_size * sizeof(int), cudaMemcpyHostToDevice);

  T *d_output;
  cudaMalloc(&d_output, input_size * sizeof(bool));

  thrust::device_vector<int> d_indices(input_size);
  thrust::sequence(thrust::device, d_indices.begin(), d_indices.end(), 0);

  void *d_temp_storage = nullptr;
  size_t temp_storage_bytes = 0;
  cub::DeviceSegmentedSort::SortPairs(d_temp_storage, temp_storage_bytes,
                                      d_data, d_output,
                                      d_indices.data().get(), d_indices.data().get(),
                                      input_size, d_segs.size() - 1,
                                      d_segs.begin(), d_segs.begin() + 1);

  cudaMalloc(&d_temp_storage, temp_storage_bytes);

  cub::DeviceSegmentedSort::SortPairs(d_temp_storage, temp_storage_bytes,
                                      d_data, d_output,
                                      d_indices.data().get(), d_indices.data().get(),
                                      input_size, d_segs.size() - 1,
                                      d_segs.begin(), d_segs.begin() + 1);

  cudaFree(d_temp_storage);
  cudaFree(d_data);
  cudaFree(d_output);

  std::vector<int> h_indices(d_indices.size());
  cudaMemcpy(h_indices.data(), d_indices.data().get(), d_indices.size() * sizeof(int), cudaMemcpyDeviceToHost);
  return h_indices;
}

int main()
{
  bool h_data[] = {true, false, false, true, true,
                   true, true, true, true, true,
                   true, true, true, false, false,
                   false, false, true, false, false,
                   true, true, true, true, true,
                   true, true, false, true, false,
                   true, true, true, true, true, true, false, true, false, false};
  auto input_size = sizeof(h_data) / sizeof(h_data[0]);
  std::vector<int> h_segs({0, 5, 10, 15, 20, 25, 30, 40});

  auto result = segmented_sort<bool>(h_data, input_size, h_segs.data(), h_segs.size());

  std::vector<int> h_valid(
      { 1,  2,  0,  3,  4,  5,  6,  7,  8,  9,
       13, 14, 10, 11, 12, 15, 16, 18, 19, 17,
       20, 21, 22, 23, 24, 27, 29, 25, 26, 28,
       36, 38, 39, 30, 31, 32, 33, 34, 35, 37});

  for (size_t i = 0; i < result.size(); ++i)  {
    if (result[i] != h_valid[i]) {
      std::cout << "error at position " << i << ": " << result[i] << "!=" << h_valid[i] << "\n";
      break;
    }
  }
}

The error consistently occurs at position 33 in the indices/result vector. The value itself is random. The values after position 33 are also incorrect but appear to be just shifted -- result[34] should be in result[33], result[35] should be in result[34], etc

The error appears to be caused by the following line:

KeyT oob_default = reinterpret_cast<KeyT &>(default_key_bits);

The oob_default value is xFF and is neither true nor false and so the sort algorithm is unable process it.
Note that true fails to consistently compare against def in this example: https://godbolt.org/z/f8bcjaMhW

@ttnghia
Copy link

ttnghia commented Nov 21, 2022

CC myself to subscibe the issue.

@gevtushenko
Copy link
Collaborator

@davidwendt thank you for reporting the issue! I can reproduce it. We haven't tested bool type for keys. I guess the issue can be addressed if we do something like:

return Traits<KeyT>::TwiddleIn(lhs) > Traits<KeyT>::TwiddleIn(rhs);

in the AgentSubWarpSort::BinaryOpT. I'll create a PR with the fix and additional tests soon. Meanwhile you could reinterpret cast your bool* to be std::uint8_t*, or use this type in the first place.

@gevtushenko gevtushenko added type: bug: functional Does not work as intended. P0: must have Absolutely necessary. Critical issue, major blocker, etc. labels Nov 21, 2022
@gevtushenko gevtushenko added this to the 2.1.0 milestone Nov 21, 2022
@davidwendt
Copy link
Contributor Author

I made this change to the oob_default declaration that seemed to work

KeyT oob_default = std::is_same_v<KeyT, bool> ? !IS_DESCENDING : reinterpret_cast<KeyT &>(default_key_bits);

Perhaps Traits<bool> can be specialized and set with LOWEST_KEY=false and MAX_KEY=true

@gevtushenko gevtushenko linked a pull request Nov 22, 2022 that will close this issue
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Nov 29, 2022
Fix in CUB DeviceSegmentedSort allows for workaround to removed. The CUB fix is applied as a patch in the libcudf build process.
Reference NVIDIA/cub#594 and #12217

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Elias Stehle (https://github.com/elstehle)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #12234
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants