Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memcheck error in nvcomp::unsnap_kernel reported in libcudf cuIO gtests #14440

Closed
davidwendt opened this issue Nov 17, 2023 · 4 comments
Closed
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@davidwendt
Copy link
Contributor

davidwendt commented Nov 17, 2023

Nightly build https://github.com/rapidsai/cudf/actions/runs/6902410549/job/18779068356 is reporting a compute-sanitizer failure in the memcheck tests indicating an out-of-bounds read error in nvcomp::unsnap_kernel for the following gtests:

  • ArrowIOTest.S3FileSystem
  • OrcWriterNumericTypeTest/6.SingleColumn
  • ParquetWriterNumericTypeTest/3.SingleColumnWithNulls

Partial output from compute-sanitizer run::

[ RUN      ] ParquetWriterNumericTypeTest/3.SingleColumnWithNulls
========= Invalid __global__ read of size 1 bytes
=========     at 0x39e0 in nvcomp::unsnap_kernel(const void *const *, const unsigned long *, void *const *, const unsigned long *, nvcompStatus_t *, unsigned long *)
=========     by thread (20,0,0) in block (0,0,0)
=========     Address 0x7f94de4008f0 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f94de400800 of size 240 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x304e32]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart808 [0xdbb6b]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libnvcomp.so
=========     Host Frame:cudaLaunchKernel [0x138b5b]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libnvcomp.so
=========     Host Frame:__device_stub__ZN6nvcomp13unsnap_kernelEPKPKvPKmPKPvS5_P14nvcompStatus_tPm(void const* const*, unsigned long const*, void* const*, unsigned long const*, nvcompStatus_t*, unsigned long*) [0x86e1c]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libnvcomp.so
=========     Host Frame:nvcomp::gpu_unsnap(void const* const*, unsigned long const*, void* const*, unsigned long const*, nvcompStatus_t*, unsigned long*, unsigned long, CUstream_st*) [0x86f07]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libnvcomp.so
=========     Host Frame:nvcompBatchedSnappyDecompressAsync [0xc7b6e]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/./libnvcomp.so
=========     Host Frame:cudf::io::nvcomp::batched_decompress(cudf::io::nvcomp::compression_type, cudf::device_span<cudf::device_span<unsigned char const, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::device_span<unsigned char, 18446744073709551615ul> const, 18446744073709551615ul>, cudf::device_span<cudf::io::compression_result, 18446744073709551615ul>, unsigned long, unsigned long, rmm::cuda_stream_view) [0x132e06f]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::parquet::detail::(anonymous namespace)::decompress_page_data(cudf::detail::hostdevice_vector<cudf::io::parquet::detail::ColumnChunkDesc>&, cudf::detail::hostdevice_vector<cudf::io::parquet::detail::PageInfo>&, rmm::cuda_stream_view) [0x14b9570]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::load_and_decompress_data() [0x14babe6]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::prepare_data(long, std::optional<int> const&, bool, cudf::host_span<std::vector<int, std::allocator<int> > const, 18446744073709551615ul>, std::optional<std::reference_wrapper<cudf::ast::expression const> >) [0x148899c]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::parquet::detail::reader::impl::read(long, std::optional<int> const&, bool, cudf::host_span<std::vector<int, std::allocator<int> > const, 18446744073709551615ul>, std::optional<std::reference_wrapper<cudf::ast::expression const> >) [0x148d03f]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::parquet::detail::reader::read(cudf::io::parquet_reader_options const&) [0x147bc47]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::io::read_parquet(cudf::io::parquet_reader_options const&, rmm::mr::device_memory_resource*) [0x13716be]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:ParquetWriterNumericTypeTest_SingleColumnWithNulls_Test<long>::TestBody() [0x28ecc2]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/PARQUET_TEST

The following can be used to reproduce these errors:
compute-sanitizer --tool memcheck gtests/PARQUET_TEST --gtest_filter=ParquetWriterNumericTypeTest/3.SingleColumnWithNulls --rmm_mode=cuda
compute-sanitizer --tool memcheck gtests/ORC_TEST --gtest_filter=OrcWriterNumericTypeTest/6.SingleColumn --rmm_mode=cuda
compute-sanitizer --tool memcheck gtests/ARROW_IO_SOURCE_TEST --gtest_filter=ArrowIOTest.S3FileSystem --rmm_mode=cuda

The Arrow test may require an S3 setup but I suspect fixing the error for the other 2 will fix this one as well.

@davidwendt davidwendt added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Nov 17, 2023
@madsbk
Copy link
Member

madsbk commented Nov 20, 2023

Running with compute-sanitizer --tool racecheck, there also seems to be a race in nvcomp::unsnap_kernel:

 Race reported between Read access at 0xb90 in nvcomp::unsnap_kernel(const void *const *, const unsigned long *, void *const *, const unsigned long *, nvcompStatus_t *, unsigned long *)
=========     and Write access at 0x38e0 in nvcomp::unsnap_kernel(const void *const *, const unsigned long *, void *const *, const unsigned long *, nvcompStatus_t *, unsigned long *) [2 hazards]

@nvdbaranec
Copy link
Contributor

I think I ran into this with a simple repro. @jbrennan333 sent me here:

void nvcomp_repro()
{  
  std::string filepath("table_with_dict.parquet");
  constexpr auto num_rows = 100;
  auto iter1 = cudf::detail::make_counting_transform_iterator(0, [](int i) { return 15; });
  cudf::test::fixed_width_column_wrapper<int> col1(iter1, iter1 + num_rows);
  auto tbl = cudf::table_view{{col1}}; 

  // compressed, no dictionary  
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, tbl)
    .compression(cudf::io::compression_type::SNAPPY)
    .dictionary_policy(cudf::io::dictionary_policy::NEVER);
  cudf::io::write_parquet(out_opts);

  cudf::io::parquet_reader_options in_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath});
  auto result = cudf::io::read_parquet(in_opts);
}

Symptoms are the same. No crash, but I get a compute-sanitizer report that looks suspiciously similar:

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 1 bytes
=========     at 0x3500 in nvcomp::unsnap_kernel(const void *const *, const unsigned long *, void *const *, const unsigned long *, nvcompStatus_t *, unsigned long *)
=========     by thread (27,0,0) in block (0,0,0)
=========     Address 0x7f6bf1e00430 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f6bf1e00400 of size 48 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x305c18]

@GregoryKimball
Copy link
Contributor

@nvdbaranec in the repro you shared above, does the data faithfully roundtrip or does it change?

rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this issue Dec 6, 2023
Upgrading to nvCOMP 3.0.5 which fixes some memcheck errors.
Reference: rapidsai/cudf#14581 and rapidsai/cudf#14440

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)

URL: #498
rapids-bot bot pushed a commit that referenced this issue Dec 6, 2023
This fixes some memcheck errors found by the libcudf nightly builds as documented here: #14440

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #14581
karthikeyann pushed a commit to karthikeyann/cudf that referenced this issue Dec 12, 2023
This fixes some memcheck errors found by the libcudf nightly builds as documented here: rapidsai#14440

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#14581
@GregoryKimball
Copy link
Contributor

Closed by #14581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

No branches or pull requests

5 participants