[BUG] groupby result cache error in dask-cudf test #9507

karthikeyann · 2021-10-22T22:50:24Z

Describe the bug
@bdice pointed out a recurring issue in gpuCI from dask-cudf in test_groupby_reset_index_names . I see the message:
RuntimeError: cuDF failure at: ../src/aggregation/result_cache.cpp:42: Result does not exist in cache
@karthikeyann recently edited result_cache.cpp , and we discussed this briefly earlier in the week. I don't think we were able to identify exactly what was going wrong. Are others seeing this? Any thoughts on what might be the cause? It failed on CentOS 7 both times, but with different CUDA versions/Python versions/GPUs.

Examples:
dask_cudf.tests.test_groupby.test_groupby_reset_index_names (from pytest)
dask_cudf.tests.test_groupby.test_groupby_reset_index_names (from pytest)
Error Message
RuntimeError: cuDF failure at: ../src/aggregation/result_cache.cpp:42: Result does not exist in cache
Stacktrace

def test_groupby_reset_index_names():
        df = cudf.datasets.randomdata(
            nrows=10, dtypes={"a": str, "b": int, "c": int}
        )
        pdf = df.to_pandas()
    
        gddf = dask_cudf.from_cudf(df, 2)
        pddf = dd.from_pandas(pdf, 2)
    
>       g_res = gddf.groupby("a", sort=True).sum()

Steps/Code to reproduce bug
Reproduced the error using Minimal cuDF python code.

import cudf
df = cudf.DataFrame({'a': ["x", "x", "y"], 'b': [1, 2, 3]})
df2 = cudf.DataFrame({'a': df.a, 'b': df.b, 'c': df.b})
print(df2.groupby('a').sum())

run this code with compute-sanitizer --tool memcheck python code.py

thanks 🙏 @bdice @jakirkham

The text was updated successfully, but these errors were encountered:

karthikeyann · 2021-10-22T22:54:09Z

Issue is in std::unordered_map with std::reference_wrapper<aggregation const> in the key.

  std::unordered_map<std::pair<column_view, std::reference_wrapper<aggregation const>>,
                     std::pair<std::unique_ptr<aggregation>, std::unique_ptr<column>>,
                     pair_column_aggregation_hash,
                     pair_column_aggregation_equal_to>
    _cache;

void result_cache::add_result(column_view const& input,
                              aggregation const& agg,
                              std::unique_ptr<column>&& col)
{
  // We can't guarantee that agg will outlive the cache, so we need to take ownership of a copy.
  // To allow lookup by reference, make the key a reference and keep the owner in the value pair.
  auto owned_agg       = agg.clone();
  auto const& key      = *owned_agg;
  auto value           = std::make_pair(std::move(owned_agg), std::move(col));
  _cache[{input, key}] = std::move(value);
}

When add_result(col1, agg1, column1); add_result(col1, agg1, column2); is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore.
This happens either randomly or always running via cuda-memcheck.

When _cache[{input, key}] = std::move(value); executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key.
So, when compared again, pair_column_aggregation_equal_to fails because we are comparing a destroyed object (whose memory may have been overwritten).

Fixes #9507 Prevents inserting to groupby result cache if the result for <column, aggregation> pair is already present in the cache. Added unit test to test this. **Details:** When `add_result(col1, agg1, column1); add_result(col1, agg1, column2);` is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore. Issue is in `_cache` `std::unordered_map` with `std::reference_wrapper<aggregation const>` in the key. When `_cache[{input, key}] = std::move(value);` executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key. So, when compared again, `pair_column_aggregation_equal_to` fails because we are comparing a destroyed object (whose memory may have been overwritten). #9507 (comment) Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #9508

karthikeyann added bug Something isn't working Needs Triage Need team to review and classify labels Oct 22, 2021

karthikeyann added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Oct 22, 2021

karthikeyann mentioned this issue Oct 22, 2021

Fix for inserting duplicates in groupby result cache #9508

Merged

karthikeyann self-assigned this Oct 22, 2021

rapids-bot bot closed this as completed in #9508 Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] groupby result cache error in dask-cudf test #9507

[BUG] groupby result cache error in dask-cudf test #9507

karthikeyann commented Oct 22, 2021 •

edited

Loading

karthikeyann commented Oct 22, 2021

[BUG] groupby result cache error in dask-cudf test #9507

[BUG] groupby result cache error in dask-cudf test #9507

Comments

karthikeyann commented Oct 22, 2021 • edited Loading

karthikeyann commented Oct 22, 2021

karthikeyann commented Oct 22, 2021 •

edited

Loading