Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] groupby result cache error in dask-cudf test #9507

Closed
karthikeyann opened this issue Oct 22, 2021 · 1 comment · Fixed by #9508
Closed

[BUG] groupby result cache error in dask-cudf test #9507

karthikeyann opened this issue Oct 22, 2021 · 1 comment · Fixed by #9508
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@karthikeyann
Copy link
Contributor

karthikeyann commented Oct 22, 2021

Describe the bug
@bdice pointed out a recurring issue in gpuCI from dask-cudf in test_groupby_reset_index_names . I see the message:
RuntimeError: cuDF failure at: ../src/aggregation/result_cache.cpp:42: Result does not exist in cache
@karthikeyann recently edited result_cache.cpp , and we discussed this briefly earlier in the week. I don't think we were able to identify exactly what was going wrong. Are others seeing this? Any thoughts on what might be the cause? It failed on CentOS 7 both times, but with different CUDA versions/Python versions/GPUs.

Examples:
dask_cudf.tests.test_groupby.test_groupby_reset_index_names (from pytest)
dask_cudf.tests.test_groupby.test_groupby_reset_index_names (from pytest)
Error Message
RuntimeError: cuDF failure at: ../src/aggregation/result_cache.cpp:42: Result does not exist in cache
Stacktrace

def test_groupby_reset_index_names():
        df = cudf.datasets.randomdata(
            nrows=10, dtypes={"a": str, "b": int, "c": int}
        )
        pdf = df.to_pandas()
    
        gddf = dask_cudf.from_cudf(df, 2)
        pddf = dd.from_pandas(pdf, 2)
    
>       g_res = gddf.groupby("a", sort=True).sum()

Steps/Code to reproduce bug
Reproduced the error using Minimal cuDF python code.

import cudf
df = cudf.DataFrame({'a': ["x", "x", "y"], 'b': [1, 2, 3]})
df2 = cudf.DataFrame({'a': df.a, 'b': df.b, 'c': df.b})
print(df2.groupby('a').sum())

run this code with compute-sanitizer --tool memcheck python code.py

thanks 🙏 @bdice @jakirkham

@karthikeyann karthikeyann added bug Something isn't working Needs Triage Need team to review and classify labels Oct 22, 2021
@karthikeyann
Copy link
Contributor Author

Issue is in std::unordered_map with std::reference_wrapper<aggregation const> in the key.

  std::unordered_map<std::pair<column_view, std::reference_wrapper<aggregation const>>,
                     std::pair<std::unique_ptr<aggregation>, std::unique_ptr<column>>,
                     pair_column_aggregation_hash,
                     pair_column_aggregation_equal_to>
    _cache;

void result_cache::add_result(column_view const& input,
                              aggregation const& agg,
                              std::unique_ptr<column>&& col)
{
  // We can't guarantee that agg will outlive the cache, so we need to take ownership of a copy.
  // To allow lookup by reference, make the key a reference and keep the owner in the value pair.
  auto owned_agg       = agg.clone();
  auto const& key      = *owned_agg;
  auto value           = std::make_pair(std::move(owned_agg), std::move(col));
  _cache[{input, key}] = std::move(value);
}

When add_result(col1, agg1, column1); add_result(col1, agg1, column2); is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore.
This happens either randomly or always running via cuda-memcheck.

When _cache[{input, key}] = std::move(value); executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key.
So, when compared again, pair_column_aggregation_equal_to fails because we are comparing a destroyed object (whose memory may have been overwritten).

@karthikeyann karthikeyann added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Oct 22, 2021
@karthikeyann karthikeyann self-assigned this Oct 22, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 26, 2021
Fixes #9507

Prevents inserting to groupby result cache if the result for <column, aggregation> pair is already present in the cache.
Added unit test to test this. 

**Details:**
When `add_result(col1, agg1, column1); add_result(col1, agg1, column2);` is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore.
Issue is in `_cache` `std::unordered_map` with `std::reference_wrapper<aggregation const>` in the key.

When `_cache[{input, key}] = std::move(value);` executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key.
So, when compared again, `pair_column_aggregation_equal_to` fails because we are comparing a destroyed object (whose memory may have been overwritten).
#9507 (comment)

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #9508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant