Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

Closed
cwharris opened this issue May 3, 2021 · 2 comments · Fixed by rapidsai/cudf#8199 or #424
Closed

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

cwharris opened this issue May 3, 2021 · 2 comments · Fixed by rapidsai/cudf#8199 or #424
Assignees
Labels
bug Something isn't working Needs Triage Need team to review and classify

Comments

@cwharris
Copy link
Contributor

cwharris commented May 3, 2021

Hausdorff fails for large number of inputs. The problem originates in Thrust, which is storing num_items as int, but fixing that uncovers an illegal memory access bug.

NVIDIA/cccl#766

@cwharris cwharris added bug Something isn't working Needs Triage Need team to review and classify labels May 3, 2021
@cwharris cwharris self-assigned this May 3, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue May 11, 2021
)

same fix seen here, but via patch: NVIDIA/thrust#1424

Also fixes rapidsai/cuspatial#393

Alternatively, we could wait and update our thrust version, rather than patching the existing one.

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Paul Taylor (https://github.com/trxcllnt)

URL: #8199
@cwharris cwharris reopened this May 20, 2021
@cwharris
Copy link
Contributor Author

fix had to be rolled back due to performances regression in libcudf. Will have to tackle this problem by partitioning inputs, rather than accepting a single large input. partitions will have to begin/end at specific offsets to ensure inclusive_scan results are correct. these offsets are deterministic, according to cartesian_product_group_index begin/end pairs. The math for this is a forward-only packing problem, which is simple to compute in the CPU, but not as straightforward on the GPU. attempts to compute on the CPU resulting in slow performance, but this may have been expected due to the high number of inputs I was testing with at the time. further investigation is required.

@cwharris
Copy link
Contributor Author

cwharris commented Jun 22, 2021

I reapplied the thrust scan_by_key patch locally to allow for 64 bit indices, ran all @rapidsai/cudf benchmarks, and found little to no performance difference. From there, I was able to get hausdorff working with 5000 spaces and 100 points. However, higher inputs resulted in an OOM from thrust temporary memory allocations:

(rapids) rapids@compose:~$ test-cuspatial-cpp HAUSDORFF_TEST

+ cd /home/charris/dev/rapids/cuspatial/cpp/build/release
[2/2] Linking CXX executable gtests/HAUSDORFF_TEST
+ ctest --force-new-ctest-process --output-on-failure -R HAUSDORFF_TEST
Test project /home/charris/dev/rapids/cuspatial/cpp/build/release
    Start 5: HAUSDORFF_TEST
1/1 Test NVIDIA/thrust#5: HAUSDORFF_TEST ...................***Exception: SegFault  1.27 sec
Running main() from ../googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from HausdorffTest/0, where TypeParam = float
[ RUN      ] HausdorffTest/0.6000Spaces100Points
Thread,Time,Action,Pointer,Size,Stream
57027,01:13:22.531965,try allocate,_,4800000,0x0
57027,01:13:23.206586,did allocate,0x7f71fc000000,4800000,0x0
57027,01:13:23.208725,try allocate,_,4800000,0x0
57027,01:13:23.208874,did allocate,0x7f71fc600000,4800000,0x0
57027,01:13:23.209343,try allocate,_,24000,0x0
57027,01:13:23.209485,did allocate,0x7f71efe00000,24000,0x0
57027,01:13:23.303551,try allocate,_,288000000,0x0
57027,01:13:23.304031,did allocate,0x7f71c2000000,288000000,0x0
57027,01:13:23.348967,try allocate,_,288000000,0x0
57027,01:13:23.349418,did allocate,0x7f71b0000000,288000000,0x0
57027,01:13:23.349435,try allocate,_,1152000000,0x0
57027,01:13:23.350452,did allocate,0x7f716a000000,1152000000,0x0
57027,01:13:23.350481,try allocate,_,56953128704,0x0
57027,01:13:23.354702,free,0x7f716a000000,1152000000,0x0
57027,01:13:23.355787,free,0x7f71b0000000,288000000,0x0
57027,01:13:23.358178,free,0x7f71c2000000,288000000,0x0
57027,01:13:23.360274,free,0x7f71efe00000,24000,0x0
57027,01:13:23.361089,free,0x7f71fc600000,4800000,0x0
57027,01:13:23.362148,free,0x7f71fc000000,4800000,0x0
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/charris/dev/rapids/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory" thrown in the test body.
[  FAILED  ] HausdorffTest/0.6000Spaces100Points, where TypeParam = float (874 ms)
[----------] 1 test from HausdorffTest/0 (874 ms total)

[----------] 1 test from HausdorffTest/1, where TypeParam = double
[ RUN      ] HausdorffTest/1.6000Spaces100Points
Thread,Time,Action,Pointer,Size,Stream
57027,01:13:23.363653,try allocate,_,4800000,0x0
57027,01:13:23.363661,did allocate,0x7f724bd62c90,4800000,0x0

I hypothesize that this is a legitimate OOM caused by the design of temporary memory usage in the single-pass scan algorithm within thrust/cub. However, It I believe it is unreasonable to attempt to reduce the temporary memory requirements for cub/thrust single-pass-scan implementation at this time (specifically, as part of this bug). Therefore, I will be looking for alternatives that do not require patching thrust.

edit:

for anyone coming across this comment wondering why inclusive_scan is being used, it's because the hausdorff implementation uses an associative (non-commutative) reduction operator, but thrust/cub does not provide an associative reduction algorithm. Instead, inclusive_scan is being used in conjunction with a proprietary "scatter output iterator" to compress the output at write time, only taking the last element of each scan. Essentially the thrust/cub scan algorithm was never designed to be used with this many inputs, and therefore makes a reasonably assumption about the maximum size of temporary storage. patching thrust to a 64 bit integer increased the maximum number of inputs, and therefore increased the maximum possibly temporary storage size, resulting in an an OOM for a large number of inputs. This could be alleviated by using a circular buffer for the single-pass scan algorithm, rather than retaining all intermediate results, but that's far out of scope for this bug.

@rapids-bot rapids-bot bot closed this as completed in #424 Jul 2, 2021
rapids-bot bot pushed a commit that referenced this issue Jul 2, 2021
Fixes #393

We switched to the exclusive scan approach to Hausdorff because certain benchmarks indicated better performance. Apparently those benchmarks were inadequate or just plain badly written (by me), and performance was in fact worse. This became apparent while fixing the OOM error reported in #393.

I copied the 0.14 implementation in to the 21.08 branch to re-benchmark. here are the results:

cuspatial@0.14:
```
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         1.62 ms         1.78 ms          428 items_per_second=23.9898G/s
HausdorffBenchmark/hausdorff/512/64/manual_time         43.9 ms         44.1 ms           16 items_per_second=23.6053G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time        2810 ms         2810 ms            1 items_per_second=23.6845G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time        6148 ms         6148 ms            1 items_per_second=23.2318G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        3.31 ms         3.47 ms          210 items_per_second=29.0333G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        88.9 ms         89.1 ms            8 items_per_second=28.7737G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time       5842 ms         5842 ms            1 items_per_second=28.132G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time      12698 ms        12698 ms            1 items_per_second=27.7783G/s
```
cuspatial@21.08 (with fix for OOM, as seen in previous commits of this PR)
```
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         17.4 ms         17.6 ms           38 items_per_second=2.2391G/s
HausdorffBenchmark/hausdorff/512/64/manual_time          489 ms          490 ms            2 items_per_second=2.11979G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time       37120 ms        37119 ms            1 items_per_second=1.79299G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time       82732 ms        82729 ms            1 items_per_second=1.7265G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        43.4 ms         43.7 ms           16 items_per_second=2.21402G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        1341 ms         1341 ms            1 items_per_second=1.90885G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time      94898 ms        94894 ms            1 items_per_second=1.7319G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time     199120 ms       199115 ms            1 items_per_second=1.77138G/s
```

The performance is bad, and this regression is my fault. Fortunately I was able to quickly reverse this regression and improve performance while getting rid of a bunch of code (and learning a lot in the process). This PR re-implements Hausdorff as a straightforward custom kernel that requires zero intermediate memory.

this pr:
```
------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
HausdorffBenchmark/hausdorff/100/64/manual_time         1.31 ms         1.47 ms          526 items_per_second=29.6763G/s
HausdorffBenchmark/hausdorff/512/64/manual_time         23.2 ms         23.3 ms           30 items_per_second=44.7567G/s
HausdorffBenchmark/hausdorff/4096/64/manual_time        1589 ms         1590 ms            1 items_per_second=41.8747G/s
HausdorffBenchmark/hausdorff/6000/64/manual_time        3170 ms         3170 ms            1 items_per_second=45.0638G/s
HausdorffBenchmark/hausdorff/100/100/manual_time        2.92 ms         3.08 ms          239 items_per_second=32.8852G/s
HausdorffBenchmark/hausdorff/512/100/manual_time        55.8 ms         55.8 ms           12 items_per_second=45.8415G/s
HausdorffBenchmark/hausdorff/4096/100/manual_time       3547 ms         3547 ms            1 items_per_second=46.3317G/s
HausdorffBenchmark/hausdorff/6000/100/manual_time       7658 ms         7658 ms            1 items_per_second=46.0564G/s
```

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Paul Taylor (https://github.com/trxcllnt)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Triage Need team to review and classify
Projects
None yet
1 participant