[BUG] Hausdorff fails with OOM for > intmax num inputs #393

cwharris · 2021-05-03T20:25:37Z

Hausdorff fails for large number of inputs. The problem originates in Thrust, which is storing num_items as int, but fixing that uncovers an illegal memory access bug.

NVIDIA/cccl#766

The text was updated successfully, but these errors were encountered:

) same fix seen here, but via patch: NVIDIA/thrust#1424 Also fixes rapidsai/cuspatial#393 Alternatively, we could wait and update our thrust version, rather than patching the existing one. Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) URL: #8199

cwharris · 2021-05-20T19:55:43Z

fix had to be rolled back due to performances regression in libcudf. Will have to tackle this problem by partitioning inputs, rather than accepting a single large input. partitions will have to begin/end at specific offsets to ensure inclusive_scan results are correct. these offsets are deterministic, according to cartesian_product_group_index begin/end pairs. The math for this is a forward-only packing problem, which is simple to compute in the CPU, but not as straightforward on the GPU. attempts to compute on the CPU resulting in slow performance, but this may have been expected due to the high number of inputs I was testing with at the time. further investigation is required.

cwharris · 2021-06-22T06:24:11Z

I reapplied the thrust scan_by_key patch locally to allow for 64 bit indices, ran all @rapidsai/cudf benchmarks, and found little to no performance difference. From there, I was able to get hausdorff working with 5000 spaces and 100 points. However, higher inputs resulted in an OOM from thrust temporary memory allocations:

(rapids) rapids@compose:~$ test-cuspatial-cpp HAUSDORFF_TEST

+ cd /home/charris/dev/rapids/cuspatial/cpp/build/release
[2/2] Linking CXX executable gtests/HAUSDORFF_TEST
+ ctest --force-new-ctest-process --output-on-failure -R HAUSDORFF_TEST
Test project /home/charris/dev/rapids/cuspatial/cpp/build/release
    Start 5: HAUSDORFF_TEST
1/1 Test NVIDIA/thrust#5: HAUSDORFF_TEST ...................***Exception: SegFault  1.27 sec
Running main() from ../googletest/src/gtest_main.cc
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from HausdorffTest/0, where TypeParam = float
[ RUN      ] HausdorffTest/0.6000Spaces100Points
Thread,Time,Action,Pointer,Size,Stream
57027,01:13:22.531965,try allocate,_,4800000,0x0
57027,01:13:23.206586,did allocate,0x7f71fc000000,4800000,0x0
57027,01:13:23.208725,try allocate,_,4800000,0x0
57027,01:13:23.208874,did allocate,0x7f71fc600000,4800000,0x0
57027,01:13:23.209343,try allocate,_,24000,0x0
57027,01:13:23.209485,did allocate,0x7f71efe00000,24000,0x0
57027,01:13:23.303551,try allocate,_,288000000,0x0
57027,01:13:23.304031,did allocate,0x7f71c2000000,288000000,0x0
57027,01:13:23.348967,try allocate,_,288000000,0x0
57027,01:13:23.349418,did allocate,0x7f71b0000000,288000000,0x0
57027,01:13:23.349435,try allocate,_,1152000000,0x0
57027,01:13:23.350452,did allocate,0x7f716a000000,1152000000,0x0
57027,01:13:23.350481,try allocate,_,56953128704,0x0
57027,01:13:23.354702,free,0x7f716a000000,1152000000,0x0
57027,01:13:23.355787,free,0x7f71b0000000,288000000,0x0
57027,01:13:23.358178,free,0x7f71c2000000,288000000,0x0
57027,01:13:23.360274,free,0x7f71efe00000,24000,0x0
57027,01:13:23.361089,free,0x7f71fc600000,4800000,0x0
57027,01:13:23.362148,free,0x7f71fc000000,4800000,0x0
unknown file: Failure
C++ exception with description "std::bad_alloc: CUDA error at: /home/charris/dev/rapids/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory" thrown in the test body.
[  FAILED  ] HausdorffTest/0.6000Spaces100Points, where TypeParam = float (874 ms)
[----------] 1 test from HausdorffTest/0 (874 ms total)

[----------] 1 test from HausdorffTest/1, where TypeParam = double
[ RUN      ] HausdorffTest/1.6000Spaces100Points
Thread,Time,Action,Pointer,Size,Stream
57027,01:13:23.363653,try allocate,_,4800000,0x0
57027,01:13:23.363661,did allocate,0x7f724bd62c90,4800000,0x0

I hypothesize that this is a legitimate OOM caused by the design of temporary memory usage in the single-pass scan algorithm within thrust/cub. However, It I believe it is unreasonable to attempt to reduce the temporary memory requirements for cub/thrust single-pass-scan implementation at this time (specifically, as part of this bug). Therefore, I will be looking for alternatives that do not require patching thrust.

edit:

for anyone coming across this comment wondering why inclusive_scan is being used, it's because the hausdorff implementation uses an associative (non-commutative) reduction operator, but thrust/cub does not provide an associative reduction algorithm. Instead, inclusive_scan is being used in conjunction with a proprietary "scatter output iterator" to compress the output at write time, only taking the last element of each scan. Essentially the thrust/cub scan algorithm was never designed to be used with this many inputs, and therefore makes a reasonably assumption about the maximum size of temporary storage. patching thrust to a 64 bit integer increased the maximum number of inputs, and therefore increased the maximum possibly temporary storage size, resulting in an an OOM for a large number of inputs. This could be alleviated by using a circular buffer for the single-pass scan algorithm, rather than retaining all intermediate results, but that's far out of scope for this bug.

Fixes #393 We switched to the exclusive scan approach to Hausdorff because certain benchmarks indicated better performance. Apparently those benchmarks were inadequate or just plain badly written (by me), and performance was in fact worse. This became apparent while fixing the OOM error reported in #393. I copied the 0.14 implementation in to the 21.08 branch to re-benchmark. here are the results: cuspatial@0.14: ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 1.62 ms 1.78 ms 428 items_per_second=23.9898G/s HausdorffBenchmark/hausdorff/512/64/manual_time 43.9 ms 44.1 ms 16 items_per_second=23.6053G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 2810 ms 2810 ms 1 items_per_second=23.6845G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 6148 ms 6148 ms 1 items_per_second=23.2318G/s HausdorffBenchmark/hausdorff/100/100/manual_time 3.31 ms 3.47 ms 210 items_per_second=29.0333G/s HausdorffBenchmark/hausdorff/512/100/manual_time 88.9 ms 89.1 ms 8 items_per_second=28.7737G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 5842 ms 5842 ms 1 items_per_second=28.132G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 12698 ms 12698 ms 1 items_per_second=27.7783G/s ``` cuspatial@21.08 (with fix for OOM, as seen in previous commits of this PR) ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 17.4 ms 17.6 ms 38 items_per_second=2.2391G/s HausdorffBenchmark/hausdorff/512/64/manual_time 489 ms 490 ms 2 items_per_second=2.11979G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 37120 ms 37119 ms 1 items_per_second=1.79299G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 82732 ms 82729 ms 1 items_per_second=1.7265G/s HausdorffBenchmark/hausdorff/100/100/manual_time 43.4 ms 43.7 ms 16 items_per_second=2.21402G/s HausdorffBenchmark/hausdorff/512/100/manual_time 1341 ms 1341 ms 1 items_per_second=1.90885G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 94898 ms 94894 ms 1 items_per_second=1.7319G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 199120 ms 199115 ms 1 items_per_second=1.77138G/s ``` The performance is bad, and this regression is my fault. Fortunately I was able to quickly reverse this regression and improve performance while getting rid of a bunch of code (and learning a lot in the process). This PR re-implements Hausdorff as a straightforward custom kernel that requires zero intermediate memory. this pr: ``` ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ HausdorffBenchmark/hausdorff/100/64/manual_time 1.31 ms 1.47 ms 526 items_per_second=29.6763G/s HausdorffBenchmark/hausdorff/512/64/manual_time 23.2 ms 23.3 ms 30 items_per_second=44.7567G/s HausdorffBenchmark/hausdorff/4096/64/manual_time 1589 ms 1590 ms 1 items_per_second=41.8747G/s HausdorffBenchmark/hausdorff/6000/64/manual_time 3170 ms 3170 ms 1 items_per_second=45.0638G/s HausdorffBenchmark/hausdorff/100/100/manual_time 2.92 ms 3.08 ms 239 items_per_second=32.8852G/s HausdorffBenchmark/hausdorff/512/100/manual_time 55.8 ms 55.8 ms 12 items_per_second=45.8415G/s HausdorffBenchmark/hausdorff/4096/100/manual_time 3547 ms 3547 ms 1 items_per_second=46.3317G/s HausdorffBenchmark/hausdorff/6000/100/manual_time 7658 ms 7658 ms 1 items_per_second=46.0564G/s ``` Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) - AJ Schmidt (https://github.com/ajschmidt8) URL: #424

cwharris added bug Something isn't working Needs Triage Need team to review and classify labels May 3, 2021

cwharris self-assigned this May 3, 2021

cwharris mentioned this issue May 10, 2021

patch thrust to fix intmax num elements limitation in scan_by_key rapidsai/cudf#8199

Merged

rapids-bot bot closed this as completed in rapidsai/cudf#8199 May 11, 2021

cwharris reopened this May 20, 2021

cwharris mentioned this issue Jun 22, 2021

Improve Hausdorff perf and accept larger number of inputs. #424

Merged

rapids-bot bot closed this as completed in #424 Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

cwharris commented May 3, 2021

cwharris commented May 20, 2021

cwharris commented Jun 22, 2021 •

edited

Loading

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

[BUG] Hausdorff fails with OOM for > intmax num inputs #393

Comments

cwharris commented May 3, 2021

cwharris commented May 20, 2021

cwharris commented Jun 22, 2021 • edited Loading

cwharris commented Jun 22, 2021 •

edited

Loading