Optimize string gather performance for large strings #7980

gaohao95 · 2021-04-16T16:28:58Z

This PR intends to improve the string gather performance for large strings. There are two kernels implemented

String-parallel kernel assigns strings to warps and each warp collectively copies the characters with large data type. This kernel is best suited for large strings.
Char-parallel kernel assigns characters to threads. This is similar to the existing implementation, except this PR uses shared memory and assigns a fixed number of strings per threadblock to improve binary search performance. This kernel is best suited for small strings.

This PR uses one of the two kernels depending on the average string size.

The following benchmark results are collected on V100 through ./gbenchmarks/STRINGS_BENCH --benchmark_filter=StringCopy/gather

Before this PR at 8a504d19c725e0ff01e28f36e5f1daf02fbf86c4:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
StringCopy/gather/4096/32/manual_time          0.127 ms        0.151 ms         5269 bytes_per_second=545.365M/s
StringCopy/gather/4096/128/manual_time         0.130 ms        0.154 ms         5135 bytes_per_second=2.0329G/s
StringCopy/gather/4096/512/manual_time         0.156 ms        0.179 ms         4331 bytes_per_second=6.85358G/s
StringCopy/gather/4096/2048/manual_time        0.255 ms        0.277 ms         2731 bytes_per_second=16.711G/s
StringCopy/gather/4096/8192/manual_time        0.650 ms        0.673 ms         1076 bytes_per_second=26.0888G/s
StringCopy/gather/32768/32/manual_time         0.148 ms        0.171 ms         4602 bytes_per_second=3.64833G/s
StringCopy/gather/32768/128/manual_time        0.206 ms        0.228 ms         3345 bytes_per_second=10.3745G/s
StringCopy/gather/32768/512/manual_time        0.438 ms        0.462 ms         1599 bytes_per_second=19.421G/s
StringCopy/gather/32768/2048/manual_time        1.38 ms         1.40 ms          506 bytes_per_second=24.7168G/s
StringCopy/gather/32768/8192/manual_time        5.14 ms         5.16 ms          136 bytes_per_second=26.5093G/s
StringCopy/gather/262144/32/manual_time        0.336 ms        0.358 ms         2082 bytes_per_second=12.8318G/s
StringCopy/gather/262144/128/manual_time       0.878 ms        0.901 ms          795 bytes_per_second=19.4286G/s
StringCopy/gather/262144/512/manual_time        3.05 ms         3.07 ms          229 bytes_per_second=22.3358G/s
StringCopy/gather/262144/2048/manual_time       11.8 ms         11.8 ms           59 bytes_per_second=23.2139G/s
StringCopy/gather/2097152/32/manual_time        2.05 ms         2.07 ms          341 bytes_per_second=16.8261G/s
StringCopy/gather/2097152/128/manual_time       6.96 ms         6.99 ms          100 bytes_per_second=19.6048G/s
StringCopy/gather/2097152/512/manual_time       26.7 ms         26.7 ms           26 bytes_per_second=20.434G/s
StringCopy/gather/16777216/32/manual_time       19.0 ms         19.0 ms           37 bytes_per_second=14.5447G/s
StringCopy/gather/67108864/2/manual_time        34.1 ms         34.2 ms           20 bytes_per_second=2.01153G/s

This PR:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
StringCopy/gather/4096/32/manual_time          0.105 ms        0.127 ms         6430 bytes_per_second=660.581M/s
StringCopy/gather/4096/128/manual_time         0.103 ms        0.125 ms         6383 bytes_per_second=2.57612G/s
StringCopy/gather/4096/512/manual_time         0.105 ms        0.126 ms         6249 bytes_per_second=10.2033G/s
StringCopy/gather/4096/2048/manual_time        0.114 ms        0.134 ms         6114 bytes_per_second=37.5549G/s
StringCopy/gather/4096/8192/manual_time        0.155 ms        0.178 ms         4547 bytes_per_second=109.744G/s
StringCopy/gather/32768/32/manual_time         0.109 ms        0.130 ms         6210 bytes_per_second=4.9546G/s
StringCopy/gather/32768/128/manual_time        0.124 ms        0.145 ms         5441 bytes_per_second=17.1911G/s
StringCopy/gather/32768/512/manual_time        0.137 ms        0.159 ms         5057 bytes_per_second=62.082G/s
StringCopy/gather/32768/2048/manual_time       0.209 ms        0.232 ms         3362 bytes_per_second=163.045G/s
StringCopy/gather/32768/8192/manual_time       0.526 ms        0.549 ms         1332 bytes_per_second=259.064G/s
StringCopy/gather/262144/32/manual_time        0.184 ms        0.205 ms         3777 bytes_per_second=23.4435G/s
StringCopy/gather/262144/128/manual_time       0.328 ms        0.349 ms         2132 bytes_per_second=51.986G/s
StringCopy/gather/262144/512/manual_time       0.400 ms        0.421 ms         1751 bytes_per_second=170.506G/s
StringCopy/gather/262144/2048/manual_time      0.965 ms        0.987 ms          725 bytes_per_second=282.969G/s
StringCopy/gather/2097152/32/manual_time        1.10 ms         1.12 ms          637 bytes_per_second=31.35G/s
StringCopy/gather/2097152/128/manual_time       1.92 ms         1.94 ms          364 bytes_per_second=71.1531G/s
StringCopy/gather/2097152/512/manual_time       2.48 ms         2.50 ms          282 bytes_per_second=220.297G/s
StringCopy/gather/16777216/32/manual_time       11.0 ms         11.0 ms           64 bytes_per_second=25.0771G/s
StringCopy/gather/67108864/2/manual_time        33.7 ms         33.7 ms           21 bytes_per_second=2.03768G/s

When there are enough strings and string sizes are large (e.g. StringCopy/gather/262144/2048), this PR improves throughput from 23.21 GB/s to 282.97 GB/s, which is a 12x improvement.

For large strings, the ncu profile on 524288 strings, with average string size of 2048, shows the kernel takes 3.48ms, so achieved throughput is 308.55GB/s (which is 68.7% of DRAM SOL on V100).

gaohao95 · 2021-04-16T16:44:28Z

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

Edit: This assumption is dropped. See #7980 (comment).

cpp/include/cudf/strings/detail/gather.cuh

codecov · 2021-04-16T18:58:45Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@e555643). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #7980   +/-   ##
===============================================
  Coverage                ?   82.83%           
===============================================
  Files                   ?      109           
  Lines                   ?    17901           
  Branches                ?        0           
===============================================
  Hits                    ?    14828           
  Misses                  ?     3073           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e555643...e87da49. Read the comment docs.

davidwendt · 2021-04-16T20:12:16Z

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

This assumption has come up before. There is no guarantee that the input data is allocated from RMM since it is just a view to some device memory wrapped in a column_view class. It very well could point to the last valid byte of some device buffer.

gaohao95 · 2021-04-16T21:49:32Z

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

This assumption has come up before. There is no guarantee that the input data is allocated from RMM since it is just a view to some device memory wrapped in a column_view class. It very well could point to the last valid byte of some device buffer.

Sorry I think my original comment is poorly worded. I meant to say that the device memory allocation should always be 256B aligned, even if it's not allocated through RMM. The programming guide says:

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.

Does this mean we are safe to read up to 256B aligned location, even if it's beyond the end or the start of the input string?

cpp/include/cudf/strings/detail/gather.cuh

nsakharnykh

Overall looks good - left some minor comments and a suggestion to try for the possible out of bound issue based on our discussion today.

cpp/include/cudf/strings/detail/gather.cuh

…ptimize-large-string-gather-perf

gaohao95 · 2021-05-11T01:56:57Z

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

To avoid reading beyond string boundaries, the start and end locations of using large datatype is calculated using out_start + 4 and out_end - 4 in 93dd1a3. This way, the next 4B alignment location is guaranteed to be within the string limit. The performance decreases only slightly.

Performance before 93dd1a3:

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
StringCopy/gather/67108864/2/manual_time         33.7 ms         33.7 ms           21 bytes_per_second=2.03787G/s
StringCopy/gather/16777216/32/manual_time        11.1 ms         11.1 ms           63 bytes_per_second=24.8907G/s
StringCopy/gather/8388608/128/manual_time        8.21 ms         8.23 ms           85 bytes_per_second=66.517G/s
StringCopy/gather/4194304/512/manual_time        5.04 ms         5.06 ms          138 bytes_per_second=216.588G/s
StringCopy/gather/2097152/1024/manual_time       4.09 ms         4.11 ms          171 bytes_per_second=267.24G/s
StringCopy/gather/524288/2048/manual_time        1.86 ms         1.88 ms          377 bytes_per_second=294.088G/s

Performance after 93dd1a3

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
StringCopy/gather/67108864/2/manual_time         33.7 ms         33.7 ms           21 bytes_per_second=2.03835G/s
StringCopy/gather/16777216/32/manual_time        11.1 ms         11.1 ms           63 bytes_per_second=24.8951G/s
StringCopy/gather/8388608/128/manual_time        8.47 ms         8.49 ms           83 bytes_per_second=64.4481G/s
StringCopy/gather/4194304/512/manual_time        5.16 ms         5.18 ms          135 bytes_per_second=211.516G/s
StringCopy/gather/2097152/1024/manual_time       4.11 ms         4.13 ms          170 bytes_per_second=265.9G/s
StringCopy/gather/524288/2048/manual_time        1.87 ms         1.89 ms          375 bytes_per_second=292.396G/s

nsakharnykh

All my comments seem to be addressed, thanks @gaohao95! approving

cpp/include/cudf/strings/detail/gather.cuh

…optimize-large-string-gather-perf

jrhemstad · 2021-05-25T13:10:32Z

cpp/include/cudf/strings/detail/gather.cuh

+/**
+ * @brief Gather characters from the input iterator, with string parallel strategy.
+ *
+ * This strategy assigns strings to warps so that each warp can cooperatively copy from the input


This can be future work, but we should explore using a sub-warp per string by using cooperative groups. Using a whole warp may be overkill compared to using 4, 8 or 16 threads.

jrhemstad

Very nice.

ttnghia

Rerun tests.

galipremsagar · 2021-05-26T20:18:00Z

rerun tests

kkraus14 · 2021-05-26T20:26:40Z

Pushed to 21.08 to be safe. This is a large rework that we don't want to squeeze in right before release.

VibhuJawa · 2021-06-02T01:14:05Z

rerun tests

VibhuJawa · 2021-06-02T17:42:09Z

@gaohao95 , Is this good to merge ?

gaohao95 · 2021-06-02T18:06:15Z

@gaohao95 , Is this good to merge ?

I think it's good to go :) Thanks!

VibhuJawa · 2021-06-02T18:07:13Z

@gpucibot merge

galipremsagar · 2021-06-02T18:09:15Z

@gpucibot merge

Optimize string gather performance for large strings

871ca22

gaohao95 requested a review from a team as a code owner April 16, 2021 16:28

gaohao95 requested review from mythrocks and ttnghia April 16, 2021 16:28

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 16, 2021