Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize string gather performance for large strings #7980

Merged

Conversation

gaohao95
Copy link
Contributor

@gaohao95 gaohao95 commented Apr 16, 2021

This PR intends to improve the string gather performance for large strings. There are two kernels implemented

  • String-parallel kernel assigns strings to warps and each warp collectively copies the characters with large data type. This kernel is best suited for large strings.
  • Char-parallel kernel assigns characters to threads. This is similar to the existing implementation, except this PR uses shared memory and assigns a fixed number of strings per threadblock to improve binary search performance. This kernel is best suited for small strings.

This PR uses one of the two kernels depending on the average string size.

The following benchmark results are collected on V100 through ./gbenchmarks/STRINGS_BENCH --benchmark_filter=StringCopy/gather

Before this PR at 8a504d19c725e0ff01e28f36e5f1daf02fbf86c4:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
StringCopy/gather/4096/32/manual_time          0.127 ms        0.151 ms         5269 bytes_per_second=545.365M/s
StringCopy/gather/4096/128/manual_time         0.130 ms        0.154 ms         5135 bytes_per_second=2.0329G/s
StringCopy/gather/4096/512/manual_time         0.156 ms        0.179 ms         4331 bytes_per_second=6.85358G/s
StringCopy/gather/4096/2048/manual_time        0.255 ms        0.277 ms         2731 bytes_per_second=16.711G/s
StringCopy/gather/4096/8192/manual_time        0.650 ms        0.673 ms         1076 bytes_per_second=26.0888G/s
StringCopy/gather/32768/32/manual_time         0.148 ms        0.171 ms         4602 bytes_per_second=3.64833G/s
StringCopy/gather/32768/128/manual_time        0.206 ms        0.228 ms         3345 bytes_per_second=10.3745G/s
StringCopy/gather/32768/512/manual_time        0.438 ms        0.462 ms         1599 bytes_per_second=19.421G/s
StringCopy/gather/32768/2048/manual_time        1.38 ms         1.40 ms          506 bytes_per_second=24.7168G/s
StringCopy/gather/32768/8192/manual_time        5.14 ms         5.16 ms          136 bytes_per_second=26.5093G/s
StringCopy/gather/262144/32/manual_time        0.336 ms        0.358 ms         2082 bytes_per_second=12.8318G/s
StringCopy/gather/262144/128/manual_time       0.878 ms        0.901 ms          795 bytes_per_second=19.4286G/s
StringCopy/gather/262144/512/manual_time        3.05 ms         3.07 ms          229 bytes_per_second=22.3358G/s
StringCopy/gather/262144/2048/manual_time       11.8 ms         11.8 ms           59 bytes_per_second=23.2139G/s
StringCopy/gather/2097152/32/manual_time        2.05 ms         2.07 ms          341 bytes_per_second=16.8261G/s
StringCopy/gather/2097152/128/manual_time       6.96 ms         6.99 ms          100 bytes_per_second=19.6048G/s
StringCopy/gather/2097152/512/manual_time       26.7 ms         26.7 ms           26 bytes_per_second=20.434G/s
StringCopy/gather/16777216/32/manual_time       19.0 ms         19.0 ms           37 bytes_per_second=14.5447G/s
StringCopy/gather/67108864/2/manual_time        34.1 ms         34.2 ms           20 bytes_per_second=2.01153G/s

This PR:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
StringCopy/gather/4096/32/manual_time          0.105 ms        0.127 ms         6430 bytes_per_second=660.581M/s
StringCopy/gather/4096/128/manual_time         0.103 ms        0.125 ms         6383 bytes_per_second=2.57612G/s
StringCopy/gather/4096/512/manual_time         0.105 ms        0.126 ms         6249 bytes_per_second=10.2033G/s
StringCopy/gather/4096/2048/manual_time        0.114 ms        0.134 ms         6114 bytes_per_second=37.5549G/s
StringCopy/gather/4096/8192/manual_time        0.155 ms        0.178 ms         4547 bytes_per_second=109.744G/s
StringCopy/gather/32768/32/manual_time         0.109 ms        0.130 ms         6210 bytes_per_second=4.9546G/s
StringCopy/gather/32768/128/manual_time        0.124 ms        0.145 ms         5441 bytes_per_second=17.1911G/s
StringCopy/gather/32768/512/manual_time        0.137 ms        0.159 ms         5057 bytes_per_second=62.082G/s
StringCopy/gather/32768/2048/manual_time       0.209 ms        0.232 ms         3362 bytes_per_second=163.045G/s
StringCopy/gather/32768/8192/manual_time       0.526 ms        0.549 ms         1332 bytes_per_second=259.064G/s
StringCopy/gather/262144/32/manual_time        0.184 ms        0.205 ms         3777 bytes_per_second=23.4435G/s
StringCopy/gather/262144/128/manual_time       0.328 ms        0.349 ms         2132 bytes_per_second=51.986G/s
StringCopy/gather/262144/512/manual_time       0.400 ms        0.421 ms         1751 bytes_per_second=170.506G/s
StringCopy/gather/262144/2048/manual_time      0.965 ms        0.987 ms          725 bytes_per_second=282.969G/s
StringCopy/gather/2097152/32/manual_time        1.10 ms         1.12 ms          637 bytes_per_second=31.35G/s
StringCopy/gather/2097152/128/manual_time       1.92 ms         1.94 ms          364 bytes_per_second=71.1531G/s
StringCopy/gather/2097152/512/manual_time       2.48 ms         2.50 ms          282 bytes_per_second=220.297G/s
StringCopy/gather/16777216/32/manual_time       11.0 ms         11.0 ms           64 bytes_per_second=25.0771G/s
StringCopy/gather/67108864/2/manual_time        33.7 ms         33.7 ms           21 bytes_per_second=2.03768G/s

When there are enough strings and string sizes are large (e.g. StringCopy/gather/262144/2048), this PR improves throughput from 23.21 GB/s to 282.97 GB/s, which is a 12x improvement.

For large strings, the ncu profile on 524288 strings, with average string size of 2048, shows the kernel takes 3.48ms, so achieved throughput is 308.55GB/s (which is 68.7% of DRAM SOL on V100).

@gaohao95 gaohao95 requested a review from a team as a code owner April 16, 2021 16:28
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 16, 2021
@gaohao95
Copy link
Contributor Author

gaohao95 commented Apr 16, 2021

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

Edit: This assumption is dropped. See #7980 (comment).

@codecov
Copy link

codecov bot commented Apr 16, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@e555643). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #7980   +/-   ##
===============================================
  Coverage                ?   82.83%           
===============================================
  Files                   ?      109           
  Lines                   ?    17901           
  Branches                ?        0           
===============================================
  Hits                    ?    14828           
  Misses                  ?     3073           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e555643...e87da49. Read the comment docs.

@davidwendt
Copy link
Contributor

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

This assumption has come up before. There is no guarantee that the input data is allocated from RMM since it is just a view to some device memory wrapped in a column_view class. It very well could point to the last valid byte of some device buffer.

@gaohao95
Copy link
Contributor Author

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

This assumption has come up before. There is no guarantee that the input data is allocated from RMM since it is just a view to some device memory wrapped in a column_view class. It very well could point to the last valid byte of some device buffer.

Sorry I think my original comment is poorly worded. I meant to say that the device memory allocation should always be 256B aligned, even if it's not allocated through RMM. The programming guide says:

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.

Does this mean we are safe to read up to 256B aligned location, even if it's beyond the end or the start of the input string?

@kkraus14 kkraus14 added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change Performance Performance related issue improvement Improvement / enhancement to an existing function labels Apr 19, 2021
@davidwendt davidwendt self-requested a review April 19, 2021 20:55
Copy link
Collaborator

@nsakharnykh nsakharnykh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good - left some minor comments and a suggestion to try for the possible out of bound issue based on our discussion today.

cpp/include/cudf/strings/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/detail/gather.cuh Outdated Show resolved Hide resolved
@gaohao95
Copy link
Contributor Author

Note that the string parallel version with large datatype could load from memory outside of the input string, up to the 4B alignment location. I think this is safe since cudaMalloc/RMM allocation is 256B aligned, so the extra bytes outside the input string must belong to the same allocation.

To avoid reading beyond string boundaries, the start and end locations of using large datatype is calculated using out_start + 4 and out_end - 4 in 93dd1a3. This way, the next 4B alignment location is guaranteed to be within the string limit. The performance decreases only slightly.

Performance before 93dd1a3:

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
StringCopy/gather/67108864/2/manual_time         33.7 ms         33.7 ms           21 bytes_per_second=2.03787G/s
StringCopy/gather/16777216/32/manual_time        11.1 ms         11.1 ms           63 bytes_per_second=24.8907G/s
StringCopy/gather/8388608/128/manual_time        8.21 ms         8.23 ms           85 bytes_per_second=66.517G/s
StringCopy/gather/4194304/512/manual_time        5.04 ms         5.06 ms          138 bytes_per_second=216.588G/s
StringCopy/gather/2097152/1024/manual_time       4.09 ms         4.11 ms          171 bytes_per_second=267.24G/s
StringCopy/gather/524288/2048/manual_time        1.86 ms         1.88 ms          377 bytes_per_second=294.088G/s

Performance after 93dd1a3

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
StringCopy/gather/67108864/2/manual_time         33.7 ms         33.7 ms           21 bytes_per_second=2.03835G/s
StringCopy/gather/16777216/32/manual_time        11.1 ms         11.1 ms           63 bytes_per_second=24.8951G/s
StringCopy/gather/8388608/128/manual_time        8.47 ms         8.49 ms           83 bytes_per_second=64.4481G/s
StringCopy/gather/4194304/512/manual_time        5.16 ms         5.18 ms          135 bytes_per_second=211.516G/s
StringCopy/gather/2097152/1024/manual_time       4.11 ms         4.13 ms          170 bytes_per_second=265.9G/s
StringCopy/gather/524288/2048/manual_time        1.87 ms         1.89 ms          375 bytes_per_second=292.396G/s

Copy link
Collaborator

@nsakharnykh nsakharnykh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments seem to be addressed, thanks @gaohao95! approving

/**
* @brief Gather characters from the input iterator, with string parallel strategy.
*
* This strategy assigns strings to warps so that each warp can cooperatively copy from the input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be future work, but we should explore using a sub-warp per string by using cooperative groups. Using a whole warp may be overkill compared to using 4, 8 or 16 threads.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerun tests.

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 26, 2021
@kkraus14 kkraus14 changed the base branch from branch-21.06 to branch-21.08 May 26, 2021 20:26
@kkraus14
Copy link
Collaborator

Pushed to 21.08 to be safe. This is a large rework that we don't want to squeeze in right before release.

@VibhuJawa
Copy link
Member

rerun tests

@VibhuJawa
Copy link
Member

@gaohao95 , Is this good to merge ?

@gaohao95
Copy link
Contributor Author

gaohao95 commented Jun 2, 2021

@gaohao95 , Is this good to merge ?

I think it's good to go :) Thanks!

@VibhuJawa
Copy link
Member

VibhuJawa commented Jun 2, 2021

@gpucibot merge

@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 53b0170 into rapidsai:branch-21.08 Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants