Improve performance of nvtext::tokenize_with_vocabulary for long strings #14336

davidwendt · 2023-10-26T20:57:44Z

Description

Improves nvtext::tokenize_with_vocabulary performance for long strings. Also adds additional tests and an nvbench benchmark.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2023-10-26T21:08:48Z

Performance numbers for long strings from the included benchmark

| width |  num_rows  |   Ref Time |   Cmp Time |          Diff |   %Diff |      |
|-------|------------|------------|------------|---------------|---------|------|
|  256  |   262144   |   3.267 ms |   1.694 ms |  -1572.363 us | -48.13% | 1.93 |
|  512  |   262144   |   8.746 ms |   3.193 ms |  -5553.180 us | -63.49% | 2.74 |
| 1024  |   262144   |  20.265 ms |   6.129 ms | -14135.536 us | -69.75% | 3.31 |
|  256  |   524288   |   6.344 ms |   3.326 ms |  -3018.149 us | -47.57% | 1.91 |
|  512  |   524288   |  18.740 ms |   6.387 ms | -12352.539 us | -65.92% | 2.93 |
| 1024  |   524288   |  42.879 ms |  12.394 ms | -30485.431 us | -71.10% | 3.46 |
|  256  |  1048576   |  11.448 ms |   6.693 ms |  -4755.292 us | -41.54% | 1.71 |
|  512  |  1048576   |  34.554 ms |  12.878 ms | -21676.282 us | -62.73% | 2.68 |
| 1024  |  1048576   |  80.785 ms |  24.909 ms | -55876.306 us | -69.17% | 3.24 |
|  256  |  2097152   |  23.507 ms |  13.482 ms | -10025.858 us | -42.65% | 1.74 |
|  512  |  2097152   |  69.627 ms |  26.090 ms | -43537.188 us | -62.53% | 2.67 |
|  256  |  4194304   |  45.730 ms |  27.060 ms | -18669.250 us | -40.83% | 1.69 |

bdice

All looks good. I have one suggested rewrite for some of the counting math.

cpp/src/text/vocabulary_tokenize.cu

mythrocks

Thanks for the comment and explanation. +1.

davidwendt · 2023-11-03T19:58:25Z

/merge

Fixes a bug introduced in #14336 when trying to simplify the token-counting logic as per this discussion #14336 (comment) The simplification caused an error which was found when running the nvtext benchmarks. The appropriate gtest has been updated to cover this case now. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14393

davidwendt added 2 commits October 26, 2023 16:52

Improve performance of nvtext::tokenize_vocabulary for long strings

1d4e48e

fix merge conflict

e19d0f9

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 26, 2023

davidwendt self-assigned this Oct 26, 2023

github-actions bot added the CMake CMake build issue label Oct 26, 2023

davidwendt added 2 commits October 27, 2023 08:34

Merge branch 'branch-23.12' into vocab-tokenize-perf

febfa58

fix divide-by-zero error

fe74eec

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 27, 2023

davidwendt added 4 commits October 27, 2023 12:01

Merge branch 'branch-23.12' into vocab-tokenize-perf

dafe06d

Merge branch 'branch-23.12' into vocab-tokenize-perf

cd2d273

Merge branch 'branch-23.12' into vocab-tokenize-perf

1c1ab04

remove commented out code

a10bbb0

davidwendt marked this pull request as ready for review October 31, 2023 21:02

davidwendt requested a review from a team as a code owner October 31, 2023 21:02

davidwendt requested review from mythrocks and divyegala October 31, 2023 21:02

bdice approved these changes Oct 31, 2023

View reviewed changes

cpp/src/text/vocabulary_tokenize.cu Outdated Show resolved Hide resolved

davidwendt added 3 commits November 1, 2023 09:10

Merge branch 'branch-23.12' into vocab-tokenize-perf

a27ed1d

simplify count calculation in token counts kernel

fe76198

Merge branch 'branch-23.12' into vocab-tokenize-perf

5aafcc5

mythrocks approved these changes Nov 3, 2023

View reviewed changes

rapids-bot bot merged commit f97e74f into rapidsai:branch-23.12 Nov 3, 2023
61 checks passed

davidwendt deleted the vocab-tokenize-perf branch November 3, 2023 19:58

davidwendt mentioned this pull request Nov 10, 2023

Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of nvtext::tokenize_with_vocabulary for long strings #14336

Improve performance of nvtext::tokenize_with_vocabulary for long strings #14336

davidwendt commented Oct 26, 2023

davidwendt commented Oct 26, 2023

bdice left a comment

mythrocks left a comment

davidwendt commented Nov 3, 2023

Improve performance of nvtext::tokenize_with_vocabulary for long strings #14336

Improve performance of nvtext::tokenize_with_vocabulary for long strings #14336

Conversation

davidwendt commented Oct 26, 2023

Description

Checklist

davidwendt commented Oct 26, 2023

bdice left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

davidwendt commented Nov 3, 2023