Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

davidwendt · 2023-11-10T00:29:49Z

Description

Fixes a bug introduced in #14336 when trying to simplify the token-counting logic as per this discussion #14336 (comment)
The simplification caused an error which was found when running the nvtext benchmarks.
The appropriate gtest has been updated to cover this case now.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2023-11-14T17:49:14Z

/merge

davidwendt added 2 commits November 9, 2023 19:24

Fix token-count logic in nvtext::tokenize_with_vocabulary

9242359

Merge branch 'branch-23.12' into bug-vocab-tokenizer

22e09f0

davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Nov 10, 2023

davidwendt self-assigned this Nov 10, 2023

davidwendt requested a review from a team as a code owner November 10, 2023 00:29

davidwendt requested review from vyasr and karthikeyann November 10, 2023 00:29

bdice approved these changes Nov 13, 2023

View reviewed changes

karthikeyann approved these changes Nov 14, 2023

View reviewed changes

rapids-bot bot merged commit b446a6f into rapidsai:branch-23.12 Nov 14, 2023
65 checks passed

davidwendt deleted the bug-vocab-tokenizer branch November 14, 2023 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

davidwendt commented Nov 10, 2023

davidwendt commented Nov 14, 2023

Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

Fix token-count logic in nvtext::tokenize_with_vocabulary #14393

Conversation

davidwendt commented Nov 10, 2023

Description

Checklist

davidwendt commented Nov 14, 2023