Change our tokenizer a bit to be more accurate #616

dkotter · 2023-11-14T20:36:10Z

Description of the Change

We have a lightweight tokenizer class that attempts to determine how many tokens are in a string. This is done mostly by determining how many characters are in the string and how many characters are in a single token. This is not meant to be 100% accurate but is meant to be close enough for our use case (which is ensuring we stay within the token limits of our model).

Recently I ran across an article that was super long and I received a token length error when trying to process it. In debugging, I found we weren't being aggressive enough with the counting of tokens, and thus we weren't trimming enough of the content to stay within the model limits.

This PR lowers the number of characters per token from 4 to 3.5. This fixed the issue I ran into and seems to be more accurate in counting tokens.

How to test the Change

Send content to OpenAI (either generate an excerpt or generate titles) and ensure things still work and no errors are shown

Changelog Entry

Fixed - More accurate token counts when trimming content

Credits

Props @dkotter

Checklist:

I agree to follow this project's Code of Conduct.
I have updated the documentation accordingly.
I have added tests to cover my change.
All new and existing tests pass.

…urate

faisal-alvi

Thanks for the PR and brief details. Tested the plugin features and found are working as expected in the fix branch, no errors are shown.

Title Generation

Excerpt Generation

Post Classification

dkotter added 3 commits November 7, 2023 14:35

Change how many characters we assume are in each token to be more acc…

684ab52

…urate

Change param type

4a77aac

Merge branch 'develop' into fix/tokenizer

9410195

dkotter added this to the 2.5.0 milestone Nov 14, 2023

dkotter self-assigned this Nov 14, 2023

dkotter requested review from jeffpaul and a team as code owners November 14, 2023 20:36

dkotter requested review from a team and faisal-alvi and removed request for a team and jeffpaul November 14, 2023 20:36

faisal-alvi approved these changes Nov 15, 2023

View reviewed changes

dkotter merged commit fffd3ec into develop Nov 15, 2023
13 checks passed

dkotter deleted the fix/tokenizer branch November 15, 2023 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change our tokenizer a bit to be more accurate #616

Change our tokenizer a bit to be more accurate #616

dkotter commented Nov 14, 2023 •

edited

Loading

faisal-alvi left a comment

Change our tokenizer a bit to be more accurate #616

Change our tokenizer a bit to be more accurate #616

Conversation

dkotter commented Nov 14, 2023 • edited Loading

Description of the Change

How to test the Change

Changelog Entry

Credits

Checklist:

faisal-alvi left a comment

Choose a reason for hiding this comment

Title Generation

Excerpt Generation

Post Classification

dkotter commented Nov 14, 2023 •

edited

Loading