Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When completions streaming mode enabled, the response no longer includes the usage field. #291

Open
3 tasks done
Hime-Hina opened this issue Mar 18, 2023 · 2 comments
Open
3 tasks done
Labels
enhancement New feature or request UI V2

Comments

@Hime-Hina
Copy link

Clear and concise description of the problem

As the official cookbook How to stream completions cites:

Another small drawback of streaming responses is that the response no longer includes the usage field to tell you how many tokens were consumed. After receiving and combining all of the responses, you can calculate this yourself using tiktoken.

Personally, I think it would be useful to implement that feature. Users wouldn't have to check the daily usage breakdown on their account page, and it would make for a more responsive and user-friendly experience.

Suggested solution

I have actually implemented that feature on the front-end already, using the @dqbd/tiktoken library, which is a third-party TypeScript version of the official tiktoken library. OpenAI also provides an example on how to count tokens with the tiktoken library. For specific implementation, please refer to my repo.

Alternative

Maybe there is a way to implement it on the back-end by providing an API, but I have not succeeded in achieving that so far because it seems impossible to load a wasm file when deploying on Vercel. I have followed the tutorial on Vercel docs and tried some plugins to load the wasm file but failed. If anyone knows about this, please let me know! 😁

Additional context

GIF 2023-3-18 2-14-08

I have not optimized my code, but it suffices for now. There are some bugs, as shown below:

7_local_1__

The first completion is primed with \n\n, and 20 tokens are used. After conducting some tests, I have observed that the number of tokens of the completion seems to be equal to the number of tokens of the completion content only, indicating that the special tokens and line breaks are not included in the count (Please refer to the code for more details).

7_local_1_

The second completion has exactly the same content as the first one, but is not primed with \n\n. As \n\n is encoded in 271, it indicates that one token is used. Therefore, the result is 19, which is exactly what we expected.

But the paradox is that

7_remote

The daily usage gives me both 19. I have no ideas about this, it requires further testing.

If you know about this, please let me know! I would appreciate it.

In addition, I feel that my implementation method is still quite rough and only supports the 'gpt-3.5' model. I have not tested it on other models. Also, if you have any advice, please let me know too.

Validations

@Hime-Hina Hime-Hina added the enhancement New feature or request label Mar 18, 2023
@yzh990918
Copy link
Member

Thank you for your sincere advice.

@CNSeniorious000
Copy link
Contributor

Thank you a lot @Hime-Hina! Inspired by your demo, I integrated the latest tiktoken library with my chatgpt-demo fork, for which you can view a live demo here. I found some conclusions related to token counting.

For conclusion, the pseudo formula can be represented as:

$$ \begin{align*} \text{prompt tokens} &= \sum_{\texttt{msg}}\texttt{( encode(msg).length+4 ) + 3}\\ \text{completion tokens} &= \texttt{encode(msg).length} \end{align*} $$

Specifically, the 3 more tokens for the context are for <|im_start|>, "assistant", \n, and the 4 more tokens for each message are for <|im_start|>, role/name, \n, <|im_end|>.

I've compared the token count in the API response's Header with the I calculated myself using Python and JavaScript respectively, and found that there is no issue. (Mention that I interestingly found that in fact the official tokenizer demo is a GPT-3 tokenizer, which encodes Chinese letters much worse than gpt-3.5-turbo's)

OpenAI also has a note of the markup language they created for conversations.


As you said, trying to make WASM work on edge functions is incredibly tough. I almost spent half a day with bugs. In the end, I found that this way works well in self-host route, which is similar to your solution in the dev branch of your demo repo. But this don't work on Edge Functions of Vercel or Netlify (yes, serverless functions work, but they can't stream responses). Finally I use fetch to load wasm and use dynamic import to solve this.

You can view my implementation through the following pages:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request UI V2
Projects
None yet
Development

No branches or pull requests

3 participants