Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma "▁viciss" token appearing randomly on summary #36

Open
brunodoamaral opened this issue May 10, 2024 · 9 comments
Open

Gemma "▁viciss" token appearing randomly on summary #36

brunodoamaral opened this issue May 10, 2024 · 9 comments

Comments

@brunodoamaral
Copy link

Hi!

I notice that Gemma-generated summary has some issues when it "hallucinates" the specific token "▁viciss" (id: 200507, as found on tokenizer file). Here are a few examples (today's news):

LogiCola, a software for learning logic, has been redesigned and released as version 3.0 vicissolar definitions and propositional translations are now available in a quiz mode. Malik Piara aims to continuously improve and maintain the open-source software.

Scientists have found evidence that giant blobs of material left behind by a cosmic collision 4 vicissitation 4 Kün 4 vicissitation billion years ago may be responsible for modern plate tectonics. Their computer models suggest the blobs caused subduction and surface sinking, leading to the formation of early tectonic boundaries.

I didn't have time to look at this repo code, but I'm a regular user of https://hackernews.betacat.io/ and I remember seeing the same issue yesterday.

@QINGCHARLES
Copy link

I see it in pretty much every summary that has a number now. It's been like that for about a week or so. It makes the summaries very hard to read. I wonder what code change caused this? It is something to do with numerical parsing AFAICT.

@polyrabbit
Copy link
Owner

@brunodoamaral Thanks for reporting. @thiswillbeyourgithub also mentioned the same issue and suggested to use logit_bias to avoid those words. I gave a quick try but it didn't work as expected. Now I know the reason - the token id I used is wrong, I didn't know the prefix trick (why?).

Just added those words to the bias list and it works perfectly now. I suppose there are more words like this, I'll keep an eye on it. Thanks for the knowledge!

if model_family() == Model.GEMMA:
# Gemma outputs weird words like Kün/viciss/▁purcha/▁xPos/▁Gorb
kwargs['logit_bias'] = {200507: -100, 225856: -100, 6204: -100, 232014: -100, 172406: -100}

@QINGCHARLES Yes, there is a change recently. I used to use the gpt model from openai, but it is expensive for such a free and long-running project. So I switched to the free Gemma model from openrouter, and here comes the issue.

@polyrabbit
Copy link
Owner

Oops, I see lots of weird 196 interspersely now. Need to seek another model...

@thiswillbeyourgithub
Copy link

Have you tried playing with the frequency and repetition penalty?

https://platform.openai.com/docs/guides/text-generation/parameter-details

@polyrabbit
Copy link
Owner

Haven't tried other values - both are set to 1 currently. I suppose we cannot get rid of those magic words completely in Gemma, need to find a better model.

You can find parameters here: https://github.com/polyrabbit/hacker-news-digest/blob/master/hacker_news/llm/openai.py#L69-L77

@thiswillbeyourgithub
Copy link

Alright, I do think it's a good rule of thumb to not stop before banning like 10 tokens, right now you banned 5. I already had to do this kind of thing a while ago and after banning a few more the model worked as expected (not gemma though).

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

it's a very simple parser for urls that makes it LLM friendly, it even parses images as a caption! It's quite new though and they had issue with scaling at some point so maybe use a timeout when querying from them.

I'm bringing that up because a good web parsing can greatly help LLMs to summarize, especially smaller models.

@polyrabbit
Copy link
Owner

not stop before banning like 10 tokens

I fine-tuned some code and switched to llama3 now, I'll use it for a while and see how it goes. Hope I don't need to spend time to fine-tune one model's tokenizer issues again.

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

It's a handwritten Python library that is small and easy to maintain. It's been used for more than 10 years, since the very beginning of this project.

The jina parser looks very helpful. I'm considering using it as a fallback for dynamic web pages. Thanks!

@QINGCHARLES
Copy link

@polyrabbit I just want to say thank you for this app. It is literally life-changing the amount of time it saves me each day so that I don't have to click into articles on HN to see if they are worth exploring.

@polyrabbit
Copy link
Owner

I'm considering using it as a fallback for dynamic web pages.

Done, now we have summaries for substack etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants