Gemma "▁viciss" token appearing randomly on summary #36

brunodoamaral · 2024-05-10T12:35:23Z

Hi!

I notice that Gemma-generated summary has some issues when it "hallucinates" the specific token "▁viciss" (id: 200507, as found on tokenizer file). Here are a few examples (today's news):

LogiCola, a software for learning logic, has been redesigned and released as version 3.0 vicissolar definitions and propositional translations are now available in a quiz mode. Malik Piara aims to continuously improve and maintain the open-source software.

Scientists have found evidence that giant blobs of material left behind by a cosmic collision 4 vicissitation 4 Kün 4 vicissitation billion years ago may be responsible for modern plate tectonics. Their computer models suggest the blobs caused subduction and surface sinking, leading to the formation of early tectonic boundaries.

I didn't have time to look at this repo code, but I'm a regular user of https://hackernews.betacat.io/ and I remember seeing the same issue yesterday.

QINGCHARLES · 2024-05-10T23:55:49Z

I see it in pretty much every summary that has a number now. It's been like that for about a week or so. It makes the summaries very hard to read. I wonder what code change caused this? It is something to do with numerical parsing AFAICT.

polyrabbit · 2024-05-11T09:18:33Z

@brunodoamaral Thanks for reporting. @thiswillbeyourgithub also mentioned the same issue and suggested to use logit_bias to avoid those words. I gave a quick try but it didn't work as expected. Now I know the reason - the token id I used is wrong, I didn't know the ▁ prefix trick (why?).

Just added those words to the bias list and it works perfectly now. I suppose there are more words like this, I'll keep an eye on it. Thanks for the knowledge!

hacker-news-digest/hacker_news/llm/openai.py

Lines 78 to 80 in 8167ef6

    
           if model_family() == Model.GEMMA: 
        
               # Gemma outputs weird words like Kün/viciss/▁purcha/▁xPos/▁Gorb 
        
               kwargs['logit_bias'] = {200507: -100, 225856: -100, 6204: -100, 232014: -100, 172406: -100}

@QINGCHARLES Yes, there is a change recently. I used to use the gpt model from openai, but it is expensive for such a free and long-running project. So I switched to the free Gemma model from openrouter, and here comes the issue.

polyrabbit · 2024-05-11T11:52:59Z

Oops, I see lots of weird 196 interspersely now. Need to seek another model...

thiswillbeyourgithub · 2024-05-11T12:46:12Z

Have you tried playing with the frequency and repetition penalty?

https://platform.openai.com/docs/guides/text-generation/parameter-details

polyrabbit · 2024-05-11T12:59:10Z

Haven't tried other values - both are set to 1 currently. I suppose we cannot get rid of those magic words completely in Gemma, need to find a better model.

You can find parameters here: https://github.com/polyrabbit/hacker-news-digest/blob/master/hacker_news/llm/openai.py#L69-L77

thiswillbeyourgithub · 2024-05-11T14:03:08Z

Alright, I do think it's a good rule of thumb to not stop before banning like 10 tokens, right now you banned 5. I already had to do this kind of thing a while ago and after banning a few more the model worked as expected (not gemma though).

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

it's a very simple parser for urls that makes it LLM friendly, it even parses images as a caption! It's quite new though and they had issue with scaling at some point so maybe use a timeout when querying from them.

I'm bringing that up because a good web parsing can greatly help LLMs to summarize, especially smaller models.

polyrabbit · 2024-05-11T15:40:22Z

not stop before banning like 10 tokens

I fine-tuned some code and switched to llama3 now, I'll use it for a while and see how it goes. Hope I don't need to spend time to fine-tune one model's tokenizer issues again.

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

It's a handwritten Python library that is small and easy to maintain. It's been used for more than 10 years, since the very beginning of this project.

The jina parser looks very helpful. I'm considering using it as a fallback for dynamic web pages. Thanks!

QINGCHARLES · 2024-05-11T21:55:57Z

@polyrabbit I just want to say thank you for this app. It is literally life-changing the amount of time it saves me each day so that I don't have to click into articles on HN to see if they are worth exploring.

polyrabbit · 2024-05-12T01:08:03Z

I'm considering using it as a fallback for dynamic web pages.

Done, now we have summaries for substack etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma "▁viciss" token appearing randomly on summary #36

Gemma "▁viciss" token appearing randomly on summary #36

brunodoamaral commented May 10, 2024

QINGCHARLES commented May 10, 2024

polyrabbit commented May 11, 2024

polyrabbit commented May 11, 2024

thiswillbeyourgithub commented May 11, 2024

polyrabbit commented May 11, 2024

thiswillbeyourgithub commented May 11, 2024

polyrabbit commented May 11, 2024

QINGCHARLES commented May 11, 2024

polyrabbit commented May 12, 2024

Gemma "▁viciss" token appearing randomly on summary #36

Gemma "▁viciss" token appearing randomly on summary #36

Comments

brunodoamaral commented May 10, 2024

QINGCHARLES commented May 10, 2024

polyrabbit commented May 11, 2024

polyrabbit commented May 11, 2024

thiswillbeyourgithub commented May 11, 2024

polyrabbit commented May 11, 2024

thiswillbeyourgithub commented May 11, 2024

polyrabbit commented May 11, 2024

QINGCHARLES commented May 11, 2024

polyrabbit commented May 12, 2024