Pretraining loss explosion #554

mattgorb · 2024-09-13T16:14:04Z

I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.

I have tried many things:
Turn flash attention off ( based on this issue: #524)
Using fp16 (based in this: #468)
Using GPT-4 tokenizer (based on #468)

At first the loss was going back up to about 8-10, now it is just going to NaN with fp16.

I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.

I am lost on what to try next. Did anyone else fix this issue?

I have gotten GPT-2 Small down to about 3.0 loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining loss explosion #554

Pretraining loss explosion #554

mattgorb commented Sep 13, 2024 •

edited

Loading

Pretraining loss explosion #554

Pretraining loss explosion #554

Comments

mattgorb commented Sep 13, 2024 • edited Loading

mattgorb commented Sep 13, 2024 •

edited

Loading