-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT Neo #10848
GPT Neo #10848
Conversation
I tried running this with the 2.7B checkpoint and got
|
Hi @StellaAthena , ['global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local',
'global',
'local'] I've converted these checkpoints and will push them to the hub in a couple of hours. I'll ping you once that's done, so you can directly download them. |
I see! Is this a problem with my local config file, or is something up with the code on the repo? I downloaded my file directly from the-eye before running the conversion script, so if the local config file is wrong that’s a bit of a problem for us. |
Hey @patil-suraj haven't had a chance to look over the whole PR yet, so i'm not sure how you load up the configuration, but I wonder why you even have separate fields for "attention_types" and "attention_layers" since they configure the same thing, and attention layers can be derived from attention types |
Hi @sdtblck
For an example config file, see https://huggingface.co/valhalla/gpt_neo_xl_test/blob/main/config.json I've uploaded the 1.3B checkpoint under my namespace temporarily, here's a colab if you wanna give it a try. |
Our config file doesn't define |
Yes, you are right! I hardcoded it since we usually prefer to keep everything explicit but yeah I agree this would be a problem for your side. I will change it so that Are there any other issues? |
The 2.7B model is up! https://huggingface.co/valhalla/gpt_neo_2.7B/tree/main |
I tried out the 2.7B model you posted @patil-suraj but it wouldn't run. I get the error
Looking through the readout, I see
I think that there's an unneeded transpose hanging out in the code. |
It looks like you are using the Could you checkout this PR and try loading it using the And yes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, good job at going so fast @patil-suraj! Could you put here the tasks left to be done, even if you plan on doing them in a future PR? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there no way to add some "# Copied from"
statements to ensure that the two models do not diverge?
I have made some changes to the code mostly related to naming and passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! Left mostly comments on style/docstrings and names, the two main comments I have are:
- there should be an easier API in the config to get the
attn_layers
(instead of having to pass a tuple of 24 elements following a given pattern), I made a suggestion. - if the tokenizer is a copy of GPT-2, the model should just use the GPT-2 tokenizer.
:obj:`inputs_ids` passed when calling :class:`~transformers.GPTNeoModel` or | ||
:class:`~transformers.TFGPTNeoModel`. Vocabulary size of the model. Defines the different tokens that can | ||
be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPTNeoModel`. | ||
attn_layers (:obj:`Tuple[str]`, `optional`, defaults to :obj:`("global","local","global","local","global","local","global","local","global","local","global","local","global","local","global","local","global","local","global","local","global","local","global","local")`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our training code produces a config file that includes the entry "attention_types" : [[["global", "local"], 16]]
. I would highly recommend using attn_layers = [copy.copy(e) for _ in range(args.attention_types[1]) for e in args.attention_types[0]]
. This has the added advantage of allowing support for additional layer types that aren't used in these pretrained models but are implemented in the repo, like Mixture of Experts and Linear Attention.
attn_layers=( | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
"global", | ||
"local", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on my comment above, I think this should just be removed as an argument and entirely calculated internally.
An update from our end: We got the 2.7B model up and running in our evaluation harness! Unfortunately the run revealed that the harness is bugged... Running it by hand gives reasonable-looking results, but I don't know how much I should trust myself to judge that. |
(to clarify: the bugs in eval harness were introduced by a series of pretty aggressive optimizations i implemented just a few hours earlier today) |
I tried finetuning the model with deepspeed and gradient checkpointing, but unlike with GPT2, the loss explodes. I used the default run_clm.py from the examples folder, but added one line to activate gradient checkpointing. Here is then the command i ran:
Here is my ds_config_gptneo.json (is almost the default, except for a lower min_loss_scaling, otherwise i got overflows) (optimizer and warmup hps are overwritten by the flags above):
I tried the exact hyperparameters as well that EleutherAi used, with long warmup phases, but it is still the same. If the learning rate is low enough the loss doesn't change and once its big enough, it immediately explodes. I also did an hyperparameter sweep with the same result. Could this be an issue with the model implementation, as finetuning with EleutherAi's implementation in Mesh Tensorflow on Colab seems to work? Here are the exact steps that i did (on the bottom half part): https://github.com/Xirider/finetune-gpt2xl |
hi @Xirider let me take a look, but meanwhile could you try without |
Hi, yes, i will try it |
Hm, setting no fp16 doesn't work with Zero: |
One thing I've caught testing the neo model is that if i try to add a padding token to the tokenizer after loading it from pretrained (i.e to predict batches instead of a single sequence at a time), then i get:
I guess because the tokenizer vocabulary is different to the way it was initialized. I'm not sure if this is a HF-wide problem (although I don't recall this being a problem with GPT2Tokenizer.from_pretrained('gpt2')) or specific to neo, but here is the code to reproduce the error: import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
ckpt_2b = "EleutherAI/gpt_neo_2-7B"
tokenizer = GPT2Tokenizer.from_pretrained(ckpt_2b)
tokenizer.add_special_tokens({'pad_token': '<|padding|>'})
ids = tokenizer("hello world", return_tensors="pt").input_ids.to("cuda") |
maybe I'm just going insane, or doing something stupid, because swapping out ckpt_2b for 'gpt2' is giving the same error. We never had this problem training with gpt-neox. Can anyone reproduce, and if so, should I open up a new issue? |
Hey @sdtblck! I think the issue here is because you're adding a new token to your tokenizer (so you're extending your vocab), but you're not resizing the token embedding matrix. When you're creating the GPT-2 tokenizer from your checkpoint, you should have a tokenizer size of 50257: from transformers import GPTNeoForCausalLM, GPT2Tokenizer
ckpt_2b = "EleutherAI/gpt_neo_2-7B"
tokenizer = GPT2Tokenizer.from_pretrained(ckpt_2b)
print(len(tokenizer))
# 50257 That's the same size as the model token embedding matrix: print(model.get_input_embeddings())
# Embedding(50257, 2560) When adding a new token, you should also resize the token embedding matrix alongside it. Otherwise you'll get some index out of range issues, as you'll be trying to obtain the 50258th row of a matrix with 50257 rows. Please add the following line to your code, once you have added a token to your tokenizer and instantiated your model: model.resize_token_embeddings(len(tokenizer)) Everything should be working smoothly now :) |
Hm, @LysandreJik so doing that does make the error to go away, but sampling with the model when I've added padding tokens seems to cause almost everything in the prediction to become padding. Let me know if i should take this somewhere else btw, don't want to clog up this PR if this issue doesn't relate to it at all.
prompt = "Q: What is the meaning of life? A:"
gen_text = predict(prompt)
print('-'*100)
print(gen_text)
tokenizer.add_special_tokens({'pad_token': '<|padding|>'})
model.resize_token_embeddings(len(tokenizer))
model.half()
gen_text = predict(prompt)
print('-'*100)
print(gen_text) Outputs:
|
Hi @sdtblck For batch generation with GPT like models, the text should be padded to the left. this is how batch generation works model.config.pad_token_id = tokenizer.pad_token_id
tokenizer.padding_side = "left"
inputs = tokenizer(sentences, return_tensors="pt", padding=True)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
) |
Also, the actual vocab size of the model is 50257 so token ids range from 0 to 50256. This tokenizer.pad_token_id = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"
inputs = tokenizer(sentences, return_tensors="pt", padding=True)
gen_tokens = model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"]
) This should work. Or feel free to open an issue if this is not working. |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the very quick implementation @patil-suraj.
Follow up PRs will cleanup some cosmetics and add more robust tests for the attention mechanism - parity has been achieved with EleutherAI's implementation on inference, and it has been verified that the model can be fine-tuned.
Merging!
The |
Do you by any chance have an example input/output with the conversion script? I was having trouble getting the new code to work with the default configs in the gpt-neo repo. |
There are models listed on the eleutherai HuggingFace account that AFAIK we did not post. Are these the pretrained models @patil-suraj had been hosting? |
I was referring to the pre-trained models posted here: https://the-eye.eu/public/AI/gptneo-release/ |
Hi @StellaAthena, which models are you talking about? The only two models available are the 1.3B and the 2.7B versions. |
Hi. I'm getting this issue on colab when trying to import it:
|
Hi @zanderbush, please make sure you:
|
@LysandreJik Thank you! That worked. I face a new issue, however, as I look to return the most probable next token. This works with the typical GPT-2, but not this for some reason:
|
Why is |
@leogao2 |
@zanderbush I believe this is unrelated to GPT Neo and related to your code instead. Please open a new issue with a reproducible code example (tokenizer and model defined). Thank you! |
* lets begin * boom boom * fix out proj in attn * fix attention * fix local attention * add tokenizer * fix imports * autotokenizer * fix checkpoint name * cleanup * more clean-up * more cleanup * output attentions * fix attn mask creation * fix imports * config doc * add tests * add slow tests * quality * add conversion script * copyright * typo * another bites the dust * fix attention tests * doc * add embed init in convert function * fix copies * remove tokenizer * enable caching * address review comments * improve config and create attn layer list internally * more consistent naming * init hf config from mesh-tf config json file * remove neo tokenizer from doc * handle attention_mask in local attn layer * attn_layers => attention_layers * add tokenizer_class in config * fix docstring * raise if len of attention_layers is not same as num_layers * remove tokenizer_class from config * more consistent naming * fix doc * fix checkpoint names * fp16 compat * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This PR adds the GPT Neo model.
The model architecture is very similar to GPT2 except it local attention in alternate layers
LocalAttention
module implements the local attention. The implementation is not as clean as it should be and will be cleaned-up in follow-up PR.use_cache
) the local attention layer caches thehidden_states
instead ofpast_key_value_states
.Also right now when
use_cache
is enabled the current length can-not be greater than 1.Example: usage
Future TODOs:
LocalAttention
especially the creation ofattention_mask
.use_cache
is enabled.LocalAttention
module.TF
model.