-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPT-J support (#285) #288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @lerouxrgd
This is great - thank you very much for this PR! Could you please open pull requests on the Huggingface model hub to upload the model weights that are registered in the crate (you may want to refer this PR as part of the PR there)? It would also be great to extend the test with a valid generation example - the crate has a flag for disabling large tests by default (in the CI pipeline), but allows running them manually on a larger machine.
The default values for use_float16
and preload_on_gpu
both sound reasonable to me.
if isinstance(v, Tensor): | ||
nps[k] = np.ascontiguousarray(v.cpu().numpy().astype(np.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this breaks a current model implementation. It looks like this was added when implementing MBart
- but the weight on the Huggingface model hub do not seem to differ. Maybe we could make the target precision an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/gpt_j/mod.rs
Outdated
//! # } | ||
//! ``` | ||
|
||
pub(crate) mod attention; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this level of visibility needed for attention
and transformer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it's not needed, I took it from GPT2. I'll remove the pub(crate)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/gpt_j/attention.rs
Outdated
); | ||
let dim_per_head = config.n_embd / config.n_head; | ||
|
||
let scale_attn = Tensor::from(dim_per_head as f32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there value in storing this as a tensor as opposed to a f32
? I believe broadcasting should work with the float value as well and is a bit lighter to manipulate/store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I originally used a Tensor as in the Python implementation. However it looks like the broadcasting is only available forf64
with <Scalar as From<f64>>
, which is fine I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
let out_proj = nn::linear(p / "out_proj", config.n_embd, config.n_embd, linear_config); | ||
if config.use_float16 { | ||
(p / "out_proj").half(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be enough to call p.half()
once for all sublayers of the attention (it should propagate to the entire sub-paths)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but I think that converting to half as soon as possible is safer, especially since at layer 28 the memory usage starts to be important.
src/gpt_j/attention.rs
Outdated
tensor.permute(&[0, 2, 1, 3]) // (batch, head, seq_length, head_features) | ||
} else { | ||
panic!( | ||
"Input tensor rank should be one of [4, 5], but is: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should be one of [4, 5] or a rotary head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/gpt_j/gpt_j_model.rs
Outdated
|
||
let (layer_past, _layer_past_length) = match layer_past { | ||
Some(value) => { | ||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this method returns a Result
you could raise an Error
rather than panicking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/gpt_j/gpt_j_model.rs
Outdated
.unsqueeze(2) | ||
.to_kind(input_embeddings.kind()); | ||
|
||
let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid for both fp32
and fp16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked OK in my tests (same text generated as in Python) with models of both precisions. Do you see another kind of test for this in particular ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the test you suggested with the tiny
model will help, I'll let you know the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if you had the chance to test this - Could you please run the Python version was half precision weights and update the "correctness" integration tests with this the transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well using half precision doesn't work well on CPU (as they usually don't support float16), so this would require Cuda and GPU to run, which is not what the GitHub CI provides I guess.
I will still try to run it on a GPU on my side to check the correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try to run a broader set of test using a GPU - maybe a solution could be to add a test for half precision that runs only on GPU? i.e.
#[test]
fn test_gpt_j_half_precision {
if Cuda::is_available {
[... tests here ...]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While giving it a try I realized that with latest Python transformers
I don't get the same results for the "correctness" test anymore. After investigations it looks like in 4.21.0
HF made changes to the implementation of GPT-J. I will try to adapt the PR accordingly first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the generation test is still fine ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end I have made the correctness test use Cuda device if available. On CPU this is the same test as before, with precision 1e-4. On the GPU the test forces float16 usage and the correctness is fine at 1e-2 only in this case.
@@ -0,0 +1,89 @@ | |||
use rust_bert::gpt_j::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this test to the integration tests as well (after
--test sentence_embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/gpt_j.rs
Outdated
/// gen_texts = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True) | ||
/// ```` | ||
#[test] | ||
fn test_generation_gpt_j() -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding tests.
I would suggest the following:
- For the
tiny
test that will run in the pipeline, please instead do a forward pass through the model by passing an input tensor and a past tensor, and compare that to the output generated by Python (e.g. to the 4th decimal, seeLine 64 in 78da0f4
(model_output.decoder_output.double_value(&[0, 0, 0]) - -2.047429323196411).abs() < 1e-4 - It would be nice to add a test for a full-sized model. Can you add a test that generates a sensible output to test the generation process using float16, on CPU and flagging the test as
#[cfg_attr(not(feature = "all-tests"), ignore)]
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on the test you suggested for the tiny
model.
As for the full-sized model using float16 it won't run on the CPU, when I tried I got a Torch error: "LayerNormKernelImpl" not implemented for 'Half'. I think some cuda kernels are not available for inference on the CPU when using float16. Using the float32 version requires at least 32 Gb of RAM though. I can write a test that runs on the GPU using float16 if possible, and on the CPU using float32 otherwise. What do you think ?
"gpt-j-6B/model", | ||
"https://huggingface.co/EleutherAI/gpt-j-6B/resolve/float16/rust_model.ot", | ||
); | ||
pub const GPT_J_TINY_RANDOM: (&'static str, &'static str) = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to clarify the license for this model with the author or the Huggingface team
@lerouxrgd to my understanding this pull request is held up by the pending model weights pull request on the huggingface hub. It seems that hub model contributors do not automatically get notifications when an issue/PR is created on the models, you may want to try reaching out directly to the authors (if you find their contact information) or the Huggingface maintainers? |
Yes, I have reached out to maintainers but I don't have a final answer yet. |
Hello @lerouxrgd , If it is difficult merging the Rust weights, I suggest updating this pull request removing the pretrained weights and replacing it with a comment indicating how to generate the set of weight from the Python version. If and when the original model authors approve hosting Rust weights (or if they encourage creating a new model on the Hub for the Rust weights), you could open a new pull request making these weights available to the community. What do you think? |
Hello @guillaume-be , It's sad to see that the model's maintainers aren't really willing to add the Rust weights nor giving constructive feedback. Therefore I think you're right we can explain in comments how to create them. I have updated the PR accordingly, let me know what you think. |
tests/gpt_j.rs
Outdated
/// [gpt-j-6B-float16]:https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16 | ||
#[test] | ||
#[cfg_attr(not(feature = "all-tests"), ignore)] | ||
fn gpt_j_generation() -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these weights are not available could we please remove this from integration tests? This could be added as an example instead if you wish to be run manually.
You could also try loading the pytorch_model.bin
model file with the ltest tch-rs
update. It does not work for all model files yet, but it compatible with GPT-J files this would allow readily loading the model from pretrained resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I also get a "Expected GenericDict but got None" error as mentioned in tch-rs/issues/595
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's unfortunate, thank you for checking anyway. It may make sense to move this code snippet to the examples if the download/conversion cannot be easily automated - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lerouxrgd it looks like the PR is almost ready to be merged -- any chance this could be moved to an example? This would give more visibility to this new feature, and keep all tests independent of pre-download/conversion steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I was doing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (And I have re-checked that it works fine on a T4 GPU)
Hello @lerouxrgd , I want to point you out to the existing utilities in that can be found in https://github.com/guillaume-be/rust-bert/blob/master/src/common/kind.rs may be useful if related to completing this pull request. Please feel free to extend if you need values that are not |
Oh I wasn't aware of it, thanks for pointing it out. Although I think it's just half of the story, the other half being the |
Could you please share a pointer what you are trying to implement? |
So this line: let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0); works because of impl From< f64> for Scalar and impl< S: Into< Scalar>> Mul< S> for Tensor, however as mentioned in the attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min So we need 2 things to port this code correctly:
|
Writing this I realize that even the So maybe simply using your function for getting |
The attention mask may be of type
where |
I see, thanks I'll give it a try on Monday ! |
So it compiles fine, but sadly it doesn't change the results... There is still some changes made in HF between 4.20.1 and 4.21.0 that affect the correctness results and I haven't found what they are. |
@lerouxrgd I noticed a discrepancy when implementing Line 42 in d7e9c03
that is merged as part of #333 and see if this helps? |
@guillaume-be Good news, after a long println debugging session I've finally figured out that the main issue was that I didn't set the attention mask in my unit test. I guess this was still OK in the previous attention version but not with the latest one. |
This is awesome, thank you so much for your patience for getting to the bottom of this! I have one last request (move of the run with GPT-J actual weights to examples) and this should be good to merge |
Ok I think everything is fine now. I have made the correctness test runnable on both CPU and GPU (forcing it to use float16 in this case, which in turn lowers precision to 1e-2 instead of 1e-4). Let me know if some other changes are needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great - thanks again for this PR @lerouxrgd !
Remaining points:
GptJModelResources
rust_model.ot files should be added to HF Hub (I have never tried to add models there but I can try if you want)GptJConfig::use_float16
defaults to true, do you want to keep it this way ?GptJConfig::preload_on_cpu
defaults to true, do you want to keep it this way ?Note that
GptJConfig::preload_on_cpu
makes it possible to loadGPT-J 6B
(both float16 and float32 with conversion to float16 on the fly) on a T4 GPU (with ~16Gb VRAM). Without this flag there are memory issues when loading the weights withvar_store.load(weights_path)
.