Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPT-J support (#285) #288

Merged
merged 5 commits into from
Feb 15, 2023
Merged

Add GPT-J support (#285) #288

merged 5 commits into from
Feb 15, 2023

Conversation

lerouxrgd
Copy link
Contributor

@lerouxrgd lerouxrgd commented Oct 13, 2022

Remaining points:

  • The GptJModelResources rust_model.ot files should be added to HF Hub (I have never tried to add models there but I can try if you want)
  • GptJConfig::use_float16 defaults to true, do you want to keep it this way ?
  • GptJConfig::preload_on_cpu defaults to true, do you want to keep it this way ?

Note that GptJConfig::preload_on_cpu makes it possible to load GPT-J 6B (both float16 and float32 with conversion to float16 on the fly) on a T4 GPU (with ~16Gb VRAM). Without this flag there are memory issues when loading the weights with var_store.load(weights_path).

Copy link
Owner

@guillaume-be guillaume-be left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lerouxrgd
This is great - thank you very much for this PR! Could you please open pull requests on the Huggingface model hub to upload the model weights that are registered in the crate (you may want to refer this PR as part of the PR there)? It would also be great to extend the test with a valid generation example - the crate has a flag for disabling large tests by default (in the CI pipeline), but allows running them manually on a larger machine.

The default values for use_float16 and preload_on_gpu both sound reasonable to me.

if isinstance(v, Tensor):
nps[k] = np.ascontiguousarray(v.cpu().numpy().astype(np.float32))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this breaks a current model implementation. It looks like this was added when implementing MBart - but the weight on the Huggingface model hub do not seem to differ. Maybe we could make the target precision an argument?

Copy link
Contributor Author

@lerouxrgd lerouxrgd Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/gpt_j/mod.rs Outdated
//! # }
//! ```

pub(crate) mod attention;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this level of visibility needed for attention and transformer?

Copy link
Contributor Author

@lerouxrgd lerouxrgd Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it's not needed, I took it from GPT2. I'll remove the pub(crate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

);
let dim_per_head = config.n_embd / config.n_head;

let scale_attn = Tensor::from(dim_per_head as f32)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there value in storing this as a tensor as opposed to a f32? I believe broadcasting should work with the float value as well and is a bit lighter to manipulate/store

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I originally used a Tensor as in the Python implementation. However it looks like the broadcasting is only available forf64 with <Scalar as From<f64>>, which is fine I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let out_proj = nn::linear(p / "out_proj", config.n_embd, config.n_embd, linear_config);
if config.use_float16 {
(p / "out_proj").half();
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be enough to call p.half() once for all sublayers of the attention (it should propagate to the entire sub-paths)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I think that converting to half as soon as possible is safer, especially since at layer 28 the memory usage starts to be important.

tensor.permute(&[0, 2, 1, 3]) // (batch, head, seq_length, head_features)
} else {
panic!(
"Input tensor rank should be one of [4, 5], but is: {}",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be one of [4, 5] or a rotary head?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


let (layer_past, _layer_past_length) = match layer_past {
Some(value) => {
assert_eq!(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method returns a Result you could raise an Error rather than panicking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.unsqueeze(2)
.to_kind(input_embeddings.kind());

let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this valid for both fp32 and fp16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked OK in my tests (same text generated as in Python) with models of both precisions. Do you see another kind of test for this in particular ?

Copy link
Contributor Author

@lerouxrgd lerouxrgd Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the test you suggested with the tiny model will help, I'll let you know the results.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if you had the chance to test this - Could you please run the Python version was half precision weights and update the "correctness" integration tests with this the transformation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well using half precision doesn't work well on CPU (as they usually don't support float16), so this would require Cuda and GPU to run, which is not what the GitHub CI provides I guess.

I will still try to run it on a GPU on my side to check the correctness.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to run a broader set of test using a GPU - maybe a solution could be to add a test for half precision that runs only on GPU? i.e.

#[test]
fn test_gpt_j_half_precision {
    if Cuda::is_available {
       [... tests here ...]
    }
}

Copy link
Contributor Author

@lerouxrgd lerouxrgd Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While giving it a try I realized that with latest Python transformers I don't get the same results for the "correctness" test anymore. After investigations it looks like in 4.21.0 HF made changes to the implementation of GPT-J. I will try to adapt the PR accordingly first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the generation test is still fine ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I have made the correctness test use Cuda device if available. On CPU this is the same test as before, with precision 1e-4. On the GPU the test forces float16 usage and the correctness is fine at 1e-2 only in this case.

@@ -0,0 +1,89 @@
use rust_bert::gpt_j::{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this test to the integration tests as well (after

--test sentence_embeddings
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tests/gpt_j.rs Outdated
/// gen_texts = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
/// ````
#[test]
fn test_generation_gpt_j() -> anyhow::Result<()> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding tests.

I would suggest the following:

  • For the tiny test that will run in the pipeline, please instead do a forward pass through the model by passing an input tensor and a past tensor, and compare that to the output generated by Python (e.g. to the 4th decimal, see
    (model_output.decoder_output.double_value(&[0, 0, 0]) - -2.047429323196411).abs() < 1e-4
    for example)
  • It would be nice to add a test for a full-sized model. Can you add a test that generates a sensible output to test the generation process using float16, on CPU and flagging the test as #[cfg_attr(not(feature = "all-tests"), ignore)] )?

Copy link
Contributor Author

@lerouxrgd lerouxrgd Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on the test you suggested for the tiny model.

As for the full-sized model using float16 it won't run on the CPU, when I tried I got a Torch error: "LayerNormKernelImpl" not implemented for 'Half'. I think some cuda kernels are not available for inference on the CPU when using float16. Using the float32 version requires at least 32 Gb of RAM though. I can write a test that runs on the GPU using float16 if possible, and on the CPU using float32 otherwise. What do you think ?

"gpt-j-6B/model",
"https://huggingface.co/EleutherAI/gpt-j-6B/resolve/float16/rust_model.ot",
);
pub const GPT_J_TINY_RANDOM: (&'static str, &'static str) = (
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify the license for this model with the author or the Huggingface team

@guillaume-be
Copy link
Owner

@lerouxrgd to my understanding this pull request is held up by the pending model weights pull request on the huggingface hub. It seems that hub model contributors do not automatically get notifications when an issue/PR is created on the models, you may want to try reaching out directly to the authors (if you find their contact information) or the Huggingface maintainers?

@lerouxrgd
Copy link
Contributor Author

Yes, I have reached out to maintainers but I don't have a final answer yet.

@guillaume-be
Copy link
Owner

Hello @lerouxrgd ,

If it is difficult merging the Rust weights, I suggest updating this pull request removing the pretrained weights and replacing it with a comment indicating how to generate the set of weight from the Python version. If and when the original model authors approve hosting Rust weights (or if they encourage creating a new model on the Hub for the Rust weights), you could open a new pull request making these weights available to the community.

What do you think?

@lerouxrgd
Copy link
Contributor Author

lerouxrgd commented Dec 23, 2022

Hello @guillaume-be ,

It's sad to see that the model's maintainers aren't really willing to add the Rust weights nor giving constructive feedback. Therefore I think you're right we can explain in comments how to create them. I have updated the PR accordingly, let me know what you think.

tests/gpt_j.rs Outdated
/// [gpt-j-6B-float16]:https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16
#[test]
#[cfg_attr(not(feature = "all-tests"), ignore)]
fn gpt_j_generation() -> anyhow::Result<()> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these weights are not available could we please remove this from integration tests? This could be added as an example instead if you wish to be run manually.

You could also try loading the pytorch_model.bin model file with the ltest tch-rs update. It does not work for all model files yet, but it compatible with GPT-J files this would allow readily loading the model from pretrained resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I also get a "Expected GenericDict but got None" error as mentioned in tch-rs/issues/595

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's unfortunate, thank you for checking anyway. It may make sense to move this code snippet to the examples if the download/conversion cannot be easily automated - what do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lerouxrgd it looks like the PR is almost ready to be merged -- any chance this could be moved to an example? This would give more visibility to this new feature, and keep all tests independent of pre-download/conversion steps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I was doing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (And I have re-checked that it works fine on a T4 GPU)

@guillaume-be
Copy link
Owner

Hello @lerouxrgd ,
I see you opened LaurentMazare/tch-rs#604 .

I want to point you out to the existing utilities in that can be found in https://github.com/guillaume-be/rust-bert/blob/master/src/common/kind.rs may be useful if related to completing this pull request.

Please feel free to extend if you need values that are not MAX/Infinity

@lerouxrgd
Copy link
Contributor Author

Oh I wasn't aware of it, thanks for pointing it out.

Although I think it's just half of the story, the other half being the impl From<half::f16> for Scalar etc for all Elements (or Kind ?) to be able to multiply them with a Tensor. Let's see how this issue goes in tch-rs.

@guillaume-be
Copy link
Owner

Could you please share a pointer what you are trying to implement?

@lerouxrgd
Copy link
Contributor Author

So this line:

let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0);

works because of impl From< f64> for Scalar and impl< S: Into< Scalar>> Mul< S> for Tensor, however as mentioned in the tch-rs issue, attention mask in HuggingFace Python code is now implemented as follows since version 4.21.0:

attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min

So we need 2 things to port this code correctly:

  1. Something in Rust similar to torch.finfo(self.dtype).min to get the correct value (instead of hardcoded -10000.0)
  2. The ability to multiply a Tensor by the resulting min value. And in case of float16 this is not doable now as I mentioned in the tch-rs issue

@lerouxrgd
Copy link
Contributor Author

Writing this I realize that even the 1.0 - attention_mask part is probably not correct as the same remark holds for impl< S: Into< Scalar>> Add< S> for Tensor, and some implicit conversion to f64 is already happening ... Which might defeat the point of using float16 I guess...

So maybe simply using your function for getting Kind's min is enough. What do you think ?

@guillaume-be
Copy link
Owner

The attention mask may be of type Kind::Bool or Kind::Int64 when the subtraction happens. I was working on a LongT5 implementation this morning and implemented that same mechanism as:

(attention_mask.ones_like()
            - attention_mask.to_kind(input_embeddings.kind()))
            * get_negative_infinity(input_embeddings.kind()).unwrap()

where input_embedding is the input tensor to this layer (you have have hidden_states or other you can use as a reference).
I have not yet tested it, but it compiles fine - maybe this helps?

@lerouxrgd
Copy link
Contributor Author

I see, thanks I'll give it a try on Monday !

@lerouxrgd
Copy link
Contributor Author

So it compiles fine, but sadly it doesn't change the results... There is still some changes made in HF between 4.20.1 and 4.21.0 that affect the correctness results and I haven't found what they are.

@guillaume-be
Copy link
Owner

@lerouxrgd I noticed a discrepancy when implementing longt5: the masked position are not filled with NEGATIVE_INFINITY but with MIN . I have created new method to access either the infinity or min/max values for the different dtypes: can you please try using the get_min method:

pub(crate) fn get_min(kind: Kind) -> Result<Scalar, RustBertError> {

that is merged as part of #333 and see if this helps?

@lerouxrgd
Copy link
Contributor Author

lerouxrgd commented Feb 12, 2023

Yes this is what I've tried already as you can see here and here

@lerouxrgd
Copy link
Contributor Author

@guillaume-be Good news, after a long println debugging session I've finally figured out that the main issue was that I didn't set the attention mask in my unit test. I guess this was still OK in the previous attention version but not with the latest one.

@guillaume-be
Copy link
Owner

@guillaume-be Good news, after a long println debugging session I've finally figured out that the main issue was that I didn't set the attention mask in my unit test. I guess this was still OK in the previous attention version but not with the latest one.

This is awesome, thank you so much for your patience for getting to the bottom of this! I have one last request (move of the run with GPT-J actual weights to examples) and this should be good to merge

@lerouxrgd
Copy link
Contributor Author

Ok I think everything is fine now. I have made the correctness test runnable on both CPU and GPU (forcing it to use float16 in this case, which in turn lowers precision to 1e-2 instead of 1e-4).

Let me know if some other changes are needed.

Copy link
Owner

@guillaume-be guillaume-be left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - thanks again for this PR @lerouxrgd !

@guillaume-be guillaume-be merged commit c448862 into guillaume-be:master Feb 15, 2023
@lerouxrgd lerouxrgd deleted the gpt-j branch February 15, 2023 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants