Add GPT-J support (#285) #288

lerouxrgd · 2022-10-13T13:15:14Z

Remaining points:

The GptJModelResources rust_model.ot files should be added to HF Hub (I have never tried to add models there but I can try if you want)
- PR for gpt-j-tiny-random
- PR for gpt-j-6b float32
- PR for gpt-j-6b float16
GptJConfig::use_float16 defaults to true, do you want to keep it this way ?
GptJConfig::preload_on_cpu defaults to true, do you want to keep it this way ?

Note that GptJConfig::preload_on_cpu makes it possible to load GPT-J 6B (both float16 and float32 with conversion to float16 on the fly) on a T4 GPU (with ~16Gb VRAM). Without this flag there are memory issues when loading the weights with var_store.load(weights_path).

guillaume-be

Hello @lerouxrgd
This is great - thank you very much for this PR! Could you please open pull requests on the Huggingface model hub to upload the model weights that are registered in the crate (you may want to refer this PR as part of the PR there)? It would also be great to extend the test with a valid generation example - the crate has a flag for disabling large tests by default (in the CI pipeline), but allows running them manually on a larger machine.

The default values for use_float16 and preload_on_gpu both sound reasonable to me.

guillaume-be · 2022-10-15T07:42:22Z

utils/convert_model.py

        if isinstance(v, Tensor):
-            nps[k] = np.ascontiguousarray(v.cpu().numpy().astype(np.float32))


I am wondering if this breaks a current model implementation. It looks like this was added when implementing MBart - but the weight on the Huggingface model hub do not seem to differ. Maybe we could make the target precision an argument?

guillaume-be · 2022-10-15T07:48:29Z

src/gpt_j/mod.rs

+//! # }
+//! ```
+
+pub(crate) mod attention;


Is this level of visibility needed for attention and transformer?

Indeed it's not needed, I took it from GPT2. I'll remove the pub(crate).

guillaume-be · 2022-10-15T07:50:48Z

src/gpt_j/attention.rs

+        );
+        let dim_per_head = config.n_embd / config.n_head;
+
+        let scale_attn = Tensor::from(dim_per_head as f32)


is there value in storing this as a tensor as opposed to a f32? I believe broadcasting should work with the float value as well and is a bit lighter to manipulate/store

Good point, I originally used a Tensor as in the Python implementation. However it looks like the broadcasting is only available forf64 with <Scalar as From<f64>>, which is fine I guess.

guillaume-be · 2022-10-15T07:53:52Z

src/gpt_j/attention.rs

+        let out_proj = nn::linear(p / "out_proj", config.n_embd, config.n_embd, linear_config);
+        if config.use_float16 {
+            (p / "out_proj").half();
+        }


It might be enough to call p.half() once for all sublayers of the attention (it should propagate to the entire sub-paths)

Right, but I think that converting to half as soon as possible is safer, especially since at layer 28 the memory usage starts to be important.

guillaume-be · 2022-10-15T07:57:46Z

src/gpt_j/attention.rs

+            tensor.permute(&[0, 2, 1, 3]) // (batch, head, seq_length, head_features)
+        } else {
+            panic!(
+                "Input tensor rank should be one of [4, 5], but is: {}",


I guess it should be one of [4, 5] or a rotary head?

guillaume-be · 2022-10-15T08:08:42Z

src/gpt_j/gpt_j_model.rs

+
+        let (layer_past, _layer_past_length) = match layer_past {
+            Some(value) => {
+                assert_eq!(


Since this method returns a Result you could raise an Error rather than panicking

guillaume-be · 2022-10-15T08:09:17Z

src/gpt_j/gpt_j_model.rs

+                .unsqueeze(2)
+                .to_kind(input_embeddings.kind());
+
+            let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0);


Is this valid for both fp32 and fp16?

It looked OK in my tests (same text generated as in Python) with models of both precisions. Do you see another kind of test for this in particular ?

I guess the test you suggested with the tiny model will help, I'll let you know the results.

I am not sure if you had the chance to test this - Could you please run the Python version was half precision weights and update the "correctness" integration tests with this the transformation?

Well using half precision doesn't work well on CPU (as they usually don't support float16), so this would require Cuda and GPU to run, which is not what the GitHub CI provides I guess.

I will still try to run it on a GPU on my side to check the correctness.

I usually try to run a broader set of test using a GPU - maybe a solution could be to add a test for half precision that runs only on GPU? i.e.

#[test] fn test_gpt_j_half_precision { if Cuda::is_available { [... tests here ...] } }

While giving it a try I realized that with latest Python transformers I don't get the same results for the "correctness" test anymore. After investigations it looks like in 4.21.0 HF made changes to the implementation of GPT-J. I will try to adapt the PR accordingly first.

Interestingly the generation test is still fine ...

In the end I have made the correctness test use Cuda device if available. On CPU this is the same test as before, with precision 1e-4. On the GPU the test forces float16 usage and the correctness is fine at 1e-2 only in this case.

guillaume-be · 2022-10-15T08:11:34Z

tests/gpt_j.rs

@@ -0,0 +1,89 @@
+use rust_bert::gpt_j::{


Can you please add this test to the integration tests as well (after

rust-bert/.github/workflows/continuous-integration.yml

Line 117 in 78da0f4

--test sentence_embeddings

)

guillaume-be · 2022-10-15T08:14:53Z

tests/gpt_j.rs

+/// gen_texts = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
+/// ````
+#[test]
+fn test_generation_gpt_j() -> anyhow::Result<()> {


Thank you for adding tests.

I would suggest the following:

For the tiny test that will run in the pipeline, please instead do a forward pass through the model by passing an input tensor and a past tensor, and compare that to the output generated by Python (e.g. to the 4th decimal, see

rust-bert/tests/m2m100.rs

Line 64 in 78da0f4

(model_output.decoder_output.double_value(&[0, 0, 0]) - -2.047429323196411).abs() < 1e-4

for example)

It would be nice to add a test for a full-sized model. Can you add a test that generates a sensible output to test the generation process using float16, on CPU and flagging the test as #[cfg_attr(not(feature = "all-tests"), ignore)] )?

I'll work on the test you suggested for the tiny model.

As for the full-sized model using float16 it won't run on the CPU, when I tried I got a Torch error: "LayerNormKernelImpl" not implemented for 'Half'. I think some cuda kernels are not available for inference on the CPU when using float16. Using the float32 version requires at least 32 Gb of RAM though. I can write a test that runs on the GPU using float16 if possible, and on the CPU using float32 otherwise. What do you think ?

guillaume-be · 2022-10-15T08:16:46Z

src/gpt_j/gpt_j_model.rs

+        "gpt-j-6B/model",
+        "https://huggingface.co/EleutherAI/gpt-j-6B/resolve/float16/rust_model.ot",
+    );
+    pub const GPT_J_TINY_RANDOM: (&'static str, &'static str) = (


It would be good to clarify the license for this model with the author or the Huggingface team

src/gpt_j/gpt_j_model.rs

guillaume-be · 2022-11-26T08:54:19Z

@lerouxrgd to my understanding this pull request is held up by the pending model weights pull request on the huggingface hub. It seems that hub model contributors do not automatically get notifications when an issue/PR is created on the models, you may want to try reaching out directly to the authors (if you find their contact information) or the Huggingface maintainers?

lerouxrgd · 2022-12-05T09:56:17Z

Yes, I have reached out to maintainers but I don't have a final answer yet.

guillaume-be · 2022-12-21T16:56:17Z

Hello @lerouxrgd ,

If it is difficult merging the Rust weights, I suggest updating this pull request removing the pretrained weights and replacing it with a comment indicating how to generate the set of weight from the Python version. If and when the original model authors approve hosting Rust weights (or if they encourage creating a new model on the Hub for the Rust weights), you could open a new pull request making these weights available to the community.

What do you think?

lerouxrgd · 2022-12-23T14:52:15Z

Hello @guillaume-be ,

It's sad to see that the model's maintainers aren't really willing to add the Rust weights nor giving constructive feedback. Therefore I think you're right we can explain in comments how to create them. I have updated the PR accordingly, let me know what you think.

guillaume-be · 2022-12-23T17:21:26Z

tests/gpt_j.rs

+/// [gpt-j-6B-float16]:https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16
+#[test]
+#[cfg_attr(not(feature = "all-tests"), ignore)]
+fn gpt_j_generation() -> anyhow::Result<()> {


If these weights are not available could we please remove this from integration tests? This could be added as an example instead if you wish to be run manually.

You could also try loading the pytorch_model.bin model file with the ltest tch-rs update. It does not work for all model files yet, but it compatible with GPT-J files this would allow readily loading the model from pretrained resources.

It looks like I also get a "Expected GenericDict but got None" error as mentioned in tch-rs/issues/595

Ah that's unfortunate, thank you for checking anyway. It may make sense to move this code snippet to the examples if the download/conversion cannot be easily automated - what do you think?

@lerouxrgd it looks like the PR is almost ready to be merged -- any chance this could be moved to an example? This would give more visibility to this new feature, and keep all tests independent of pre-download/conversion steps

That's exactly what I was doing :)

Done (And I have re-checked that it works fine on a T4 GPU)

guillaume-be · 2023-01-14T08:53:43Z

Hello @lerouxrgd ,
I see you opened LaurentMazare/tch-rs#604 .

I want to point you out to the existing utilities in that can be found in https://github.com/guillaume-be/rust-bert/blob/master/src/common/kind.rs may be useful if related to completing this pull request.

Please feel free to extend if you need values that are not MAX/Infinity

lerouxrgd · 2023-01-14T13:00:51Z

Oh I wasn't aware of it, thanks for pointing it out.

Although I think it's just half of the story, the other half being the impl From<half::f16> for Scalar etc for all Elements (or Kind ?) to be able to multiply them with a Tensor. Let's see how this issue goes in tch-rs.

guillaume-be · 2023-01-14T13:22:22Z

Could you please share a pointer what you are trying to implement?

lerouxrgd · 2023-01-14T13:56:52Z

So this line:

let attention_mask: Tensor = (1.0 - attention_mask) * (-10000.0);

works because of impl From< f64> for Scalar and impl< S: Into< Scalar>> Mul< S> for Tensor, however as mentioned in the tch-rs issue, attention mask in HuggingFace Python code is now implemented as follows since version 4.21.0:

attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min

So we need 2 things to port this code correctly:

Something in Rust similar to torch.finfo(self.dtype).min to get the correct value (instead of hardcoded -10000.0)
The ability to multiply a Tensor by the resulting min value. And in case of float16 this is not doable now as I mentioned in the tch-rs issue

lerouxrgd · 2023-01-14T14:02:56Z

Writing this I realize that even the 1.0 - attention_mask part is probably not correct as the same remark holds for impl< S: Into< Scalar>> Add< S> for Tensor, and some implicit conversion to f64 is already happening ... Which might defeat the point of using float16 I guess...

So maybe simply using your function for getting Kind's min is enough. What do you think ?

guillaume-be · 2023-01-14T14:40:19Z

The attention mask may be of type Kind::Bool or Kind::Int64 when the subtraction happens. I was working on a LongT5 implementation this morning and implemented that same mechanism as:

(attention_mask.ones_like()
            - attention_mask.to_kind(input_embeddings.kind()))
            * get_negative_infinity(input_embeddings.kind()).unwrap()

where input_embedding is the input tensor to this layer (you have have hidden_states or other you can use as a reference).
I have not yet tested it, but it compiles fine - maybe this helps?

lerouxrgd · 2023-01-14T14:47:04Z

I see, thanks I'll give it a try on Monday !

lerouxrgd · 2023-01-16T10:34:50Z

So it compiles fine, but sadly it doesn't change the results... There is still some changes made in HF between 4.20.1 and 4.21.0 that affect the correctness results and I haven't found what they are.

guillaume-be · 2023-02-12T16:19:07Z

@lerouxrgd I noticed a discrepancy when implementing longt5: the masked position are not filled with NEGATIVE_INFINITY but with MIN . I have created new method to access either the infinity or min/max values for the different dtypes: can you please try using the get_min method:

rust-bert/src/common/kind.rs

Line 42 in d7e9c03

pub(crate) fn get_min(kind: Kind) -> Result<Scalar, RustBertError> {

that is merged as part of #333 and see if this helps?

lerouxrgd · 2023-02-12T20:42:09Z

Yes this is what I've tried already as you can see here and here

lerouxrgd · 2023-02-14T17:30:28Z

@guillaume-be Good news, after a long println debugging session I've finally figured out that the main issue was that I didn't set the attention mask in my unit test. I guess this was still OK in the previous attention version but not with the latest one.

guillaume-be · 2023-02-14T17:45:41Z

@guillaume-be Good news, after a long println debugging session I've finally figured out that the main issue was that I didn't set the attention mask in my unit test. I guess this was still OK in the previous attention version but not with the latest one.

This is awesome, thank you so much for your patience for getting to the bottom of this! I have one last request (move of the run with GPT-J actual weights to examples) and this should be good to merge

lerouxrgd · 2023-02-14T18:33:58Z

Ok I think everything is fine now. I have made the correctness test runnable on both CPU and GPU (forcing it to use float16 in this case, which in turn lowers precision to 1e-2 instead of 1e-4).

Let me know if some other changes are needed.

guillaume-be

This is great - thanks again for this PR @lerouxrgd !

guillaume-be reviewed Oct 15, 2022

View reviewed changes

guillaume-be reviewed Oct 22, 2022

View reviewed changes

src/gpt_j/gpt_j_model.rs Outdated Show resolved Hide resolved

guillaume-be reviewed Oct 22, 2022

View reviewed changes

src/gpt_j/gpt_j_model.rs Outdated Show resolved Hide resolved

guillaume-be reviewed Oct 22, 2022

View reviewed changes

src/gpt_j/gpt_j_model.rs Outdated Show resolved Hide resolved

guillaume-be reviewed Oct 22, 2022

View reviewed changes

src/gpt_j/gpt_j_model.rs Outdated Show resolved Hide resolved

guillaume-be reviewed Dec 23, 2022

View reviewed changes

lerouxrgd mentioned this pull request Jan 3, 2023

Support for Type Info LaurentMazare/tch-rs#604

Open

lerouxrgd and others added 4 commits February 14, 2023 10:43

Add GPT-J support (#285)

d7e5a7d

Improve GPT-J implementation

a8f37bd

Improve GPT-J tests

2458d82

Adapt GPT-J to latest master branch

dc9a245

Specify how to convert GPT-J weights instead of providing them

81d860a

guillaume-be approved these changes Feb 15, 2023

View reviewed changes

guillaume-be merged commit c448862 into guillaume-be:master Feb 15, 2023

lerouxrgd deleted the gpt-j branch February 15, 2023 19:12

		if isinstance(v, Tensor):
		nps[k] = np.ascontiguousarray(v.cpu().numpy().astype(np.float32))

Add GPT-J support (#285) #288

Add GPT-J support (#285) #288

Conversation

lerouxrgd commented Oct 13, 2022 • edited Loading

guillaume-be left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lerouxrgd Oct 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaume-be commented Nov 26, 2022

lerouxrgd commented Dec 5, 2022

guillaume-be commented Dec 21, 2022

lerouxrgd commented Dec 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaume-be commented Jan 14, 2023

lerouxrgd commented Jan 14, 2023

guillaume-be commented Jan 14, 2023

lerouxrgd commented Jan 14, 2023

lerouxrgd commented Jan 14, 2023

guillaume-be commented Jan 14, 2023

lerouxrgd commented Jan 14, 2023

lerouxrgd commented Jan 16, 2023

guillaume-be commented Feb 12, 2023

lerouxrgd commented Feb 12, 2023 • edited Loading

lerouxrgd commented Feb 14, 2023

guillaume-be commented Feb 14, 2023

lerouxrgd commented Feb 14, 2023

guillaume-be left a comment

Choose a reason for hiding this comment

lerouxrgd commented Oct 13, 2022 •

edited

Loading

guillaume-be left a comment •

edited

Loading

lerouxrgd Oct 16, 2022 •

edited

Loading

lerouxrgd Oct 16, 2022 •

edited

Loading

lerouxrgd Oct 16, 2022 •

edited

Loading

lerouxrgd Jan 3, 2023 •

edited

Loading

lerouxrgd Oct 16, 2022 •

edited

Loading

lerouxrgd commented Dec 23, 2022 •

edited

Loading

lerouxrgd commented Feb 12, 2023 •

edited

Loading