Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Merged
merged 12 commits into from
Jun 30, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Jun 29, 2023

What does this PR do?

Fixes the T5Tokenizer (not the fast one yet). (at the same time adresses part of #11531)
When converting UMT5 I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process <extra_id_xx>.

Example:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

#after fix: 
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

The reason is that t5x wrapps arround sentencepiece, and adds the extra id to the vocab, but they are not saved that way.
We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:

>>> tokenizer.sp_model.encode(". Hello")
[273, 274, 9] 

While the original model never sees a . (or a lot of other characters) alone, and thus we add an extra space...

This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.

TODO:

  • Extra checks should be added to make sure this does not add anything else (like stripping a . This for example would break: tokenizer.encode(". Hello") as it remove the prefix space that is normally added.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 29, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review June 29, 2023 03:28
@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Jun 29, 2023

Actually switch t5 tests have to be updated!
This means I have to check if the models were trained with this extra token (if they used HF tokenizer) or not.

  • tests.models.instructblip.test_modeling_instructblip.InstructBlipModelIntegrationTest testMethod=test_inference_flant5_xl failing on main too so not related.....
image
  • tests.models.mt5.test_modeling_flax_mt5.MT5IntegrationTest also fails on main...

  • tests/models/t5/test_tokenization_t5.py the issue comes from the convert_slow modification. Need to investigate
    - [ ] tests/models/t5/test_tokenization_t5.py:399 T5TokenizationTest.test_get_sentinel_token_ids_for_fasttokenizer
    - [ ] tests/test_tokenization_common.py:3425 T5TokenizationTest.test_save_pretrained
    - [ ] tests/models/t5/test_tokenization_t5.py:271 T5TokenizationTest.test_special_tokens_initialization

@ArthurZucker
Copy link
Collaborator Author

This can also be made non "breakable" with a flag. Up to debate since it is a bug fix.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved
src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved
ArthurZucker and others added 3 commits June 30, 2023 03:54
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@ArthurZucker ArthurZucker merged commit b52a03c into huggingface:main Jun 30, 2023
4 checks passed
@dtiarks dtiarks mentioned this pull request Jun 30, 2023
4 tasks
@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Jul 2, 2023

Edit: just to make sure, I did more testing and unfortunately , there is one bug:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

instead of

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

This is because we have to prepend a _ instead of a space. (text = SPIECE_UNDERLINE + text. Not a single test caught this when runing pytest tests -k t5 which is interesting.
Fixing asap and adding tests. This is becoming very complex 😓

@pointonjoel
Copy link

I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained("google/mt5-small")

The error is:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at #24565
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

@ArthurZucker
Copy link
Collaborator Author

Yep, just set legacy=False. The goal of the warning is for you to decide wether or not you thing the legacy behaviour is alright with you or not.

@ArthurZucker
Copy link
Collaborator Author

you might want to open an issue on ilab

AironasGit added a commit to AironasGit/Audio2Text2Summary that referenced this pull request Sep 22, 2024
You are using the default legacy behaviour of the <class'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
mergify bot added a commit to instructlab/instructlab that referenced this pull request Oct 2, 2024
Update the split function to based on the last occurence of "." since the file path now includes . in MacOs in InstructLab version 0.19.

**Issue resolved by this Pull Request:**
Resolves #2356

## Before

```bash
ilab model train --pipeline simple
```

```bash
##Output
ᕙ(•̀‸•́‶)ᕗ  Training has started! ᕙ(•̀‸•́‶)ᕗ 

*********
Epoch 1: Iter 1: Val loss 3.961, Val took 40.321s
Iter 010: Train loss 3.318, It/sec 0.212, Tokens/sec 88.791
Epoch 1: Iter 10: Val loss 2.477, Val took 41.355s
Traceback (most recent call last):
  File "/Users/ahmedazraq/Documents/instructlab/venv/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/clickext.py", line 319, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/model/train.py", line 730, in train
    load_and_train(
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 306, in load_and_train
    train_model(
  File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 204, in train_model
    a, b = adapter_file.split(".")
    ^^^^
ValueError: too many values to unpack (expected 2)
```

## After

```bash
[INFO] Loading
Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 4778.60it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
dtype=mlx.core.float16
[INFO] Quantizing
Using model_type='llama'
Loading pretrained model
Using model_type='llama'
Total parameters 1165.829M
Trainable parameters 2.097M
Loading datasets
*********

ᕙ(•̀‸•́‶)ᕗ  Training has started! ᕙ(•̀‸•́‶)ᕗ 

*********
Epoch 1: Iter 1: Val loss 3.957, Val took 40.587s
Iter 010: Train loss 3.302, It/sec 0.201, Tokens/sec 84.345
Epoch 1: Iter 10: Val loss 2.450, Val took 43.608s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 10: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-010.npz.
Iter 020: Train loss 1.670, It/sec 0.194, Tokens/sec 76.757
Epoch 1: Iter 20: Val loss 1.315, Val took 41.714s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 20: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-020.npz.
Iter 030: Train loss 0.902, It/sec 0.167, Tokens/sec 69.851
Epoch 1: Iter 30: Val loss 1.010, Val took 41.833s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 30: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-030.npz.
Iter 040: Train loss 0.679, It/sec 0.188, Tokens/sec 76.435
Epoch 1: Iter 40: Val loss 0.861, Val took 42.498s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 40: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-040.npz.
Iter 050: Train loss 0.618, It/sec 0.167, Tokens/sec 69.118
Epoch 1: Iter 50: Val loss 0.797, Val took 42.391s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 50: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-050.npz.
Iter 060: Train loss 0.442, It/sec 0.184, Tokens/sec 75.080
Epoch 2: Iter 60: Val loss 0.764, Val took 42.659s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 60: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-060.npz.
Iter 070: Train loss 0.496, It/sec 0.158, Tokens/sec 65.023
Epoch 2: Iter 70: Val loss 0.704, Val took 42.645s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 70: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-070.npz.
Iter 080: Train loss 0.420, It/sec 0.165, Tokens/sec 68.490
Epoch 2: Iter 80: Val loss 0.682, Val took 44.482s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 80: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-080.npz.
Iter 090: Train loss 0.422, It/sec 0.147, Tokens/sec 62.503
Epoch 2: Iter 90: Val loss 0.647, Val took 42.198s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 90: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-090.npz.
Iter 100: Train loss 0.404, It/sec 0.199, Tokens/sec 79.267
Epoch 2: Iter 100: Val loss 0.631, Val took 42.723s
/Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters
npz
Iter 100: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz.
```

**Checklist:**

- [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release.
- [ ] Documentation has been updated, if necessary.
- [x] Unit tests have been added, if necessary.
- [x] Functional tests have been added, if necessary.
- [x] E2E Workflow tests have been added, if necessary.



Approved-by: jaideepr97

Approved-by: bjhargrave
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.