⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker · 2023-06-29T02:10:16Z

What does this PR do?

Fixes the T5Tokenizer (not the fast one yet). (at the same time adresses part of #11531)
When converting UMT5 I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process <extra_id_xx>.

Example:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

#after fix: 
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

The reason is that t5x wrapps arround sentencepiece, and adds the extra id to the vocab, but they are not saved that way.
We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:

>>> tokenizer.sp_model.encode(". Hello")
[273, 274, 9]

While the original model never sees a . (or a lot of other characters) alone, and thus we add an extra space...

This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.

TODO:

Extra checks should be added to make sure this does not add anything else (like stripping a . This for example would break: tokenizer.encode(". Hello") as it remove the prefix space that is normally added.

HuggingFaceDocBuilderDev · 2023-06-29T02:34:11Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-06-29T03:29:25Z

Actually switch t5 tests have to be updated!
This means I have to check if the models were trained with this extra token (if they used HF tokenizer) or not.

tests.models.instructblip.test_modeling_instructblip.InstructBlipModelIntegrationTest testMethod=test_inference_flant5_xl failing on main too so not related.....

tests.models.mt5.test_modeling_flax_mt5.MT5IntegrationTest also fails on main...
tests/models/t5/test_tokenization_t5.py the issue comes from the convert_slow modification. Need to investigate
- [ ] tests/models/t5/test_tokenization_t5.py:399 T5TokenizationTest.test_get_sentinel_token_ids_for_fasttokenizer
- [ ] tests/test_tokenization_common.py:3425 T5TokenizationTest.test_save_pretrained
- [ ] tests/models/t5/test_tokenization_t5.py:271 T5TokenizationTest.test_special_tokens_initialization

ArthurZucker · 2023-06-29T06:47:39Z

This can also be made non "breakable" with a flag. Up to debate since it is a bug fix.

sgugger

Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ArthurZucker · 2023-07-02T03:07:29Z

Edit: just to make sure, I did more testing and unfortunately , there is one bug:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

instead of

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

This is because we have to prepend a _ instead of a space. (text = SPIECE_UNDERLINE + text. Not a single test caught this when runing pytest tests -k t5 which is interesting.
Fixing asap and adding tests. This is becoming very complex 😓

pointonjoel · 2023-07-20T11:34:28Z

I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained("google/mt5-small")

The error is:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at #24565
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

ArthurZucker · 2023-07-20T13:55:24Z

Yep, just set legacy=False. The goal of the warning is for you to decide wether or not you thing the legacy behaviour is alright with you or not.

ArthurZucker · 2024-08-20T16:48:29Z

you might want to open an issue on ilab

You are using the default legacy behaviour of the <class'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565

Update the split function to based on the last occurence of "." since the file path now includes . in MacOs in InstructLab version 0.19. **Issue resolved by this Pull Request:** Resolves #2356 ## Before ```bash ilab model train --pipeline simple ``` ```bash ##Output ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.961, Val took 40.321s Iter 010: Train loss 3.318, It/sec 0.212, Tokens/sec 88.791 Epoch 1: Iter 10: Val loss 2.477, Val took 41.355s Traceback (most recent call last): File "/Users/ahmedazraq/Documents/instructlab/venv/bin/ilab", line 8, in <module> sys.exit(ilab()) ^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/clickext.py", line 319, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/model/train.py", line 730, in train load_and_train( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 306, in load_and_train train_model( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 204, in train_model a, b = adapter_file.split(".") ^^^^ ValueError: too many values to unpack (expected 2) ``` ## After ```bash [INFO] Loading Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 4778.60it/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. dtype=mlx.core.float16 [INFO] Quantizing Using model_type='llama' Loading pretrained model Using model_type='llama' Total parameters 1165.829M Trainable parameters 2.097M Loading datasets ********* ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.957, Val took 40.587s Iter 010: Train loss 3.302, It/sec 0.201, Tokens/sec 84.345 Epoch 1: Iter 10: Val loss 2.450, Val took 43.608s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 10: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-010.npz. Iter 020: Train loss 1.670, It/sec 0.194, Tokens/sec 76.757 Epoch 1: Iter 20: Val loss 1.315, Val took 41.714s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 20: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-020.npz. Iter 030: Train loss 0.902, It/sec 0.167, Tokens/sec 69.851 Epoch 1: Iter 30: Val loss 1.010, Val took 41.833s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 30: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-030.npz. Iter 040: Train loss 0.679, It/sec 0.188, Tokens/sec 76.435 Epoch 1: Iter 40: Val loss 0.861, Val took 42.498s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 40: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-040.npz. Iter 050: Train loss 0.618, It/sec 0.167, Tokens/sec 69.118 Epoch 1: Iter 50: Val loss 0.797, Val took 42.391s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 50: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-050.npz. Iter 060: Train loss 0.442, It/sec 0.184, Tokens/sec 75.080 Epoch 2: Iter 60: Val loss 0.764, Val took 42.659s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 60: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-060.npz. Iter 070: Train loss 0.496, It/sec 0.158, Tokens/sec 65.023 Epoch 2: Iter 70: Val loss 0.704, Val took 42.645s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 70: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-070.npz. Iter 080: Train loss 0.420, It/sec 0.165, Tokens/sec 68.490 Epoch 2: Iter 80: Val loss 0.682, Val took 44.482s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 80: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-080.npz. Iter 090: Train loss 0.422, It/sec 0.147, Tokens/sec 62.503 Epoch 2: Iter 90: Val loss 0.647, Val took 42.198s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 90: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-090.npz. Iter 100: Train loss 0.404, It/sec 0.199, Tokens/sec 79.267 Epoch 2: Iter 100: Val loss 0.631, Val took 42.723s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 100: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz. ``` **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [x] Unit tests have been added, if necessary. - [x] Functional tests have been added, if necessary. - [x] E2E Workflow tests have been added, if necessary. Approved-by: jaideepr97 Approved-by: bjhargrave

ArthurZucker added 2 commits June 29, 2023 01:57

don't add space before single letter chars that don't have a merge

e03a768

fix the fix

76d6ab3

ArthurZucker added 4 commits June 29, 2023 02:35

fixup

5a7184b

add a test

baac7be

more testing

6e37601

fixup

b933328

ArthurZucker mentioned this pull request Jun 29, 2023

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed

4 tasks

hack to make sure fast is also fixed

d0cbc49

ArthurZucker marked this pull request as ready for review June 29, 2023 03:28

ArthurZucker added 2 commits June 29, 2023 04:36

update switch transformers test

50008ed

revert convert slow

5edf863

ArthurZucker requested review from Narsil and sgugger June 29, 2023 06:09

sgugger approved these changes Jun 29, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits June 30, 2023 03:54

Update src/transformers/models/t5/tokenization_t5.py

17bda2c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add typechecking

059999e

quality

8d3f2a2

ArthurZucker merged commit b52a03c into huggingface:main Jun 30, 2023
4 checks passed

dtiarks mentioned this pull request Jun 30, 2023

Add UDOP #22940

Merged

4 tasks

This was referenced Jul 2, 2023

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

hy928302776 mentioned this pull request Jul 13, 2023

启动模型 bash ./scripts/infer.sh 异常 jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese#10

Open

ashercn97 mentioned this pull request Jul 18, 2023

Confusing Error Message (i think). axolotl-ai-cloud/axolotl#290

Closed

atillabasaran mentioned this pull request Jul 18, 2023

Legacy tokenizer artidoro/qlora#212

Open

GOOD-N-LCM mentioned this pull request Aug 21, 2024

[BUG] <title> 启动Qanything-api服务报错 netease-youdao/QAnything#478

Open

2 tasks

C3po-D2rd2 mentioned this pull request Aug 21, 2024

[Bug]: Loading mistral-7B-instruct-v03 KeyError: 'layers.0.attention.wk.weight' vllm-project/vllm#4989

Closed

Nikoliazzz mentioned this pull request Aug 26, 2024

[Bug]: RuntimeError:Internal:could not parse:ModelProto secretflow/spu#833

Open

aluu317 mentioned this pull request Aug 28, 2024

fix: Resolve some warnings foundation-model-stack/fms-hf-tuning#317

Draft

2 tasks

chenminglin mentioned this pull request Sep 2, 2024

I don't know why the LoRA for training failed kijai/ComfyUI-FluxTrainer#32

Open

kalle07 mentioned this pull request Sep 6, 2024

BUG: FluX error train stops since today update bmaltais/kohya_ss#2789

Closed

bolli20000 mentioned this pull request Sep 7, 2024

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze bmaltais/kohya_ss#2795

Open

Telllinex mentioned this pull request Sep 10, 2024

Returned non-zero exit status 3221225477 - can't train lora bmaltais/kohya_ss#2800

Open

123LiVo321 mentioned this pull request Sep 10, 2024

Really awfull training times cocktailpeanut/fluxgym#45

Closed

XT-404 mentioned this pull request Sep 12, 2024

[ERROR] Command exited with code 1 cocktailpeanut/fluxgym#36

Open

zxtmzxtm mentioned this pull request Sep 14, 2024

[ERROR] Command exited with code 1 [2024-09-14 13:44:41] [INFO] Runner: <LogsViewRunner nb_logs=127 exit_code=1>,What should I do? cocktailpeanut/fluxgym#68

Open

MartinAbilev mentioned this pull request Sep 18, 2024

I can't get it to work black-forest-labs/flux#145

Open

wizaaaard mentioned this pull request Sep 19, 2024

'Llama tokenizer' issues rui-ye/OpenFedLLM#29

Open

zrxpower mentioned this pull request Sep 19, 2024

can not train cocktailpeanut/fluxgym#113

Open

palvarez19 mentioned this pull request Sep 20, 2024

Could not load library libnvrtc.so.12 cocktailpeanut/fluxgym#114

Open

ZCzzzzzz mentioned this pull request Sep 24, 2024

[Bug] HuggingFacewithChatTemplate open-compass/opencompass#1557

Open

2 tasks

guyuliang7 mentioned this pull request Sep 24, 2024

opensora-plan-v1.1 推理生成的视频都是类似马赛克 mindspore-lab/mindone#668

Open

chris9-0 mentioned this pull request Sep 24, 2024

Smaller flux checkpoints? cocktailpeanut/fluxgym#128

Open

harmlessSR mentioned this pull request Sep 26, 2024

Environment issues yule-BUAA/MergeLM#40

Closed

xuanzhangyang mentioned this pull request Sep 27, 2024

Convert llama-2 to hf error #33746

Open

4 tasks

Itime-ren mentioned this pull request Sep 29, 2024

convert_llama_weights_to_hf.py, llama3.1-8B, Trying to set a tensor of shape torch.Size([128256, 4096]) in "weight" (which has shape torch.Size([128003, 4096])), this looks incorrect. #33791

Open

4 tasks

Noblezhong mentioned this pull request Sep 30, 2024

an issue for repeat a command lm-sys/FastChat#3542

Open

ahmed-azraq mentioned this pull request Oct 1, 2024

ilab model train --pipeline simple, on v0.19 fails on Mac with ValueError: too many values to unpack instructlab/instructlab#2356

Closed

Wolchenok57 mentioned this pull request Oct 3, 2024

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. kohya-ss/sd-scripts#1665

Open

SidneyLann mentioned this pull request Oct 3, 2024

Fine tune and infer llama3 with cpu unslothai/unsloth#1037

Open

This was referenced Oct 6, 2024

Are there any available tools that can convert the original .pth to safetensors meta-llama/llama-stack#191

Open

RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model #34017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

ArthurZucker commented Jul 2, 2023 •

edited

Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

ArthurZucker commented Aug 20, 2024

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Conversation

ArthurZucker commented Jun 29, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 2, 2023 • edited Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

ArthurZucker commented Aug 20, 2024

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jul 2, 2023 •

edited

Loading