[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

ArthurZucker · 2023-07-02T03:55:17Z

What does this PR do?

There was a small typo that modified the behaviour in #24565, the test were not able to catch it. #24569
When a sentence does not start with a space, a space was added.
Before:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

After:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

Big bug 35 models involved

Not only punctuation but anything after a special token is basically wrong... Let's ignore the fact that we also split when it's the beginning of word less important

Tests were added as they were green before merging

src/transformers/models/t5/tokenization_t5.py

HuggingFaceDocBuilderDev · 2023-07-02T04:13:58Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-07-02T04:35:20Z

Ran all the tests with RUN_SLOW, switch-ci is fixed

amyeroberts

Thanks for fixing! Fix looks OK to me. I'm not super familiar with tokenizers, so would be good to have a second approval before merging.

Do we have equivalent tests for other tokenizers in the library?

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…to patch-t5-tokenizer

…er/transformers into patch-t5-tokenizer

Narsil

I don't feel good about this kind of change.

Adding ifs in general in tokenization methods is asking for trouble as the number of differents cases is astronomical (we have too much surface).

I wasn't a bit fan on the original fix either.

IMO we need to strengthen way more the test suite.

src/transformers/models/t5/tokenization_t5.py

Narsil

LGTM.

legacy means we are not modifying existing tokenizers without knowing. (And we need to manually update those core tokenizers)
Test suite is more extensive and if we make code modifications later I feel more confident we're note going to rebreak things unknowingly (either way).

amyeroberts

Thanks for iterating on this!

Just to make sure I've understood - was this bug always part of the T5 tokenizer, or was it introduced in #24565?
I'm asking because of the default value of legacy - at the moment any new tokenizers created will default to the legacy behaviour. If the bug was recently introduced in #24565, then it wasn't part of the most recent release and so we can default to the "new" behaviour.

amyeroberts · 2023-07-10T13:34:47Z

src/transformers/utils/__init__.py

@@ -177,15 +178,6 @@
 )


-if is_protobuf_available():


Why is this deleted?

Because it's messing up the import. By deleting it here, we can at least run the test individually

I am not sure what's messed up. With the fix in #24689 , everything should be fine.

Also, moving this to another place might count as breaking change, as

from transformers.utils import sentencepiece_model_pb2

won't work anymore (despite the direct such usage might be low).

See here

Could we bring it back quickly, and you open an issue about what's the import error you got. Thanks!

Having a look

Also from transformers.utils import sentencepiece_model_pb2 works for me

Comes from #24690

The idea is that we only import it once, where we actually use it.

This does not fix the version error, but fixes the issue with 3.20.3, when we cannot use seqio or anything importing protobuf:

! pip install protobuf=3.20.3 from transformers import AutoTokenizer from seqio import SentencePieceVocabulary In [3]: from seqio import SentencePieceVocabulary --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[3], line 1 ----> 1 from seqio import SentencePieceVocabulary File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/__init__.py:18 15 """Import to top-level API.""" 16 # pylint:disable=wildcard-import,g-bad-import-order ---> 18 from seqio.dataset_providers import * 19 from seqio import evaluation 20 from seqio import experimental File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/dataset_providers.py:39 37 from packaging import version as version_lib 38 import pyglove as pg ---> 39 from seqio import metrics as metrics_lib 40 from seqio import preprocessors as seqio_preprocessors 41 from seqio import task_registry_provenance_tracking File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/metrics.py:27 25 import jax.numpy as jnp 26 import numpy as np ---> 27 from seqio import utils 28 import tensorflow.compat.v2 as tf 31 @dataclasses.dataclass 32 class MetricValue: File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/utils.py:29 27 from absl import logging 28 import numpy as np ---> 29 from seqio.vocabularies import Vocabulary 30 import tensorflow.compat.v2 as tf 31 import tensorflow_datasets as tfds File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/vocabularies.py:27 24 import tensorflow.compat.v2 as tf 25 import tensorflow_text as tf_text ---> 27 from sentencepiece import sentencepiece_model_pb2 28 import sentencepiece as sentencepiece_processor 30 PAD_ID = 0 File /opt/conda/envs/py39/lib/python3.9/site-packages/sentencepiece/sentencepiece_model_pb2.py:16 9 # @@protoc_insertion_point(imports) 11 _sym_db = _symbol_database.Default() ---> 16 DESCRIPTOR = _descriptor.FileDescriptor( 17 name='sentencepiece_model.proto', 18 package='sentencepiece', 19 syntax='proto2', 20 serialized_options=b'H\003', 21 create_key=_descriptor._internal_create_key, 22 serialized_pb=b'\n\x19sentencepiece_model.proto\x12\rsentencepiece\"\x80\x0c\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01 \x03(\t\x12\x14\n\x0cinput_format\x18\x07 \x01(\t\x12\x14\n\x0cmodel_prefix\x18\x02 \x01(\t\x12\x41\n\nmodel_type\x18\x03 \x01(\x0e\x32$.sentencepiece.TrainerSpec.ModelType:\x07UNIGRAM\x12\x18\n\nvocab_size\x18\x04 \x01(\x05:\x04\x38\x30\x30\x30\x12\x17\n\x0f\x61\x63\x63\x65pt_language\x18\x05 \x03(\t\x12 \n\x15self_test_sample_size\x18\x06 \x01(\x05:\x01\x30\x12*\n\x1b\x65nable_differential_privacy\x18\x32 \x01(\x08:\x05\x66\x61lse\x12+\n differential_privacy_noise_level\x18\x33 \x01(\x02:\x01\x30\x12\x32\n\'differential_privacy_clipping_threshold\x18\x34 \x01(\x04:\x01\x30\x12\"\n\x12\x63haracter_coverage\x18\n \x01(\x02:\x06\x30.9995\x12\x1e\n\x13input_sentence_size\x18\x0b \x01(\x04:\x01\x30\x12$\n\x16shuffle_input_sentence\x18\x13 \x01(\x08:\x04true\x12 \n\x14mining_sentence_size\x18\x0c \x01(\x05\x42\x02\x18\x01\x12\"\n\x16training_sentence_size\x18\r \x01(\x05\x42\x02\x18\x01\x12(\n\x17seed_sentencepiece_size\x18\x0e \x01(\x05:\x07\x31\x30\x30\x30\x30\x30\x30\x12\x1e\n\x10shrinking_factor\x18\x0f \x01(\x02:\x04\x30.75\x12!\n\x13max_sentence_length\x18\x12 \x01(\x05:\x04\x34\x31\x39\x32\x12\x17\n\x0bnum_threads\x18\x10 \x01(\x05:\x02\x31\x36\x12\x1d\n\x12num_sub_iterations\x18\x11 \x01(\x05:\x01\x32\x12$\n\x18max_sentencepiece_length\x18\x14 \x01(\x05:\x02\x31\x36\x12%\n\x17split_by_unicode_script\x18\x15 \x01(\x08:\x04true\x12\x1d\n\x0fsplit_by_number\x18\x17 \x01(\x08:\x04true\x12!\n\x13split_by_whitespace\x18\x16 \x01(\x08:\x04true\x12)\n\x1atreat_whitespace_as_suffix\x18\x18 \x01(\x08:\x05\x66\x61lse\x12+\n\x1c\x61llow_whitespace_only_pieces\x18\x1a \x01(\x08:\x05\x66\x61lse\x12\x1b\n\x0csplit_digits\x18\x19 \x01(\x08:\x05\x66\x61lse\x12#\n\x19pretokenization_delimiter\x18\x35 \x01(\t:\x00\x12\x17\n\x0f\x63ontrol_symbols\x18\x1e \x03(\t\x12\x1c\n\x14user_defined_symbols\x18\x1f \x03(\t\x12\x16\n\x0erequired_chars\x18$ \x01(\t\x12\x1c\n\rbyte_fallback\x18# \x01(\x08:\x05\x66\x61lse\x12+\n\x1dvocabulary_output_piece_score\x18 \x01(\x08:\x04true\x12\x1e\n\x10hard_vocab_limit\x18! \x01(\x08:\x04true\x12\x1c\n\ruse_all_vocab\x18\" \x01(\x08:\x05\x66\x61lse\x12\x11\n\x06unk_id\x18( \x01(\x05:\x01\x30\x12\x11\n\x06\x62os_id\x18) \x01(\x05:\x01\x31\x12\x11\n\x06\x65os_id\x18* \x01(\x05:\x01\x32\x12\x12\n\x06pad_id\x18+ \x01(\x05:\x02-1\x12\x18\n\tunk_piece\x18- \x01(\t:\x05<unk>\x12\x16\n\tbos_piece\x18. \x01(\t:\x03<s>\x12\x17\n\teos_piece\x18/ \x01(\t:\x04</s>\x12\x18\n\tpad_piece\x18\x30 \x01(\t:\x05<pad>\x12\x1a\n\x0bunk_surface\x18, \x01(\t:\x05 \xe2\x81\x87 \x12+\n\x1ctrain_extremely_large_corpus\x18\x31 \x01(\x08:\x05\x66\x61lse\"5\n\tModelType\x12\x0b\n\x07UNIGRAM\x10\x01\x12\x07\n\x03\x42PE\x10\x02\x12\x08\n\x04WORD\x10\x03\x12\x08\n\x04\x43HAR\x10\x04*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xd1\x01\n\x0eNormalizerSpec\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x1c\n\x14precompiled_charsmap\x18\x02 \x01(\x0c\x12\x1e\n\x10\x61\x64\x64_dummy_prefix\x18\x03 \x01(\x08:\x04true\x12&\n\x18remove_extra_whitespaces\x18\x04 \x01(\x08:\x04true\x12 \n\x12\x65scape_whitespaces\x18\x05 \x01(\x08:\x04true\x12\x1e\n\x16normalization_rule_tsv\x18\x06 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"y\n\x0cSelfTestData\x12\x33\n\x07samples\x18\x01 \x03(\x0b\x32\".sentencepiece.SelfTestData.Sample\x1a)\n\x06Sample\x12\r\n\x05input\x18\x01 \x01(\t\x12\x10\n\x08\x65xpected\x18\x02 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xfe\x03\n\nModelProto\x12\x37\n\x06pieces\x18\x01 \x03(\x0b\x32\'.sentencepiece.ModelProto.SentencePiece\x12\x30\n\x0ctrainer_spec\x18\x02 \x01(\x0b\x32\x1a.sentencepiece.TrainerSpec\x12\x36\n\x0fnormalizer_spec\x18\x03 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x12\x33\n\x0eself_test_data\x18\x04 \x01(\x0b\x32\x1b.sentencepiece.SelfTestData\x12\x38\n\x11\x64\x65normalizer_spec\x18\x05 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x1a\xd2\x01\n\rSentencePiece\x12\r\n\x05piece\x18\x01 \x01(\t\x12\r\n\x05score\x18\x02 \x01(\x02\x12\x42\n\x04type\x18\x03 \x01(\x0e\x32,.sentencepiece.ModelProto.SentencePiece.Type:\x06NORMAL\"T\n\x04Type\x12\n\n\x06NORMAL\x10\x01\x12\x0b\n\x07UNKNOWN\x10\x02\x12\x0b\n\x07\x43ONTROL\x10\x03\x12\x10\n\x0cUSER_DEFINED\x10\x04\x12\x08\n\x04\x42YTE\x10\x06\x12\n\n\x06UNUSED\x10\x05*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\x42\x02H\x03' 23 ) 27 _TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor( 28 name='ModelType', 29 full_name='sentencepiece.TrainerSpec.ModelType', (...) 58 serialized_end=1570, 59 ) 60 _sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE) File /opt/conda/envs/py39/lib/python3.9/site-packages/google/protobuf/descriptor.py:1024, in FileDescriptor.__new__(cls, name, package, options, serialized_options, serialized_pb, dependencies, public_dependencies, syntax, pool, create_key) 1022 raise RuntimeError('Please link in cpp generated lib for %s' % (name)) 1023 elif serialized_pb: -> 1024 return _message.default_pool.AddSerializedFile(serialized_pb) 1025 else: 1026 return super(FileDescriptor, cls).__new__(cls) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "sentencepiece_model.proto": sentencepiece_model.proto: A file with this name is already in the pool.

Hi, thanks for the information. Could you provide the full trace 🙏 . It might be useful.

Just added it!

src/transformers/convert_slow_tokenizer.py

src/transformers/models/llama/tokenization_llama.py

amyeroberts · 2023-07-10T17:23:30Z

tests/models/llama/test_tokenization_llama.py

+        self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])
+
+        # `'▁'` is also a whitespace
+        input_ids = self.tokenizer.encode("▁He is not")


What if we have two different whitespaces e.g. " _He is not"?

Not sure I understand the test case, ▁ is always converted to . So and ▁ are the same!

Can we add a test to make sure it's treated that way?

There is one:

input_ids = self.tokenizer.encode("▁He is not ▁He") self.assertEqual(input_ids, [156, 46, 44, 156, 2]) tokens = self.tokenizer.tokenize("▁He is not ▁He")

This is making sure that ▁ is just encoded as a single .

tests/models/llama/test_tokenization_llama.py

tests/models/t5/test_tokenization_t5.py

src/transformers/models/llama/tokenization_llama.py

ArthurZucker · 2023-07-10T19:33:39Z

The bug has always been in T5, but since some models were trained with the bugged T5, we will let the user decide whether or not they incorparate the change

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts

Thanks for fixing this and for all the work making it safe and backwards compatible!

src/transformers/models/t5/tokenization_t5.py

src/transformers/models/llama/tokenization_llama.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…transformers into patch-t5-tokenizer

…behaviour is still valide for beginning of words (huggingface#24622) * patch `_tokenize` function * more tests * properly fix * fixup * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix without ifs * update * protect import * add python processing * is first needed * add doc and update with lefacy * updaate * fix T5 SPM converter * styling * fix T5 warning * add is_seqio_available * remove is_first * revert some changes * more tests and update * update llama test batterie * fixup * refactor T5 spm common tests * draft the llama tests * update * uopdate test * nits * refine * name nit * fix t5 tests * fix T5 * update * revert convert slow to fast changes that fail lots of tests * legacy support * fixup * nits is first not defined * don't use legacy behaviour for switch transformers * style * My attempt to check. * nits * fixes * update * fixup * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates * fixup * add legacy warning * fixup * warning_once nit * update t5 documentation test * update llama tok documentation * add space to warning * nits * nit * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * last nits --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

ArthurZucker added 3 commits July 2, 2023 03:12

patch _tokenize function

e59ec38

more tests

e31ac3a

properly fix

0384cd8

ArthurZucker requested a review from LysandreJik July 2, 2023 03:55

fixup

d360308

ArthurZucker commented Jul 2, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker marked this pull request as ready for review July 2, 2023 03:58

ArthurZucker requested a review from amyeroberts July 2, 2023 04:35

amyeroberts approved these changes Jul 3, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits July 3, 2023 13:13

Update src/transformers/models/t5/tokenization_t5.py

5cc0c9d

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Merge branch 'main' of https://github.com/huggingface/transformers in…

adfd99d

…to patch-t5-tokenizer

pouMerge branch 'patch-t5-tokenizer' of https://github.com/ArthurZuck…

7df5038

…er/transformers into patch-t5-tokenizer

ArthurZucker requested a review from Narsil July 3, 2023 11:41

Narsil reviewed Jul 3, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Jul 4, 2023

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

2 tasks

ArthurZucker added 11 commits July 4, 2023 02:30

fix without ifs

57995f9

update

aa9455f

protect import

b26d29e

add python processing

6ed0617

is first needed

b558acd

add doc and update with lefacy

df58420

updaate

c20b6ba

fix T5 SPM converter

b7a7776

styling

347236e

fix T5 warning

85c5894

add is_seqio_available

4845533

ArthurZucker force-pushed the patch-t5-tokenizer branch from 6cbeb38 to 4845533 Compare July 4, 2023 07:39

ArthurZucker requested a review from Narsil July 7, 2023 09:12

ArthurZucker mentioned this pull request Jul 10, 2023

Enable conversational pipeline for GPTSw3Tokenizer #24648

Merged

5 tasks

Narsil approved these changes Jul 10, 2023

View reviewed changes

amyeroberts reviewed Jul 10, 2023

View reviewed changes

ArthurZucker and others added 6 commits July 10, 2023 21:35

Apply suggestions from code review

4217843

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

updates

62a6eb5

fixup

1c209a7

add legacy warning

d540e2e

fixup

5d17a46

warning_once nit

ea057ca

ydshieh mentioned this pull request Jul 11, 2023

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

Closed

4 tasks

update t5 documentation test

071b87a

This was linked to issues Jul 11, 2023

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

Closed

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

ArthurZucker added 4 commits July 11, 2023 10:05

update llama tok documentation

2fdcf09

add space to warning

3cd1f29

nits

ae31daa

nit

9510908

amyeroberts approved these changes Jul 11, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Show resolved Hide resolved

src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits July 11, 2023 13:39

Apply suggestions from code review

23ab157

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

last nits

898c057

Merge branch 'patch-t5-tokenizer' of https://github.com/ArthurZucker/…

d6170ba

…transformers into patch-t5-tokenizer

ArthurZucker merged commit b15343d into huggingface:main Jul 11, 2023
4 checks passed

ydshieh mentioned this pull request Jul 11, 2023

Use protobuf 4 #24599

Merged

This was referenced Jul 25, 2023

Slow Tokenizer adds whitespace after special token #25073

Closed

[T5/LlamaTokenizer] default legacy to None to not always warn #25131

Merged

sanchit-gandhi mentioned this pull request Aug 25, 2023

T5 Tokenizer requires protobuf package #25753

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

ArthurZucker commented Jul 2, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 2, 2023 •

edited

Loading

ArthurZucker commented Jul 2, 2023

amyeroberts left a comment

Narsil left a comment

Narsil left a comment

amyeroberts left a comment

amyeroberts Jul 10, 2023

ArthurZucker Jul 10, 2023

ydshieh Jul 11, 2023 •

edited

Loading

ArthurZucker Jul 11, 2023

ArthurZucker Jul 11, 2023

ArthurZucker Jul 11, 2023

ArthurZucker Jul 11, 2023

ArthurZucker Jul 11, 2023 •

edited

Loading

ydshieh Jul 11, 2023

ArthurZucker Jul 11, 2023

amyeroberts Jul 10, 2023

ArthurZucker Jul 10, 2023

amyeroberts Jul 11, 2023

ArthurZucker Jul 11, 2023

ArthurZucker commented Jul 10, 2023

amyeroberts left a comment

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Conversation

ArthurZucker commented Jul 2, 2023 • edited Loading

What does this PR do?

Big bug 35 models involved

HuggingFaceDocBuilderDev commented Jul 2, 2023 • edited Loading

ArthurZucker commented Jul 2, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 10, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 2, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 2, 2023 •

edited

Loading

ydshieh Jul 11, 2023 •

edited

Loading

ArthurZucker Jul 11, 2023 •

edited

Loading