Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged
merged 59 commits into from
Jul 11, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Jul 2, 2023

What does this PR do?

There was a small typo that modified the behaviour in #24565, the test were not able to catch it. #24569
When a sentence does not start with a space, a space was added.
Before:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

After:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

Big bug 35 models involved

Not only punctuation but anything after a special token is basically wrong... Let's ignore the fact that we also split when it's the beginning of word less important
image

Tests were added as they were green before merging

@ArthurZucker ArthurZucker marked this pull request as ready for review July 2, 2023 03:58
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 2, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker
Copy link
Collaborator Author

Ran all the tests with RUN_SLOW, switch-ci is fixed

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! Fix looks OK to me. I'm not super familiar with tokenizers, so would be good to have a second approval before merging.

Do we have equivalent tests for other tokenizers in the library?

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved
@ArthurZucker ArthurZucker requested a review from Narsil July 3, 2023 11:41
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel good about this kind of change.

Adding ifs in general in tokenization methods is asking for trouble as the number of differents cases is astronomical (we have too much surface).

I wasn't a bit fan on the original fix either.

IMO we need to strengthen way more the test suite.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

  • legacy means we are not modifying existing tokenizers without knowing. (And we need to manually update those core tokenizers)
  • Test suite is more extensive and if we make code modifications later I feel more confident we're note going to rebreak things unknowingly (either way).

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

Just to make sure I've understood - was this bug always part of the T5 tokenizer, or was it introduced in #24565?
I'm asking because of the default value of legacy - at the moment any new tokenizers created will default to the legacy behaviour. If the bug was recently introduced in #24565, then it wasn't part of the most recent release and so we can default to the "new" behaviour.

@@ -177,15 +178,6 @@
)


if is_protobuf_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's messing up the import. By deleting it here, we can at least run the test individually

Copy link
Collaborator

@ydshieh ydshieh Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what's messed up. With the fix in #24689 , everything should be fine.

Also, moving this to another place might count as breaking change, as

from transformers.utils import sentencepiece_model_pb2

won't work anymore (despite the direct such usage might be low).

See here

Could we bring it back quickly, and you open an issue about what's the import error you got. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from transformers.utils import sentencepiece_model_pb2 works for me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comes from #24690

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we only import it once, where we actually use it.

Copy link
Collaborator Author

@ArthurZucker ArthurZucker Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not fix the version error, but fixes the issue with 3.20.3, when we cannot use seqio or anything importing protobuf:

! pip install protobuf=3.20.3
from transformers import AutoTokenizer
from seqio import SentencePieceVocabulary

In [3]: from seqio import SentencePieceVocabulary
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 from seqio import SentencePieceVocabulary

File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/__init__.py:18
     15 """Import to top-level API."""
     16 # pylint:disable=wildcard-import,g-bad-import-order
---> 18 from seqio.dataset_providers import *
     19 from seqio import evaluation
     20 from seqio import experimental

File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/dataset_providers.py:39
     37 from packaging import version as version_lib
     38 import pyglove as pg
---> 39 from seqio import metrics as metrics_lib
     40 from seqio import preprocessors as seqio_preprocessors
     41 from seqio import task_registry_provenance_tracking

File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/metrics.py:27
     25 import jax.numpy as jnp
     26 import numpy as np
---> 27 from seqio import utils
     28 import tensorflow.compat.v2 as tf
     31 @dataclasses.dataclass
     32 class MetricValue:

File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/utils.py:29
     27 from absl import logging
     28 import numpy as np
---> 29 from seqio.vocabularies import Vocabulary
     30 import tensorflow.compat.v2 as tf
     31 import tensorflow_datasets as tfds

File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/vocabularies.py:27
     24 import tensorflow.compat.v2 as tf
     25 import tensorflow_text as tf_text
---> 27 from sentencepiece import sentencepiece_model_pb2
     28 import sentencepiece as sentencepiece_processor
     30 PAD_ID = 0

File /opt/conda/envs/py39/lib/python3.9/site-packages/sentencepiece/sentencepiece_model_pb2.py:16
      9 # @@protoc_insertion_point(imports)
     11 _sym_db = _symbol_database.Default()
---> 16 DESCRIPTOR = _descriptor.FileDescriptor(
     17   name='sentencepiece_model.proto',
     18   package='sentencepiece',
     19   syntax='proto2',
     20   serialized_options=b'H\003',
     21   create_key=_descriptor._internal_create_key,
     22   serialized_pb=b'\n\x19sentencepiece_model.proto\x12\rsentencepiece\"\x80\x0c\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01 \x03(\t\x12\x14\n\x0cinput_format\x18\x07 \x01(\t\x12\x14\n\x0cmodel_prefix\x18\x02 \x01(\t\x12\x41\n\nmodel_type\x18\x03 \x01(\x0e\x32$.sentencepiece.TrainerSpec.ModelType:\x07UNIGRAM\x12\x18\n\nvocab_size\x18\x04 \x01(\x05:\x04\x38\x30\x30\x30\x12\x17\n\x0f\x61\x63\x63\x65pt_language\x18\x05 \x03(\t\x12 \n\x15self_test_sample_size\x18\x06 \x01(\x05:\x01\x30\x12*\n\x1b\x65nable_differential_privacy\x18\x32 \x01(\x08:\x05\x66\x61lse\x12+\n differential_privacy_noise_level\x18\x33 \x01(\x02:\x01\x30\x12\x32\n\'differential_privacy_clipping_threshold\x18\x34 \x01(\x04:\x01\x30\x12\"\n\x12\x63haracter_coverage\x18\n \x01(\x02:\x06\x30.9995\x12\x1e\n\x13input_sentence_size\x18\x0b \x01(\x04:\x01\x30\x12$\n\x16shuffle_input_sentence\x18\x13 \x01(\x08:\x04true\x12 \n\x14mining_sentence_size\x18\x0c \x01(\x05\x42\x02\x18\x01\x12\"\n\x16training_sentence_size\x18\r \x01(\x05\x42\x02\x18\x01\x12(\n\x17seed_sentencepiece_size\x18\x0e \x01(\x05:\x07\x31\x30\x30\x30\x30\x30\x30\x12\x1e\n\x10shrinking_factor\x18\x0f \x01(\x02:\x04\x30.75\x12!\n\x13max_sentence_length\x18\x12 \x01(\x05:\x04\x34\x31\x39\x32\x12\x17\n\x0bnum_threads\x18\x10 \x01(\x05:\x02\x31\x36\x12\x1d\n\x12num_sub_iterations\x18\x11 \x01(\x05:\x01\x32\x12$\n\x18max_sentencepiece_length\x18\x14 \x01(\x05:\x02\x31\x36\x12%\n\x17split_by_unicode_script\x18\x15 \x01(\x08:\x04true\x12\x1d\n\x0fsplit_by_number\x18\x17 \x01(\x08:\x04true\x12!\n\x13split_by_whitespace\x18\x16 \x01(\x08:\x04true\x12)\n\x1atreat_whitespace_as_suffix\x18\x18 \x01(\x08:\x05\x66\x61lse\x12+\n\x1c\x61llow_whitespace_only_pieces\x18\x1a \x01(\x08:\x05\x66\x61lse\x12\x1b\n\x0csplit_digits\x18\x19 \x01(\x08:\x05\x66\x61lse\x12#\n\x19pretokenization_delimiter\x18\x35 \x01(\t:\x00\x12\x17\n\x0f\x63ontrol_symbols\x18\x1e \x03(\t\x12\x1c\n\x14user_defined_symbols\x18\x1f \x03(\t\x12\x16\n\x0erequired_chars\x18$ \x01(\t\x12\x1c\n\rbyte_fallback\x18# \x01(\x08:\x05\x66\x61lse\x12+\n\x1dvocabulary_output_piece_score\x18  \x01(\x08:\x04true\x12\x1e\n\x10hard_vocab_limit\x18! \x01(\x08:\x04true\x12\x1c\n\ruse_all_vocab\x18\" \x01(\x08:\x05\x66\x61lse\x12\x11\n\x06unk_id\x18( \x01(\x05:\x01\x30\x12\x11\n\x06\x62os_id\x18) \x01(\x05:\x01\x31\x12\x11\n\x06\x65os_id\x18* \x01(\x05:\x01\x32\x12\x12\n\x06pad_id\x18+ \x01(\x05:\x02-1\x12\x18\n\tunk_piece\x18- \x01(\t:\x05<unk>\x12\x16\n\tbos_piece\x18. \x01(\t:\x03<s>\x12\x17\n\teos_piece\x18/ \x01(\t:\x04</s>\x12\x18\n\tpad_piece\x18\x30 \x01(\t:\x05<pad>\x12\x1a\n\x0bunk_surface\x18, \x01(\t:\x05 \xe2\x81\x87 \x12+\n\x1ctrain_extremely_large_corpus\x18\x31 \x01(\x08:\x05\x66\x61lse\"5\n\tModelType\x12\x0b\n\x07UNIGRAM\x10\x01\x12\x07\n\x03\x42PE\x10\x02\x12\x08\n\x04WORD\x10\x03\x12\x08\n\x04\x43HAR\x10\x04*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xd1\x01\n\x0eNormalizerSpec\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x1c\n\x14precompiled_charsmap\x18\x02 \x01(\x0c\x12\x1e\n\x10\x61\x64\x64_dummy_prefix\x18\x03 \x01(\x08:\x04true\x12&\n\x18remove_extra_whitespaces\x18\x04 \x01(\x08:\x04true\x12 \n\x12\x65scape_whitespaces\x18\x05 \x01(\x08:\x04true\x12\x1e\n\x16normalization_rule_tsv\x18\x06 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"y\n\x0cSelfTestData\x12\x33\n\x07samples\x18\x01 \x03(\x0b\x32\".sentencepiece.SelfTestData.Sample\x1a)\n\x06Sample\x12\r\n\x05input\x18\x01 \x01(\t\x12\x10\n\x08\x65xpected\x18\x02 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xfe\x03\n\nModelProto\x12\x37\n\x06pieces\x18\x01 \x03(\x0b\x32\'.sentencepiece.ModelProto.SentencePiece\x12\x30\n\x0ctrainer_spec\x18\x02 \x01(\x0b\x32\x1a.sentencepiece.TrainerSpec\x12\x36\n\x0fnormalizer_spec\x18\x03 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x12\x33\n\x0eself_test_data\x18\x04 \x01(\x0b\x32\x1b.sentencepiece.SelfTestData\x12\x38\n\x11\x64\x65normalizer_spec\x18\x05 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x1a\xd2\x01\n\rSentencePiece\x12\r\n\x05piece\x18\x01 \x01(\t\x12\r\n\x05score\x18\x02 \x01(\x02\x12\x42\n\x04type\x18\x03 \x01(\x0e\x32,.sentencepiece.ModelProto.SentencePiece.Type:\x06NORMAL\"T\n\x04Type\x12\n\n\x06NORMAL\x10\x01\x12\x0b\n\x07UNKNOWN\x10\x02\x12\x0b\n\x07\x43ONTROL\x10\x03\x12\x10\n\x0cUSER_DEFINED\x10\x04\x12\x08\n\x04\x42YTE\x10\x06\x12\n\n\x06UNUSED\x10\x05*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\x42\x02H\x03'
     23 )
     27 _TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor(
     28   name='ModelType',
     29   full_name='sentencepiece.TrainerSpec.ModelType',
   (...)
     58   serialized_end=1570,
     59 )
     60 _sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE)

File /opt/conda/envs/py39/lib/python3.9/site-packages/google/protobuf/descriptor.py:1024, in FileDescriptor.__new__(cls, name, package, options, serialized_options, serialized_pb, dependencies, public_dependencies, syntax, pool, create_key)
   1022     raise RuntimeError('Please link in cpp generated lib for %s' % (name))
   1023 elif serialized_pb:
-> 1024   return _message.default_pool.AddSerializedFile(serialized_pb)
   1025 else:
   1026   return super(FileDescriptor, cls).__new__(cls)

TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "sentencepiece_model.proto":
  sentencepiece_model.proto: A file with this name is already in the pool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the information. Could you provide the full trace 🙏 . It might be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added it!

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])

# `'▁'` is also a whitespace
input_ids = self.tokenizer.encode("▁He is not")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have two different whitespaces e.g. " _He is not"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the test case, is always converted to . So and are the same!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test to make sure it's treated that way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one:

        input_ids = self.tokenizer.encode("▁He is not             ▁He")
        self.assertEqual(input_ids, [156, 46, 44, 156, 2])
        tokens = self.tokenizer.tokenize("▁He is not              ▁He")

This is making sure that is just encoded as a single .

tests/models/llama/test_tokenization_llama.py Show resolved Hide resolved
tests/models/t5/test_tokenization_t5.py Show resolved Hide resolved
tests/models/t5/test_tokenization_t5.py Show resolved Hide resolved
@ArthurZucker
Copy link
Collaborator Author

The bug has always been in T5, but since some models were trained with the bugged T5, we will let the user decide whether or not they incorparate the change

ArthurZucker and others added 6 commits July 10, 2023 21:35
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this and for all the work making it safe and backwards compatible!

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved
src/transformers/models/t5/tokenization_t5.py Show resolved Hide resolved
src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved
ArthurZucker and others added 3 commits July 11, 2023 13:39
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@ArthurZucker ArthurZucker merged commit b15343d into huggingface:main Jul 11, 2023
4 checks passed
@ydshieh ydshieh mentioned this pull request Jul 11, 2023
Lorenzobattistela pushed a commit to Lorenzobattistela/transformers that referenced this pull request Jul 13, 2023
…behaviour is still valide for beginning of words (huggingface#24622)

* patch `_tokenize` function

* more tests

* properly fix

* fixup

* Update src/transformers/models/t5/tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix without ifs

* update

* protect import

* add python processing

* is first needed

* add doc and update with lefacy

* updaate

* fix T5 SPM converter

* styling

* fix T5 warning

* add is_seqio_available

* remove is_first

* revert some changes

* more tests and update

* update llama test batterie

* fixup

* refactor T5 spm common tests

* draft the llama tests

* update

* uopdate test

* nits

* refine

* name nit

* fix t5 tests

* fix T5

* update

* revert convert slow to fast changes that fail lots of tests

* legacy support

* fixup

* nits is first not defined

* don't use legacy behaviour for switch transformers

* style

* My attempt to check.

* nits

* fixes

* update

* fixup

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates

* fixup

* add legacy warning

* fixup

* warning_once nit

* update t5 documentation test

* update llama tok documentation

* add space to warning

* nits

* nit

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* last nits

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…behaviour is still valide for beginning of words (huggingface#24622)

* patch `_tokenize` function

* more tests

* properly fix

* fixup

* Update src/transformers/models/t5/tokenization_t5.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix without ifs

* update

* protect import

* add python processing

* is first needed

* add doc and update with lefacy

* updaate

* fix T5 SPM converter

* styling

* fix T5 warning

* add is_seqio_available

* remove is_first

* revert some changes

* more tests and update

* update llama test batterie

* fixup

* refactor T5 spm common tests

* draft the llama tests

* update

* uopdate test

* nits

* refine

* name nit

* fix t5 tests

* fix T5

* update

* revert convert slow to fast changes that fail lots of tests

* legacy support

* fixup

* nits is first not defined

* don't use legacy behaviour for switch transformers

* style

* My attempt to check.

* nits

* fixes

* update

* fixup

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates

* fixup

* add legacy warning

* fixup

* warning_once nit

* update t5 documentation test

* update llama tok documentation

* add space to warning

* nits

* nit

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* last nits

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants