Add m2m100 #10236

patil-suraj · 2021-02-17T15:13:39Z

What does this PR do?

Adds the M2M100 model
https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

patil-suraj · 2021-02-17T15:31:54Z

docs/source/model_doc/m2m_100.rst

+    tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+    # => "La vie est comme une boîte de chocolat."


This one got it right @lhoestq 😉

Nice ! 🍫🍫🍫

docs/source/model_doc/m2m_100.rst

src/transformers/models/auto/modeling_auto.py

src/transformers/models/m2m_100/configuration_m2m_100.py

src/transformers/models/m2m_100/modeling_m2m_100.py

patrickvonplaten

Great work @patil-suraj - could you add as many #Copied from statements as possible in the modeling_...py file and ping me again for review?

patil-suraj · 2021-02-17T16:06:40Z

Sure, Patrick !

tests/test_modeling_m2m_100.py

sgugger

Great addition! Out of curiosity, what is missing to get a fast tokenizer like for mBART?

docs/source/model_doc/m2m_100.rst

src/transformers/models/auto/modeling_auto.py

src/transformers/models/m2m_100/configuration_m2m_100.py

src/transformers/models/m2m_100/modeling_m2m_100.py

src/transformers/models/m2m_100/tokenization_m2m_100.py

LysandreJik

This looks great! Would it be hard to implement the fast tokenizer as well?

src/transformers/models/m2m_100/configuration_m2m_100.py

src/transformers/models/m2m_100/modeling_m2m_100.py

patil-suraj · 2021-03-06T16:44:05Z

I’ve addressed all the review comments, and all the slow/fast tests are now passing.

I didn’t add fast tokenizer because M2M100 is sentencepiece based tokenizer, but it uses sentencepiece for just tokenizing and then uses a vocab file to convert the tokens to ids and ids to tokens. So our current SpmConverter doesn’t work for this. I’ll try to add fast tokenizer in a follow-up PR.

Merging!

* m2m_100 * no layernorm_embedding * sinusoidal positional embeddings * update pos embeddings * add default config values * tokenizer * add conversion script * fix config * fix pos embed * remove _float_tensor * update tokenizer * update lang codes * handle lang codes * fix pos embeds * fix spm key * put embedding weights on device * remove qa and seq classification heads * fix convert script * lang codes pn one line * fix embeds * fix tokenizer * fix tokenizer * add fast tokenizer * style * M2M100MT => M2M100 * fix copyright, style * tokenizer converter * vocab file * remove fast tokenizer * fix embeds * fix tokenizer * fix tests * add tokenizer tests * add integration test * quality * fix model name * fix test * doc * doc * fix doc * add copied from statements * fix tokenizer tests * apply review suggestions * fix urls * fix shift_tokens_right * apply review suggestions * fix * fix doc * add lang code to id * remove unused function * update checkpoint names * fix copy * fix tokenizer * fix checkpoint names * fix merge issue * style

Muennighoff · 2021-09-09T16:44:21Z

I’ve addressed all the review comments, and all the slow/fast tests are now passing.

I didn’t add fast tokenizer because M2M100 is sentencepiece based tokenizer, but it uses sentencepiece for just tokenizing and then uses a vocab file to convert the tokens to ids and ids to tokens. So our current SpmConverter doesn’t work for this. I’ll try to add fast tokenizer in a follow-up PR.

Merging!

Hey, I was wondering if there's any progress on a Fast Tokenizer for M2M or if any help can be needed?
Thanks :)

patil-suraj requested review from patrickvonplaten, sgugger and LysandreJik February 17, 2021 15:15

patil-suraj commented Feb 17, 2021

View reviewed changes

patil-suraj added the PR for Model Addition label Feb 17, 2021