Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add m2m100 #10236

Merged
merged 56 commits into from
Mar 6, 2021
Merged

Add m2m100 #10236

merged 56 commits into from
Mar 6, 2021

Conversation

patil-suraj
Copy link
Contributor

@patil-suraj patil-suraj commented Feb 17, 2021

What does this PR do?

Adds the M2M100 model
https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

Fixes #8054

Comment on lines 88 to 89
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one got it right @lhoestq 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! 🍫🍫🍫

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @patil-suraj - could you add as many #Copied from statements as possible in the modeling_...py file and ping me again for review?

@patil-suraj
Copy link
Contributor Author

Sure, Patrick !

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! Out of curiosity, what is missing to get a fast tokenizer like for mBART?

docs/source/model_doc/m2m_100.rst Outdated Show resolved Hide resolved
docs/source/model_doc/m2m_100.rst Outdated Show resolved Hide resolved
src/transformers/models/auto/modeling_auto.py Outdated Show resolved Hide resolved
src/transformers/models/auto/modeling_auto.py Outdated Show resolved Hide resolved
src/transformers/models/m2m_100/configuration_m2m_100.py Outdated Show resolved Hide resolved
src/transformers/models/m2m_100/modeling_m2m_100.py Outdated Show resolved Hide resolved
src/transformers/models/m2m_100/tokenization_m2m_100.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Would it be hard to implement the fast tokenizer as well?

src/transformers/models/m2m_100/configuration_m2m_100.py Outdated Show resolved Hide resolved
@patil-suraj
Copy link
Contributor Author

I’ve addressed all the review comments, and all the slow/fast tests are now passing.

I didn’t add fast tokenizer because M2M100 is sentencepiece based tokenizer, but it uses sentencepiece for just tokenizing and then uses a vocab file to convert the tokens to ids and ids to tokens. So our current SpmConverter doesn’t work for this. I’ll try to add fast tokenizer in a follow-up PR.

Merging!

@patil-suraj patil-suraj merged commit f6e74a6 into huggingface:master Mar 6, 2021
@patil-suraj patil-suraj deleted the add-m2m100 branch March 6, 2021 16:44
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* m2m_100

* no layernorm_embedding

* sinusoidal positional embeddings

* update pos embeddings

* add default config values

* tokenizer

* add conversion script

* fix config

* fix pos embed

* remove _float_tensor

* update tokenizer

* update lang codes

* handle lang codes

* fix pos embeds

* fix spm key

* put embedding weights on device

* remove qa and seq classification heads

* fix convert script

* lang codes pn one line

* fix embeds

* fix tokenizer

* fix tokenizer

* add fast tokenizer

* style

* M2M100MT => M2M100

* fix copyright, style

* tokenizer converter

* vocab file

* remove fast tokenizer

* fix embeds

* fix tokenizer

* fix tests

* add tokenizer tests

* add integration test

* quality

* fix model name

* fix test

* doc

* doc

* fix doc

* add copied from statements

* fix tokenizer tests

* apply review suggestions

* fix urls

* fix shift_tokens_right

* apply review suggestions

* fix

* fix doc

* add lang code to id

* remove unused function

* update checkpoint names

* fix copy

* fix tokenizer

* fix checkpoint names

* fix merge issue

* style
@Muennighoff
Copy link
Contributor

I’ve addressed all the review comments, and all the slow/fast tests are now passing.

I didn’t add fast tokenizer because M2M100 is sentencepiece based tokenizer, but it uses sentencepiece for just tokenizing and then uses a vocab file to convert the tokens to ids and ids to tokens. So our current SpmConverter doesn’t work for this. I’ll try to add fast tokenizer in a follow-up PR.

Merging!

Hey, I was wondering if there's any progress on a Fast Tokenizer for M2M or if any help can be needed?
Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add m2m 100 multilingual translation model from FAIR
6 participants