-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add m2m100 #10236
Add m2m100 #10236
Conversation
docs/source/model_doc/m2m_100.rst
Outdated
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) | ||
# => "La vie est comme une boîte de chocolat." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one got it right @lhoestq 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! 🍫🍫🍫
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @patil-suraj - could you add as many #Copied from
statements as possible in the modeling_...py
file and ping me again for review?
Sure, Patrick ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! Out of curiosity, what is missing to get a fast tokenizer like for mBART?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Would it be hard to implement the fast tokenizer as well?
I’ve addressed all the review comments, and all the slow/fast tests are now passing. I didn’t add fast tokenizer because Merging! |
* m2m_100 * no layernorm_embedding * sinusoidal positional embeddings * update pos embeddings * add default config values * tokenizer * add conversion script * fix config * fix pos embed * remove _float_tensor * update tokenizer * update lang codes * handle lang codes * fix pos embeds * fix spm key * put embedding weights on device * remove qa and seq classification heads * fix convert script * lang codes pn one line * fix embeds * fix tokenizer * fix tokenizer * add fast tokenizer * style * M2M100MT => M2M100 * fix copyright, style * tokenizer converter * vocab file * remove fast tokenizer * fix embeds * fix tokenizer * fix tests * add tokenizer tests * add integration test * quality * fix model name * fix test * doc * doc * fix doc * add copied from statements * fix tokenizer tests * apply review suggestions * fix urls * fix shift_tokens_right * apply review suggestions * fix * fix doc * add lang code to id * remove unused function * update checkpoint names * fix copy * fix tokenizer * fix checkpoint names * fix merge issue * style
Hey, I was wondering if there's any progress on a Fast Tokenizer for M2M or if any help can be needed? |
What does this PR do?
Adds the M2M100 model
https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
Fixes #8054