Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add m2m100 #10236

Merged
merged 56 commits into from
Mar 6, 2021
Merged
Changes from 1 commit
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
6ea416f
m2m_100
patil-suraj Jan 29, 2021
7750073
no layernorm_embedding
patil-suraj Jan 31, 2021
01fd460
sinusoidal positional embeddings
patil-suraj Jan 31, 2021
faf1d59
update pos embeddings
patil-suraj Jan 31, 2021
50e2327
add default config values
patil-suraj Jan 31, 2021
1657d16
tokenizer
patil-suraj Jan 31, 2021
a107379
add conversion script
patil-suraj Jan 31, 2021
b7c27c0
fix config
patil-suraj Jan 31, 2021
e469dd7
fix pos embed
patil-suraj Jan 31, 2021
3ac1d6d
remove _float_tensor
patil-suraj Feb 1, 2021
55d9a3b
update tokenizer
patil-suraj Feb 4, 2021
107ce93
update lang codes
patil-suraj Feb 4, 2021
2ca2b04
handle lang codes
patil-suraj Feb 4, 2021
feed630
fix pos embeds
patil-suraj Feb 4, 2021
30bdba7
fix spm key
patil-suraj Feb 5, 2021
5ad6f91
put embedding weights on device
patil-suraj Feb 5, 2021
17afb95
remove qa and seq classification heads
patil-suraj Feb 7, 2021
0b21878
fix convert script
patil-suraj Feb 7, 2021
ecc8d2d
lang codes pn one line
patil-suraj Feb 7, 2021
525fc97
fix embeds
patil-suraj Feb 17, 2021
1be458a
fix tokenizer
patil-suraj Feb 17, 2021
6c98aca
fix tokenizer
patil-suraj Feb 17, 2021
50291c6
add fast tokenizer
patil-suraj Feb 17, 2021
2eb83a5
style
patil-suraj Feb 17, 2021
c9f171f
M2M100MT => M2M100
patil-suraj Feb 17, 2021
5d3225a
fix copyright, style
patil-suraj Feb 17, 2021
4495c2c
tokenizer converter
patil-suraj Feb 17, 2021
02f3766
vocab file
patil-suraj Feb 17, 2021
da7a595
remove fast tokenizer
patil-suraj Feb 17, 2021
2a05942
fix embeds
patil-suraj Feb 17, 2021
799783b
fix tokenizer
patil-suraj Feb 17, 2021
58df655
fix tests
patil-suraj Feb 17, 2021
f364413
add tokenizer tests
patil-suraj Feb 17, 2021
9cc69ac
add integration test
patil-suraj Feb 17, 2021
f9f63b8
quality
patil-suraj Feb 17, 2021
78c2dc5
fix model name
patil-suraj Feb 17, 2021
5b406ee
fix test
patil-suraj Feb 17, 2021
ce9a147
doc
patil-suraj Feb 17, 2021
38559d7
doc
patil-suraj Feb 17, 2021
c3702aa
fix doc
patil-suraj Feb 17, 2021
96df893
add copied from statements
patil-suraj Feb 17, 2021
bb90cbf
fix tokenizer tests
patil-suraj Feb 17, 2021
9f92f21
apply review suggestions
patil-suraj Feb 17, 2021
5afb2f6
fix urls
patil-suraj Feb 17, 2021
b8ac87a
fix shift_tokens_right
patil-suraj Feb 17, 2021
7d47e9d
apply review suggestions
patil-suraj Mar 5, 2021
eb20a6b
fix
patil-suraj Mar 5, 2021
f16b244
fix doc
patil-suraj Mar 5, 2021
beaa589
add lang code to id
patil-suraj Mar 5, 2021
cfdb807
remove unused function
patil-suraj Mar 5, 2021
bcd6b78
update checkpoint names
patil-suraj Mar 5, 2021
41f1799
fix copy
patil-suraj Mar 5, 2021
fd3c01d
fix tokenizer
patil-suraj Mar 5, 2021
e8dc722
fix checkpoint names
patil-suraj Mar 5, 2021
0173e43
fix merge issue
patil-suraj Mar 5, 2021
efcbdbd
style
patil-suraj Mar 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix tokenizer tests
  • Loading branch information
patil-suraj committed Mar 6, 2021
commit bb90cbf08ff466c0961ae3c59e83758994128982
6 changes: 4 additions & 2 deletions tests/test_tokenization_m2m_100.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@


if is_torch_available():
from transformers.models.mbart.modeling_mbart import shift_tokens_right
from transformers.models.m2m_100.modeling_m2m_100 import shift_tokens_right

EN_CODE = 128022
FR_CODE = 128028
Expand Down Expand Up @@ -153,7 +153,9 @@ def test_batch_fairseq_parity(self):
with self.tokenizer.as_target_tokenizer():
batch["labels"] = self.tokenizer(self.tgt_text, padding=True, return_tensors="pt").input_ids

batch["decoder_input_ids"] = shift_tokens_right(batch["labels"], self.tokenizer.pad_token_id)
batch["decoder_input_ids"] = shift_tokens_right(
batch["labels"], self.tokenizer.pad_token_id, self.tokenizer.eos_token_id
)

for k in batch:
batch[k] = batch[k].tolist()
Expand Down