The code for training mCOLT, a multilingual NMT training framework, implemented based on fairseq.
mRASP2/mCOLT, representing multilingual Contrastive Learning for Transformer, is a multilingual neural machine translation model that supports complete many-to-many multilingual machine translation. It employs both parallel corpora and multilingual corpora in a unified training framework. For detailed information please refer to the paper.
pip install -r requirements.txt
We release our preprocessed training data and checkpoints in the following.
We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.
Dataset | #Pair |
---|---|
32-lang-pairs-TRAIN | 197603294 |
32-lang-pairs-RAS-TRAIN | 262662792 |
mono-split-a | - |
mono-split-b | - |
mono-split-c | - |
mono-split-d | - |
mono-split-e | - |
mono-split-de-fr-en | - |
mono-split-nl-pl-pt | - |
32-lang-pairs-DEV-en-centric | - |
32-lang-pairs-DEV-many-to-many | - |
Vocab | - |
BPE Code | - |
Note that the provided checkpoint is sightly different from that in the paper.
bash train_w_mono.sh ${model_config}
- We give example of
${model_config}
in${PROJECT_REPO}/examples/configs/parallel_mono_12e12d_contrastive.yml