Two datasets mt5-wiki14 and mt5-16 are created for intermediate training purpose, in order to encode knowledge via the domain- and language-specific corpus.
You can simply download the data from mt5-wiki14 and mt5-16. Or you can modify the scripts under wiki
and ccnet
folder for your own usage.
You can modify the script run_mlm_t5.sh
with different hyperparameters and datasets for your own usage.
Here, we only keep the QA translation pairs. And you can simply run the script prepare_aug.py
. The data can be downloaded from aug_data.
Here, we translate both the QA pairs and the extracted passages, and you can further run the script with prepare_translation.py
. The data can be downloaded from aug_data_trans.
We use this translatepy as the Translator, where you can pip install translatepy==2.3
for your own usage.