This repository contains code for the following paper:
Pengfei Li, Liangyou Li, Meng Zhang, Minghao Wu, Qun Liu. Universal Conditional Masked Language Pre-training for Neural Machine Translation. In ACL2022 main conference paper.
This is a reimplementation based on fairseq and Mask-Predict (for Non-autoregressive neural machine translation ) which should reproduce reasonable results.
- Python >= 3.7
- Pytorch >= 1.7.0
- Fairseq 1.0.0a0
- sacrebleu==1.5.1
- fastBPE (for BPE codes)
- Moses (for tokenization)
- Apex (for fp16 training)
- kytea(for Japanese tokenization)
- jieba(for Chinese tokenization)
The pipeline contains two steps: Pre-training and fine-tuning.
You can simply ignore the current step if you directly download the pre-trained model and vocab we provide.
We use the bilingual and monolingual corpus for pre-training. where we use bilingual data in 32 languages provided by mRASP, and the monolingual corpus can be downloaded from news-crawl.
You can simply ignore the current step if you directly download the data and vocab we provide.
Special tokenization for Romanian, Chinese and Japanese, we directly use the Vocab and BPE Code provided by mRASP.
# bilingual
bash ${PROJECT_ROOT}/cemat_scripts/process/preprocess_Para.sh
# monolingual
bash ${PROJECT_ROOT}/cemat_scripts/process/preprocess_Mono.sh
if you want to use you own code, should learn your own BPE code before applying BPE, and get vocab file base on subword data.
FASTBPE=fastBPE/fast
# merge operation nums.
CODES=35000
SRC_TRAIN_TOK=
TGT_TRAIN_TOK=
#output
BPE_CODES=
VOCAB=
SRC_TRAIN_BPE=
TGT_TRAIN_BPE=
$FASTBPE learnbpe $CODES ${SRC_TRAIN_TOK} ${TGT_TRAIN_TOK} > $BPE_CODES
$FASTBPE applybpe $SRC_TRAIN_BPE $SRC_TRAIN_TOK $BPE_CODES
$FASTBPE applybpe $TGT_TRAIN_BPE $TGT_TRAIN_TOK $BPE_CODES
$FASTBPE getvocab $SRC_TRAIN_BPE $TGT_TRAIN_BPE > $VOCAB
For bilignual.
SRC=
TGT=
# your bpe corpus path
OUTPATH=
#OUTPUT
DEST=
fairseq-preprocess \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${OUTPATH}/train.${SRC}-${TGT}.spm.clean \
--validpref ${OUTPATH}/valid.spm \
--testpref ${OUTPATH}/test.spm \
--destdir ${DEST}/ \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict $VOCAB \
--tgtdict $VOCAB \
--workers 70
For monolingual.
lang=
# your bpe corpus path
OUTPATH=
#OUTPUT
DEST=
fairseq-preprocess \
--only-source \
--source-lang ${lang} \
--trainpref ${OUTPUT}/train.spm \
--validpref ${OUTPUT}/valid.spm \
--destdir ${DEST}/ \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict $VOCAB \
--tgtdict $VOCAB \
--workers 70
mv ${DEST}/train.$lang-None.$lang.bin ${DEST}/train.$lang-$lang.$lang.bin
mv ${DEST}/train.$lang-None.$lang.idx ${DEST}/train.$lang-$lang.$lang.idx
mv ${DEST}/valid.$lang-None.$lang.bin ${DEST}/valid.$lang-$lang.$lang.bin
mv ${DEST}/valid.$lang-None.$lang.idx ${DEST}/valid.$lang-$lang.$lang.idx
You can simply ignore the current step if you directly download the data(train.merge.json, valid.merge.json) we provide.
# vocab file path.
vocab_path=
# bilignual(multilignual) word translation dict path(word_trans2id.dict)
wordTrans_path=
# data path(BPE format)
data_path=
prefix=
# 'zh-en'
langs=
# output path
out_path=
python extract_aligned_pairs.py --vocab-path $vocab_path --trans-path $wordTrans_path --data-path $data_path --output-path $out_path --prefix $prefix --langs $langs --add-mask --merge
After above preprocesses, all training data is ready. Please put all the dictionaries (including word_trans2id.dict, train.merge.json, valid.merge.json,dict.txt) into the ./Dicts directory, which is in the same place as the binarize data.
bash ${PROJECT_ROOT}/CeMAT_plugins/task_pt_cemat.sh
You can modify the configs to choose the model architecture or dataset used.
# bilingual
bash ${PROJECT_ROOT}/process/preprocess_NMT.sh
bash ${PROJECT_ROOT}/CeMAT_plugins/task_NMT_cemat.sh
bash ${PROJECT_ROOT}/CeMAT_plugins/task_infer_nmt.sh
# bilingual
bash ${PROJECT_ROOT}/process/preprocess_NMT.sh
bash ${PROJECT_ROOT}/CeMAT_maskPredict/task_NAT_cemat.sh
bash ${PROJECT_ROOT}/CeMAT_maskPredict/task_infer_nat.sh
Dataset |
---|
Aligned Dicts |
Vocab |
We release Our pre-trained models and NMT models on special bilingual pairs.
Language | Model |
---|---|
Pre-trained | model |
CeMAT is MIT License, however, there is one exception that needs to be noted that the license of CeMAT_maskPredict(for non-autoregressive NMT) is CC-BY-NC 4.0, which is limited by the license of MASK-PREDICT. The license applies to the pre-trained models as well.
If you find this repo helpful for your research, please cite:
@inproceedings{CeMAT,
author = {Li, Pengfei and Li, Liangyou and Zhang, Meng and Wu, Minghao and Liu, Qun},
title = {Universal Conditional Masked Language Pre-training for Neural Machine Translation},
booktitle = {Annual Meeting of the Association for Computational Linguistics},
year = {2022}
}