Trains an OpenNMT model and a SentencePiece parser then packages them with a Stanza model for use with Argos Translate.
Argos Translate packages available for download here.
Uses data from the Opus project in the Moses format.
This is the setup currently used to train models:
- NVIDIA Tesla K80 GPU
- 7 cores, 30GB Memory
- 75-200GB swap space
- Ubuntu 20.04
Tested on Ubuntu 20.04 with this script:
curl https://raw.githubusercontent.com/PJ-Finlay/cuda-setup/main/setup.sh | sh
sudo reboot
Using the nvidia/cuda Docker container should also work.
cd
git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip3 install -e .
pip3 install -r requirements.opt.txt
PATH=~/.local/bin:$PATH
cd
git clone https://github.com/argosopentech/onmt-models.git
cd ~/onmt-models/raw_data
wget https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/moses/en-es.txt.zip
unzip en-es.txt.zip
cat *.en >> source.en
cat *.es >> source.es
75GB works for most models, if you have free disk space you can do more.
sudo fallocate -l 75G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon --show
cd ~/onmt-models
sudo ./setup.sh
screen
./train.sh
Detaching:
Ctrl-a d
Reattaching:
screen -r
metadata.json
example:
{
"package_version": "1.0",
"argos_version": "1.1",
"from_code": "en",
"from_name": "English",
"to_code": "zh",
"to_name": "Chinese"
}
MODEL_README.md
is a Markdown document that will be packaged with your model.
./package.sh
./reset.sh
- Reset training but leave data.
- If you're running out of GPU memory reduce
batch_size
andvalid_batch_size
inconfig.yml
.