This directory contains tools for building and evaluating vocabulary mapping files (a.k.a. vmap). These files are used to dynamically reduce the size of the target vocabulary during translation.
A vocabulary map is a text file mapping source N-grams to a list of target tokens. Each line has the following format:
source N-gram\tw_1 w_2 w_3 ... w_K
A source N-gram can be empty (0-gram).
Since we are only interested to predict the vocabulary of a full sentence, the words associated to a specific N-gram are not necessarily its actual translation, but they are words necessary to build the meaning of the N-gram.
Given a sentence S = m_1 ... m_k
, the target vocabulary Tvoc
is calculated as followed:
Tvoc.insert(vmap[''])
for i = 1, k do
for p = 1, k-i+1 do
seq = concat(m_p, ' ', ..., m_{p+i-1})
Tvoc.insert(vmap[seq])
end
end
Building a vmap requires a phrase table for your language pair. It can be generated using the provided Dockerfile:
docker build -f Dockerfile . -t build-pt
docker run --rm -v MYCORPUSPATH:/root/corpus build-pt CORPUSNAME SS TT N > phrase-table.gz
where:
CORPUSPATH/CORPUSNAME.{SS,TT}
are tokenized source and target filesN
is the max N-gram length (3 is usually a good value)
The vmap can be generated with the build-vmap.py
script:
usage: build-vmap.py [-h] -pt PHRASE_TABLE [-zg ZERO_GENERATE_LIST]
[-ms MAX_SIZE] [-mf MIN_FREQ] [-km KEEP_MEANING]
[-tv TGT_VOCAB] [-l LIMIT]
optional arguments:
-h, --help show this help message and exit
-pt PHRASE_TABLE, --phrase_table PHRASE_TABLE
phrase table
-zg ZERO_GENERATE_LIST, --zero_generate_list ZERO_GENERATE_LIST
list of terms generated from 0
-ms MAX_SIZE, --max_size MAX_SIZE
maximal size of source sequences
-mf MIN_FREQ, --min_freq MIN_FREQ
minimal frequency of pair
-km KEEP_MEANING, --keep_meaning KEEP_MEANING
number of meaning to keep per entry
-tv TGT_VOCAB, --tgt_vocab TGT_VOCAB
save target vocabulary for max coverage calculation
-l LIMIT, --limit LIMIT
limit the number of entries (for dev)
The vmap is generated on the standard output, and the 20 most common meanings are written on the error output.
These common meanings can be saved to a file and passed to the -zg
option. They will be added in the vmap as tokens that are always considered for a translation. For example:
Example:
# Dry run to generate the 20 most common meanings:
python build-vmap.py -pt phrase-table.gz 1> /dev/null 2> zg.txt
# Build the vmap:
python build-vmap.py -pt phrase-table.gz -zg zg.txt 1> vmap.txt 2> /dev/null
You can evaluate the coverage of a vmap on a given test set (tokenized source and target files) with the eval-vmap.py
script and the TGT_VOCAB
that was optionally generated by the build-vmap.py
script.
usage: eval-vmap.py [-h] -vmap VMAP [-tv TV] -src SRC -tgt TGT
optional arguments:
-h, --help show this help message and exit
-vmap VMAP a vocab mapping file
-tv TV target vocabulary file
-src SRC source file
-tgt TGT target file
The script outputs the coverage ratio of the target sentences predicted from source sentences according to the vmap, and the number of vocabs per sentence.