Skip to content

Latest commit

 

History

History

vmap

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Generating vocabulary mapping files

This directory contains tools for building and evaluating vocabulary mapping files (a.k.a. vmap). These files are used to dynamically reduce the size of the target vocabulary during translation.

File format

A vocabulary map is a text file mapping source N-grams to a list of target tokens. Each line has the following format:

source N-gram\tw_1 w_2 w_3 ... w_K

A source N-gram can be empty (0-gram).

Since we are only interested to predict the vocabulary of a full sentence, the words associated to a specific N-gram are not necessarily its actual translation, but they are words necessary to build the meaning of the N-gram.

Vocabulary calculation

Given a sentence S = m_1 ... m_k, the target vocabulary Tvoc is calculated as followed:

Tvoc.insert(vmap[''])

for i = 1, k do
  for p = 1, k-i+1 do
    seq = concat(m_p, ' ', ..., m_{p+i-1})
    Tvoc.insert(vmap[seq])
  end
 end

Building a phrase table

Building a vmap requires a phrase table for your language pair. It can be generated using the provided Dockerfile:

docker build -f Dockerfile . -t build-pt
docker run --rm -v MYCORPUSPATH:/root/corpus build-pt CORPUSNAME SS TT N > phrase-table.gz

where:

  • CORPUSPATH/CORPUSNAME.{SS,TT} are tokenized source and target files
  • N is the max N-gram length (3 is usually a good value)

Building a vmap

The vmap can be generated with the build-vmap.py script:

usage: build-vmap.py [-h] -pt PHRASE_TABLE [-zg ZERO_GENERATE_LIST]
                     [-ms MAX_SIZE] [-mf MIN_FREQ] [-km KEEP_MEANING]
                     [-tv TGT_VOCAB] [-l LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  -pt PHRASE_TABLE, --phrase_table PHRASE_TABLE
                        phrase table
  -zg ZERO_GENERATE_LIST, --zero_generate_list ZERO_GENERATE_LIST
                        list of terms generated from 0
  -ms MAX_SIZE, --max_size MAX_SIZE
                        maximal size of source sequences
  -mf MIN_FREQ, --min_freq MIN_FREQ
                        minimal frequency of pair
  -km KEEP_MEANING, --keep_meaning KEEP_MEANING
                        number of meaning to keep per entry
  -tv TGT_VOCAB, --tgt_vocab TGT_VOCAB
                        save target vocabulary for max coverage calculation
  -l LIMIT, --limit LIMIT
                        limit the number of entries (for dev)

The vmap is generated on the standard output, and the 20 most common meanings are written on the error output.

These common meanings can be saved to a file and passed to the -zg option. They will be added in the vmap as tokens that are always considered for a translation. For example:

Example:

# Dry run to generate the 20 most common meanings:
python build-vmap.py -pt phrase-table.gz 1> /dev/null 2> zg.txt

# Build the vmap:
python build-vmap.py -pt phrase-table.gz -zg zg.txt 1> vmap.txt 2> /dev/null

Evaluating the vmap

You can evaluate the coverage of a vmap on a given test set (tokenized source and target files) with the eval-vmap.py script and the TGT_VOCAB that was optionally generated by the build-vmap.py script.

usage: eval-vmap.py [-h] -vmap VMAP [-tv TV] -src SRC -tgt TGT

optional arguments:
  -h, --help  show this help message and exit
  -vmap VMAP  a vocab mapping file
  -tv TV      target vocabulary file
  -src SRC    source file
  -tgt TGT    target file

The script outputs the coverage ratio of the target sentences predicted from source sentences according to the vmap, and the number of vocabs per sentence.