Skip to content

Library for fast text representation and classification.

License

Notifications You must be signed in to change notification settings

MLStudy/fastText

Repository files navigation

fastText

fastText is a library for efficient computation of word representations and sentence classification.

Requirements

fastText compiles on all modern platforms including Mac OS and Linux. Because of the use of C++ 11 features, it requires the use of a C++ 11 compatible compiler. These include :

  • (gcc-4.6.3 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. For the word-similarity evaluation script you will need:

  • python 2.6 or newer

Building fastText

In order to build fastText, use the following:

$ git clone git@github.com:facebookresearch/fastText.git
$ cd fastText
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, please update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Example use cases

This library has two main use cases that we will describe here. These two uses correspond to papers [1] and [2].

Word representation

In order to compute word vectors as described in [1] do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing utf-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is the binary containing, the parameters of the model along with the dictionary and all hyper parameters. It can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be reused to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, please issue the following command:

$ ./fasttext print-vectors model.bin < queries.txt

This will output to standard output, the word and its vector, one word per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./get-vectors.sh

will compile the code, download data, compute the word vectors and evaluate on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in [2], issue:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words in a sentence that are prefixed by __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision at 1 (P@1) on a test set using:

$ ./fasttext test model.bin test.txt

If you want to obtain the most likely label for a piece of text, please use:

$ ./fasttext predict model.bin test.txt

where test.txt contains a piece of text to classify per line. Doing so will output to standard output the most likely label per line. Please check classification.sh for an example use case. In order to reproduce results from the paper [2] please run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

Full documentation

The following arguments are mandatory:
  -input      training file path
  -output     output file path

The following arguments are optional:
  -lr         learning rate [0.05]
  -dim        size of word vectors [100]
  -ws         size of the context window [5]
  -epoch      number of epochs [5]
  -minCount   minimal number of word occurences [1]
  -neg        number of negatives sampled [5]
  -wordNgrams max length of word ngram [1]
  -sampling   sampling distribution {sqrt, log, tf, uni} [log]
  -loss       loss function {ns, hs, softmax}   [ns]
  -bucket     number of buckets [2000000]
  -minn       min length of char ngram [3]
  -maxn       max length of char ngram [6]
  -onlyWord   number of words with no ngrams [0]
  -thread     number of threads [12]
  -verbose    how often to print to stdout [1000]
  -t          sampling threshold [0.0001]
  -label      labels prefix [__label__]

References

[1] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, arXiv 1607.04606, 2016

[2] Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Bag of Tricks for Efficient Text Classification, arXiv 1607.01759, 2016

Join the fastText community

See the CONTRIBUTING file for information about how to help out.

License

fastText is BSD-licensed. We also provide an additional patent grant.

About

Library for fast text representation and classification.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 81.9%
  • Shell 8.7%
  • Python 3.6%
  • Perl 3.3%
  • Makefile 1.9%
  • C 0.6%