Word embeddings update (#159) · paperplanet/gluon-nlp@ab75fde

Commit

Word embeddings update (dmlc#159)

* Mask accidental hits

* Simplify frequent token subsampling

* Remove tqdm dependency

* Simplifications

* Support read from vec format

* Add back DeduplicatedFasttext

* Average the subword embeddings for FastText

* Fix Fasttext hash function for ngrams containing non-ASCII data

std::string in C++ uses signed char on most implementations. While the behavior
is implementation defined and binary Fasttext models trained after compiling
Fasttext with different compilers may result in different behavior, let's match
the behavior of the officially distributed binary models here.

* Merge train_word2vec and train_fasttext

* Clean up fasttext evaluation binary script

- Fix support of loading bin Fasttext models without subwords

* Remove waitall

* Only evaluate at end of training by default

* Set mxnet env variables

* Increase number of subword units considered by default

* Update hyperparameters

* Fix cbow

* Use separate batch-size for evaluation

* Fix lint

* Rerun extended_results.ipynb and commit dependant results/*tvs files to repo

* Refactor TokenEmbedding OOV inference

* Clean up TokenEmbedding API docs

* Use GluonNLP load_fasttext_model for word embeddings evaluation script

Instead of custom evaluate_fasttext_bin script

* Add tests

* Remove deprecated to_token_embedding method from train/embedding.py

* Merge TokenEmbedding.extend in TokenEmbedding.__setitem__

Previously __setitem__ was only allowed to update known tokens.

* Use full link to #11314

* Improve test coverage

* Update notebook

* Fix doc

* Cache word ngram hashes

* Move results to dmlc/web-data

* Move candidate_sampler to scripts

* Update --negative doc

* Match old default behavior of TokenEmbedding and add warnings

* Match weight context in UnigramCandidateSampler

* Add Pad test case with empty ndarray input

* Address review comments

* Fix doc and superfluous inheritance

Loading branch information

leezu authored and szha committed Jul 1, 2018

1 parent f20d281 commit ab75fde

docs/api/embedding.rst

-Original file line number
+Diff line change
@@ Expand Up / @@ -27,3 +27,4 @@ API Reference @@
     .. automodule:: gluonnlp.embedding
         :members:
         :imported-members:
+        :special-members: __contains__, __getitem__, __setitem__

0 comments on commit `ab75fde`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `ab75fde`

Commit

There are no files selected for viewing

0 comments on commit ab75fde

0 comments on commit `ab75fde`