Skip to content

Commit

Permalink
Word embeddings update (dmlc#159)
Browse files Browse the repository at this point in the history
* Mask accidental hits

* Simplify frequent token subsampling

* Remove tqdm dependency

* Simplifications

* Support read from vec format

* Add back DeduplicatedFasttext

* Average the subword embeddings for FastText

* Fix Fasttext hash function for ngrams containing non-ASCII data

std::string in C++ uses signed char on most implementations. While the behavior
is implementation defined and binary Fasttext models trained after compiling
Fasttext with different compilers may result in different behavior, let's match
the behavior of the officially distributed binary models here.

* Merge train_word2vec and train_fasttext

* Clean up fasttext evaluation binary script

- Fix support of loading bin Fasttext models without subwords

* Remove waitall

* Only evaluate at end of training by default

* Set mxnet env variables

* Increase number of subword units considered by default

* Update hyperparameters

* Fix cbow

* Use separate batch-size for evaluation

* Fix lint

* Rerun extended_results.ipynb and commit dependant results/*tvs files to repo

* Refactor TokenEmbedding OOV inference

* Clean up TokenEmbedding API docs

* Use GluonNLP load_fasttext_model for word embeddings evaluation script

Instead of custom evaluate_fasttext_bin script

* Add tests

* Remove deprecated to_token_embedding method from train/embedding.py

* Merge TokenEmbedding.extend in TokenEmbedding.__setitem__

Previously __setitem__ was only allowed to update known tokens.

* Use full link to #11314

* Improve test coverage

* Update notebook

* Fix doc

* Cache word ngram hashes

* Move results to dmlc/web-data

* Move candidate_sampler to scripts

* Update --negative doc

* Match old default behavior of TokenEmbedding and add warnings

* Match weight context in UnigramCandidateSampler

* Add Pad test case with empty ndarray input

* Address review comments

* Fix doc and superfluous inheritance
  • Loading branch information
leezu authored and szha committed Jul 1, 2018
1 parent f20d281 commit ab75fde
Show file tree
Hide file tree
Showing 30 changed files with 1,501 additions and 8,233 deletions.
1 change: 1 addition & 0 deletions docs/api/embedding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ API Reference
.. automodule:: gluonnlp.embedding
:members:
:imported-members:
:special-members: __contains__, __getitem__, __setitem__
Loading

0 comments on commit ab75fde

Please sign in to comment.