Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Mask accidental hits * Simplify frequent token subsampling * Remove tqdm dependency * Simplifications * Support read from vec format * Add back DeduplicatedFasttext * Average the subword embeddings for FastText * Fix Fasttext hash function for ngrams containing non-ASCII data std::string in C++ uses signed char on most implementations. While the behavior is implementation defined and binary Fasttext models trained after compiling Fasttext with different compilers may result in different behavior, let's match the behavior of the officially distributed binary models here. * Merge train_word2vec and train_fasttext * Clean up fasttext evaluation binary script - Fix support of loading bin Fasttext models without subwords * Remove waitall * Only evaluate at end of training by default * Set mxnet env variables * Increase number of subword units considered by default * Update hyperparameters * Fix cbow * Use separate batch-size for evaluation * Fix lint * Rerun extended_results.ipynb and commit dependant results/*tvs files to repo * Refactor TokenEmbedding OOV inference * Clean up TokenEmbedding API docs * Use GluonNLP load_fasttext_model for word embeddings evaluation script Instead of custom evaluate_fasttext_bin script * Add tests * Remove deprecated to_token_embedding method from train/embedding.py * Merge TokenEmbedding.extend in TokenEmbedding.__setitem__ Previously __setitem__ was only allowed to update known tokens. * Use full link to #11314 * Improve test coverage * Update notebook * Fix doc * Cache word ngram hashes * Move results to dmlc/web-data * Move candidate_sampler to scripts * Update --negative doc * Match old default behavior of TokenEmbedding and add warnings * Match weight context in UnigramCandidateSampler * Add Pad test case with empty ndarray input * Address review comments * Fix doc and superfluous inheritance
- Loading branch information