Skip to content

A curated list of pretrained sentence and word embedding models

License

Notifications You must be signed in to change notification settings

faithxuyanyan/awesome-sentence-embedding

 
 

Repository files navigation

awesome-sentence-embedding Awesome

Build Status GitHub - LICENSE

A curated list of pretrained sentence and word embedding models

Table of Contents

About This Repo

  • well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
  • this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
  • this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
  • if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
  • enjoy!

General Framework

  • Almost all the sentence embeddings work like this:
  • Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
  • Then they define some sort of pooling (it can be as simple as last pooling).
  • Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
  • So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!

Word Embeddings

  • Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
date paper citation count training code pretrained models
- GloVe: Global Vectors for Word Representation N/A C GloVe
- Dependency-Based Word Embeddings N/A C++ word2vecf
- Learning Word Meta-Embeddings N/A - Meta-Emb(broken)
- Dict2vec : Learning Word Embeddings using Lexical Dictionaries N/A C++ Dict2vec
- WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models N/A - RusVectōrēs
- SensEmbed: Learning Sense Embeddings for Word and Relational Similarity N/A - SensEmbed
2013/01 Efficient Estimation of Word Representations in Vector Space 998 C Word2Vec
2014/12 Word Representations via Gaussian Embedding 134 Cython -
2014/?? A Probabilistic Model for Learning Multi-Prototype Word Embeddings 88 DMTK -
2015/06 From Paraphrase Database to Compositional Paraphrase Model and Back 0 Theano PARAGRAM
2015/06 Non-distributional Word Vector Representations 42 Python WordFeat
2015/06 Sparse Overcomplete Word Vector Representations 78 C++ -
2015/?? Joint Learning of Character and Word Embeddings 111 C -
2015/?? Topical Word Embeddings 156 Cython
2016/02 Swivel: Improving Embeddings by Noticing What's Missing 47 TF -
2016/03 Counter-fitting Word Vectors to Linguistic Constraints 130 Python counter-fitting(broken)
2016/05 Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec 35 Chainer -
2016/06 Siamese CBOW: Optimizing Word Embeddings for Sentence Representations 109 Theano Siamese CBOW
2016/06 Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations 32 Go lexvec
2016/07 Enriching Word Vectors with Subword Information 999+ C++ fastText
2016/08 Morphological Priors for Probabilistic Neural Word Embeddings 23 Theano -
2016/11 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks 194 C++ charNgram2vec
2016/12 ConceptNet 5.5: An Open Multilingual Graph of General Knowledge 186 Python Numberbatch
2017/02 Offline bilingual word vectors, orthogonal transformations and the inverted softmax 167 Python -
2017/04 Multimodal Word Distributions 30 TF word2gm
2017/05 Poincaré Embeddings for Learning Hierarchical Representations 145 Pytorch -
2017/06 Context encoders as a simple but powerful extension of word2vec 4 Python -
2017/06 Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints 71 TF Attract-Repel
2017/08 Learning Chinese Word Representations From Glyphs Of Characters 18 C -
2017/08 Making Sense of Word Embeddings 23 Python sensegram
2017/09 Hash Embeddings for Efficient Word Representations 7 Keras -
2017/10 BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages 22 Gensim BPEmb
2017/11 SPINE: SParse Interpretable Neural Embeddings 16 Pytorch SPINE
2017/?? Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics 13 C -
2017/?? Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components 26 C -
2017/?? AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP 40 Gensim AraVec
2018/04 Dynamic Meta-Embeddings for Improved Sentence Representations 19 Pytorch DME/CDME
2018/04 Representation Tradeoffs for Hyperbolic Embeddings 32 Pytorch h-MDS
2018/05 Analogical Reasoning on Chinese Morphological and Semantic Relations 35 - ChineseWordVectors
2018/06 Probabilistic FastText for Multi-Sense Word Embeddings 12 C++ Probabilistic FastText
2018/09 Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks 1 TF SynGCN
2018/09 FRAGE: Frequency-Agnostic Word Representation 27 Pytorch -
2018/12 Wikipedia2Vec: An Optimized Tool for LearningEmbeddings of Words and Entities from Wikipedia 3 Cython Wikipedia2Vec
2018/?? cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information 18 C++ -
2018/?? Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings 21 - ChineseEmbedding
2019/02 VCWE: Visual Character-Enhanced Word Embeddings 0 Pytorch VCWE
2019/05 Learning Cross-lingual Embeddings from Twitter via Distant Supervision 1 Text -
2019/08 ViCo: Word Embeddings from Visual Co-occurrences 0 Pytorch ViCo
2019/08 An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning 2 TF -
2019/?? Unsupervised word embeddings capture latent knowledge from materials science literature 3 Gensim -

OOV Handling

Contextualized Word Embeddings

  • Note: all the unofficial models can load the official pretrained models
date paper citation count code pretrained models
- Language Models are Unsupervised Multitask Learners N/A TF
Pytorch, TF2.0
Keras
GPT-2
2017/08 Learned in Translation: Contextualized Word Vectors 265 Pytorch
Keras
CoVe
2018/01 Universal Language Model Fine-tuning for Text Classification 383 Pytorch ULMFit(English, Zoo)
2018/02 Deep contextualized word representations 999+ Pytorch
TF
ELMO(AllenNLP, TF-Hub)
2018/04 Efficient Contextualized Representation:Language Model Pruning for Sequence Labeling 10 Pytorch LD-Net
2018/07 Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation 39 Pytorch ELMo
2018/08 Direct Output Connection for a High-Rank Language Model 11 Pytorch DOC
2018/10 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 999+ TF
Keras
Pytorch, TF2.0
MXNet
PaddlePaddle
TF
Keras
BERT(BERT, ERNIE, KoBERT)
2018/?? Contextual String Embeddings for Sequence Labeling 100 Pytorch Flair
2018/?? Improving Language Understanding by Generative Pre-Training 441 TF
Keras
Pytorch, TF2.0
GPT
2019/01 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 101 TF
Pytorch
Pytorch, TF2.0
Transformer-XL
2019/01 BioBERT: pre-trained biomedical language representation model for biomedical text mining 62 TF BioBERT
2019/01 Multi-Task Deep Neural Networks for Natural Language Understanding 76 Pytorch MT-DNN
2019/01 Cross-lingual Language Model Pretraining 91 Pytorch
Pytorch, TF2.0
XLM
2019/02 Efficient Contextual Representation Learning Without Softmax Layer 1 Pytorch -
2019/03 SciBERT: Pretrained Contextualized Embeddings for Scientific Text 1 Pytorch, TF SciBERT
2019/04 Publicly Available Clinical BERT Embeddings 15 Text clinicalBERT
2019/04 ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission 8 Pytorch ClinicalBERT
2019/05 ERNIE: Enhanced Language Representation with Informative Entities 14 Pytorch ERNIE
2019/05 Unified Language Model Pre-training for Natural Language Understanding and Generation 19 Pytorch UniLM
2019/05 HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization 3 -
2019/06 Pre-Training with Whole Word Masking for Chinese BERT 3 Pytorch, TF BERT-wwm
2019/06 XLNet: Generalized Autoregressive Pretraining for Language Understanding 86 TF
Pytorch, TF2.0
XLNet
2019/07 ERNIE 2.0: A Continual Pre-training Framework for Language Understanding 1 PaddlePaddle ERNIE 2.0
2019/07 RoBERTa: A Robustly Optimized BERT Pretraining Approach 45 Pytorch
Pytorch, TF2.0
RoBERTa
2019/07 SpanBERT: Improving Pre-training by Representing and Predicting Spans 8 Pytorch SpanBERT
2019/09 MultiFiT: Efficient Multi-lingual Language Model Fine-tuning 0 Pytorch -
2019/09 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 0 TF -
2019/09 Extreme Language Model Compression with Optimal Subwords and Shared Projections 0 -
2019/09 UNITER: Learning UNiversal Image-TExt Representations 0 -
2019/09 MULE: Multimodal Universal Language Embedding 0 -
2019/09 Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks 0 -
2019/09 Knowledge Enhanced Contextual Word Representations 0 -
2019/09 TinyBERT: Distilling BERT for Natural Language Understanding 0 -
2019/09 K-BERT: Enabling Language Representation with Knowledge Graph 0 -
2019/09 Subword ELMo 0 Pytorch -
2019/10 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 0 Pytorch, TF2.0 DistilBERT

Pooling Methods

Encoders

date paper citation count code model_name
- Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings N/A Python AraSIF
2014/05 Distributed Representations of Sentences and Documents 999+ Pytorch
Python
Doc2Vec
2014/11 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models 497 Theano
Pytorch
VSE
2015/06 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books 307 Theano
TF
Pytorch, Torch
SkipThought
2015/11 Order-Embeddings of Images and Language 197 Theano order-embedding
2015/11 Towards Universal Paraphrastic Sentence Embeddings 256 Theano ParagramPhrase
2015/?? From Word Embeddings to Document Distances 523 C, Python Word Mover's Distance
2016/02 Learning Distributed Representations of Sentences from Unlabelled Data 248 Python FastSent
2016/07 Charagram: Embedding Words and Sentences via Character n-grams 86 Theano Charagram
2016/11 Learning Generic Sentence Representations Using Convolutional Neural Networks 42 Theano ConvSent
2017/03 Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features 143 C++ Sent2Vec
2017/04 Learning to Generate Reviews and Discovering Sentiment 173 TF
Pytorch
Pytorch
Sentiment Neuron
2017/05 Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings 40 Theano GRAN
2017/05 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data 502 Pytorch InferSent
2017/07 VSE++: Improving Visual-Semantic Embeddings with Hard Negatives 57 Pytorch VSE++
2017/08 Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm 165 Keras
Pytorch
DeepMoji
2017/09 StarSpace: Embed All The Things! 56 C++ StarSpace
2017/10 DisSent: Learning Sentence Representations from Explicit Discourse Relations 39 Pytorch DisSent
2017/11 Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations 47 Theano para-nmt
2017/11 Dual-Path Convolutional Image-Text Embedding with Instance Loss 5 Matlab Image-Text-Embedding
2018/03 Universal Sentence Encoder 118 TF-Hub USE
2018/03 An efficient framework for learning sentence representations 64 TF Quick-Thought
2018/04 End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions 2 Theano DEISTE
2018/04 Learning general purpose distributed sentence representations via large scale multi-task learning 94 Pytorch GenSen
2018/06 Embedding Text in Hyperbolic Spaces 18 TF HyperText
2018/07 Representation Learning with Contrastive Predictive Coding 87 Keras CPC
2018/08 Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations 2 Python CMD
2018/09 Learning Universal Sentence Representations with Mean-Max Attention Autoencoder 2 TF Mean-MaxAAE
2018/10 Improving Sentence Representations with Consensus Maximisation 0 - Multi-view
2018/10 BioSentVec: creating sentence embeddings for biomedical texts 13 Python BioSentVec
2018/10 Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model 5 TF-Hub USE-xling
2018/11 Word Mover's Embedding: From Word2Vec to Document Embedding 14 C, Python WordMoversEmbeddings
2018/11 A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks 20 Pytorch HMTL
2018/12 Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond 40 Pytorch LASER
2018/?? Convolutional Neural Network for Universal Sentence Embeddings 0 Theano CSE
2019/01 No Training Required: Exploring Random Encoders for Sentence Classification 13 Pytorch randsent
2019/02 CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model 0 Pytorch CMOW
2019/07 GLOSS: Generative Latent Optimization of Sentence Representations 0 - GLOSS
2019/07 Multilingual Universal Sentence Encoder 1 TF-Hub MultilingualUSE
2019/08 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks 1 Pytorch Sentence-BERT

Evaluation

Misc

Vector Mapping

Articles

About

A curated list of pretrained sentence and word embedding models

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%