awesome-sentence-embedding

授权协议 GPL-3.0 License
开发语言 Python
所属分类 神经网络/人工智能、 自然语言处理
软件类型 开源软件
地区 不详
投 递 者 程振濂
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

awesome-sentence-embedding

A curated list of pretrained sentence and word embedding models

Table of Contents

About This Repo

  • well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
  • this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
  • this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
  • if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
  • enjoy!

General Framework

  • Almost all the sentence embeddings work like this:
  • Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
  • Then they define some sort of pooling (it can be as simple as last pooling).
  • Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
  • So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!

Word Embeddings

  • Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
date paper citation count training code pretrained models
- WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models N/A - RusVectōrēs
2013/01 Efficient Estimation of Word Representations in Vector Space 999+ C Word2Vec
2014/12 Word Representations via Gaussian Embedding 221 Cython -
2014/?? A Probabilistic Model for Learning Multi-Prototype Word Embeddings 127 DMTK -
2014/?? Dependency-Based Word Embeddings 719 C++ word2vecf
2014/?? GloVe: Global Vectors for Word Representation 999+ C GloVe
2015/06 Sparse Overcomplete Word Vector Representations 129 C++ -
2015/06 From Paraphrase Database to Compositional Paraphrase Model and Back 3 Theano PARAGRAM
2015/06 Non-distributional Word Vector Representations 68 Python WordFeat
2015/?? Joint Learning of Character and Word Embeddings 195 C -
2015/?? SensEmbed: Learning Sense Embeddings for Word and Relational Similarity 249 - SensEmbed
2015/?? Topical Word Embeddings 292 Cython
2016/02 Swivel: Improving Embeddings by Noticing What's Missing 61 TF -
2016/03 Counter-fitting Word Vectors to Linguistic Constraints 232 Python counter-fitting(broken)
2016/05 Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec 91 Chainer -
2016/06 Siamese CBOW: Optimizing Word Embeddings for Sentence Representations 166 Theano Siamese CBOW
2016/06 Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations 58 Go lexvec
2016/07 Enriching Word Vectors with Subword Information 999+ C++ fastText
2016/08 Morphological Priors for Probabilistic Neural Word Embeddings 34 Theano -
2016/11 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks 359 C++ charNgram2vec
2016/12 ConceptNet 5.5: An Open Multilingual Graph of General Knowledge 604 Python Numberbatch
2016/?? Learning Word Meta-Embeddings 58 - Meta-Emb(broken)
2017/02 Offline bilingual word vectors, orthogonal transformations and the inverted softmax 336 Python -
2017/04 Multimodal Word Distributions 57 TF word2gm
2017/05 Poincaré Embeddings for Learning Hierarchical Representations 413 Pytorch -
2017/06 Context encoders as a simple but powerful extension of word2vec 13 Python -
2017/06 Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints 99 TF Attract-Repel
2017/08 Learning Chinese Word Representations From Glyphs Of Characters 44 C -
2017/08 Making Sense of Word Embeddings 92 Python sensegram
2017/09 Hash Embeddings for Efficient Word Representations 25 Keras -
2017/10 BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages 91 Gensim BPEmb
2017/11 SPINE: SParse Interpretable Neural Embeddings 48 Pytorch SPINE
2017/?? AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP 161 Gensim AraVec
2017/?? Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics 25 C -
2017/?? Dict2vec : Learning Word Embeddings using Lexical Dictionaries 49 C++ Dict2vec
2017/?? Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components 63 C -
2018/04 Representation Tradeoffs for Hyperbolic Embeddings 120 Pytorch h-MDS
2018/04 Dynamic Meta-Embeddings for Improved Sentence Representations 60 Pytorch DME/CDME
2018/05 Analogical Reasoning on Chinese Morphological and Semantic Relations 128 - ChineseWordVectors
2018/06 Probabilistic FastText for Multi-Sense Word Embeddings 39 C++ Probabilistic FastText
2018/09 Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks 3 TF SynGCN
2018/09 FRAGE: Frequency-Agnostic Word Representation 64 Pytorch -
2018/12 Wikipedia2Vec: An Optimized Tool for LearningEmbeddings of Words and Entities from Wikipedia 17 Cython Wikipedia2Vec
2018/?? Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings 106 - ChineseEmbedding
2018/?? cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information 45 C++ -
2019/02 VCWE: Visual Character-Enhanced Word Embeddings 5 Pytorch VCWE
2019/05 Learning Cross-lingual Embeddings from Twitter via Distant Supervision 2 Text -
2019/08 An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning 5 TF -
2019/08 ViCo: Word Embeddings from Visual Co-occurrences 7 Pytorch ViCo
2019/11 Spherical Text Embedding 25 C -
2019/?? Unsupervised word embeddings capture latent knowledge from materials science literature 150 Gensim -

OOV Handling

Contextualized Word Embeddings

  • Note: all the unofficial models can load the official pretrained models
date paper citation count code pretrained models
- Language Models are Unsupervised Multitask Learners N/A TF
Pytorch, TF2.0
Keras
GPT-2(117M, 124M, 345M, 355M, 774M, 1558M)
2017/08 Learned in Translation: Contextualized Word Vectors 524 Pytorch
Keras
CoVe
2018/01 Universal Language Model Fine-tuning for Text Classification 167 Pytorch ULMFit(English, Zoo)
2018/02 Deep contextualized word representations 999+ Pytorch
TF
ELMO(AllenNLP, TF-Hub)
2018/04 Efficient Contextualized Representation:Language Model Pruning for Sequence Labeling 26 Pytorch LD-Net
2018/07 Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation 120 Pytorch ELMo
2018/08 Direct Output Connection for a High-Rank Language Model 24 Pytorch DOC
2018/10 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 999+ TF
Keras
Pytorch, TF2.0
MXNet
PaddlePaddle
TF
Keras
BERT(BERT, ERNIE, KoBERT)
2018/?? Contextual String Embeddings for Sequence Labeling 486 Pytorch Flair
2018/?? Improving Language Understanding by Generative Pre-Training 999+ TF
Keras
Pytorch, TF2.0
GPT
2019/01 Multi-Task Deep Neural Networks for Natural Language Understanding 364 Pytorch MT-DNN
2019/01 BioBERT: pre-trained biomedical language representation model for biomedical text mining 634 TF BioBERT
2019/01 Cross-lingual Language Model Pretraining 639 Pytorch
Pytorch, TF2.0
XLM
2019/01 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 754 TF
Pytorch
Pytorch, TF2.0
Transformer-XL
2019/02 Efficient Contextual Representation Learning Without Softmax Layer 2 Pytorch -
2019/03 SciBERT: Pretrained Contextualized Embeddings for Scientific Text 124 Pytorch, TF SciBERT
2019/04 Publicly Available Clinical BERT Embeddings 229 Text clinicalBERT
2019/04 ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission 84 Pytorch ClinicalBERT
2019/05 ERNIE: Enhanced Language Representation with Informative Entities 210 Pytorch ERNIE
2019/05 Unified Language Model Pre-training for Natural Language Understanding and Generation 278 Pytorch UniLMv1(unilm1-large-cased, unilm1-base-cased)
2019/05 HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization 81 -
2019/06 Pre-Training with Whole Word Masking for Chinese BERT 98 Pytorch, TF BERT-wwm
2019/06 XLNet: Generalized Autoregressive Pretraining for Language Understanding 999+ TF
Pytorch, TF2.0
XLNet
2019/07 ERNIE 2.0: A Continual Pre-training Framework for Language Understanding 107 PaddlePaddle ERNIE 2.0
2019/07 SpanBERT: Improving Pre-training by Representing and Predicting Spans 282 Pytorch SpanBERT
2019/07 RoBERTa: A Robustly Optimized BERT Pretraining Approach 999+ Pytorch
Pytorch, TF2.0
RoBERTa
2019/09 Subword ELMo 1 Pytorch -
2019/09 Knowledge Enhanced Contextual Word Representations 115 -
2019/09 TinyBERT: Distilling BERT for Natural Language Understanding 129 -
2019/09 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 136 Pytorch Megatron-LM(BERT-345M, GPT-2-345M)
2019/09 MultiFiT: Efficient Multi-lingual Language Model Fine-tuning 29 Pytorch -
2019/09 Extreme Language Model Compression with Optimal Subwords and Shared Projections 32 -
2019/09 MULE: Multimodal Universal Language Embedding 5 -
2019/09 Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks 51 -
2019/09 K-BERT: Enabling Language Representation with Knowledge Graph 59 -
2019/09 UNITER: Learning UNiversal Image-TExt Representations 60 -
2019/09 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 803 TF -
2019/10 BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 349 Pytorch BART(bart.base, bart.large, bart.large.mnli, bart.large.cnn, bart.large.xsum)
2019/10 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 481 Pytorch, TF2.0 DistilBERT
2019/10 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 696 TF T5
2019/11 CamemBERT: a Tasty French Language Model 102 - CamemBERT
2019/11 ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations 15 Pytorch -
2019/11 Unsupervised Cross-lingual Representation Learning at Scale 319 Pytorch XLM-R (XLM-RoBERTa)(xlmr.large, xlmr.base)
2020/01 ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training 35 Pytorch ProphetNet(ProphetNet-large-16GB, ProphetNet-large-160GB)
2020/02 CodeBERT: A Pre-Trained Model for Programming and Natural Languages 25 Pytorch CodeBERT
2020/02 UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training 33 Pytorch -
2020/03 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators 203 TF ELECTRA(ELECTRA-Small, ELECTRA-Base, ELECTRA-Large)
2020/04 MPNet: Masked and Permuted Pre-training for Language Understanding 5 Pytorch MPNet
2020/05 ParsBERT: Transformer-based Model for Persian Language Understanding 1 Pytorch ParsBERT
2020/05 Language Models are Few-Shot Learners 382 - -
2020/07 InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training 12 Pytorch -

Pooling Methods

Encoders

date paper citation count code model_name
- Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings N/A Python AraSIF
2014/05 Distributed Representations of Sentences and Documents 999+ Pytorch
Python
Doc2Vec
2014/11 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models 849 Theano
Pytorch
VSE
2015/06 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books 795 Theano
TF
Pytorch, Torch
SkipThought
2015/11 Order-Embeddings of Images and Language 354 Theano order-embedding
2015/11 Towards Universal Paraphrastic Sentence Embeddings 411 Theano ParagramPhrase
2015/?? From Word Embeddings to Document Distances 999+ C, Python Word Mover's Distance
2016/02 Learning Distributed Representations of Sentences from Unlabelled Data 363 Python FastSent
2016/07 Charagram: Embedding Words and Sentences via Character n-grams 144 Theano Charagram
2016/11 Learning Generic Sentence Representations Using Convolutional Neural Networks 76 Theano ConvSent
2017/03 Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features 319 C++ Sent2Vec
2017/04 Learning to Generate Reviews and Discovering Sentiment 293 TF
Pytorch
Pytorch
Sentiment Neuron
2017/05 Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings 60 Theano GRAN
2017/05 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data 999+ Pytorch InferSent
2017/07 VSE++: Improving Visual-Semantic Embeddings with Hard Negatives 132 Pytorch VSE++
2017/08 Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm 357 Keras
Pytorch
DeepMoji
2017/09 StarSpace: Embed All The Things! 129 C++ StarSpace
2017/10 DisSent: Learning Sentence Representations from Explicit Discourse Relations 47 Pytorch DisSent
2017/11 Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations 128 Theano para-nmt
2017/11 Dual-Path Convolutional Image-Text Embedding with Instance Loss 44 Matlab Image-Text-Embedding
2018/03 An efficient framework for learning sentence representations 183 TF Quick-Thought
2018/03 Universal Sentence Encoder 564 TF-Hub USE
2018/04 End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions 14 Theano DEISTE
2018/04 Learning general purpose distributed sentence representations via large scale multi-task learning 198 Pytorch GenSen
2018/06 Embedding Text in Hyperbolic Spaces 50 TF HyperText
2018/07 Representation Learning with Contrastive Predictive Coding 736 Keras CPC
2018/08 Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations 8 Python CMD
2018/09 Learning Universal Sentence Representations with Mean-Max Attention Autoencoder 14 TF Mean-MaxAAE
2018/10 Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model 35 TF-Hub USE-xling
2018/10 Improving Sentence Representations with Consensus Maximisation 4 - Multi-view
2018/10 BioSentVec: creating sentence embeddings for biomedical texts 70 Python BioSentVec
2018/11 Word Mover's Embedding: From Word2Vec to Document Embedding 47 C, Python WordMoversEmbeddings
2018/11 A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks 76 Pytorch HMTL
2018/12 Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond 238 Pytorch LASER
2018/?? Convolutional Neural Network for Universal Sentence Embeddings 6 Theano CSE
2019/01 No Training Required: Exploring Random Encoders for Sentence Classification 54 Pytorch randsent
2019/02 CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model 4 Pytorch CMOW
2019/07 GLOSS: Generative Latent Optimization of Sentence Representations 1 - GLOSS
2019/07 Multilingual Universal Sentence Encoder 52 TF-Hub MultilingualUSE
2019/08 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks 261 Pytorch Sentence-BERT
2020/02 SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models 11 Pytorch SBERT-WK
2020/06 DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations 4 Pytorch DeCLUTR
2020/07 Language-agnostic BERT Sentence Embedding 5 TF-Hub LaBSE
2020/11 On the Sentence Embeddings from Pre-trained Language Models 0 TF BERT-flow

Evaluation

Misc

Vector Mapping

Articles

  • python计算中文文本相似度神器 import sys from sentence_transformers.util import cos_sim from sentence_transformers import SentenceTransformer as SBert #model = SBert('paraphrase-multilingual-MiniLM-L12-v2') #如果这

  • 一、简单理解 特征嵌入,将数据转换(降维)为固定大小的特征表示(矢量),以便于处理和计算(如求距离)。 例如,针对用于说话者识别的语音信号训练的模型可以允许您将语音片段转换为数字向量,使得来自相同说话者的另一片段与原始向量具有小的距离(例如,欧几里德距离)。 embedding的主要目的是对(稀疏)特征进行降维,它降维的方式可以类比为一个全连接层(没有激活函数),通过 embedding 层的权重

  • embedding可以描述为Many models of source code are based on learned representation called embedding。 word embedding可以看作为word vector contextual embedding可以表述为it captures the context, since the hidden states

  • Different forms of knowledge Self-Knowledge Distillation [ICCV 2019] Be Your Own Teacher: Improve the Performance of CNN via Self Distillation Zhang, Linfeng, et al. “Be your own teacher: Improve the

  • 深度学习中,embedding如何理解? 参考:https://www.zhihu.com/question/38002635/answer/1382442522 1)这个概念在深度学习领域最原初的切入点是所谓的Manifold Hypothesis(流形假设)。流形假设是指“自然的原始数据是低维的流形嵌入于(embedded in)原始数据所在的高维空间”。那么,深度学习的任务就是把高维原始数据

  • 13年 Word2vev 横空出世,开启了基于 word embedding pre-trained 的 NLP 技术浪潮,6年过去了,embedding 技术已经成为了 nn4nlp 的标配,从不同层面得到了提升和改进。今天,我们一起回顾 embedding 的理论基础,发现它的技术演进,考察主流 embedding 的技术细节,最后再学习一些实操案例。 从实战角度而言,现在一般把 fastTe

  • Embedding在数学上表示一个maping, f: X -> Y, 也就是一个function,其中该函数是injective(就是我们所说的单射函数,每个Y只有唯一的X对应,反之亦然)和structure-preserving (结构保存,比如在X所属的空间上X1 < X2,那么映射后在Y所属空间上同理 Y1 < Y2)。那么对于word embedding,就是将单词word映射到另外一个

 相关资料
  • Awesome Awesome Node.js A curated list of awesome lists that are about or related to Node.js. Inspired by the awesome list thing, going deeper down the rabbit hole. �� Meta stuff about this awesome li

  • A curated list of awesome things related to Vite.js This awesome list is for Vite 2.x and onward. Vite 1.x's list is archived. Resources Official Resources 文档 GitHub Repo Release Notes Vue 3 Docs Awes

  • Awesome Python 是一个资源整理集合,由 vinta 发起和维护。内容包括:Web框架、网络爬虫、网络内容提取、模板引擎、数据库、数据可视化、图片处理、文本处理、自然语言处理、机器学习、日志、代码分析等。 这个系列没有推荐 Python 书籍、经典博文、交互教程,所以另外推荐:《25本免费的Python电子书》、《学习Python编程的11个(教程)资源》、《PythonMonk:Py

  • Font Awesome 是一个图标工具包。其已经被重新设计并从头构建。除此之外,还增加了一些功能,比如 icon font ligature、SVG 框架、流行的前端库(如 React)的官方 NPM 包,以及对新 CDN 的访问。Font Awesome 已扩展至 7,865 个图标。

  • awesome-android 收录了来自 github 或其他网站的关于 Android 的大部分库。

  • The Lithe Project Development Team Awesome CryptoNote A curated list of awesome CryptoNote open-source blockchains, resources, projects, and shiny other things related.Many projects are derived from f