-
Notifications
You must be signed in to change notification settings - Fork 14
Distributional and Word Embedding Models
Word2Vec provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Explicit Semantic Analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words.
The context elements are the syntactic contexts of the target word, rather than the words in a window around it.
Dependency-Based Word Embeddings
Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row’s word occurs in the column’s document. Then the Singular Value Decomposition is calculated for the word-document matrix to produce three matrices (UΣV), U – the wordspace, Σ – the singular values, and V – the document space. The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.
The Positive Pointwise Mutual Information, which works well for measuring semantic similarity in the Term-Sentence-Matrix, is used in our method to assign weights for each entry in the Term-Sentence-Matrix. The Sentence-Rank-Matrix generated from this weighted TSM, is then used to extract a summary from the document.
ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. It is built using an ensemble that combines models which generates by PPMI, Word2Vec and GloVe.