Skip to content

Distributional and Word Embedding Models

Siamak Barzegar edited this page Jul 14, 2017 · 9 revisions

Word2Vec (W2V)

Word2Vec provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

See also:

Global Vectors (GloVe)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

See also:

Explicit Semantic Analysis (ESA)

Explicit Semantic Analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words.

See also:

Dependency-Based Word Embeddings

The context elements are the syntactic contexts of the target word, rather than the words in a window around it.

See also:

Dependency-Based Word Embeddings

Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row’s word occurs in the column’s document. Then the Singular Value Decomposition is calculated for the word-document matrix to produce three matrices (UΣV), U – the wordspace, Σ – the singular values, and V – the document space. The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.

See also:

Positive Pointwise Mutual Information (PPMI)

The Positive Pointwise Mutual Information, which works well for measuring semantic similarity in the Term-Sentence-Matrix, is used in our method to assign weights for each entry in the Term-Sentence-Matrix. The Sentence-Rank-Matrix generated from this weighted TSM, is then used to extract a summary from the document.

See also:

ConceptNet Numberbatch

ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. It is built using an ensemble that combines models which generates by PPMI, Word2Vec and GloVe.

See also: