- Latent Dirichlet Allocation
latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
Notebook : Latent Dirichlet Allocation
Good read : Gensim : Topic Modelling for Humans
Consider the below BOW model where we have documents d, topics t and probability that given the document d, the word is t as P(t|d)
. Now if we compute the number of parameters in this model (C-BOW), we get 500 times 1000 = 500,000 parameters.
This is too many parameters to figure out. Another way to appraoch this problem is by adding a notion of small set of topics or latent variables to our model. So in the above example, every document is associated with underlying mixture of topics. In the refactored example below, we see 2 sets of probabilities :
- Probability of a particular topic z, given the document d :
P(z|d)
- Probability of a particular term t, given the topix z :
P(t|z)
The final probability of a term given a document (P(t|d)
) will be the sum of above mentioned probabilities.
Now using LDA, the number of parameters will be (500 (documents) times 10 (topics) = 5000) + (10 (topics) times 1000 (terms)) = 15,000 parameters which is significantly less than C-BOW.
Here, we have 5 sets of documents and 7 topics and we have pre-processed every document to include only important words. Now we calculate the occurence of each word in the document and put them in coressponding row and to calculate the probability we divide them by total no. of words.
Here, we have 5 sets of documents and say 3 sets of topics. Consider an example of document no. 3 which is combination of 70% science, 10% politics and 20% sports. That's it, we save these values in the matrix to form our document matrix.
Here lets say, we have 3 topics and 6 random words and the probability that word generated by the topic politics is as given below. Now we feed these probabilities into matrix to form our topic matrix.
Consider the following scenario where a coin is tossed and it gives heads : a times and tails : b times, then the Beta distribution is given according to the gamma rule as below -
The gamma function is defined as follows -
The advantage of Gamma function is that we can use gamma functions for decimals as well.
Consider an example where a news article is comprised of 60% politics , 30% science and 10% sports news. Then the coressponding Dirichlet distribution is given by -
The coressponding 3-D distribution would be similar to the diagram below. If we want a good topic model we need to pick small parameters like the one on the left.
Consider a case where we pick a point in Dirichlet distribution that coressponds to a news article that contains 80% percent politics, 10% science and 10% sports as below :
Now we sample multiple documents from Dirichlet distribution and merge all the vectors to create our first matrix : A matrix that indices documents with their corresponding topics..
Consider a news article which consists of the following words : space, climate, vote, rule and we draw its corresponding Dirichlet distribution as below:
Similarly, we sample multiple topics and create a probabilistic distribution for them.
Later, we combine the vectors to form a matrix, from this matrix we can say that since document 1 has a 40% probability of containing the word space and 40% probability of containing the word climate, document 1 might be about the topic science.
Now we perform a matrix multiplication of the above two models i.e. topic matrix and word matrix as below. The enteries of first matrix comes from picking points in the distribution alpha and for the second matrix it comes from distribution beta. The idea is to find the best locations of these points to find the topics we want.
First, we generate some fake documents and compare them to original docs. Lets look at some document points and its Dirichlet distribution alpha. These points will give us some values for each of the 3 topics : Sports, Science, Politics, thus generating a multi-variate distribution theta. This theta is a mult-nomial distribution for document -1.
Based on theta we will generate topics based on Poisson distribution.
Now we will assign words to the topics generated above using Dirichlet distribution beta. In this distribution we locate the topic and from each of these topics we generate a distribution of words, example, topic 1 science generates the word space and climate with 0.4 probability and vote, rule with probability 0.1. These distributions are called phi.
Now for each of the topic we have chosen, we will pick a word associated with multi-nomial distribution phi. Example, for the first topic, we have science, we look at the science row in phi distribution and pick a word e.g. space. We do this for every topic and thus we have generated our fake document number 1. We keep repeating this process and create more fake documents.
After generating the fake documnets, we compare them to the original ones. The probability of generating fake document close to real one is pretty small but we try to maximize this probability and based on maximizing the probability and minimizing the error, we find arrangement of points and that will give us topics.
Error will tell us how far we are from generating real document. This error will back-propogate through distributions that will tell us where to move the points to minimize the error.
A good arrangement of points will give us topics. The Dirichlet distribution alpha will give us articles associated with those topics and beta will give us words related to that topic.