Topic modeling is an unsupervised method used to perform text analysis. When we are given large sets of unlabeled documents, it is very difficult to get an insight into the discussions upon which the documents are based upon. Here comes the role of topic modeling. It helps to identify a number of hidden topics within a set of documents. Based on those identified topics, the entire sets of documents can be classified. Also, it can be used to predict the topic of upcoming documents. It can have various applications like customer reviews, story genre prediction, tweets classification, and so on.
LDA (Latent Dirichlet Allocation) is one of the most used and well-known techniques to perform topic modeling.
The word latent means hidden because we don’t know what topics a set of documents contain. Based on the probability distribution of the occurrence of words in each document, they are allocated to defined topics.
Working of LDA:
What we want for LDA is to learn the topic mix in each document and also learn the word mix in each document.
We choose a random number of topics for the given dataset.
Assign each word in each given document to one of the defined topics, randomly.
Now, we go through each and every word and to which topic those words are assigned in each document. Then, it is analyzed how often the topic occurs in the document and how often the word occurs in the topic as a whole. Based on this analysis, the new topic is assigned to the given word.
It goes through a number of such iterations and finally, the topic will start making sense. We can analyze those found topics and assign a suitable name to those topics which best describe them.
Fig: Schematic diagram of LDA algorithm
0 comments:
Post a Comment