LSA (Latent Semantic Analysis) is another technique used for topic modeling. The main concept behind topic modeling is that the meaning behind any document is based on some latent variables so we use various topic modeling techniques to unravel those hidden variables i.e., topics so that we can make sense of the given document. LSA is mostly suitable for large sets of documents. It converts the documents into a document term matrix before actually deriving topics from the documents.
Working of LSA:
The given text is converted into the document-term matrix using either bag of words or the Term Frequency- Inverse Document Frequency.
Then, using Truncated Singular Value Decomposition (SVD). It is at this stage the topics within the documents are identified. Mathematically, it can be given as,
Though it may look difficult to understand at first glance, in simple terms what the above formula represents is that it simply decomposes a high dimensional matrix into smaller matrices i.e., u, s, and v, where,
A = n*m document-term matrix (n = no. of documents and m = no. of words)
U = n*r document-topic matrix (n = no. of documents and r = no. of topics)
S = r*r matrix (r = no. of topics)
V = m*r word-topic matrix (m = no. of words and r = no. of topics)
Finally, we can now classify which document belongs to which topics.
Schematic diagram of LSA algorithm
0 comments:
Post a Comment