Tuesday, December 6, 2022

Word embedding techniques for text processing

 It is difficult to perform analysis on the text so we use word embedding techniques to convert the texts into numerical representation. This is also called vectorization of the words. It is a representation technique for text in which words having same meaning are given similar representation. It helps to extract features from the text.

i. Bag of Words (BoW)

It is one of the most widely used technique for word embedding. This method focuses on the frequency of the words. It is easy to implement and also efficient at the same time.  In Python, we can use CountVectorizer() function from sklearn library to implement Bag of Words. A graphical representation of bag of Words has been shown in the figure below.

The reason for selecting BoW as word embedding technique is that it is still widely used technique due to its simplicity in implementation. It is also very useful when working on a domain specific dataset. If not entirely, it does give some idea to the researchers regarding the performance of the work.NLP: Bag of words and TF-IDF explained! | by Koushik kumar | Medium

 Fig: Representation of BoW


ii. TF-IDF (Term Frequency- Inverse Document Frequency)

Term Frequency is used to count the number of words in a document. It is not necessary that the words with higher frequencies tend to represent significant information about the document. A word appearing in a document for many times does not mean it is relevant and significant all the time. Many times, words with less frequencies carry more significant meanings about the document. One way to normalize the frequency of words is to use TF-IDF. Inverse Document Frequency (IDF) is used to calculate the significance of rare words or words with less frequencies.

The reason for using TF-IDF is it can be used to find and remove stop words in the textual data. It helps to find unique identifier in a text. It helps to understand the importance of a words in entire document which in turn can help in text summarization.


Related Posts:

  • Topic Modeling Topic modeling is an unsupervised technique as it is used to perform analysis on text data having no label attached to it. As the name suggests, it is used to discover a number of topics within the given sets of text li… Read More
  • Unleashing the Power of Artificial Intelligence and Machine Learning: A Journey into the FutureIntroduction:Artificial Intelligence (AI) and Machine Learning (ML) have emerged as transformative technologies that are reshaping industries and our daily lives. From self-driving cars to personalized recommendations, AI and… Read More
  • Why is it difficult to recover files in SSDs? Recovering files from an SSD is more challenging compared to an HDD due to several reasons:1. Data Distribution: SSDs use a technique called wear-leveling to evenly distribute data across the drive. This means that when… Read More
  • Challenges of Sentiment analysis Lack of availability of enough domain-specific datasetMulti-class classification of the dataDifficult to extract the context of the post made by the usersHandling of neutral postsAnalyzing the posts of multiple language… Read More
  • Some popular free video generation AI tools Here's a list of some popular free video generation tools:Synthesia (Free Trial): While Synthesia offers a paid subscription with more features, they also have a free trial that allows you to create short AI videos with… Read More

0 comments:

Post a Comment