Wednesday, December 7, 2022

N-grams for text processing


N-gram is an N sequence of words. It can be a unigram (one word), bigram (sequence of two words), trigram (sequence of three words), and so on. It focuses on a sequence of words. Such a method is very useful in speech recognition and predicting input text. It helps us to predict the next words that could occur in a given sequence. Search engines also use the n-gram technique to predict the next word while a search query is typed in a search bar. 

Let us consider a sentence: 

This is going to be an amazing experience.


Unigram for the above sentence can be written as:
  • "This"
  • "is"
  • "going"
  • "to"
  • "be"
  • "an"
  • "amazing"
  • "experience"

Bigram for the above sentence can be written as:

  • “This is”
  • “is going”
  • “going to”
  • "to be"
  • "be an"
  • "an amazing"
  • "amazing experience"

Trigram for the above sentence can be written as:

  • "This is going"
  • "is going to "
  • "going to be"
  • "to be an"
  • "be an amazing"
  • "an amazing experience"

With n-grams, we can use a bag of n-grams (for example, a bag of bigrams) instead of only using a bag of words. A bag of bigrams or trigrams is more powerful than using just a bag of words as it takes the context into consideration as well.


Related Posts:

  • N-grams for text processingN-gram is an N sequence of words. It can be a unigram (one word), bigram (sequence of two words), trigram (sequence of three words), and so on. It focuses on a sequence of words. Such a method is very useful in speech recogni… Read More
  • Word embedding techniques for text processing It is difficult to perform analysis on the text so we use word embedding techniques to convert the texts into numerical representation. This is also called vectorization of the words. It is a representation technique fo… Read More
  • Bi-LSTM in simple words In traditional neural networks, inputs and outputs were independent of each other. To predict next word in a sentence, it is difficult for such model to give correct output, as previous words are required to be remember… Read More
  • Topic Modeling: Working of LSA (Latent Semantic Analysis) in simple terms LSA (Latent Semantic Analysis) is another technique used for topic modeling. The main concept behind topic modeling is that the meaning behind any document is based on some latent variables so we use various topic model… Read More
  • Easy understanding of Confusion matrix with examplesConfusion matrix is an important metric to evaluate the performance of classifier. The performance of a classifier depends on their capability predict the class correctly against new or unseen data. It is one of the easiest m… Read More

0 comments:

Post a Comment