Wednesday, December 7, 2022

N-grams for text processing


N-gram is an N sequence of words. It can be a unigram (one word), bigram (sequence of two words), trigram (sequence of three words), and so on. It focuses on a sequence of words. Such a method is very useful in speech recognition and predicting input text. It helps us to predict the next words that could occur in a given sequence. Search engines also use the n-gram technique to predict the next word while a search query is typed in a search bar. 

Let us consider a sentence: 

This is going to be an amazing experience.


Unigram for the above sentence can be written as:
  • "This"
  • "is"
  • "going"
  • "to"
  • "be"
  • "an"
  • "amazing"
  • "experience"

Bigram for the above sentence can be written as:

  • “This is”
  • “is going”
  • “going to”
  • "to be"
  • "be an"
  • "an amazing"
  • "amazing experience"

Trigram for the above sentence can be written as:

  • "This is going"
  • "is going to "
  • "going to be"
  • "to be an"
  • "be an amazing"
  • "an amazing experience"

With n-grams, we can use a bag of n-grams (for example, a bag of bigrams) instead of only using a bag of words. A bag of bigrams or trigrams is more powerful than using just a bag of words as it takes the context into consideration as well.


0 comments:

Post a Comment