Text Representation in NLP

Text to Number transformation | TfIdf | N-grams | Bi-grams | Uni-grams

Introduction

  1. What is Feature Extraction from text?

    • To text representation

    • To text recognition

  2. Why do we need it?

  3. Why is it difficult?

  4. What is the core idea?

  5. What are the techniques?

    • OHE (One Hot Encoding)

    • BOW (Bag of Words)

    • ngrams

    • TfIdf

    • Custom features

    • Word2Vec (Word Embeddings)


Common Terms used in NLP

  1. What is Corpus (C)?

    A collection of authentic texts or audio that’s organized into datasets.

    In other words, A corpus is a collection of machine-readable texts that’s representative of a specific language or language variety.

  2. What is Vocabulary (V)?

    Vocabulary is the collection of all unique words or linguistic units that appear in a given dataset.

    In other words, the set of unique words used in the text corpus.

  3. What is Document (D)?

    A text object, the collection of which make up your corpus.

    If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically.

  4. What is Word (W)?

    In NLP, a word is represented as a vector of real numbers, called a word embedding, to help computers understand the meaning of words.


One Hot Encoding (OHE)

To understand this topic, we will use an example,

D1people watch CodeA2Z
D2CodeA2Z watch CodeA2Z
D3people write comment
D4CodeA2Z write comment

Here Corpus is shown below,

people watch CodeA2Z CodeA2Z watch CodeA2Z people write comment CodeA2Z write comment

Here Vocabulary is shown below,

people watch CodeA2Z write comment

By using this technique, we can create as many features as number of words in vocabulary. If word is present, it will mark it as true else marks it false.

Pros
  • Intuitive (Easy to understand)

  • Ease to implement

Flaws
  • Sparsity

    • It means, we can show a single word among 5,000,000 words of vocabulary, needs 5,000,000 large array to represent that word.

    • There is single true (1), but the full array is filled with false (0) which is not efficient to handle this large array.

    • Chances of over-fitting.

  • Out of Vocabulary (OOV)

    • If user inputs any other words which is not present in vocabulary, it will not convert into numbers, model will not be able to predict on that input.
  • No fixed Size

  • No capturing of semantic meaning

    • Words loses their semantic meaning in OHE.
      For e.g., Walk and Run are closer to each other. But bottle is also encoded by OHE, it does not have very close meaning to each other.

Bag of Words

In bag of words, we count the frequency of the document in the vocabulary and build the array.

The core intuition of this topic is that same type of words in different document occurs in the same frequency.

There is no matter of order of words, context (Don’t cover the meaning of the word i.e., semantic meaning).

We are providing a sentence a vector and just like clustering of k-means, we are organizing the words i.e., closest vectors are assigned with the same class. In this way, we can say that we are covering the semantic meaning of the sentence using bag of words.

Code Implementation,

from sklearn.feature_extration.text import CountVectorizer
cv = CountVectorizer()

bow = cv.fit_transform(df['text'])
vocabulary = cv.vocabulary_

# To print any sentence after encoding,
# print(bow[0].toarray())
# cv.transform(["CodeA2Z watch and write comment of CodeA2Z"])

There is no problem of Out of Vocabulary problem.

As count vectorizer is counts the words i.e., words frequency. If you want to work on span detection problem i.e., wants to find any word is present in the content or not, you can use binary = True in count vectorizer.

You can go through this topic from the scikit-learn page of the website.

Advantages
  • Simple and Intuitive

  • Semantic relation setup

Dis-Advantages
  • Sparsity

  • Overfitting

  • OOV

  • Words order is ignored.

  • Meaning of different sentences are not convey different in BOW. For e.g.,
    D1: This is a very good movie.
    D2: This is not a very good movie.
    Here, only not is the extra word, sentence meaning becomes different, but its BOW is not considered it and keeps those 2 vectors under same class.


N-grams (Bag of n-grams)

In n-grams, N means number of consecutive words are used to build the vocabulary. If N=2, It is called as bi-grams. If N=3, It is called as tri-grams and so on.

For e.g., D: people watch CodeA2Z

Bi-grams

Vocabulary: people watch, watch CodeA2Z.

Tri-grams

Vocabulary: people watch CodeA2Z

Quad-grams

It is not possible on the provided example.

NOTE: BOW is also called as uni-grams as we are considering only one word at a time.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,1)) # only uni-gram
cv = CountVectorizer(ngram_range=(1,2)) # uni & bi-gram
cv = CountVectorizer(ngram_range=(2,2)) # only bi-gram
cv = CountVectorizer(ngram_range=(3,3)) # only tri-gram
cv = CountVectorizer(ngram_range=(1,3)) # uni & bi & tri-gram

bow = cv.fit_transform(df['text'])
What is the benefit of using N-grams?

It can be easily understood by using an example,
For e.g., D1: This movie is very good.
D2: This movie is not good.
These 2 sentences having completely different meaning. If using BOW, it will keep these 2 vectors under same class.

If using bi grams, it will consider 2 words as a single phrase and create a vocabulary of phrases. It will create a large difference between these 2 vectors and does not keep them under a same class. In this way, we will consider the meaning of the sentences.

Advantages
  • Able to capture semantic meaning of the sentence.

  • Easy to implement.

Dis-Advantages
  • Dimensionality of the dataset will increase as increasing the N in N-grams.

  • OOV

  • Slows down the model building because of increment in the features.


Tf-Idf

The core intuition of Tf-Idf is that if any word is coming in a particular document is more frequent but in corpus that word is very less, the importance of that word is very more for that document.

In this way, Tf-Idf will assign a particular weightage to that word by calculating Tf (Term Frequency) and Idf (Inverse document frequency).

Weightage of any word = Tf (Term frequency) * Idf (Inverse document frequency)

Tf (t, d) = (Number of occurrences of term t in document d) / (Total number of terms in the document d)

Idf (t) = 1 + ln ((Total number of documents in the corpus) / (Number of documents with term t in them ))

Range of Tf: 0 < Tf (Probability) < 1

Range of Idf: 1 < Idf < INFINITE

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(df['text']).toarray())

print(tfidf.idf_)
print(tfidf.get_feature_names_out())
Advantages
  • Information retrieval
Dis-advantages
  • Sparsity

  • OOV

  • Dimensionality increases

  • Semantic relation is not captured

    For e.g., D1: This is beautiful. D2: This is gorgeous. Both the documents are considered completely different but meaning is similar.


Custom features

These are also called hand-crafted features.

Features like, positive words, negative words, ration between positive and negative words, word counts, char counts, etc.

It is mainly dependent on what type of problem are you using and how you want to represent it to the user?

Thanks for giving the time to this blog!
If you find it useful, please upvote it.