Photo by Clayton Robbins on Unsplash
Text Representation in NLP
Text to Number transformation | TfIdf | N-grams | Bi-grams | Uni-grams
Introduction
What is Feature Extraction from text?
To text representation
To text recognition
Why do we need it?
Why is it difficult?
What is the core idea?
What are the techniques?
OHE (One Hot Encoding)
BOW (Bag of Words)
ngrams
TfIdf
Custom features
Word2Vec (Word Embeddings)
Common Terms used in NLP
What is Corpus (C)?
A collection of authentic texts or audio that’s organized into datasets.
In other words, A corpus is a collection of machine-readable texts that’s representative of a specific language or language variety.
What is Vocabulary (V)?
Vocabulary is the collection of all unique words or linguistic units that appear in a given dataset.
In other words, the set of unique words used in the text corpus.
What is Document (D)?
A text object, the collection of which make up your corpus.
If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically.
What is Word (W)?
In NLP, a word is represented as a vector of real numbers, called a word embedding, to help computers understand the meaning of words.
One Hot Encoding (OHE)
To understand this topic, we will use an example,
D1 | people watch CodeA2Z |
D2 | CodeA2Z watch CodeA2Z |
D3 | people write comment |
D4 | CodeA2Z write comment |
Here Corpus is shown below,
people watch CodeA2Z CodeA2Z watch CodeA2Z people write comment CodeA2Z write comment
Here Vocabulary is shown below,
people watch CodeA2Z write comment
By using this technique, we can create as many features as number of words in vocabulary. If word is present, it will mark it as true else marks it false.
Pros
Intuitive (Easy to understand)
Ease to implement
Flaws
Sparsity
It means, we can show a single word among 5,000,000 words of vocabulary, needs 5,000,000 large array to represent that word.
There is single true (1), but the full array is filled with false (0) which is not efficient to handle this large array.
Chances of over-fitting.
Out of Vocabulary (OOV)
- If user inputs any other words which is not present in vocabulary, it will not convert into numbers, model will not be able to predict on that input.
No fixed Size
No capturing of semantic meaning
- Words loses their semantic meaning in OHE.
For e.g., Walk and Run are closer to each other. But bottle is also encoded by OHE, it does not have very close meaning to each other.
- Words loses their semantic meaning in OHE.
Bag of Words
In bag of words, we count the frequency of the document in the vocabulary and build the array.
The core intuition of this topic is that same type of words in different document occurs in the same frequency.
There is no matter of order of words, context (Don’t cover the meaning of the word i.e., semantic meaning).
We are providing a sentence a vector and just like clustering of k-means, we are organizing the words i.e., closest vectors are assigned with the same class. In this way, we can say that we are covering the semantic meaning of the sentence using bag of words.
Code Implementation,
from sklearn.feature_extration.text import CountVectorizer
cv = CountVectorizer()
bow = cv.fit_transform(df['text'])
vocabulary = cv.vocabulary_
# To print any sentence after encoding,
# print(bow[0].toarray())
# cv.transform(["CodeA2Z watch and write comment of CodeA2Z"])
There is no problem of Out of Vocabulary problem.
As count vectorizer is counts the words i.e., words frequency. If you want to work on span detection problem i.e., wants to find any word is present in the content or not, you can use binary = True
in count vectorizer.
You can go through this topic from the scikit-learn page of the website.
Advantages
Simple and Intuitive
Semantic relation setup
Dis-Advantages
Sparsity
Overfitting
OOV
Words order is ignored.
Meaning of different sentences are not convey different in BOW. For e.g.,
D1: This is a very good movie.
D2: This is not a very good movie.
Here, onlynot
is the extra word, sentence meaning becomes different, but its BOW is not considered it and keeps those 2 vectors under same class.
N-grams (Bag of n-grams)
In n-grams, N means number of consecutive words are used to build the vocabulary. If N=2, It is called as bi-grams. If N=3, It is called as tri-grams and so on.
For e.g., D: people watch CodeA2Z
Bi-grams
Vocabulary: people watch
, watch CodeA2Z
.
Tri-grams
Vocabulary: people watch CodeA2Z
Quad-grams
It is not possible on the provided example.
NOTE: BOW is also called as uni-grams as we are considering only one word at a time.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,1)) # only uni-gram
cv = CountVectorizer(ngram_range=(1,2)) # uni & bi-gram
cv = CountVectorizer(ngram_range=(2,2)) # only bi-gram
cv = CountVectorizer(ngram_range=(3,3)) # only tri-gram
cv = CountVectorizer(ngram_range=(1,3)) # uni & bi & tri-gram
bow = cv.fit_transform(df['text'])
What is the benefit of using N-grams?
It can be easily understood by using an example,
For e.g., D1: This movie is very good.
D2: This movie is not good.
These 2 sentences having completely different meaning. If using BOW, it will keep these 2 vectors under same class.
If using bi grams, it will consider 2 words as a single phrase and create a vocabulary of phrases. It will create a large difference between these 2 vectors and does not keep them under a same class. In this way, we will consider the meaning of the sentences.
Advantages
Able to capture semantic meaning of the sentence.
Easy to implement.
Dis-Advantages
Dimensionality of the dataset will increase as increasing the N in N-grams.
OOV
Slows down the model building because of increment in the features.
Tf-Idf
The core intuition of Tf-Idf is that if any word is coming in a particular document is more frequent but in corpus that word is very less, the importance of that word is very more for that document.
In this way, Tf-Idf will assign a particular weightage to that word by calculating Tf (Term Frequency) and Idf (Inverse document frequency).
Weightage of any word = Tf (Term frequency) * Idf (Inverse document frequency)
Tf (t, d) = (Number of occurrences of term t in document d) / (Total number of terms in the document d)
Idf (t) = 1 + ln ((Total number of documents in the corpus) / (Number of documents with term t in them ))
Range of Tf: 0 < Tf (Probability) < 1
Range of Idf: 1 < Idf < INFINITE
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
print(tfidf.fit_transform(df['text']).toarray())
print(tfidf.idf_)
print(tfidf.get_feature_names_out())
Advantages
- Information retrieval
Dis-advantages
Sparsity
OOV
Dimensionality increases
Semantic relation is not captured
For e.g., D1: This is beautiful. D2: This is gorgeous. Both the documents are considered completely different but meaning is similar.
Custom features
These are also called hand-crafted features.
Features like, positive words, negative words, ration between positive and negative words, word counts, char counts, etc.
It is mainly dependent on what type of problem are you using and how you want to represent it to the user?
Thanks for giving the time to this blog!
If you find it useful, please upvote it.