NLP — Zero to Hero with Python

A handbook for learning NLP with basics ideas

Sonam

11 min readNov 30, 2020

Photo by Sincerely Media on Unsplash

Topics to be covered:

Section 1: NLP Introduction, Installation guide of Spacy and NLTK

Section 2: Basic ideas about a text, Regular expression

Section 3: Tokenization and Stemming

Section 4: Lemmatisation and Stop words

Section 5: Part of Speech (POS) and Named Entity Recognition (NER)

Let’s talk about one by one step about these.

Python Data Structures Data-types and Objects

Handy concepts on class objects in python

Section 1:

Introduction about NLP

Natural Language processing comes under the umbrella of the Artificial Intelligence domain. All computers are good with numerical data to do processing, this class of section is dealing with text data to analyze different languages in this world.

In this article, we will do a morphological study in language processing with python using libraries like Spacy and NLTK.

If we consider raw text data, the human eye can analyze some points. But if we try to build a mechanism in programming using Python to analyze and extract maximum information from the text data.

Let’s consider we will use a jupyter notebook for all our processing and analyzing language processing. Jupyter comes in anaconda distribution.

Installation guide

First, do install anaconda distribution from this link. After installation, anaconda installs Spacy and NLTK library in your environment.

To install the Spacy link is here.

To install the NLTK link is here.

To download the English language library for spacy is

python -m spacy download en   #en stands for english

Section 2:

Basic Concept

We all know that the data is the same in both the files. We will learn how to read these files using python because to work on language processing. We need some text data.

Starts with basic strings with variables. Let’s see how to print a normal string.

print('Amit')#output: Amit

Take an example:

The name of the string is GURUGRAM, the name of my city. When we need to select a specific range of alphabet, then we use the slicing method and indexing method. When we go from left to right, the indexing starts from 0, and when we want the alphabet from right to left, then it starts from minus (-1), not from zero.

Photo created by the author

With python

#first insert the string to a variablestring = GURUGRAM#get first alphabet with index
print(string[0])#output: G #printing multiple alphabets
print(string[2], string[5])#output: RR#for getting alphabet with negative indexing
print(string[-4])#output: G

Now get the character with slicing

print(string[0:2])#output: GUprint(string[1:4])#output: URU

Let’s do some basic with sentences. An example of cleaning a sentence with having starred in it. Lets

I came across a function named is strip() function. This function removes character in the starting and from the end, but it cannot remove character in the middle. If we don’t specify a removing character, then it will remove spaces by default.

#A sentence and the removing character from the sentence
sentence = "****Hello World! I am Amit Chauhan****"
removing_character = "*"#using strip function to remove star(*)
sentence.strip(removing_character)#output: 'Hello World! I am Amit Chauhan'

We see the output of the above, and the star is removed from the sentence. So, it’s a basic thing to remove character but not reliable for accuracy.

Like strip function, I also came across a different operation is join operation.

Example:

str1 = "Happy"
str2 = "Home"" Good ".join([str1, str2])#output: 'Happy Good Home'

Regular Expression

A regular expression is sometimes called relational expression or RegEx, which is used for character or string matching and, in many cases, find and replace the characters or strings.

Let’s see how to work on string and pattern in the regular expression. First, we will see how to import regular expression in practical.

# to use a regular expression, we need to import re
import re

How to use “re” for simple string

Example:

Let’s have a sentence in which we have to find the string and some operations on the string.

sentence = "My computer gives a very good performance in a very short time."
string = "very"

How to search a string in a sentence

str_match = re.search(string, sentence)
str_match#output:
<re.Match object; span=(20, 24), match='very'>

We can do some operations on this string also. To check all operations, write str_match. Then press the tab. It will show all operations.

All operations on a string. Photo by author

str_match.span()#output:
(20, 24)

The is showing the span of the first string “very” here, 20 means it starts from the 20th index and finishes at the 24th index in the sentence. What if we want to find a word which comes multiple times, for that we use the findall operation.

find_all = re.findall("very", sentence)
find_all#output: ['very', 'very']

The above operation just finds the prints a string that occurs multiple times in a string. But if we want to know the span of the words in a sentence so that we can get an idea of the placement of the word for that, we use an iteration method finditer operation.

for word in re.finditer("very", sentence):
    print(word.span())#output:(20, 24)
(47, 51)

Some of the regular expressions are (a-z), (A-Z), (0–9), (\- \.), (@, #, $, %). These expressions are used to find patterns in text and, if necessary, to remove for clean data. With patterns when we can use quantifiers to know how many expressions we expect.

Section 3:

Tokenization

When a sentence breakup into small individual words, these pieces of words are known as tokens, and the process is known as tokenization.

The sentence breakup in prefix, infix, suffix, and exception. For tokenization, we will use the spacy library.

#import library
import spacy#Loading spacy english library
load_en = spacy.load('en_core_web_sm')#take an example of string
example_string = "I'm going to meet\ M.S. Dhoni."#load string to library 
words = load_en(example_string)#getting tokens pieces with for loop
for tokens in words:
    print(tokens.text)#output:"
I
'm
going
to
meet
M.S.
Dhoni
.
"

We can get tokens from indexing and slicing.

str1 = load_en(u"This laptop belongs to Amit Chauhan")#getting tokens with index
str1[1]#output: laptop#getting tokens with slicing
str1[2:6]#output: belongs to Amit Chauhan

Stemming

Stemming is a process in which words are reduced to their root meaning.

Types of stemmer

Porter Stemmer
Snowball Stemmer

Spacy doesn’t include a stemmer, so we will use the NLTK library for the stemming process.

Porter stemmer developed in 1980. It is used for the reduction of a word to its stem or root word.

#import nltk library
import nltk#import porter stemmer from nltk
from nltk.stem.porter import PorterStemmer
pot_stem = PorterStemmer()#random words to test porter stemmer
words = ['happy', 'happier', 'happiest', 'happiness', 'breathing', 'fairly']for word in words:
    print(word + '----->' + pot_stem.stem(word))#output:happy----->happi
happier----->happier
happiest----->happiest
happiness----->happi
breathing----->breath
fairly----->fairli

As we see above, the words are reduced to its stem word, but one thing is noticed that the porter stemmer is not giving many good results. So, that’s why the Snowball stemmer is used for a more improved method.

from nltk.stem.snowball import SnowballStemmer
snow_stem = SnowballStemmer(language='english')for word in words:
    print(word + '----->' + snow_stem.stem(word))#output:happy----->happi
happier----->happier
happiest----->happiest
happiness----->happi
breathing----->breath
fairly----->fair

Section 4:

Lemmatization

Lemmatization is better than stemming and informative to find beyond the word to its stem also determine part of speech around a word. That’s why spacy has lemmatization, not stemming. So we will do lemmatization with spacy.

#import library
import spacy#Loading spacy english library
load_en = spacy.load('en_core_web_sm')#take an example of string
example_string = load_en(u"I'm happy in this happiest place with all happiness. It feels how happier we are")for lem_word in example_string:
    print(lem_word.text, '\t', lem_word.pos_, '\t', lem_word.lemma, '\t', lem_word.lemma_)

Description of words in the lemmatization process. Photo by Author

In the above code of lemmatization, the description of words giving all information. The part of speech of each word and the number in the output is a specific lemma in an English language library. We can observe that happiest to happy and happier to happy giving good results than stemming.

Stop Words

Stop word is used to filter some words which are repeat often and not giving information about the text. In Spacy, there is a built-in list of some stop words.

#import library
import spacy#Loading spacy english library
load_en = spacy.load('en_core_web_sm')print(load_en.Defaults.stop_words)

Some Default Stop Words. Photo by Author

Section 5:

Part of Speech (POS)

Part of speech is a process to get information about the text and words as tokens, or we can say grammatical information of words. Deep information is very much important for natural language processing. There are two types of tags. For the noun, verb coarse tags are used, and for a plural noun, past tense type, we used fine-grained tags.

#import library
import spacy#Loading spacy english library
load_en = spacy.load('en_core_web_sm')str1 = load_en(u"This laptop belongs to Amit Chauhan")

Check tokens with index position.

print(str1[1])#output: laptop

How to call various operations this token

#pos_ tag operation 
print(str1[1].pos_)#output: NOUN#to know fine grained information
print(str1[1].tag_)#output: NN

So the coarse tag is a NOUN, and the fine grain tag is NN, so it says that this noun is singular. Let’s get to know what is POS count with spacy.

pos_count = str1.count_by(spacy.attrs.POS)
pos_count#output: {90: 1, 92: 1, 100: 1, 85: 1, 96: 2}

Oh! You are confused about what these numbers are, and I will clear your confusion.

Let’s check what this number 90 means.

str1.vocab[90].text#output: DET

DET means that the 90 number belongs to determiner and the value 1 belongs to it is that this DET repeated one time in a sentence.

Named Entity Recognition (NER)

Named entity recognition is very useful to identify and give a tag entity to the text, whether it is in raw form or in an unstructured form. Sometimes readers don’t know the type of entity of the text so, NER helps to tag them and give meaning to the text.

We will do NER examples with spacy.

#import library
import spacy#Loading spacy english library
load_en = spacy.load('en_core_web_sm')#lets label the entity in the text filefile = load_en(u" I am living in India, Studying in IIT")
doc = fileif doc.ents:
    for ner in doc.ents:
        print(ner.text + ' - '+ ner.label_ + ' - ' + 
               str(spacy.explain(ner.label_)))else:
    print(No Entity Found)#output:India - GPE - Countries, cities, states

In the above code, we see that the text analyzed with NER and found India as a countries name or state name. So we can analyze that the tagging is done with entity annotation.

Conclusion:

These concepts are very good for learners and can get an idea of natural language processing.

Reach me on my LinkedIn

NATURAL LANGUAGE PROCESSING