Data Cleaning, Splitting, Normalizing, & Stemming – NLP COURSE 01

Natural Language Processing

Welcome to AIHUB’s new tutorial series on Natural Language Processing. We have already completed tutorials on “Python 3 For AI” and most liked course by followers i.e “Machine Learning From Scratch“. If you haven’t gone through those courses you can check them now. In today’s session, we will be covering a basic course on Natural Language Processing.


  • 1. Introduction to Natural Language Processing
  • 2. Text Cleaning, Splitting & Normalization
  • 3. NLTK – Splitting, Filtering & Stemming


Language is the most important tool of communication invented by human civilization. It is either spoken or written, consisting of the use of words in a structured and conventional way. Language helps us share our thoughts, and understand others.

Natural Language Processing is, a form of artificial intelligence, all about trying to analyze and understand either written or spoken language and the context that it’s being used in. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

Wikipedia defines NLP as “ a sub field of AI concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.


In this session, we will show you the ways of cleaning the text for the preparation of the dataset in NLP. We will be using the built-in python function in this section and we also will be introducing the NLTK library in the next session. Talking about data preprocessing in NLP, we encounter the steps like splitting the documents into sentences, words and there are various ways to split the texts. Here we will go through some of the ways:-

1) Split by White Spaces

Splitting by white spaces refers to the splitting of documents or texts by word. Applying split() with no input parameters calls the function to split text by looking at white spaces only. It doesn’t take account of any apostrophe. Example:- Look how who’s is split.

2) Split by Words

It is already clear by title about its function. Do you know the difference between split by words and split by white space? Notice the difference in “who’s”.

3) Normalization

In NLP, we convert all uppercase characters to lowercase. We don’t recommend to use this step in every dataset preprocessing. Normalizing the words can change the entire meaning. Example:- Orange is a French telecom company whereas orange is fruit.


NLTK, Natural Language ToolKit, is an open-source Python platform to work on Natural Language Processing. This library requires Python 3.5, 3.6, 3.7, or 3.8


The tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. We should train on a large collection of plain text in the target language before using them.

The NLTK data package includes a pre-trained Punkt tokenizer for English language.


Make sure you check out the output and spot the differences in “who’s”.


Python includes the built-in function isalpha() that can be used in order to determine whether or not the scanned word is alphabetical or else (numerical, punctuation, special characters, etc.). Make sure you check out the output and spot the differences.


Stopwords are words that do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. The most common are short function words such as the, is, at, which, and on, etc.

In this case, removing stopwords can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

Including the word “not” as a stopword also changes the entire meaning if removed (try “this code is not good”)

As you can see, the stopwords are all lower case and don’t have punctuation. If we’re to compare them with our tokens, we need to make sure that our text is prepared the same way.

This cell recaps all that we have previously learnt in this colab: tokenizing, lower casing and checking for alphabetic words.


Stemming refers to the process of reducing each word to its root or base. There are two types of stemmers for suffix stripping: porter and lancaster and each has its own algorithm and sometimes display different outputs.

This is all for today’s tutorial. We will cover more details in coming tutorials. Stay Safe & Happy Coding.

About Diwas

🚀 I'm Diwas Pandey, a Computer Engineer with an unyielding passion for Artificial Intelligence, currently pursuing a Master's in Computer Science at Washington State University, USA. As a dedicated blogger at AIHUBPROJECTS.COM, I share insights into the cutting-edge developments in AI, and as a Freelancer, I leverage my technical expertise to craft innovative solutions. Join me in bridging the gap between technology and healthcare as we shape a brighter future together! 🌍🤖🔬

View all posts by Diwas →

Leave a Reply

Your email address will not be published. Required fields are marked *