LOGISTIC REGRESSION FROM SCRATCH

machine learning from scratch

Welcome to new machine learning from scratch tutorial. In today’s tutorial, we are going to cover Logistic Regression from scratch. First of all, we would like to thank Bishal Boss who helped us to make this content possible. Now before begin, we assume you are already familiar with some of the topics:

  • Classification and Regression in Machine Learning.
  • What is a Binary Classification ?
  • Basic Geometry of Line, Plane, and Hyper-Plane in 2D, 3D, and n-D space respectively.
  • What is Maxima & Minima?
  • Loss Function.

INTRODUCTION TO REGRESSION

Regression analysis is to know the nature of the relationship between two or more variables to use it for predicting the most likely value of dependent variables for a given value of independent variables. READ MORE >>

Someone said, “We only learn great things, by questioning them.” So let’s start with questions and their respective answer. In this blog, we will use LR for Logistic Regression.

WHAT IS LOGISTIC REGRESSION ?

In simple Terms, Logistic Regression is a Classification technique, which is best used for Binary Classification Problems. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables.

Let’s say, we have a Binary Classification problem, which has only 2 classes true or false. Imagine we want to detect whether a credit card transaction is genuine or fraudulent.

  • +1 = Genuine
  • -1 = Fraudulent

Firstly, let’s try to plot our data points in a 2D space, considering we can visualize 2D in a better way here.

HOW DO WE DIFFERENTIATE DATA POINTS ?

As a human being, if we show this image to a little child with no knowledge of Maths or graph and ask them to differentiate between the two points, I am pretty much sure that he will use his common sense and will draw a line in-between the two points which will differentiate them. That line is called a Decision Boundary, as shown below:

LOGISTIC REGRESSION FROM SCRATCH

Anything on the left side of the boundary is Fraudulent and anything on the right side of the boundary is Genuine. This is a common-sense approach.

HOW TO BUILD BOUNDARY ?

Now, if we look at it from a Geometric perspective, then the decision boundary is nothing but a simple Line in 2D space. And we know that line also has an equation that can be used to represent it. Now a line can be represented using 2 unique variables, “m” and “b”:

In two dimensions, the equation for non-vertical lines is often given in the slope-intercept form: y = mx + b where:

  • m is the slope or gradient of the line.
  • b is the y-intercept of the line.
  • x is the independent variable of the function y = f(x).

If, our line passes through origin then b = 0, then y = mx

 In 3D space we have equation of Plane as:

Ax + By + Cz = D

If line passes through origin D=0, then z = – (a/c) x + (-b/c) y

Now, this is about 2D and 3D space, in n-D space we have Hyper-Plane instead of Line or Plane.

HOW DO WE REPRESENT HYPERPLANE IN EQUATION ?

Imagine we have n-dimension in our Hyper-Plane, then each dimension will have its own slope value, let’s call them 

w1, w2, w3, ….. , w

and let the dimensions we represented as 

x1, x2, x3, …. , x

and lets have an intercept term as “b”. So, the Hyper-Plane equation holds as:

w1x1 + w2x2 + w3x3 + …. + wnxn + b = 0

We, can represent it as:

b + ∑ (wixi) = 0     s.t. (‘i’ ranges from 1 to n)

Now, If we consider, wi as a vector and xi as another vector, then we can represent ∑ (wixi) as a vector multiplication of 2 different vectors as WiTXi  where WiT is the transposed vector of values represented using all Wi and Xi   is the vector represented using all values of xi . Also, b here is a scalar value.

Considering all these, Wi is normal to our Hyper-Plane then only the multiplication would make sense. This means that W is Normal to our Hyper-Plane and as we all know, Normal is always perpendicular to our surface.

So now, coming back to our model, after all that maths, we conclude that our model needs to learn the Decision Boundary, using 2 important things, 

  1. All the wi values.
  2. The intercept b.

For simplicity, let’s consider our plane passes through origin. Then b=0.

Now, our model needs to only figure out the values of wi , which is normal to our Hyper-Plane. The values in our normal is a vector, which means it is a set of values which needs to be found which best suits our data. So the main task in LR boils down to a simple problem of finding a decision boundary, which is a hyperplane, which is represented using (Wi , b) given a Dataset of (+ve, -ve) points that best separate the points.

Let’s imagine that Wi is derived, so even after that how are we going to find out the points whether they are of +1 or -1 class. For that let’s consider the below Diagram :

LOGISTIC REGRESSION FROM SCRATCH

So, If you know the basics of ML problems, it can be explained as given a set of Xi we have to predict Yi. So Yi here belongs to the set {+1, -1}. Now, if we want to calculate the value of di or dj we can do it with the below formula:

di = (WTXi) / ||W||

where W is the normal vector to the out hyperplane, let’s assume that it is Unit Vector for simplicity. Therefore. ||W|| = 1. Hence, di = WTXi .Similarly, dj = WTXj. Since W and X are on the same side the Hyper-Plane i.e. on the positive side, hence (WTXi) > 0 and (WTXj) < 0

Basically, it means di belongs to +1 class and dj belongs to -1 class. And, this is how we can classify our data points using the Hyper-Plane.

HOW DO WE CALCULATE OUT HYPERPLANE ?

Drawing hyperplane- logistic-regression-in-python
Source: dataspoof.info

Well, if you have heard something about optimization problems, our model finds the Hyper-Plane as an optimization problem. Before that, we have to create an optimization equation. Let’s Consider a few cases of yi :

  1. Case 1 : If a point is Positive, and we predict as Positive, then 
    • yi = +1  and  WTXi = +1, then    
    • yi * (WTXi) > 0
  2. Case 2 : If a point is Negative, and we predict as Negative, then
    • yi = -1  and  WTXi = -1, then    
    • yi * (WTXi) > 0
  3. Case 3 : If a point is Positive, and we predict as Negative, then
    • yi = +1  and  WTXi = -1, then    
    • yi * (WTXi) < 0
  4. Case 4 : If a point is Negative, and we predict as Positive, then
    • yi = -1  and  WTXi = +1, then    
    • yi * (WTXi) < 0

Now, if we look closely whenever we made a correct prediction our equation of [yi * (WTXi)] is always positive, irrespective of the cardinality of our data point. Hence our Optimization equation holds, as such

(Max W) ∑ [yi * (WTXi)] >0 s.t. (i = {1,n} )

Let’s try to understand what the equation has to offer, the equation says that find me a W (the vector normal to our Hyper-Plane) which has a maximum  of [yi * (WTXi)] > 0 such that the value of “i” ranges from 1 to n, where “n” is the total number of dimensions we have.

It means, for which ever Hyperplane, we have maximum correctly predicted points we will choose that.

HOW DO WE SOLVE THE OPTIMIZATION PROBLEM TO FIND THE OPTIMAL W WHICH HAS THE MAX CORRECTLY CLASSIFIED POINT?

Logistic Regression uses Logistic Function. The logistic function also called the sigmoid function is an S-shaped curve that will take any real-valued number and map it into a worth between 0 and 1, but never exactly at those limits.

https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png

So we use our optimization equation in place of “t”

t = yi * (WTXi) s.t. (i = {1,n} )

And when we solve this sigmoid function, we get another optimization problem, which is computationally easy to solve. The end Optimization equation becomes as below:

    W* = (Argmin W) ∑ logn(1  +  e-t

So, our equation changes form finding a Max to Min, now we can solve this using optimizer or a Gradient Descent. Now, to solve this equation we use something like Gradient Descent, intuitively it tries to find the minima of a function. In our case, it tries to find the minima of out sigmoid function.

HOW DOES IT MINIMIZE A FUNCTION OR FINDS MINIMA ?

Our Optimizer tries to minimize the loss function of our sigmoid, by loss function I mean, it tries to minimize the error made by our model, and eventually finds a Hyper-Plane which has the lowest error. The loss function has the below equation: 

-[y*log(yp) + (i-y)*log(1-yp)]

  • y = actual class value of a data point
  • yp = predicted class value of data point

And so this is what Logistic Regression is and that is how we get our best Decision Boundary for classification. In broader sense, Logistic Regression tries to find the best decision boundary which best separates the data points of different classes.

CODE FROM SCRATCH

Before that, let’s re-iterate over few key points, so the code could make more sense to us:

  • X is a set of data points, with m rows and n dimensions. 
  • y is a set of class which define a class for every data point from X as +1 or -1
  • z = WTXi
  • W = Set of values for a vector that forms the Normal to our Hyper-Plane.
  • b = Set of scalars of the intercept term, not required if our Hyper-Plane passes through the origin
  • yp = predicted value of Xi,  from the sigmoid function.

Intuitively speaking, our model tries to learn from each iteration using something called a learning rate and gradient value, think this as, once we predict the value using the sigmoid function, we get some values of yp and then we have y.

We calculate error, and then we try to use the error to predict a new set of W values, which we use to repeat the cycle, until we finally find the best value possible.

In today’s code from scratch, we will be working on Iris dataset. So let’s dive into the code

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn import linear_model

iris = datasets.load_iris()
X = iris.data[:, :2]     #we use only 2 class
y = (iris.target != 0) * 1

Let’s try to plot and see how our data lies. Whether can it be separated using a decision boundary.

plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
plt.legend();
linear regression from scratch

Let’s train the model by creating a class of it, we will give Learning rate as 0.1 and number of iterations as 300000.

model = LogisticRegression(learning_rate=0.1, num_iterations=300000)
model.fit(X, y)

Lets us see how well our prediction works:

preds = model.predict(X)
(preds == y).mean()

Output: 1.0
logistic regression from scratch

LOGISTIC REGRESSION FROM SCIKIT LEARN


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
model = LogisticRegression(solver='liblinear', random_state=0)
model.fit(X, y)
model.predict_proba(X)

It’s often useful to visualize the confusion matrix.

print(classification_report(y, model.predict(X)))
logistic regression from scratch

Hello ! I am Bishal Bose. You can mail me at [email protected] if you have any queries regarding the project. Connect with me on instagram. Click here

About Diwas Pandey

Highly motivated, strong drive with excellent interpersonal, communication, and team-building skills. Motivated to learn, grow and excel in Data Science, Artificial Intelligence, SEO & Digital Marketing

View all posts by Diwas Pandey →

4 Comments on “LOGISTIC REGRESSION FROM SCRATCH”

  1. I just could not go away your site prior to suggesting
    that I actually loved the standard information a person supply in your visitors?
    Is gonna be back incessantly in order to investigate cross-check new posts

  2. Pingback: Homepage

Leave a Reply

Your email address will not be published. Required fields are marked *