linear regression from scratch

Regression is the method which measures the average relationship between two or more continuous variables in term of the response variable and feature variables. In other words, regression analysis is to know the nature of the relationship between two or more variables to use it for predicting the most likely value of dependent variables for a given value of independent variables. Linear regression is a mostly used regression algorithm.

For more concrete understanding, let’s say there is a high correlation between day temperature and sales of tea and coffee. Then the salesman might wish to know the temperature for the next day to decide for the stock of tea and coffee. This can be done with the help of regression.

The variable, whose value is estimated, predicted, or influenced is called a dependent variable. And the variable which is used for prediction or is known is called an independent variable. It is also called explanatory, regressor, or predictor variable.


Linear Regression is a supervised method that tries to find a relation between a continuous set of variables from any given dataset. So, the problem statement that the algorithm tries to solve linearly is to best fit a line/plane/hyperplane (as the dimension goes on increasing) for any given set of data.

This algorithm use statistics on the training data to find the best fit linear or straight-line relationship between the input variables (X) and output variable (y). Simple equation of Linear Regression model can be written as:

Y=mX+c ;Here m and c are calculated on training

In the above equation, m is the scale factor or coefficient, c being the bias coefficient, Y is the dependent variable and X is the independent variable. Once the coefficient m and c are known, this equation can be used to predict the output value Y when input X is provided.

Mathematically, coefficients m and c can be calculated as:

m = sum((X(i) - mean(X)) * (Y(i) - mean(Y))) / sum( (X(i) - mean(X))^2 )
c = mean(Y) - m * mean(X)

As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large. The best-fitting line is the line that minimizes the sum of the squared errors of prediction. Source


We will build a linear regression model to predict the salary of a person on the basis of years of experience from scratch. You can download the dataset from the link given below. Let’s start with importing required libraries:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

We are using dataset of 30 data items consisting of features like years of experience and salary. Let’s visualize the dataset first.

dataset = pd.read_csv('salaries.csv')

#Scatter Plot
X = dataset['Years of Experience']
Y = dataset['Salary']

plt.xlabel('Years of Experience')
plt.title('Salary Prediction Curves')

linear regression from scratch

Finally, we have calculated the unknown coefficient m as b1 and c as b0. Here we have b1 = 9449.962321455077 and b0 = 25792.20019866869.

Let’s visualize the best fit line from scratch. Code is available below.linear regression from scratch 


Now let’s predict the salary Y by providing years of experience as X:

def predict(x):
    return (b0 + b1 * x)
y_pred = predict(6.5)                      

Output: 87216.95528812669


from sklearn.linear_model import LinearRegression

X = dataset.drop(['Salary'],axis=1)                
Y = dataset['Salary'] 

reg = LinearRegression()  #creating object reg,Y)     # Fitting the Data set

Let’s visualize the best fit line using Linear Regression from sklearn. Code is available below.


Fig: Best fit line using sklearn

Now let’s predict the salary Y by providing years of experience as X:

y_pred = reg.predict([[6.5]])  

Output: 87216.95528812669


We need to able to measure how good our model is (accuracy). There are many methods to achieve this but we would implement Root mean squared error and coefficient of Determination (R² Score).

  1. Try Model with Different error metric for Linear Regression like Mean Absolute Error, Root mean squared error.
  2. Try algorithm with large data set, imbalanced & balanced dataset so that you can have all flavors of Regression.


  • Diwas Pandey
  • Sunil Ghimire
  • Abhishek chougule

About Diwas Pandey

Highly motivated, strong drive with excellent interpersonal, communication, and team-building skills. Motivated to learn, grow and excel in Data Science, Artificial Intelligence, SEO & Digital Marketing

View all posts by Diwas Pandey →


  1. Hello! I just would like to give a huge thumbs up for the great info you have here on this post. I will be coming back to your blog for more soon.

  2. I love your blog.. very nice colors & theme. Did
    you create this website yourself or did you hire someone to do it for you?
    Plz respond as I’m looking to create my own blog and
    would like to know where u got this from.

Leave a Reply

Your email address will not be published.