random forest from scratch

As the name suggests, the Random forest is a “forest” of trees! i.e Decision Trees. A random forest is a tree-based machine learning algorithm that randomly selects specific features to build multiple decision trees. The random forest then combines the output of individual decision trees to generate the final output. Now, let’s start our today’s topic on random forest from scratch.

Decision trees involve the greedy selection to the best split point from the dataset at each step. We can use random forest for classification as well as regression problems. If the total number of column in the training dataset is denoted by p :

  1. We take sqrt(p) number of columns for classification
  2. For regression, we take a p/3 number of columns.
  1. When we focus on accuracy rather than interpretation
  2. If you want better accuracy on the unexpected validation dataset
  • Select random samples from a given dataset
  • Construct decision trees from every sample and obtain their output
  • Perform a vote for each predicted result.
  • Most voted prediction is selected as the final prediction result.
Fig: Random Forest in Picture
Source: javapoint


A decision tree is essentially a series of if-then statements, that, when applied to a record in a data set, results in the classification of that record. We have covered all mathematical concepts and a project from scratch with a detailed explanation. CLICK FOR MORE


In this tutorial of Random forest from scratch, since it is totally based on a decision tree we aren’t going to cover scratch tutorial. You can go through decision tree from scratch.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

data = pd.read_csv('data.csv')
random forest from scratch
random forest from scratch

Here, we will be using the dataset (available below) which contains seven columns namely date, open, high, low, close, volume, and name of the company. Here in this case google is the only company we have used. Open refers to the time at which people can begin trading on a particular exchange. Low represents a lower price point for a stock or index. High refers to a market milestone in which a stock or index reaches a greater price point than previously for a particular time period. Close simply refers to the time at which a stock exchange closes to trading. Volume refers to the number of shares of stock traded during a particular time period, normally measured in average daily trading volume.

for i in range(len(data)):
    data['Date'][i] = ''.join(abc[i])

Using above dataset, we are now trying to predict the ‘Close’ Value based on all attributes. Let’s split the data into train and test dataset.

Now, let’s instantiate the model and train the model on training dataset:

rfg = RandomForestRegressor(n_estimators= 10, random_state=42)

Let’s find out the features on the basis of their importance by calculating numerical feature importances

random forest from scratch
random forest from scratch
rfg.score(X_test_1, y_test_1)

output:- 0.9997798214978976

We are getting an accuracy of ~99% while predicting. We then display the original value and the predicted Values.

pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(drop=True)], axis=1)


  • It reduces overfitting as it yields prediction based on majority voting.
  • Random forest can be used for classification as well as regression.
  • It works well on a large range of datasets.
  • Random forest provides better accuracy on unseen data and even if some data is missing
  • Data normalization isn’t required as it is a rule-based approach


  1. Random forest requires much more computational power and memory space to build numerous decision trees.
  2. Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable.
  3. Random forests can be less intuitive for a large collection of decision trees.
  4. Using bagging techniques, Random forest makes trees only which are dependent on each other. Bagging might provide similar predictions in each tree as the same greedy algorithm is used to create each decision tree. Hence, it is likely to be using the same or very similar split points in each tree which mitigates the variance originally sought. 

Participate in AI QUIZ
View All

About Diwas

🚀 I'm Diwas Pandey, a Computer Engineer with an unyielding passion for Artificial Intelligence, currently pursuing a Master's in Computer Science at Washington State University, USA. As a dedicated blogger at AIHUBPROJECTS.COM, I share insights into the cutting-edge developments in AI, and as a Freelancer, I leverage my technical expertise to craft innovative solutions. Join me in bridging the gap between technology and healthcare as we shape a brighter future together! 🌍🤖🔬

View all posts by Diwas →


  1. You made some nice points there. I did a search on the theme and found mainly people will have the same opinion with your blog.

  2. Heya i’m for the first time here. I found this board and I to find It really helpful & it helped me out much. I’m hoping to provide something again and aid others like you helped me.

  3. Pingback: Homepage
  4. It is truly a great and helpful piece of info. I’m satisfied that you shared this useful information with us. Please keep us informed like this. Thank you for sharing.

  5. You made some decent points there. I looked on the internet for the topic and found most people will approve with your website.

Leave a Reply

Your email address will not be published. Required fields are marked *