An Intrusion Detection System is a software application which monitors a network or systems for malicious activity or policy violations. A host-based intrusion detection system (HIDS) is an intrusion detection system that is capable of monitoring and analyzing the internals of a computing system as well as the network packets on its network interfaces. A HIDS analyzes the traffic to and from the specific computer on which the intrusion detection software is
installed. A host-based system also has the ability to monitor key system files and any attempt to overwrite these files.

Over the years, mankind has made great progress with technology. In today’s context the internet has been a great source of everything such as entertainment, work, classes communication and the list goes on. But there are consequences to them as our data is being compromised in the process. There is a problem of how secure our data Is and how it is being used. The traditional security system can no longer be used to detect the intrusion due to complex intrusion behavior. Data mining is the process of finding the important data from a large dataset which can be used with machine learning techniques to build an efficient model. In this project we will be using NSL KDD dataset. We will use various classifiers and compare them to analyze NSL KDD dataset and the classifier which will have
greatest accuracy will be considered the best.

Dataset Description

Various drawbacks of KDD CUP 99 which was the main cause to decrease in the
performance of various IDS [7] led to the invention of NSL KDD dataset. NSL KDD is the refined version and also called the successor of KDD CUP dataset. It consists of all the needed attributes from KDD CUP dataset. It is an open source data and can be downloaded easily [2]. The advantage of using this dataset is redundant record is removed and sufficient number of records is present for train and test data. It consists of 41 attributes which is classified under Nominal, Binary and Numeric 1.

One more attribute is added as class which is 42nd attribute. There are two types of class called Normal and Anomaly. Anomaly class can be further divided into DOS, PROBE, R2L and U2R. For experiment purpose only two classes are considered: Normal and Anomaly.


To compare and analyze the accuracy of different algorithms for intrusion detection.


In 2015, Seventh International Conference on Measuring Technology and Mechatronics Automation, Computer Network Security and Technology Research was published. This paper introduced the network security technologies mainly in detail, including authentication, data encryption technology, firewall technology, intrusion detection system (IDS), antivirus technology and virtual private network (VPN). Network security problem is related to every network user, so we should put a high value upon network security, try to prevent hostile attacks and ensure the network security [1].

In 2017, a research article Different Type Network Security Threats and Solutions, A Review was published illustrating few existing secured routing protocols to identify how to recover this malicious node from the network and find out a secure data path [2].

In 2016, a research article, Effect of Genetic Algorithm on Artificial Neural Network for Intrusion Detection System was published. This paper investigated to detect the attacks intrusion detection system and understand the effective of GA on the ANN result: artificial Neural Network (ANN) for recognition and used Genetic Algorithm (GA) for optimization of ANN result [3].

A research paper -Unknown Network Detection in Packet Sniffing has been published using Machine Learning Method. This paper worked for a solution for secure network traffic detection and monitoring. NIMA and MAWI datasets were used to analyze networks and classify machine learning such as SVM, Naive Bayes and many more [4].

In 2016, an article -Predicting Unlabeled Traffic for Intrusion Detection Using Semi-Supervised Machine Learning was published. In this paper, Semi supervised machine learning technique was used in intrusion detection, for both labeled and unlabeled data. Machine Learning tool was used for this purpose which uses semi-supervised classifier to build the model, and then integrated in Pentaho which with the help of Weka Scoring provided the expected output [5].

In these research papers, intrusion detection was done and ensures the network security. But only one type of classifier was used or only one type of algorithm was used for detection. So, we could not be sure which type of classifier should be used or which classifier is beneficial among all types for a certain dataset. In our project, using classifier, we compare and analyze the accuracy of different algorithms for intrusion detection using machine learning. And among those algorithms, whose accuracy is more is can be used for intrusion detection. In this way, we can put high value upon host-network security.

Software Requirement

1. Python

Python is an interpreted, high level, general purpose Programming language. The simple and clean structure of python, it’s modular design and its extensive library make it ideal for security applications cyber experts rely on the ability to quick code programs and set new Strategies and techniques implement features.

2. Java

Java is a general-purpose computer programming language that is concurrent, class based, object oriented and specially designed to have a few implementation dependencies as possible. It is intended to let applications developers” write once, run anywhere” (WORA), meaning that compiled java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to bite code that can run on any Java virtual machine (JVM) regardless of computer architecture.

Hardware Requirement

For our project, we are using machine learning to classify the threats so we don’t require specific hardware, but we need computer for the software development, which will be our only hardware product used.



1 Logistic Regression

Logistic Regression is a classification algorithm for machine learning that is used to forecast a categorical dependent variable’s likelihood. The dependent variable in logistic regression is a binary variable containing information coded as either 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, as a function of X, the model of logistic regression predicts P(Y=1). Logistic Regression is used when the dependent variable (target) is categorical.

Whether the tumor is malignant (1) or not (0) Consider a scenario where we need to classify whether an email is spam or not. If we use linear regression for this problem, there is a need for setting up a threshold based on which classification can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not malignant which can lead to serious consequence in real time.

For example,

To predict whether an email is spam (1) or (0). Whether the tumor is malignant (1) or not (0) From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.


  1. Initialize a = [1, .., 1]^T
  2. Perform feature scaling on the examples attributes
  3. Repeat until convergence
  4. I for each j = 0, .., n: aj’ = aj + α £ i (y i − ha(x^i ))xj^i
  5. I for each j = 0, .., n: I aj = aj
  6. The output is a.
2 Naive Bayes Algorithm

Naive Bayes is a binary (two-class) and multi-class classification issues classification algorithm. When described using binary or categorical input values, the technique is easier to understand. However, the algorithm is getting its popularity because of its robustness ability to noise and outliers as well as to irrelevant attributes. The missing values are easily handled. Before going to Naïve Bayes theorem, we have to understand conditional probability and Bayes theorem. Conditional Probability is a measure of the probability of an event given that the another even has occurred. If the event of interest is ‘A’ and the event ‘B’ is known or assumed to have occurred, “the conditional probability of ‘A’ given ‘B’”, or “the probability of ‘A’ under the condition ‘B’”, is usually written as P(A|B).

For example, if we pick a card from the deck, we can guess the probability of getting a king given the card is a heart. We already have a condition that the card is a heart. So, the denominator is 13 and not 52. And since there is only one king in spades, the probability it is a king given the card is a heart is 1/13 =0.077 Mathematically, conditional probability of A given B can be computed as:

P(A|B) = P(AՈB)/P(B)…. (1) P(B|A) = P(AՈB)/P(A)…. (2)

And the Bayes theorem used for Naïve Bayes is derived from above given equations. Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The Bayes Rule is a way of going from P(A|B), known from the training dataset, to find P(B|A).


  1. Scan the data set
  2. Calculate the probability of each attribute value. [n, n_c, m, p]
  3. Apply the formulae P(attribute value(ai)|subject value (vj)) = (n_c + mp) (n+m) where, n = the number of training data item for which v = vj nc = number of examples for which v = vj and a = ai p = a priori estimate for P(ai,vj) m = the parallel size of the sample
  4. Multiply the probabilities by p
  5. Compare the values and classify the attribute values to one of the predefined set of class.
3 . Decision Trees

A decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continues value).The whole idea is to create a tree like this for the entire data and process a single outcome at every leaf(or minimize the error in every leaf).


  1. Take a training dataset.
  2. Determine the attribute that best classifies the training data; use this attribute at the root of the tree. Repeat this process at for each branch.
  3. compute the entropy for data-set
  4. for every attribute/feature:
  5. calculate entropy for all categorical values
  6. take average information entropy for the current attribute
  7. calculate gain for the current attribute
  8. pick the highest gain attribute.
  9. Repeat until we get the tree we desired.
4 KNN Algorithm

K is a number used to identify similar neighbors for the new data point.Referring to our example of friend circle in our new neighborhood. We select 3 neighbors that we want to be very close friends based on common thinking or hobbies. In this case K is 3. KNN takes K nearest neighbors to decide where the new data point with belong to. This decision is based on feature similarity. Choice of K has a drastic impact on the results we obtain from KNN.

We can take the test set and plot the accuracy rate or F1 score against different values of K. We see a high error rate for test set when K=1. Hence, we can conclude that model over fits when k=1. For a high value of K, we see that the F1 score starts to drop. The test set reaches a minimum error rate when k=5. This is very similar to the elbow method used in K-means. Value of K at the elbow of test error rate gives us the optimal value of K.


  1. Choose a value for K
  2. Find the distance of the new point to each of the training data using Euclidean distance formula
  3. Euclidean distance is the square root of the sum of squared distance between two points. It is also known as L2 norm.
  1. Find the K nearest neighbors to the new data point.
  2. For classification, count the number of data points in each category among the k neighbors. New data point will belong to class that has the most neighbors.


In this project, we have used four classifiers namely, Logistic Regression, Naïve Bayes, Decision Tree and KNN Algorithm. After coding, we have found the model accuracy of these classifiers.

In this project, we have used four classifiers namely, Logistic Regression, Naïve Bayes, Decision Tree and KNN Algorithm. After coding, we have found the model accuracy of these classifiers which is shown in the table below. Hence, we found that Decision Tree has got highest accuracy of 99.49%.


Among the four algorithms used, the accuracy of Decision trees algorithm is highest which implies that decision trees algorithm is the most efficient classification algorithm in machine learning for detecting network intrusion. Hence, decision tree algorithm should be used for the intrusion detection.


  1. Nanchang Jiangxi “Seventh International Conference on Measuring Technology and Mechatronics Automation, Computer Network Security and Technology Research” 2016
  2. Shilpa Pareek “Different Type Network Security Threats and Solutions, A
  3. Amind Dastanpour “Energy Sector Cybersecurity Framework Implementation Guidance.” Energy Sector Cybersecurity Framework Implementation Guidance. Office of Electricity Delivery & Energy Reliability. November 2016.
  4. Annu Ailawalthi “Has the number of successful cyber-attacks your organization has experienced increased in the past 12 months?” November 2016.
  5. Anku Jaiswal “Predicting Unlabeled Traffic for Intrusion Detection Using Semi-Supervised Machine Learning”. November 2016 [5].
  6. “The Rising Tide of Cybersecurity Concern.” Blackhat USA 2016. July 2016. November 2016.
  7. “News Recap: Energy Industry Concerned about Cyber Security.” CSID. 15

This page is contributed by NABITA & her team . If you like AIHUB and would like to contribute, you can also write an article & mail your article to . See your articles appearing on AI HUB platform and help other AI Enthusiasts.

Leave a Reply

Your email address will not be published. Required fields are marked *