K-Modes Clustering Algorithm: Mathematical & Scratch Implementation

k-modes clustering

Welcome to the new tutorial session on “Machine Learning From Scratch“. In our previous article, we had covered about K-means algorithm. In this tutorial, we will cover the K-modes clustering algorithm from Scratch with a mathematical implementation example.

K-MODES CLUSTERING ALGORITHM

Before entering the tutorial on k-modes, let’s revisit the k-means clustering algorithm. K- means is an unsupervised partitional clustering algorithm that is based on grouping data into k – numbers of clusters by determining centroid using the Euclidean or Manhattan method for distance calculation. It updates or calculates the new centroid by calculating the mean of data points in that cluster. We repeat the process for a number of iterations till successive iterations cluster data items into the same group.

Similar to K-means, k-modes clustering is also an unsupervised algorithm used for clustering categorical variables. This algorithm groups data points in clusters based on the number of matching categories between centroid data points and observation data points. Unmatched data points are counted as +1 and similar data points are counted as zero. Centroids are updated based on the maximum frequency of categorical value i.e mode.

THEORY ?? WE PREFER EXAMPLE

Let’s kick off K-Means Clustering Scratch with a simple example. Suppose we have categorical data which should be clustered in 2 groups. Let’s suppose, on column 1 (C1) we have categories (M, N, O), on column 2 (C2) we have categories (P, Q, R), and on column 3 (C3) we have categories (W, X, Y, Z). Categorical distribution in the dataset is shown in the table below:

S.N.C1C2C3D1D2
1MQX
2MQZ
3NPZ
4OQW
5ORY
6NPX
7NPW
8NRY
9MOX
10NPY

Step 1: It is already defined that k = 2 for this problem

Step-2:  Since k = 2, we are randomly selecting two centroid as S.N = 2 & 6 i.e (M,Q,Z) & (N,P,X).

Step 3: In k-means, we calculate the distance of each point to each centroid using the euclidean distance calculation method (see here). But it’s different in the case of k-modes. We compare each data points to centroid to each observation data point. If any elements aren’t equal, we add one (+1) in each unmatching case. But, if the elements aren’t equal, we add zero i.e we do nothing. 

S.N.C1C2C3D1
1MQX0+0+1 = 1
Centroid 1MQZ

Here, we are comparing the first data points with centroid 1. Since the third column isn’t matching, the total distance point is 1.

S.N.C1C2C3D2
1MQX1+1+0 = 2
Centroid 2NPX

Here, we are comparing the first data points with centroid 2. Since the first & the second column aren’t matching, the total distance point is 2.

Similarly, we have calculated and completed the data points table. Data points are assigned to the closest centroid with a minimum difference value.

S.N.C1C2C3D1(M,Q,Z)D2(N,P,X)
1MQX12
2MQZ03
3NPZ21
4OQW23
5ORY33
6NPX30
7NPW31
8NRY32
9MOX22
10NPY31

From the table, we can see that data points 1, 2, 4, & 5 are assigned to cluster 1st, and the rest are assigned to the second cluster. 

Step 4:- Here comes the actual use of modes in the algorithm. Maximum categorical occurrence (mode) is assigned to the updated centroid elements of each cluster.

S.N.C1C2C3
1MQX
2MQZ
4OQW
5ORY
9MOX
Centroid 1(updated)MQX
3NPZ
6NPX
7NPW
8NRY
10NPY
Centroid 2(updated)NPY

We have obtained the new cluster centroid as (M, Q, X) and (N, P, Y). Now we repeat the same process from step 3 till successive iterations cluster data items into the same group.

FINAL CLUSTER: K-MODES CLUSTERING

After repeating step 3 & 4 three times i.e after the 3rd Iteration we get the cluster as:-

S.N.C1C2C3Cluster
1MQX1
2MQZ1
3NPZ2
4OQW1
5ORY2
6NPX2
7NPW2
8NRY2
9MOX1
10NPY2
Clusters are finalized when cluster elements don’t change in successive iteration

CONCLUSION

The k-modes clustering algorithm is an extension of the k-means clustering algorithm. The k-means algorithm is the most widely used centre based partitional clustering algorithm. Huang extends the k-means clustering algorithm to k-modes clustering algorithm to group the categorical data. The modifications done in the k-means are

  • (i) using a simple matching dissimilarity measure for categorical objects,
  • (ii) replacing means of clusters by modes, and
  • (iii) using a frequency-based method to update the modes.

Do you know??

UDEMY Free Course are on available on Allcoursefree.com

About Diwas Pandey

Highly motivated, strong drive with excellent interpersonal, communication, and team-building skills. Motivated to learn, grow and excel in Data Science, Artificial Intelligence, SEO & Digital Marketing

View all posts by Diwas Pandey →

Leave a Reply

Your email address will not be published. Required fields are marked *