Welcome to the new tutorial session on “Machine Learning From Scratch.” In our previous article, we covered the K-means algorithm. This tutorial will cover the K-mode clustering algorithm from Scratch with a mathematical implementation example.
K-MODE CLUSTERING ALGORITHM
Before entering the tutorial on k-modes, letās revisit the k-means clustering algorithm. K-means is an unsupervised partitional clustering algorithm based on grouping data into k-numbers of clusters by determining the centroid using the Euclidean or Manhattan method for distance calculation. It updates or calculates the new centroid by calculating the mean of the data points in that cluster. We repeat the process for several iterations until successive iterations cluster data items into the same group.
Like K-means, k-mode clustering is also an unsupervised algorithm for categorical variables. This algorithm groups data points in clusters based on the number of matching categories between centroid and observation data points. Unmatched data points are counted as +1, and similar data points are counted as zero. Centroids are updated based on the maximum frequency of the categorical value, i.e., mode.
THEORY ?? WE PREFER EXAMPLE
Letās kick off K-Means Clustering Scratch with a simple example. Suppose we have categorical data, which should be clustered into two groups. Letās suppose, in column 1 (C1), we have categories (M, N, O); in column 2 (C2), we have categories (P, Q, and R); and in column 3 (C3), we have categories (W, X, Y, Z). The categorical distribution of the dataset is shown in the table below:
S.N. | C1 | C2 | C3 | D1 | D2 |
1 | M | Q | X | ||
2 | M | Q | Z | ||
3 | N | P | Z | ||
4 | O | Q | W | ||
5 | O | R | Y | ||
6 | N | P | X | ||
7 | N | P | W | ||
8 | N | R | Y | ||
9 | M | O | X | ||
10 | N | P | Y |
Step 1: It is already defined that k = 2 for this problem
Step-2: Since k = 2, we randomly select two centroids as S.N = 2 & 6, i.e. (M,Q,Z) & (N,P,X).
Step 3: In k-means, we calculate the distance of each point to each centroid using the Euclidean distance calculation method (see here). But itās different in the case of k-modes. We compare each data point to the centroid of each observation data point. If the elements aren’t equal, we add one (+1) in each unmatching case. But, if the elements arenāt equal, we add zero, i.e., we do nothing.
S.N. | C1 | C2 | C3 | D1 |
1 | M | Q | X | 0+0+1 = 1 |
Centroid 1 | M | Q | Z |
Here, we are comparing the first data points with centroid 1. Since the third column doesnāt match, the total distance point is 1.
S.N. | C1 | C2 | C3 | D2 |
1 | M | Q | X | 1+1+0 = 2 |
Centroid 2 | N | P | X |
Here, we are comparing the first data points with centroid 2. Since the first & the second columns donāt match, the total distance point is 2.
Similarly, we have calculated and completed the data point table. Data points are assigned to the closest centroid with a minimum difference value.
S.N. | C1 | C2 | C3 | D1(M,Q,Z) | D2(N,P,X) |
1 | M | Q | X | 1 | 2 |
2 | M | Q | Z | 0 | 3 |
3 | N | P | Z | 2 | 1 |
4 | O | Q | W | 2 | 3 |
5 | O | R | Y | 3 | 3 |
6 | N | P | X | 3 | 0 |
7 | N | P | W | 3 | 1 |
8 | N | R | Y | 3 | 2 |
9 | M | O | X | 2 | 2 |
10 | N | P | Y | 3 | 1 |
The table shows that data points 1, 2, 4, & 5 are assigned to cluster 1st, and the rest are assigned to the second cluster.
Step 4: Here comes the actual use of modes in the algorithm. The maximum categorical occurrence (mode) is assigned to the updated centroid elements of each cluster.
S.N. | C1 | C2 | C3 |
1 | M | Q | X |
2 | M | Q | Z |
4 | O | Q | W |
5 | O | R | Y |
9 | M | O | X |
Centroid 1(updated) | M | Q | X |
3 | N | P | Z |
6 | N | P | X |
7 | N | P | W |
8 | N | R | Y |
10 | N | P | Y |
Centroid 2(updated) | N | P | Y |
We have obtained the new cluster centroids as (M, Q, X) and (N, P, Y). We repeat the same process from step 3 until successive iterations cluster data items into the same group.
FINAL CLUSTER: K-MODES CLUSTERING
After repeating steps 3 & 4 three times i.e. after the 3rd Iteration, we get the cluster as:-
S.N. | C1 | C2 | C3 | Cluster |
1 | M | Q | X | 1 |
2 | M | Q | Z | 1 |
3 | N | P | Z | 2 |
4 | O | Q | W | 1 |
5 | O | R | Y | 2 |
6 | N | P | X | 2 |
7 | N | P | W | 2 |
8 | N | R | Y | 2 |
9 | M | O | X | 1 |
10 | N | P | Y | 2 |
CONCLUSION
The k-mode clustering algorithm is an extension of the k-means clustering algorithm. The k-means algorithm is the most widely used center-based partitional clustering algorithm. Huang extends the k-means clustering algorithm to the k-modes clustering algorithm to group the categorical data. The modifications done to the k-means are
- (i) using a simple matching dissimilarity measure for categorical objects,
- (ii) replacing means of clusters by modes, and
- (iii) using a frequency-based method to update the modes.
Key Points to Remember:
- 1. K-Modes is well-suited for datasets with categorical attributes.
- 2. It uses a different distance metric (simple matching distance) compared to K-Means.
- 3. Initialization can impact the final clustering results, and different strategies can be employed.
- 4. Like K-Means, K-Modes is sensitive to the choice of K and the number of clusters. You may need to experiment with different K values and use validation metrics like the silhouette score or elbow method to determine the optimal K.
K-Modes has found applications in various domains, including market segmentation, customer profiling, text mining, and more, where categorical attributes are prevalent. However, it’s crucial to consider the impact of initialization, the choice of K, and the assessment of clustering quality when applying K-Modes in practice. Additionally, as with any clustering algorithm, the interpretability and meaningfulness of the clusters depend on the nature of the data and the problem at hand.
In summary, K-Modes is a clustering algorithm designed for categorical data that iteratively assigns data points to clusters based on categorical attribute similarity, updating cluster centroids until convergence to create distinct clusters of categorical data.