Scikit-learn Clustering Algorithms: A Comprehensive Comparison

In the realm of machine learning, clustering stands out as a powerful unsupervised learning technique. Unlike supervised learning methods like regression and classification, clustering algorithms operate on unlabeled data, seeking to uncover inherent patterns and group similar data points together. This makes clustering invaluable for exploratory data analysis, pattern recognition, image analysis, customer analytics, market segmentation, social network analysis, and various other applications.

Clustering is not a specific algorithm but a general task. It arranges objects so that those in the same group (cluster) are more similar to each other than to those in other groups. Data professionals often employ clustering during Exploratory Data Analysis to discover new information and patterns.

Understanding Clustering

Clustering is the process of arranging a group of objects in such a manner that the objects in the same group (which is referred to as a cluster) are more similar to each other than to the objects in any other group.

Imagine you have 10M customers, and you want to develop customized or focused marketing campaigns. It is unlikely that you will develop 10M marketing campaigns, so what do we do?

Applications Across Industries

Clustering is a very powerful technique and has broad applications in various industries ranging from media to healthcare, manufacturing to service industries, and anywhere you have large amounts of data.

Retail

There are many opportunities for clustering in retail businesses. Another example could be clustering at a category level. In the diagram below, we have eight stores. Different colors represent different clusters. Notice that the deodorants category in Store 1 is represented by the red cluster, whereas the deodorants category in Store 2 is represented by the blue cluster.

Healthcare and Clinical Science

Healthcare and Clinical Science is again one of those areas that are full of opportunities for clustering that are indeed very impactful in the field. One such example is research published by Komaru & Yoshida et al. Each cluster was represented by different conditions. For example, cluster 1 has patients with Low WBC & CRP. Cluster 2 has patients with High BMP & Serum, and Cluster 3 has patients with Low Serum.

Image Segmentation

Image segmentation is the classification of an image into different groups. Much research has been done in the area of image segmentation using clustering. In the example below, the left-hand side represents the original image, and the right-hand side is the result of the clustering algorithm.

The Challenge of Evaluating Clustering Performance

Clustering, unlike supervised learning use-cases such as classification or regression, cannot be completely automated end-to-end. Most importantly, because clustering is unsupervised learning and doesn’t use labeled data, we cannot calculate performance metrics like accuracy, AUC, RMSE, etc., to compare different algorithms or data preprocessing techniques.

Scikit-learn's Clustering Arsenal

There are 10 unsupervised clustering algorithms implemented in scikit-learn - a popular machine learning library in Python.

In the diagram below, each column represents an output from a different clustering algorithm such as KMeans, Affinity Propagation, MeanShift, etc. Some algorithms have yielded the same output. However, if you notice and compare the output of KMeans with the output of MeanShift algorithm, you will notice both the algorithms yielded different results. Unfortunately (or fortunately), there is no right or wrong answer in Clustering.

Key Clustering Algorithms in Scikit-learn

Let’s focus on the output of these algorithms.

K-Means Clustering

K-Means clustering algorithm is easily the most popular and widely used algorithm for clustering tasks. It is primarily because of the intuition and the ease of implementation.

K-Means clustering is an unsupervised machine learning algorithm that is mainly used when we have to cluster or classify data that do not have labels assigned to it. In the KMeans clustering algorithm clusters are divided on basis of centroids. WCSS will continue to minimize as "n" decreases. Plotting WCSS against clusters will define the shape of the elbow. The plt.plot() function takes two parameters.

K-Means clustering is an iterative algorithm that creates non-overlapping clusters meaning each instance in your dataset can only belong to one cluster exclusively. The easiest way to get the intuition of the K-Means algorithm is to understand the steps along with the example diagram below. Initialize centroids randomly based on the number of clusters. Iteration keeps on going until there is no change to the centroid's mean or a parameter max_iter is reached, which is the maximum number of the iterations as defined by the user during training.

Advantages of K-Means:

Simplicity and ease of implementation
Efficiency in terms of computational cost

Disadvantages of K-Means:

Requires pre-defining the number of clusters (K)
Sensitive to initial centroid placement, potentially leading to different results on different runs
Assumes clusters are spherical and equally sized, which may not be true for all datasets
The KMeans algorithm gets highly affected by outliers and it's also highly dependent on the initial values of the centroid.

Applying K-Means:

In the KMeans clustering algorithm once clusters are created then if we add a new data point in those clusters then with help of predict function we can assign that new data point to any of the clusters.

Mean Shift Clustering

Unlike the K-Means algorithm, MeanShift algorithm does not require specifying the number of clusters. MeanShift is also based on centroids and iteratively assigns each data point to clusters. The MeanShift algorithm is based on kernel density estimation. Similar to the K-Means algorithm, MeanShift algorithms iteratively assigns each data point towards the closest cluster centroid which are initialized randomly and each point are iteratively moved in the space based on where the most points are i.e. This is why the MeanShift algorithm is also known as the Mode-seeking algorithm.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN clustering algorithm is an unsupervised machine learning algorithm in which there is no need to pre-specify the cluster to form. From start, it used a distance matrix that consists of distances of neighboring points. Every data point is surrounded by a circle with a radius of epsilon, and DBSCAN identifies them as being either a Core point, Border point, or Noise point. It is considered a Border Point if the number of points is lower than the minimum required, and it is considered Noise if there are no additional data points located within an epsilon radius of any data point.

Advantages of DBSCAN:

Does not require specifying the number of clusters beforehand
Effective at identifying clusters of arbitrary shapes
Robust to outliers

Disadvantages of DBSCAN:

Sensitive to parameter settings (epsilon and minimum number of points)
May struggle with clusters of varying densities

Comparison with K-Means:

We cannot add a new point in clusters created by the DBSCAN algorithm because DBSCAN recalculates it every time hence predict function cannot be used in DBSCAN. Clusters formed in the iris data.

Hierarchical Clustering

Hierarchical clustering is a method of clustering that builds a hierarchy of clusters. When it comes to analyzing data from social networks, hierarchical clustering is by far the most common and popular method of clustering. The nodes (branches) in the graph are compared to each other depending on the degree of similarity that exists between them. The biggest advantage of hierarchical clustering is that it is easy to understand and implement. Usually, the output of this clustering method is analyzed in an image such as below.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

BIRCH stands for Balanced Iterative Hierarchical Based Clustering. It is used on very large datasets where K-Means cannot practically scale. BIRCH algorithm divides large data into small clusters and tries to retain the maximum amount of information possible. BIRCH is often used to supplement other clustering algorithms by generating a summary of the information that the other clustering algorithms can utilize. One of the benefits of using BIRCH is that it can progressively and dynamically cluster multi-dimensional data points. This is done to create the highest quality clusters under a given memory and time constraints.

Considerations for Choosing a Clustering Algorithm

Clustering is a very useful machine learning technique, but it is not as straightforward as some of the supervised learning use-cases like classification and regression.

Accuracy and Performance

One of the key takeaways from this project was the importance of selecting an algorithm that not only fits the data well but also offers stability and intuitive parameter settings. For instance, while K-Means is known for its simplicity and speed, it sometimes struggled with noisy data points.

Parameter Sensitivity

Each algorithm was evaluated based on key criteria: accuracy, intuitive parameter settings, stability of clusters, and performance.

Stability of Clusters