Published in ·
--
Using clustering algorithms such as K-means is one of the most popular starting points for machine learning. K-means clustering is an unsupervised machine learning technique that sorts similar data into groups, or clusters. Data within a specific cluster bears a higher degree of commonality amongst observations within the cluster than it does with observations outside of the cluster.
The K in K-means represents the user-defined k-number of clusters. K-means clustering works by attempting to find the best cluster centroid positions within the data for k-number of clusters, ensuring data within the cluster is closer in distance to the given centroid than it is to any other centroid. Ideally, the resulting clusters maximize similarity amongst the data within each unique cluster.
Note that various methods for clustering exist; this article will focus on one of the most popular techniques: K-means.
This guide consists of two parts:
- A K-means clustering introduction using generated data.
- An application of K-means clustering to an automotive dataset.
Code:
All code is available at the github page linked here. Feel free to download the notebook (click CODE and Download Zip) and run it alongside this article!
For this guide, we will use the scikit-learn libraries [1]:
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.datasets import make_blobs
To demonstrate K-means clustering, we first need data. Conveniently, the sklearn library includes the ability to generate data blobs [2]. The code is rather simple:
# Generate sample data:
X, y = make_blobs(n_samples=150,
centers=3,
cluster_std=.45,
random_state = 0)
The parameters of the make_blobs() function allow the user to specify the number of centers (which could correlate to potential cluster centroids) and how messy the “blobs” are (cluster_std, which adjusts the cluster’s standard deviation). The above code generates the blobs; the below code gets it into a dataframe and a plotly scatter plot:
# Import required libraries:
import plotly.express as px
import pandas as pd# Convert to dataframe:
dfBlobs = pd.DataFrame(X, columns = ['X','Y'])
# Plot data:
plot = px.scatter(dfBlobs, x="X", y="Y")
plot.update_layout(
title={'text':"Randomly Generated Data",
'xanchor':'center',
'yanchor':'top',
'x':0.5})
plot.show()
Here’s the output:
Due to picking a low “cluster_std” value in the make_blobs() function, the resulting graph has three very clearly defined data blobs that should be easy work for a K-means clustering algorithm.
How Many Clusters?
The K in K-means is the number of clusters, a user-defined figure. For a given dataset, there is typically an optimal number of clusters. In the generated data seen above, it’s probably three.
To mathematically determine the optimal number of clusters, use the “Elbow Method.” This method calculates the within-cluster sum of squares (WCSS) for various values of k, with lower values generally being better. The WCSS represents the sum of the squared distances of each data point from a cluster’s centroid. The Elbow Method plots the WCSS as a result of adding additional clusters; eventually, an “elbow” appears as WCSS drops diminish with the addition of new clusters. This reveals the optimal cluster amount.
The following code generates an Elbow Chart for the above data:
# Determine optimal number of clusters:
wcss = []for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=5000, random_state=42)
kmeans.fit(dfBlobs)
wcss.append(kmeans.inertia_)
# Prepare data for visualization:
wcss = pd.DataFrame(wcss, columns = ['Value'])
wcss.index += 1
When plotted, this yields:
# Plot the elbow curve:
plot = px.line(wcss, y = "Value")
plot.update_layout(
title={'text':"Within Cluster Sum of Squares or 'Elbow Chart'",
'xanchor':'center',
'yanchor':'top',
'x':0.5},
xaxis_title = 'Clusters',
yaxis_title = 'WCSS')
plot.show()
Note how the plot of WCSS has a sharp “elbow” at 3 clusters. This implies 3 is the optimal cluster choice, as the WCSS value decreased sharply with the addition of clusters up to three. Adding clusters beyond 3 sees only minimal gains in WCSS reduction. Thus, the optimal cluster value is k = 3.
Generating Clusters
The next step is to run the K-means clustering algorithm. In the below code, the line kmeans = KMeans(3) is where the value for k is input:
# Cluster the data:
kmeans = KMeans(3)
clusters = kmeans.fit_predict(dfBlobs)# Add the cluster labels to the dataframe:
labels = pd.DataFrame({'Cluster':clusters})
labeledDF = pd.concat((dfBlobs, labels), axis = 1)
The result is a labeled dataframe with a “Cluster” column:
Plotting this yields:
# Generate plot:# Change Cluster column to strings for cluster visualization:
labeledDF["Cluster"] = labeledDF["Cluster"].astype(str)
# Generate plot:
plot = px.scatter(labeledDF, x="X", y="Y", color="Cluster")
plot.update_layout(
title={'text': "Clustered Data",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()
The K-means method has successfully clustered the data into three distinct clusters. Now let’s see what happens with more realistic data.
The Python Library Seaborn provides various datasets, including one on automobile fuel efficiency from cars built during the oil crisis era. For the purpose of aiding in the learning of clustering, we will filter the dataset for 8 and 4-cylinder-engined cars. This represents the largest and smallest engines typically available during that time period.
Fortunately, this could be an example of a real world analysis scenario. In the 1970’s, rapid increases in fuel prices made 8-cylinder cars less desirable; as a result, 4-cylinder cars became increasingly common, but how do 4 and 8-cylinder-engined cars truly behave regarding fuel consumption?
Specifically, we will explore 4 and 8-cylinder cars with regards to their weight and fuel efficiency measured in miles per gallon (MPG).
Preparing the data is straightfoward:
import seaborn as sns# Load in the data - Seaborn's mpg data:
df = sns.load_dataset('mpg')
# Filter for 4 and 8 cylinder cars:
df = df[(df['cylinders'] == 4) | (df['cylinders'] == 8)]
df = df.reset_index(drop=True)
# Display dataframe head:
df.head(3)
The dataset looks like this:
Visualizing the weight and MPG of the cars yields the following:
# Plot the mpg and weight data:plot = px.scatter(df, x='weight', y='mpg',
hover_data=['name', 'model_year'])
plot.update_layout(
title={'text': "Vehicle Fuel Efficiency",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()
Scaling the Data
Note how the x-axis represents weight, while the y-axis represents MPG. An increase of 10 pounds is not as significant as an increase in 10 MPG. This could impact the clustering results.
There are various options available to combat this issue; Jeff Hale has an excellent article linked here providing a technical overview of the various methods and use cases for scaling, standardizing, and normalizing [1]. For this exercise, we will use Sci-Kit Learn’s StandardScaler() function.
# Create DF copy for standardizing:
dfCluster = df.copy()# Set the scaler:
scaler = preprocessing.StandardScaler()
# Normalize the two variables of interest:
dfCluster[['weight', 'mpg']] = scaler.fit_transform(dfCluster[['weight', 'mpg']])
# Create dataframe for clustering:
dfCluster = dfCluster[['weight', 'mpg']]
# View dataframe head:
dfCluster.head(3)
This returns:
This two-column dataframe is now ready to pass through the Elbow Method.
How Many Clusters?
As in section 1, we follow the Elbow Method with the same code to return the following chart:
# Determine optimal number of clusters:wcss = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=5000, random_state=42)
kmeans.fit(dfCluster)
wcss.append(kmeans.inertia_)
# Prepare data for visualization:
wcss = pd.DataFrame(wcss, columns=['Value'])
wcss.index += 1
# Plot the elbow curve:
plot = px.line(wcss)
plot.update_layout(
title={'text': "Within Cluster Sum of Squares or 'Elbow Chart'",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5},
xaxis_title='Clusters',
yaxis_title='WCSS')
plot.update_layout(showlegend=False)
plot.show()
Two clusters appears to be the elbow. However, the jump from 2 to 3 clusters isn’t as flat as it was in the data blobs example of section 1.
Note: The elbow method may not always provide the clearest results, necessitating trying a few values for k in the region that appears to be the elbow. While we are going with 2 clusters, it might be worth trying 3. For more advanced analysis on the topic, Satyam Kumar provides an alternative to the Elbow Method, referred to as the Silhouette Method, in this Towards Data Science article [4].
Generating Clusters
The code to generate clusters remains similar to the earlier example:
# Cluster the data:
kmeans = KMeans(2)
clusters = kmeans.fit_predict(dfCluster)# Add the cluster labels to the dataframe:
labels = pd.DataFrame({'Cluster': clusters})
labeledDF = pd.concat((df, labels), axis=1)
labeledDF['Cluster'] = labeledDF['Cluster'].astype(str)
This gets the dataframe back to where we started, but with one addition — a column identifying the Cluster for each observation:
Plotting this reveals the clusters:
# Generate plot:plot = px.scatter(labeledDF, x="weight", y="mpg", color="Cluster",
hover_data=['name', 'model_year', 'cylinders'])
plot.update_yaxes(categoryorder='category ascending')
plot.update_layout(
title={'text': "Clustered Data",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()
Further Analysis
Remember that K-means clustering attempts to group data based on similarities within the data, with those similarities determined by distance to a cluster centroid. Having the cluster value added to the original dataframe allows additional analysis on the original dataset. For example, the above cluster visualization shows a split between the clusters around 3000 pounds and about 20 MPG.
Additional visualizations may yield more insights. Consider the following strip plots (code available at linked Git page):
First, K-means clustering managed to sort vehicles based only on weight and MPG into clusters that almost perfectly align with the cylinder counts. One cluster is 100% 4-cylinder cars, while the other is almost all 8-cylinder.
Second, the mostly 8-cylinder cluster has four cars with 4-cylinder engines. Despite having smaller engines, these cars appear to have MPG performance closer to larger, 8-cylinder cars. The 4-cylinder cars in cluster zero are all close to 3000 pounds in weight while performing at or below 20 MPG.
Finally, there appears to be some 8-cylinder cars that achieve an MPG score closer to 4-cylinder cars. An additional chart reveals these examples are considered outliers (code available at Git page):
This is an example of how clustering can help understand data while guiding follow-on analysis and data exploration. An engineer may find it worth analyzing why certain 4-cylinder cars seem to perform poorly for efficiency and get clustered with 8-cylinder cars.
For additional practice and learning: take the full notebook at the linked Git page, re-load the seaborn MPG dataset, but do not filter for 4 and 8-cylinder cars. You will find the Elbow Method reveals an ambiguous divide between 2 versus 3 clusters — both may be worth trying. Or, attempt the notebook using one of the other seaborn datasets or other features within the MPG dataset [5].
Is K-means clustering the best technique for this data?
Recall that for the example with blobs, the K-means Elbow Method had a very clear optimal point and the resultant clustering analysis easily identified the distinct blobs. K-means tends to perform better when the data is more spherical in nature, as was the case with the data blobs. Alternative methods may work better with complicated data and perhaps even the automotive data used in this example. For further reading on some alternative clustering methods, Victor Roman provides a great overview in this Towards Data Science article [6].
K-means clustering is a powerful machine learning tool that enables identifying similarities within data. The technique can provide insights or enhance data understanding in a way that guides further analysis questions and improves data visualization. This article and code provide a guide on K-means clustering, but there are other clustering techniques available, some of which may be more appropriate given the type of data being analyzed. But even if K-means is not the most appropriate method for the given data, K-means clustering is an excellent method to know and a great spot to start getting familiarized with machine learning. Furthermore, K-means clustering can serve as a baseline for comparison to other clustering methods, meaning it may still prove useful even if it ends up not being the ideal clustering algorithm for a given dataset.
References:
[1] Scikit learn, scikit-learn: machine learning in Python (2023).
[2] Scikit learn, sklearn.datasets.make_blobs (2023).
[3] J. Hale, Scale, Standardize, or Normalize with Scikit-Learn (2019), Towards Data Science.
[4] S. Kumar, Silhouette Method — Better than Elbow Method to Find Optimal Clusters (2020), Towards Data Science.
[5] M. Alam, Seaborn essentials for data visualization in Python (2020), Towards Data Science.
[6] V. Roman, Machine Learning: Clustering Analysis (2019), Towards Data Science.