K-means Clustering: An Introductory Guide and Practical Application (2024)

K-means Clustering: An Introductory Guide and Practical Application (3)

Using clustering algorithms such as K-means is one of the most popular starting points for machine learning. K-means clustering is an unsupervised machine learning technique that sorts similar data into groups, or clusters. Data within a specific cluster bears a higher degree of commonality amongst observations within the cluster than it does with observations outside of the cluster.

The K in K-means represents the user-defined k-number of clusters. K-means clustering works by attempting to find the best cluster centroid positions within the data for k-number of clusters, ensuring data within the cluster is closer in distance to the given centroid than it is to any other centroid. Ideally, the resulting clusters maximize similarity amongst the data within each unique cluster.

Note that various methods for clustering exist; this article will focus on one of the most popular techniques: K-means.

This guide consists of two parts:

  1. A K-means clustering introduction using generated data.
  2. An application of K-means clustering to an automotive dataset.

Code:

All code is available at the github page linked here. Feel free to download the notebook (click CODE and Download Zip) and run it alongside this article!

For this guide, we will use the scikit-learn libraries [1]:

from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.datasets import make_blobs

To demonstrate K-means clustering, we first need data. Conveniently, the sklearn library includes the ability to generate data blobs [2]. The code is rather simple:

# Generate sample data:
X, y = make_blobs(n_samples=150,
centers=3,
cluster_std=.45,
random_state = 0)

The parameters of the make_blobs() function allow the user to specify the number of centers (which could correlate to potential cluster centroids) and how messy the “blobs” are (cluster_std, which adjusts the cluster’s standard deviation). The above code generates the blobs; the below code gets it into a dataframe and a plotly scatter plot:

# Import required libraries:
import plotly.express as px
import pandas as pd

# Convert to dataframe:
dfBlobs = pd.DataFrame(X, columns = ['X','Y'])

# Plot data:
plot = px.scatter(dfBlobs, x="X", y="Y")
plot.update_layout(
title={'text':"Randomly Generated Data",
'xanchor':'center',
'yanchor':'top',
'x':0.5})
plot.show()

Here’s the output:

K-means Clustering: An Introductory Guide and Practical Application (4)

Due to picking a low “cluster_std” value in the make_blobs() function, the resulting graph has three very clearly defined data blobs that should be easy work for a K-means clustering algorithm.

How Many Clusters?

The K in K-means is the number of clusters, a user-defined figure. For a given dataset, there is typically an optimal number of clusters. In the generated data seen above, it’s probably three.

To mathematically determine the optimal number of clusters, use the “Elbow Method.” This method calculates the within-cluster sum of squares (WCSS) for various values of k, with lower values generally being better. The WCSS represents the sum of the squared distances of each data point from a cluster’s centroid. The Elbow Method plots the WCSS as a result of adding additional clusters; eventually, an “elbow” appears as WCSS drops diminish with the addition of new clusters. This reveals the optimal cluster amount.

The following code generates an Elbow Chart for the above data:

# Determine optimal number of clusters:
wcss = []

for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=5000, random_state=42)
kmeans.fit(dfBlobs)
wcss.append(kmeans.inertia_)

# Prepare data for visualization:
wcss = pd.DataFrame(wcss, columns = ['Value'])
wcss.index += 1

When plotted, this yields:

# Plot the elbow curve:
plot = px.line(wcss, y = "Value")
plot.update_layout(
title={'text':"Within Cluster Sum of Squares or 'Elbow Chart'",
'xanchor':'center',
'yanchor':'top',
'x':0.5},
xaxis_title = 'Clusters',
yaxis_title = 'WCSS')
plot.show()
K-means Clustering: An Introductory Guide and Practical Application (5)

Note how the plot of WCSS has a sharp “elbow” at 3 clusters. This implies 3 is the optimal cluster choice, as the WCSS value decreased sharply with the addition of clusters up to three. Adding clusters beyond 3 sees only minimal gains in WCSS reduction. Thus, the optimal cluster value is k = 3.

Generating Clusters

The next step is to run the K-means clustering algorithm. In the below code, the line kmeans = KMeans(3) is where the value for k is input:

# Cluster the data:
kmeans = KMeans(3)
clusters = kmeans.fit_predict(dfBlobs)

# Add the cluster labels to the dataframe:
labels = pd.DataFrame({'Cluster':clusters})
labeledDF = pd.concat((dfBlobs, labels), axis = 1)

The result is a labeled dataframe with a “Cluster” column:

K-means Clustering: An Introductory Guide and Practical Application (6)

Plotting this yields:

# Generate plot:

# Change Cluster column to strings for cluster visualization:
labeledDF["Cluster"] = labeledDF["Cluster"].astype(str)

# Generate plot:
plot = px.scatter(labeledDF, x="X", y="Y", color="Cluster")
plot.update_layout(
title={'text': "Clustered Data",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()

K-means Clustering: An Introductory Guide and Practical Application (7)

The K-means method has successfully clustered the data into three distinct clusters. Now let’s see what happens with more realistic data.

The Python Library Seaborn provides various datasets, including one on automobile fuel efficiency from cars built during the oil crisis era. For the purpose of aiding in the learning of clustering, we will filter the dataset for 8 and 4-cylinder-engined cars. This represents the largest and smallest engines typically available during that time period.

Fortunately, this could be an example of a real world analysis scenario. In the 1970’s, rapid increases in fuel prices made 8-cylinder cars less desirable; as a result, 4-cylinder cars became increasingly common, but how do 4 and 8-cylinder-engined cars truly behave regarding fuel consumption?

Specifically, we will explore 4 and 8-cylinder cars with regards to their weight and fuel efficiency measured in miles per gallon (MPG).

Preparing the data is straightfoward:

import seaborn as sns

# Load in the data - Seaborn's mpg data:
df = sns.load_dataset('mpg')

# Filter for 4 and 8 cylinder cars:
df = df[(df['cylinders'] == 4) | (df['cylinders'] == 8)]
df = df.reset_index(drop=True)

# Display dataframe head:
df.head(3)

The dataset looks like this:

K-means Clustering: An Introductory Guide and Practical Application (8)

Visualizing the weight and MPG of the cars yields the following:

# Plot the mpg and weight data:

plot = px.scatter(df, x='weight', y='mpg',
hover_data=['name', 'model_year'])
plot.update_layout(
title={'text': "Vehicle Fuel Efficiency",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()

K-means Clustering: An Introductory Guide and Practical Application (9)

Scaling the Data

Note how the x-axis represents weight, while the y-axis represents MPG. An increase of 10 pounds is not as significant as an increase in 10 MPG. This could impact the clustering results.

There are various options available to combat this issue; Jeff Hale has an excellent article linked here providing a technical overview of the various methods and use cases for scaling, standardizing, and normalizing [1]. For this exercise, we will use Sci-Kit Learn’s StandardScaler() function.

# Create DF copy for standardizing:
dfCluster = df.copy()

# Set the scaler:
scaler = preprocessing.StandardScaler()

# Normalize the two variables of interest:
dfCluster[['weight', 'mpg']] = scaler.fit_transform(dfCluster[['weight', 'mpg']])

# Create dataframe for clustering:
dfCluster = dfCluster[['weight', 'mpg']]

# View dataframe head:
dfCluster.head(3)

This returns:

K-means Clustering: An Introductory Guide and Practical Application (10)

This two-column dataframe is now ready to pass through the Elbow Method.

How Many Clusters?

As in section 1, we follow the Elbow Method with the same code to return the following chart:

# Determine optimal number of clusters:

wcss = []

for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=5000, random_state=42)
kmeans.fit(dfCluster)
wcss.append(kmeans.inertia_)

# Prepare data for visualization:
wcss = pd.DataFrame(wcss, columns=['Value'])
wcss.index += 1

# Plot the elbow curve:
plot = px.line(wcss)
plot.update_layout(
title={'text': "Within Cluster Sum of Squares or 'Elbow Chart'",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5},
xaxis_title='Clusters',
yaxis_title='WCSS')
plot.update_layout(showlegend=False)
plot.show()

K-means Clustering: An Introductory Guide and Practical Application (11)

Two clusters appears to be the elbow. However, the jump from 2 to 3 clusters isn’t as flat as it was in the data blobs example of section 1.

Note: The elbow method may not always provide the clearest results, necessitating trying a few values for k in the region that appears to be the elbow. While we are going with 2 clusters, it might be worth trying 3. For more advanced analysis on the topic, Satyam Kumar provides an alternative to the Elbow Method, referred to as the Silhouette Method, in this Towards Data Science article [4].

Generating Clusters

The code to generate clusters remains similar to the earlier example:

# Cluster the data:
kmeans = KMeans(2)
clusters = kmeans.fit_predict(dfCluster)

# Add the cluster labels to the dataframe:
labels = pd.DataFrame({'Cluster': clusters})
labeledDF = pd.concat((df, labels), axis=1)
labeledDF['Cluster'] = labeledDF['Cluster'].astype(str)

This gets the dataframe back to where we started, but with one addition — a column identifying the Cluster for each observation:

K-means Clustering: An Introductory Guide and Practical Application (12)

Plotting this reveals the clusters:

# Generate plot:

plot = px.scatter(labeledDF, x="weight", y="mpg", color="Cluster",
hover_data=['name', 'model_year', 'cylinders'])
plot.update_yaxes(categoryorder='category ascending')
plot.update_layout(
title={'text': "Clustered Data",
'xanchor': 'center',
'yanchor': 'top',
'x': 0.5})
plot.show()

K-means Clustering: An Introductory Guide and Practical Application (13)

Further Analysis

Remember that K-means clustering attempts to group data based on similarities within the data, with those similarities determined by distance to a cluster centroid. Having the cluster value added to the original dataframe allows additional analysis on the original dataset. For example, the above cluster visualization shows a split between the clusters around 3000 pounds and about 20 MPG.

Additional visualizations may yield more insights. Consider the following strip plots (code available at linked Git page):

K-means Clustering: An Introductory Guide and Practical Application (14)

First, K-means clustering managed to sort vehicles based only on weight and MPG into clusters that almost perfectly align with the cylinder counts. One cluster is 100% 4-cylinder cars, while the other is almost all 8-cylinder.

Second, the mostly 8-cylinder cluster has four cars with 4-cylinder engines. Despite having smaller engines, these cars appear to have MPG performance closer to larger, 8-cylinder cars. The 4-cylinder cars in cluster zero are all close to 3000 pounds in weight while performing at or below 20 MPG.

Finally, there appears to be some 8-cylinder cars that achieve an MPG score closer to 4-cylinder cars. An additional chart reveals these examples are considered outliers (code available at Git page):

K-means Clustering: An Introductory Guide and Practical Application (15)

This is an example of how clustering can help understand data while guiding follow-on analysis and data exploration. An engineer may find it worth analyzing why certain 4-cylinder cars seem to perform poorly for efficiency and get clustered with 8-cylinder cars.

For additional practice and learning: take the full notebook at the linked Git page, re-load the seaborn MPG dataset, but do not filter for 4 and 8-cylinder cars. You will find the Elbow Method reveals an ambiguous divide between 2 versus 3 clusters — both may be worth trying. Or, attempt the notebook using one of the other seaborn datasets or other features within the MPG dataset [5].

Is K-means clustering the best technique for this data?

Recall that for the example with blobs, the K-means Elbow Method had a very clear optimal point and the resultant clustering analysis easily identified the distinct blobs. K-means tends to perform better when the data is more spherical in nature, as was the case with the data blobs. Alternative methods may work better with complicated data and perhaps even the automotive data used in this example. For further reading on some alternative clustering methods, Victor Roman provides a great overview in this Towards Data Science article [6].

K-means clustering is a powerful machine learning tool that enables identifying similarities within data. The technique can provide insights or enhance data understanding in a way that guides further analysis questions and improves data visualization. This article and code provide a guide on K-means clustering, but there are other clustering techniques available, some of which may be more appropriate given the type of data being analyzed. But even if K-means is not the most appropriate method for the given data, K-means clustering is an excellent method to know and a great spot to start getting familiarized with machine learning. Furthermore, K-means clustering can serve as a baseline for comparison to other clustering methods, meaning it may still prove useful even if it ends up not being the ideal clustering algorithm for a given dataset.

References:

[1] Scikit learn, scikit-learn: machine learning in Python (2023).

[2] Scikit learn, sklearn.datasets.make_blobs (2023).

[3] J. Hale, Scale, Standardize, or Normalize with Scikit-Learn (2019), Towards Data Science.

[4] S. Kumar, Silhouette Method — Better than Elbow Method to Find Optimal Clusters (2020), Towards Data Science.

[5] M. Alam, Seaborn essentials for data visualization in Python (2020), Towards Data Science.

[6] V. Roman, Machine Learning: Clustering Analysis (2019), Towards Data Science.

K-means Clustering: An Introductory Guide and Practical Application (2024)

FAQs

What is the practical application of K clustering? ›

KMeans is used across many fields in a wide variety of use cases; some examples of clustering use cases include customer segmentation, fraud detection, predicting account attrition, targeting client incentives, cybercrime identification, and delivery route optimization.

What is K clustering in simple terms? ›

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

What is the purpose of K means clustering? ›

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters. The goal is to group similar data points together and discover underlying patterns or structures within the data.

What is K clustering for dummies? ›

Imagine you have a set of data points scattered across a graph. K-Means Clustering will group these points into K clusters by minimizing the variance within each cluster. It's like finding the best way to organize a room with different types of items into specific zones.

Where is clustering used in real life? ›

How can cluster analysis be applied in real-world scenarios? Cluster analysis can be used for customer segmentation, genomic data analysis, image segmentation, and anomaly detection in various fields.

When should you use K means clustering? ›

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.

What is an example of using k-means clustering? ›

Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.

Is k-means clustering good? ›

In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data points part of the same group.

What is the difference between Kmeans and clustering? ›

K-Means is used when the number of classes is fixed, while the latter is used for an unknown number of classes. Distance is used to separate observations into different groups in clustering algorithms. Clustering is an essential part of unsupervised machine learning training.

What are the disadvantages of K-Means clustering? ›

Hence we can say that K-means clustering is useful , but it has its limitations. It can be sensitive to the initial guess, outliers can impact the results, it assumes round clusters, we need to know the number of clusters in advance, and it may face challenges with large datasets.

What are the two main goals of K-Means clustering algorithm? ›

This approach consists of two steps Expectation(E) and Maximization(M) and iterates between these two. For K-means, The Expectation(E) step is where each data point is assigned to the most likely cluster and the Maximization(M) step is where the centroids are recomputed using the least square optimization technique.

What is the main purpose of clustering? ›

Clustering is used to identify groups of similar objects in datasets with two or more variable quantities. In practice, this data may be collected from marketing, biomedical, or geospatial databases, among many other places.

What is K clustering in layman's terms? ›

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

How do you explain clustering? ›

Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other.

How do you explain k-means clustering results? ›

K-means clustering is an unsupervised machine learning algorithm used for grouping a dataset into a set of distinct, non-overlapping clusters. The outcome of K-means is this collection of clusters, each containing data points that share common traits.

What are the practical applications of cluster analysis? ›

Applications of Cluster Analysis

This method of analysis helps to both target customer segments and perform sales analysis by groups. Business Operations: Businesses can optimize their processes and reduce costs by analyzing clusters and identifying similarities and differences between data points.

What are the common applications of clustering? ›

Some common applications for clustering:
  • Market segmentation.
  • Social network analysis.
  • Search result grouping.
  • Medical imaging.
  • Image segmentation.
  • Anomaly detection.
Jul 22, 2024

Which of the following is a common application of K means clustering? ›

Clustering Algorithms like K-Means are popular in almost every domain. It has got quite a lot of applications like Market Segmentation, Image Segmentation, Identifying Crime Localities, Recommendation Engines etc.

What is the application of K clustering in image processing? ›

The K-means algorithm can be used to compress the image. Unlike lossless compression, K-means uses lossy compression, so it is not possible to recover the original image from the compressed image. The larger the compression ratio, the larger the difference between the compressed image and the original image.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Prof. An Powlowski

Last Updated:

Views: 6070

Rating: 4.3 / 5 (44 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Prof. An Powlowski

Birthday: 1992-09-29

Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

Phone: +26417467956738

Job: District Marketing Strategist

Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.