K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (2024)

Introduction

K Means is one of the most popular Unsupervised Machine Learning Algorithms used for solving classification problems in data science, making it a crucial skill for those aspiring to excel in a data scientist role. K Means segregates unlabeled data into various groups, known as clusters, by identifying similar features and common patterns within the dataset. This tutorial aims to provide a comprehensive understanding of clustering, with a specific focus on the K Means clustering algorithm and its implementation in Python. By delving into the nuances of K means clustering in Python, you will gain valuable insights into how to effectively organize and analyze data. Additionally, the tutorial will guide you on determining the optimum number of clusters for a dataset, enhancing your ability to apply K means clustering in practical scenarios.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (1)

Learning Objectives

  • Understand what the K-means clustering algorithm is.
  • Develop a good understanding of the steps involved in implementing the K-Means algorithm and finding the optimal number of clusters.
  • Implement K means Clustering in Python with scikit-learn library.

This article was published as a part of the Data Science Blogathon.

Table of contents

  • Introduction
  • What Is Clustering?
  • What Is K-Means Clustering Algorithm?
  • What is K-Means clustering method in Python?
  • How K Means Clustering in Python Works?
  • Diagrammatic Implementation of K-Means Clustering
  • Choosing the Optimal Number of Clusters
  • Python Code for K-Means Clustering:
  • WCSS and Elbow Method
  • Conclusion
  • Frequently Asked Questions

What Is Clustering?

Suppose we have N number of unlabeled multivariate datasets of various animals like dogs, cats, birds, etc. The technique of segregating these datasets into various groups on the basis of having similar features and characteristics is called clustering.

The groups being formed are known as clusters. Clustering techniques are used in various fields, such as image recognition, spam filtering, etc. They are also used in unsupervised learning algorithms in machine learning, as they can segregate multivariate data into various groups, without any supervisor, on the basis of common patterns hidden inside the datasets.

What Is K-Means Clustering Algorithm?

The k-means clustering algorithm is an Iterative algorithm that divides a group of n datasets into k different clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.

K, here is the pre-defined number of clusters to be formed by the algorithm. If K=3, It means the number of clusters to be formed from the dataset is 3.

Implementation of the K-Means Algorithm

The implementation and working of the K-Means algorithm are explained in the steps below:

Step 1: Select the value of K to decide the number of clusters (n_clusters) to be formed.

Step 2: Select random K points that will act as cluster centroids (cluster_centers).

Step 3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid, which will form the predefined clusters.

Step 4: Place a new centroid of each cluster.

Step 5: Repeat step no.3, which reassigns each datapoint to the new closest centroid of each cluster.

Step 6: If any reassignment occurs, then go to step 4; else, go to step 7.

Step 7: Finish

What is K-Means clustering method in Python?

K-Means clustering is a method in Python for grouping a set of data points into distinct clusters. The goal is to partition the data in such a way that points in the same cluster are more similar to each other than to points in other clusters. Here’s a breakdown of how to use K Means clustering in Python:

Import Libraries:

  • First, you need to import the necessary libraries. In Python, the popular scikit-learn library provides an implementation of K-Means.
from sklearn.cluster import KMeans

Prepare Your Data:

  • Organize your data into a format that the algorithm can understand. In many cases, you’ll have a 2D array or a pandas DataFrame.
import numpy as npdata = np.array([[1, 2], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])

Choose the Number of Clusters (K):

  • Decide on the number of clusters you want the algorithm to find. This is often based on your understanding of the data or through techniques like the elbow method.
kmeans = KMeans(n_clusters=2)

Fit the Model:

  • Train the K-Means model on your data.
kmeans.fit(data)

Get Results:

  • Once the model is trained, you can get information about the clusters.
# Get the cluster centerscentroids = kmeans.cluster_centers_# Get the labels (cluster assignments for each data point)labels = kmeans.labels_

In this example, n_clusters=2 indicates that we want the algorithm to find two clusters. The fit method trains the model, and then you can access information about the clusters, such as the cluster centers and labels. Visualizing the results can be helpful to see how well the algorithm grouped your data points.

How K Means Clustering in Python Works?

Here is Step-by-Step Explanation that How K-means Clustering in Python works:

Initialize Centroids:

  • Randomly choose K data points from the dataset to be the initial centroids. K is the number of clusters you want to create.

Assign Data Points to Nearest Centroid:

  • For each data point in the dataset, calculate the distance to each centroid.
  • Assign the data point to the cluster whose centroid is the closest (usually using Euclidean distance).

Update Centroids:

  • Recalculate the centroids of the clusters by taking the mean of all the data points assigned to each cluster.

Repeat:

  • Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no longer change significantly or after a predefined number of iterations.

Final Result:

  • The algorithm converges, and each data point is assigned to one of the K clusters.

Here’s a simple example using Python with the popular machine learning library, scikit-learn:

from sklearn.cluster import KMeansimport numpy as np# Sample datadata = np.array([[1, 2], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])# Specify the number of clusters (K)kmeans = KMeans(n_clusters=2)# Fit the data to the algorithmkmeans.fit(data)# Get the cluster centroids and labelscentroids = kmeans.cluster_centers_labels = kmeans.labels_print("Centroids:")print(centroids)print("Labels:")print(labels)

Diagrammatic Implementation of K-Means Clustering

Step 1: Let’s choose the number k of clusters, i.e., K=2, to segregate the dataset and put them into different respective clusters. We will choose some random 2 points which will act as centroids to form the cluster.

Step 2: Now, we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids.

Step 3:points on the left side of the line are near the blue centroid, and points to the right of the line are close to the yellow centroid. The left forms a cluster with the blue centroid, and the right one with the yellow centroid.

Step 4: Repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, as depicted below.

Step 5:Next, we will reassign each data point to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (2)

Step 6:As reassignment has occurred, we will repeat the above step of finding new k centroids.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (3)

Step 7:We will repeat the above process of finding the center of gravity of k centroids, as depicted below.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (4)

Step 8:After finding the new k centroids, we will again draw the median line and reassign the data points, like the above steps.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (5)

Step 9:We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point is to be included in a single group.

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (6)

The final cluster formed is like this:

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (7)

Choosing the Optimal Number of Clusters

The number of clusters that we choose for the algorithm shouldn’t be random. Each and every cluster is formed by calculating and comparing the mean distances of each data point within a cluster from its centroid.

We can choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.

The main idea is to minimize the distance (e.g., euclidean distance) between the data points and the centroid of the clusters. The process is iterated until we reach a minimum value for the sum of distances.

Elbow Method

Here are the steps to follow in order to find the optimal number of clusters using the elbow method:

Step 1: Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).

Step 2: For each value of K, calculate the WCSS value.

Step 3: Plot a graph/curve between WCSS values and the respective number of clusters K.

Step 4: The sharp point of bend or a point (looking like an elbow joint) of the plot, like an arm, will be considered as the best/optimal value of K.

Python Implementation:

Importing relevant libraries

import numpy as npimport pandas as pdimport statsmodels.api as smimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.cluster import KMeans

Loading the data

data = pd.read_csv('Countryclusters.csv')data
K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (8)

Plotting the data

Python Code for K-Means Clustering:

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (9)

Selecting the feature

 x = data.iloc[:,1:3] # 1t for rows and second for columnsx
K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (10)

Clustering

kmeans = KMeans(3)means.fit(x)

Clustering results

identified_clusters = kmeans.fit_predict(x)identified_clusters
array([1, 1, 0, 0, 0, 2])
data_with_clusters = data.copy()data_with_clusters['Clusters'] = identified_clusters plt.scatter(data_with_clusters['Longitude'],data_with_clusters['Latitude'],c=data_with_clusters['Clusters'],cmap='rainbow')
K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (11)

WCSS and Elbow Method

wcss=[]for i in range(1,7):kmeans = KMeans(i)kmeans.fit(x)wcss_iter = kmeans.inertia_wcss.append(wcss_iter)number_clusters = range(1,7)plt.plot(number_clusters,wcss)plt.title('The Elbow title')plt.xlabel('Number of clusters')plt.ylabel('WCSS')
K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (12)

This method shows that 3 is a good number of clusters.

Conclusion

To summarize everything that has been stated so far, k means clustering in python is a widely used unsupervised machine learning technique that enables the grouping of data into clusters based on similarity. It is a simple algorithm that can be applied to various domains and data types, including image and text data. k-means can be used for a variety of purposes. We can use it to perform dimensionality reduction also, where each transformed feature is the distance of the point from a cluster center.

Key Takeaways

  • K-means is a widely used unsupervised machine learning algorithm for clustering data into groups (also known as clusters) of similar objects.
  • The objective is to minimize the sum of squared distances between the objects and their respective cluster centroids.
  • The k-means clustering algorithm is limited as it can not handle complex and non-linear data.

Frequently Asked Questions

Q1. What is meant by n_init in k-means clustering?

A. n_init is an integer and represents the number of times or the number of iterations the k-means algorithm will be run independently.

Q2. What are the advantages and disadvantages of K-means Clustering?

A. Advantages of K-means Clustering include its simplicity, scalability, and versatility, as it can be applied to a wide range of data types. Disadvantages include its sensitivity to the initial placement of centroids and its limitations in handling complex, non-linear data. k-means is also sensitive to outliers.

Q3. What is meant by random_state in k-means clustering?

A. In K-Means, random_state represents random number generation for centroid initialization. We can use an Integer value to make the randomness fixed or constant. Also, it helps when we want to produce the same clusters every time.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathonK Means AlgorithmK means clusteringk means clustering in python

Pranshu Sharma17 Jan, 2024

AlgorithmBeginnerClassificationClusteringPython

K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis (2024)

FAQs

How to do k mean clustering in Python? ›

The recipe for k -means is quite straightforward.
  1. Decide how many clusters you want, i.e. choose k.
  2. Randomly assign a centroid to each of the k clusters.
  3. Calculate the distance of all observation to each of the k centroids.
  4. Assign observations to the closest centroid.
Jul 21, 2023

How to solve k-means clustering problem? ›

How to Apply K-Means Clustering Algorithm?
  1. Choose the number of clusters k. The first step in k-means is to pick the number of clusters, k.
  2. Select k random points from the data as centroids. ...
  3. Assign all the points to the closest cluster centroid. ...
  4. Recompute the centroids of newly formed clusters. ...
  5. Repeat steps 3 and 4.

What is k-means matching? ›

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The term 'K' is a number.

What is the cluster center in Kmeans? ›

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like: The "cluster center" is the arithmetic mean of all the points belonging to the cluster.

How to calculate accuracy of k-means clustering in Python? ›

The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration's assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid.

What is K clustering for beginners? ›

K-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster. The goal is to identify the K number of groups in the dataset.

What are the two main problems of k-means clustering algorithm? ›

Detailed theorotical explanation and scikit-learn implementation. There are two challenges that need to handled wisely in order to get the most out of the k-means clustering algorithm: Defining the number of clusters. Determining the initial centroids.

How do you analyze k-means clustering results? ›

Interpreting the meaning of k-means clusters boils down to characterizing the clusters. A Parallel Coordinates Plot allows us to see how individual data points sit across all variables. By looking at how the values for each variable compare across clusters, we can get a sense of what each cluster represents.

What is the formula for k-means clustering? ›

Algorithmic steps for k-means clustering

Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of centers. 1) Randomly select 'c' cluster centers. 2) Calculate the distance between each data point and cluster centers.

What is the rule of thumb for K clustering? ›

The first way is a rule of thumb that sets the number of clusters to the square root of half the number of objects. If we want to cluster 200 objects, the number of clusters would be √(200/2)=10.

How many clusters are generated by the k-means algorithm? ›

The k-means algorithm in data clustering generates a number of clusters equivalent to k. The 'k' value is predefined by the user and dictates the number of clusters that the algorithm will create. For instance, if you set k=3 in the k-means algorithm, it will categorize the data points into three clusters.

What does K stand for in data analysis? ›

The k-means clustering method is used in non-hierarchical cluster analysis . The goal is to divide the whole set of objects into a predefined number (k) of clusters.

Is k-means sensitive to outliers? ›

The K-means clustering algorithm is sensitive to outliers, because a mean is easily influenced by extreme values. K-medoids clustering is a variant of K-means that is more robust to noises and outliers.

How do you determine how many clusters to use Kmeans? ›

Using Silhouette Score: The silhouette score is particularly helpful in determining the optimal number of clusters (k) for K-means. You can calculate the silhouette score for different values of k and choose the k that results in the highest average silhouette score.

How to train the k-means model? ›

Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

How to determine the k value in the K-means clustering algorithm? ›

The elbow method is a graphical representation of finding the optimal 'K' in a K-means clustering. It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between points in a cluster and the cluster centroid.

How does K-means clustering work? ›

K-means assigns every data point in the dataset to the nearest centroid, meaning that a data point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

What is the difference between KNN and K-means? ›

KNN is a predictive algorithm, which means that it uses the existing data to make predictions or classifications for new data. K-means is a descriptive algorithm, which means that it uses the data to find patterns or structure within it.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 6591

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.