Model Interpretability
Understand your K-Means clusters by extracting each cluster’s most important features.
Published in ·
--
Machine learning models go through many stages for them to be considered production-ready. One critical stage is that moment of truth where the model is given a scientific green light; Model Evaluation. Many evaluation metrics are designated for different purposes and problem specifications, but none of them is flawless. Hence, data scientists have a huge burden to carry when choosing which evaluation metric to use that best fits the problem at hand and serves as a functional decision-inducing value for a machine learning project.
However, some evaluation metric values could seem unintuitive and cannot be explained in laymen’s terms, especially when evaluating unsupervised clustering algorithms using internal indexes (No external information is present such as true labels). Although these measurements serve a considerable benefit for comparative benchmarking, combining them with an interpretation is crucial for independent, intuitive validation and easier results communication with stakeholders.
Data scientists tend to lose a focal point in the evaluation process when it comes to internal validation indexes, which is the intuitive “Human” understanding of the model’s performance and its explanation. To elaborate by a counterexample, supervised classification evaluation metrics such as Accuracy, Precision, or Recall have very intuitive explanations for both laymen and experts. For instance:
- Accuracy: “We have correctly classified this percentage of samples out of all samples we have.”
- Precision: “We have correctly classified this percentage of Positives out of all predicted positives.”
- Recall: “We have correctly classified this percentage of Positives out of all actual positives.”
On the other hand, internal validation indexes such as the widely used Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index, and many others are, more often than not, used comparatively rather than an inherent and independent performance evaluation. The values of these measurements can be broken down to how compact each cluster is (How similar the cluster data points are to each other), how well-separated the clusters are (How dissimilar each cluster’s data points are from other clusters’ data points), or both. But, these measurements are not as easy to explain and connect to the clustering problem you are solving.
Notwithstanding how valuable these measures are, obtaining an interpretation for a model along with an internal evaluation index is crucial when trying to understand how good your model is as a data scientist or attempting to deeply communicate your results to stakeholders where most internal validation indexes do not pass the ELI5 (Explain it Like I’m 5) Test :). This article will focus on one of the most popular unsupervised clustering algorithms; The K-Means, and presents two possible techniques to extract the most important features for each cluster. The article’s outline is as follows:
- K-Means Business Value
- How K-Means Works
- Interpretation Techniques
- Real-Life Application
If you already know how K-Means works, jump to the Interpretation Techniques section, or would like to visit the repository for this article and use the code directly, visit kmeans-feature-importance.
Say that you are running a business with thousands of customers, and you would want to know more about your customers, albeit how many you have. You cannot study each customer and cater a marketing campaign specifically for them, for example. Yet, you know that each customer will probably have different demographic (e.g., Age), Geographic (e.g., Region), Psychographic (e.g., Spending Habits), and Behavioral (e.g., Engagement) properties. Segmenting them into multiple similar groups will simplify understanding who they are from the most prevalent properties of each group. The most popular algorithm to solve this problem, I am sure you guessed it! Is K-Means; “K” will refer to the number of possible customer segments/groups/clusters, and “Means” can be thought of as imaginary persons who are the center (Most similar to their group members) of each group in the K groups.
K-Means is an unsupervised clustering algorithm that groups similar data samples in one group away from dissimilar data samples. Precisely, it aims to minimize the Within-Cluster Sum of Squares (WCSS) and consequently maximize the Between-Cluster Sum of Squares (BCSS). K-Means algorithm has different implementations and conceptual variations, but for the sake of this article, We will focus on the most common method, namely Lloyd’s algorithm (Naive K-Means), which follows an iterative approach to find a sub-optimal solution. Before starting, we will prepare a simple dataset for explanation and see how it looks like.
The steps we need to do to cluster the data points above into K groups using K-Means are:
Step 1 — Choosing Initial Number of Groups/Clusters (K)
A centroid represents each cluster; The mean of all data points assigned to that cluster. Choosing an initial number of groups is synonymous with choosing an initial number of centroids K. We can decide on K = 3
and choose 3 distinct data points randomly from the dataset as our initial centroids (There are better methods to choose the initial coordinates of each centroid).
Step 2 — Assigning Data Points to Clusters Centroids
We have to assign each data point in our dataset to the closest centroid. We will use the most common distance metric as our definition of “close” through the following steps:
1. Calculate the Euclidean distance between each centroid and all data points using the equation below for each j
-th cluster centroid C_j
, and one point p
for a dataset that has d
-Dimensions. (My usage of C_j = u_c_j (u = average) = Cluster Centroid
is only a simplification since the equations I will list are in a single cluster level).
2. To minimize the WCSS, we assign each data point to its closest centroid (Most similar / Least Distant). The reason why this will be a WCSS minimization step is from the equation for one cluster’s WCSS with p_m
number of points assigned to the cluster centroid C_j
where the shorter the distance for the points assigned to the cluster centroid, the lower its WCSS.
Step 3 — Updating Clusters Centroids’ Positions
The last step is to update the cluster centroids' positions since different data points were assigned to each cluster. We have to move each centroid to the mean of all data points assigned to it by:
- Calculating the mean of each cluster’s data points
- Setting the new cluster centroid to the new mean for each cluster
- Repeating Step 2 and Step 3 until the cluster centroids (the new means) do not change
Using sklearn.cluster.KMean
; We will end up with the following where you can take, for example, Sample 0
and Sample 1
features, calculate their mean and then check if it equals to Cluster Centroid D1
and Cluster Centroid D2
:
The plot below shows how K-Means clustered the dataset where each centroid is positioned at the mean of the points assigned to it:
Approach 1: WCSS Minimizers
Disclaimer: The author has developed this method. Please do not hesitate to cite any references if found or criticize the methodology.
This approach is a direct analysis of each centroid’s sub-optimal position. Since K-Means aim is to minimize the Within-Cluster Sum of Squares and assuming that the distance metric used is the euclidean distance: We can find the dimensions that were responsible for the highest amount of WCSS (The sum of squares of each data point distance to its cluster centroid) minimization for each cluster through finding the maximum absolute centroid dimensional movement.
Using the same explanation example above, we can access cluster_centers_
from sklearn.cluster.KMean
fitted model; The final cluster centroids’ positions. Then show the feature names (Dimensions d
):
Plotting the clusters along with their positions would yield the following:
Now let’s take each centroid and sort it by the dimensions axis, and remember to take the absolute value if you have features that have negative values.
After that, we can get the first centroid’s dimensions with sorted weights (Centroid distance traveled on features plane) and map that to feature names which will give us the hypothetical features importances below:
Since the centroid has moved with equal distances through both dimensions, the features will be of equal importance. Let’s give the last two samples (Belonging to cluster 0
) a nudge on the first dimension and repeat the same process, then plot the new centroids
We will now apply the same approach again and extract the feature importances. Notice that cluster 0 has moved on feature one much more than feature 2 and thus has had a higher impact on WCSS minimization.
Approach 2: Unsupervised to Supervised
This approach is model-agnostic; Not exclusive to K-Means, in which we convert the unsupervised clustering problem into a One-vs-All supervised classification problem using an easily interpretable classifier such as tree-based models. The steps to do this are as follows:
- Change the cluster labels into One-vs-All binary labels for each
- Train a classifier to discriminate between each cluster and all other clusters
- Extract the feature importances from the model (We will be using
sklearn.ensemble.RandomForestClassifier
)
Following from step 1, let us set cluster 0
label as 1
and set all other cluster labels to 0
:
We have converted the problem into a binary classification problem. What is left is to train a classifier and use its feature_importances_
method implemented in scikit-learn
to get the features that have the most discriminatory power between all clusters and the targeted cluster. We also need to map them to their feature names sorted by the weights.
One important note is that this approach finds what discriminates between two clusters and is not at all inherent to the targeted cluster. Furthermore, there are many other sophisticated methods to extract the most important features from tree-based models and model-agnostic approaches which you can try.
I have chosen to apply the interpretation technique on an NLP problem since we can easily relate to the feature importances (English words), which could be considered as a group-based keyword extraction technique where we aim to cluster similar documents together using K-Means and then apply the techniques above. The dataset I will be using can be found here Kaggle BBC-News, which presents a classification problem. We will exclude the category column at first (Sport, Business, Politics, Entertainment, and Technology News articles) and use it later as a proof-of-concept.
The distribution of news categories in the dataset is:
Since this article is not NLP-Specific (Natural Language Processing), I won’t intensively go through any NLP-related tasks. So, quickly moving towards preprocessing the text to prepare the features. The code below normalizes the words and filtering tokens (words and digits) that do not present a discriminatory power and are thus redundant. Then calculates the highest mentioned words in the whole dataset and plots them (You can skip the code):
Using TF-IDF (Text Representation Technique), we can convert the categorical variables (Words) into numeric representations. We do not need to scale the features here as TF-IDF normalizes features within its equation, and its output should be used in its raw form. Remember to scale the features if you apply K-Means on a dataset with features of different units or ranges where this difference is not relevant to the problem.
Finally, now is the step we care about the most; I have wrapped up the method above in a class that inherits from sklearn.cluster.KMean
and can be used in the same way with the difference of having feature_importances_
property added. You will have to provide the features ordered in the same way as X
parameter in the fit
method to ordered_feature_names
parameter for the modified class when initializing.
You can find the code here kmeans-feature-importance and simply clone it like this in your favorite CLI or simply follow through by accessing the Colab example in the repository:
git clone https://github.com/YousefGh/kmeans-feature-importance.git
And then run the modified KMeans class with k
= number of news dataset categories so that we can compare the results later with the actual categories.
Let’s check if K-Means has produced a cluster distribution similar to the category distribution in the news dataset.
That’s a close-enough category distribution similarity to get us going. Don’t forget to ensure that K-Means has produced accurate results by using different internal validation indexes (I won’t be going through them as this will be out-of-scope), and you probably will not have true labels, so you will need to choose the best K in K-Means if K is unknown from the problem domain knowledge.
Moving on to interpretation, we can access the feature_importances_
for the second cluster cluster 1
like this (The example is for WCSS Minimizers):
Approaches Comparison
We will compare both the WCSS Minimizers method and the Unsupervised to Supervised problem conversion method using the feature_importance_method
parameter in KMeanInterp
class. The flow will be as follows:
- Plot categories distribution for comparison with unique colors
- set
feature_importance_method
parameter aswcss_min
and plot feature importances - set
feature_importance_method
parameter asunsup2sup
and plot feature importances - Infer the category of each cluster using its most important features
WCSS Minimizers
Unsupervised to Supervised
Clustering Interpretability becomes crucial when truth labels are not available at development time. It not only prevents data scientists from a direct evaluation of clustering validity due to the nature of internal validation indexes but also obstructs a simple and intuitive explanation of cluster performance to stakeholders. We have presented two possible approaches that aim to tackle this through extracting cluster-based feature importance, which allows us to know why the K-Means algorithm has chosen each cluster to be as such. The approach extends itself to stakeholder communication, simple and intuitive evaluation, cluster-based Keyword Extraction in NLP, and a general feature selection technique.
The notebook for this article, KMeansInterp
class, along with a direct usage example on Colab, can be found here. Happy Interpretation!
References:
- Y. Liu, Z. Li, H. Xiong, X. Gao and J. Wu, “Understanding of Internal Clustering Validation Measures,” 2010 IEEE International Conference on Data Mining, 2010, pp. 911–916, doi: 10.1109/ICDM.2010.35.
- Kriegel, HP., Schubert, E. & Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowl Inf Syst 52, 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2
- Ng, A., & Piech, C. (2021). CS221. Retrieved 18 July 2021, from https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
- Ismaili, Oumaima & Lemaire, Vincent & Cornuéjols, Antoine. (2014). A Supervised Methodology to Measure the Variables Contribution to a Clustering. 159–166. 10.1007/978–3–319–12637–1_20.
- “2.3. Clustering — scikit-learn 0.24.2 documentation”, 2021