One of the most common clustering algorithms in machine learning is known as k-means clustering.
K-means clustering is a technique in which we place each observation in a dataset into one of K clusters.
The end goal is to haveKclusters in which the observations within each cluster are quite similar to each other while the observations in different clusters are quite different from each other.
In practice, we use the following steps to perform K-means clustering:
1. Choose a value forK.
- First, we must decide how many clusters we’d like to identify in the data. Often we have to simply test several different values for K and analyze the results to see which number of clusters seems to make the most sense for a given problem.
2. Randomly assign each observation to an initial cluster, from 1 toK.
3. Perform the following procedure until the cluster assignments stop changing.
- For each of theKclusters, compute the clustercentroid. This is simply the vector of the p feature means for the observations in the kth cluster.
- Assign each observation to the cluster whose centroid is closest. Here,closest is defined using Euclidean distance.
The following step-by-step example shows how to perform k-means clustering in Python by using the KMeans function from the sklearn module.
Step 1: Import Necessary Modules
First, we’ll import all of the modules that we will need to perform k-means clustering:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler
Step 2: Create the DataFrame
Next, we’ll create a DataFrame that contains the following three variables for 20 different basketball players:
- points
- assists
- rebounds
The following code shows how to create this pandas DataFrame:
#create DataFramedf = pd.DataFrame({'points': [18, np.nan, 19, 14, 14, 11, 20, 28, 30, 31, 35, 33, 29, 25, 25, 27, 29, 30, 19, 23], 'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14, np.nan, 9, 4, 3, 4, 12, 15, 11], 'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4, 11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})#view first five rows of DataFrameprint(df.head()) points assists rebounds0 18.0 3.0 151 NaN 3.0 142 19.0 4.0 143 14.0 5.0 104 14.0 4.0 8
We will use k-means clustering to group together players that are similar based on these three metrics.
Step 3: Clean & Prep the DataFrame
Next, we’ll perform the following steps:
- Use dropna() to drop rows with NaN values in any column
- Use StandardScaler() to scale each variableto have a mean of 0 and a standard deviation of 1
The following code shows how to do so:
#drop rows with NA values in any columnsdf = df.dropna()#create scaled DataFrame where each variable has mean of 0 and standard dev of 1scaled_df = StandardScaler().fit_transform(df)#view first five rows of scaled DataFrameprint(scaled_df[:5])[[-0.86660275 -1.22683918 1.72722524] [-0.72081911 -0.96077767 1.45687694] [-1.44973731 -0.69471616 0.37548375] [-1.44973731 -0.96077767 -0.16521285] [-1.88708823 -0.16259314 1.45687694]]
Note: We use scaling so that each variable has equal importance when fitting the k-means algorithm. Otherwise, the variables with the widest ranges would have too much influence.
Step 4: Find the Optimal Number of Clusters
To perform k-means clustering in Python, we can use the KMeans function from the sklearn module.
This function uses the following basic syntax:
KMeans(init=’random’, n_clusters=8, n_init=10, random_state=None)
where:
- init: Controls the initialization technique.
- n_clusters: The number of clusters to place observations in.
- n_init: The number of initializations to perform. The default is to run the k-means algorithm 10 times and return the one with the lowest SSE.
- random_state: An integer value you can pick to make the results of the algorithm reproducible.
The most important argument in this function is n_clusters, which specifies how many clusters to place the observations in.
However, we don’t know beforehand how many clusters is optimal so we must create a plot that displays the number of clusters along with the SSE (sum of squared errors) of the model.
Typically when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.
The following code shows how to create this type of plot that displays the number of clusters on the x-axis and the SSE on the y-axis:
#initialize kmeans parameterskmeans_kwargs = {"init": "random","n_init": 10,"random_state": 1,}#create list to hold SSE values for each ksse = []for k in range(1, 11): kmeans = KMeans(n_clusters=k, **kmeans_kwargs) kmeans.fit(scaled_df) sse.append(kmeans.inertia_)#visualize resultsplt.plot(range(1, 11), sse)plt.xticks(range(1, 11))plt.xlabel("Number of Clusters")plt.ylabel("SSE")plt.show()
In this plot it appears that there is an elbow or “bend” at k = 3 clusters.
Thus, we will use 3 clusters when fitting our k-means clustering model in the next step.
Note: In the real-world, it’s recommended to use a combination of this plot along with domain expertise to pick how many clusters to use.
Step 5: Perform K-Means Clustering with Optimal K
The following code shows how to perform k-means clustering on the dataset using the optimal value for k of 3:
#instantiate the k-means class, using optimal number of clusterskmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1)#fit k-means algorithm to datakmeans.fit(scaled_df)#view cluster assignments for each observationkmeans.labels_array([1, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0])
The resulting array shows the cluster assignments for each observation in the DataFrame.
To make these results easier to interpret, we can add a column to the DataFrame that shows the cluster assignment of each player:
#append cluster assingments to original DataFramedf['cluster'] = kmeans.labels_#view updated DataFrameprint(df) points assists rebounds cluster0 18.0 3.0 15 12 19.0 4.0 14 13 14.0 5.0 10 14 14.0 4.0 8 15 11.0 7.0 14 16 20.0 8.0 13 17 28.0 7.0 9 28 30.0 6.0 5 29 31.0 9.0 4 010 35.0 12.0 11 011 33.0 14.0 6 013 25.0 9.0 5 014 25.0 4.0 3 215 27.0 3.0 8 216 29.0 4.0 12 217 30.0 12.0 7 018 19.0 15.0 6 019 23.0 11.0 5 0
The cluster column contains a cluster number (0, 1, or 2) that each player was assigned to.
Players that belong to the same cluster have roughly similar values for the points, assists, and rebounds columns.
Note: You can find the complete documentation for the KMeans function from sklearn here.
Additional Resources
The following tutorials explain how to perform other common tasks in Python:
How to Perform Linear Regression in Python
How to Perform Logistic Regression in Python
How to Perform K-Fold Cross Validation in Python