K-Means Clustering in Python: Step-by-Step Example (2024)

One of the most common clustering algorithms in machine learning is known as k-means clustering.

K-means clustering is a technique in which we place each observation in a dataset into one of K clusters.

The end goal is to haveKclusters in which the observations within each cluster are quite similar to each other while the observations in different clusters are quite different from each other.

In practice, we use the following steps to perform K-means clustering:

1. Choose a value forK.

  • First, we must decide how many clusters we’d like to identify in the data. Often we have to simply test several different values for K and analyze the results to see which number of clusters seems to make the most sense for a given problem.

2. Randomly assign each observation to an initial cluster, from 1 toK.

3. Perform the following procedure until the cluster assignments stop changing.

  • For each of theKclusters, compute the clustercentroid. This is simply the vector of the p feature means for the observations in the kth cluster.
  • Assign each observation to the cluster whose centroid is closest. Here,closest is defined using Euclidean distance.

The following step-by-step example shows how to perform k-means clustering in Python by using the KMeans function from the sklearn module.

Step 1: Import Necessary Modules

First, we’ll import all of the modules that we will need to perform k-means clustering:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler

Step 2: Create the DataFrame

Next, we’ll create a DataFrame that contains the following three variables for 20 different basketball players:

  • points
  • assists
  • rebounds

The following code shows how to create this pandas DataFrame:

#create DataFramedf = pd.DataFrame({'points': [18, np.nan, 19, 14, 14, 11, 20, 28, 30, 31, 35, 33, 29, 25, 25, 27, 29, 30, 19, 23], 'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14, np.nan, 9, 4, 3, 4, 12, 15, 11], 'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4, 11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})#view first five rows of DataFrameprint(df.head()) points assists rebounds0 18.0 3.0 151 NaN 3.0 142 19.0 4.0 143 14.0 5.0 104 14.0 4.0 8

We will use k-means clustering to group together players that are similar based on these three metrics.

Step 3: Clean & Prep the DataFrame

Next, we’ll perform the following steps:

  • Use dropna() to drop rows with NaN values in any column
  • Use StandardScaler() to scale each variableto have a mean of 0 and a standard deviation of 1

The following code shows how to do so:

#drop rows with NA values in any columnsdf = df.dropna()#create scaled DataFrame where each variable has mean of 0 and standard dev of 1scaled_df = StandardScaler().fit_transform(df)#view first five rows of scaled DataFrameprint(scaled_df[:5])[[-0.86660275 -1.22683918 1.72722524] [-0.72081911 -0.96077767 1.45687694] [-1.44973731 -0.69471616 0.37548375] [-1.44973731 -0.96077767 -0.16521285] [-1.88708823 -0.16259314 1.45687694]]

Note: We use scaling so that each variable has equal importance when fitting the k-means algorithm. Otherwise, the variables with the widest ranges would have too much influence.

Step 4: Find the Optimal Number of Clusters

To perform k-means clustering in Python, we can use the KMeans function from the sklearn module.

This function uses the following basic syntax:

KMeans(init=’random’, n_clusters=8, n_init=10, random_state=None)

where:

  • init: Controls the initialization technique.
  • n_clusters: The number of clusters to place observations in.
  • n_init: The number of initializations to perform. The default is to run the k-means algorithm 10 times and return the one with the lowest SSE.
  • random_state: An integer value you can pick to make the results of the algorithm reproducible.

The most important argument in this function is n_clusters, which specifies how many clusters to place the observations in.

However, we don’t know beforehand how many clusters is optimal so we must create a plot that displays the number of clusters along with the SSE (sum of squared errors) of the model.

Typically when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.

The following code shows how to create this type of plot that displays the number of clusters on the x-axis and the SSE on the y-axis:

#initialize kmeans parameterskmeans_kwargs = {"init": "random","n_init": 10,"random_state": 1,}#create list to hold SSE values for each ksse = []for k in range(1, 11): kmeans = KMeans(n_clusters=k, **kmeans_kwargs) kmeans.fit(scaled_df) sse.append(kmeans.inertia_)#visualize resultsplt.plot(range(1, 11), sse)plt.xticks(range(1, 11))plt.xlabel("Number of Clusters")plt.ylabel("SSE")plt.show()

K-Means Clustering in Python: Step-by-Step Example (1)

In this plot it appears that there is an elbow or “bend” at k = 3 clusters.

Thus, we will use 3 clusters when fitting our k-means clustering model in the next step.

Note: In the real-world, it’s recommended to use a combination of this plot along with domain expertise to pick how many clusters to use.

Step 5: Perform K-Means Clustering with Optimal K

The following code shows how to perform k-means clustering on the dataset using the optimal value for k of 3:

#instantiate the k-means class, using optimal number of clusterskmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1)#fit k-means algorithm to datakmeans.fit(scaled_df)#view cluster assignments for each observationkmeans.labels_array([1, 1, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0]) 

The resulting array shows the cluster assignments for each observation in the DataFrame.

To make these results easier to interpret, we can add a column to the DataFrame that shows the cluster assignment of each player:

#append cluster assingments to original DataFramedf['cluster'] = kmeans.labels_#view updated DataFrameprint(df) points assists rebounds cluster0 18.0 3.0 15 12 19.0 4.0 14 13 14.0 5.0 10 14 14.0 4.0 8 15 11.0 7.0 14 16 20.0 8.0 13 17 28.0 7.0 9 28 30.0 6.0 5 29 31.0 9.0 4 010 35.0 12.0 11 011 33.0 14.0 6 013 25.0 9.0 5 014 25.0 4.0 3 215 27.0 3.0 8 216 29.0 4.0 12 217 30.0 12.0 7 018 19.0 15.0 6 019 23.0 11.0 5 0

The cluster column contains a cluster number (0, 1, or 2) that each player was assigned to.

Players that belong to the same cluster have roughly similar values for the points, assists, and rebounds columns.

Note: You can find the complete documentation for the KMeans function from sklearn here.

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Perform Linear Regression in Python
How to Perform Logistic Regression in Python
How to Perform K-Fold Cross Validation in Python

K-Means Clustering in Python: Step-by-Step Example (2024)

References

Top Articles
Oxtail and red wine potjie - Cooksister | Food, Travel, Photography
7-Day No-Sugar, High-Fiber, Anti-Inflammatory Meal Plan, Created by a Dietitian
Www.fresno.courts.ca.gov
Danielle Moodie-Mills Net Worth
Satyaprem Ki Katha review: Kartik Aaryan, Kiara Advani shine in this pure love story on a sensitive subject
Bhad Bhabie Shares Footage Of Her Child's Father Beating Her Up, Wants Him To 'Get Help'
What is IXL and How Does it Work?
Declan Mining Co Coupon
Best Pawn Shops Near Me
Kitty Piggy Ssbbw
Abortion Bans Have Delayed Emergency Medical Care. In Georgia, Experts Say This Mother’s Death Was Preventable.
Google Flights Missoula
Prosser Dam Fish Count
Yakimacraigslist
Classic | Cyclone RakeAmerica's #1 Lawn and Leaf Vacuum
Loves Employee Pay Stub
Danforth's Port Jefferson
Craigslist Pet Phoenix
Invitation Homes plans to spend $1 billion buying houses in an already overheated market. Here's its presentation to investors setting out its playbook.
Universal Stone Llc - Slab Warehouse & Fabrication
Talk To Me Showtimes Near Marcus Valley Grand Cinema
Jobs Hiring Near Me Part Time For 15 Year Olds
A Man Called Otto Showtimes Near Cinemark University Mall
Low Tide In Twilight Ch 52
Znamy dalsze plany Magdaleny Fręch. Nie będzie nawet chwili przerwy
R Baldurs Gate 3
Healthy Kaiserpermanente Org Sign On
Progressbook Newark
5 Star Rated Nail Salons Near Me
Diggy Battlefield Of Gods
Ripsi Terzian Instagram
Puerto Rico Pictures and Facts
Car Crash On 5 Freeway Today
What Time Is First Light Tomorrow Morning
Pensacola 311 Citizen Support | City of Pensacola, Florida Official Website
New York Rangers Hfboards
Facebook Marketplace Marrero La
Manatee County Recorder Of Deeds
Baywatch 2017 123Movies
Tokyo Spa Memphis Reviews
How To Paint Dinos In Ark
888-333-4026
Bcy Testing Solution Columbia Sc
Luvsquad-Links
California Craigslist Cars For Sale By Owner
How Big Is 776 000 Acres On A Map
Patricia And Aaron Toro
Crystal Glassware Ebay
Paperlessemployee/Dollartree
Keci News
877-552-2666
Hampton Inn Corbin Ky Bed Bugs
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5570

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.