Definitive Guide to K-Means Clustering with Scikit-Learn (2024)

Introduction

K-Means clustering is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances.

In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn. Then, we'll discuss how to determine the number of clusters (Ks) in K-Means, and also cover distance metrics, variance, and K-Means pros and cons.

Motivation

Imagine the following situation. One day, when walking around the neighborhood, you noticed there were 10 convenience stores and started to wonder which stores were similar - closer to each other in proximity. While searching for ways to answer that question, you've come across an interesting approach that divides the stores into groups based on their coordinates on a map.

For instance, if one store was located 5 km West and 3 km North - you'd assign (5, 3) coordinates to it, and represent it in a graph. Let's plot this first point to visualize what's happening:

import matplotlib.pyplot as pltplt.title("Store With Coordinates (5, 3)")plt.scatter(x=5, y=3)

This is just the first point, so we can get an idea of how we can represent a store. Say we already have 10 coordinates to the 10 stores collected. After organizing them in a numpy array, we can also plot their locations:

import numpy as nppoints = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52],[80, 91]])xs = points[:,0] # Selects all xs from the arrayys = points[:,1] # Selects all ys from the arrayplt.title("10 Stores Coordinates")plt.scatter(x=xs, y=ys)

How to Manually Implement K-Means Algorithm

Now we can look at the 10 stores on a graph, and the main problem is to find is there a way they could be divided into different groups based on proximity? Just by taking a quick look at the graph, we'll probably notice two groups of stores - one is the lower points to the bottom-left, and the other one is the upper-right points. Perhaps, we can even differentiate those two points in the middle as a separate group - therefore creating three different groups.

In this section, we'll go over the process of manually clustering points - dividing them into the given number of groups. That way, we'll essentially carefully go over all steps of the K-Means clustering algorithm. By the end of this section, you'll gain both an intuitive and practical understanding of all steps performed during the K-Means clustering. After that, we'll delegate it to Scikit-Learn.

What would be the best way of determining if there are two or three groups of points? One simple way would be to simply choose one number of groups - for instance, two - and then try to group points based on that choice.

Let's say we have decided there are two groups of our stores (points). Now, we need to find a way to understand which points belong to which group. This could be done by choosing one point to represent group 1 and one to represent group 2. Those points will be used as a reference when measuring the distance from all other points to each group.

In that manner, say point (5, 3) ends up belonging to group 1, and point (79, 60) to group 2. When trying to assign a new point (6, 3) to groups, we need to measure its distance to those two points. In the case of the point (6, 3) is closer to the (5, 3), therefore it belongs to the group represented by that point - group 1. This way, we can easily group all points into corresponding groups.

In this example, besides determining the number of groups (clusters) - we are also choosing some points to be a reference of distance for new points of each group.

That is the general idea to understand similarities between our stores. Let's put it into practice - we can first choose the two reference points at random. The reference point of group 1 will be (5, 3) and the reference point of group 2 will be (10, 15). We can select both points of our numpy array by [0] and [1] indexes and store them in g1 (group 1) and g2 (group 2) variables:

g1 = points[0]g2 = points[1]

After doing this, we need to calculate the distance from all other points to those reference points. This raises an important question - how to measure that distance. We can essentially use any distance measure, but, for the purpose of this guide, let's use Euclidean Distance_.

Definitive Guide to K-Means Clustering with Scikit-Learn (1)

Advice: If you want learn more more about Euclidean distance, you can read our "Calculating Euclidean Distances with NumPy" guide.

It can be useful to know that Euclidean distance measure is based on Pythagoras' theorem:

$$
c^2 = a^2 + b^2
$$

When adapted to points in a plane - (a1, b1) and (a2, b2), the previous formula becomes:

$$
c^2 = (a2-a1)^2 + (b2-b1)^2
$$

The distance will be the square root of c, so we can also write the formula as:

$$
euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2)]
$$

Definitive Guide to K-Means Clustering with Scikit-Learn (2)

Note: You can also generalize the Euclidean distance formula for multi-dimensional points. For example, in a three-dimensional space, points have three coordinates - our formula reflects that in the following way:
$$
euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2 + (c2 - c1) ^2)]
$$
The same principle is followed no matter the number of dimensions of the space we are operating in.

So far, we have picked the points to represent groups, and we know how to calculate distances. Now, let's put the distances and groups together by assigning each of our collected store points to a group.

To better visualize that, we will declare three lists. The first one to store points of the first group - points_in_g1. The second one to store points from the group 2 - points_in_g2, and the last one - group, to label the points as either 1 (belongs to group 1) or 2 (belongs to group 2):

points_in_g1 = []points_in_g2 = []group = []

We can now iterate through our points and calculate the Euclidean distance between them and each of our group references. Each point will be closer to one of two groups - based on which group is closest, we'll assign each point to the corresponding list, while also adding 1 or 2 to the group list:

for p in points: x1, y1 = p[0], p[1] euclidean_distance_g1 = np.sqrt((g1[0] - x1)**2 + (g1[1] - y1)**2) euclidean_distance_g2 = np.sqrt((g2[0] - x1)**2 + (g2[1] - y1)**2) if euclidean_distance_g1 < euclidean_distance_g2: points_in_g1.append(p) group.append('1') else: points_in_g2.append(p) group.append('2')

Let's look at the results of this iteration to see what happened:

print(f'points_in_g1:{points_in_g1}\n \\npoints_in_g2:{points_in_g2}\n \\ngroup:{group}')

Which results in:

points_in_g1:[array([5, 3])] points_in_g2:[array([10, 15]), array([15, 12]), array([24, 10]), array([30, 45]), array([85, 70]), array([71, 80]), array([60, 78]), array([55, 52]), array([80, 91])] group:[1, 2, 2, 2, 2, 2, 2, 2, 2, 2] 

We can also plot the clustering result, with different colors based on the assigned groups, using Seaborn's scatterplot() with the group as a hue argument:

import seaborn as snssns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

It's clearly visible that only our first point is assigned to group 1, and all other points were assigned to group 2. That result differs from what we had envisioned in the beginning. Considering the difference between our results and our initial expectations - is there a way we could change that? It seems there is!

One approach is to repeat the process and choose different points to be the references of the groups. This will change our results, hopefully, more in line with what we've envisioned in the beginning. This second time, we could choose them not at random as we previously did, but by getting a mean of all our already grouped points. That way, those new points could be positioned in the middle of corresponding groups.

For instance, if the second group had only points (10, 15), (30, 45). The new central point would be (10 + 30)/2 and (15+45)/2 - which is equal to (20, 30).

Since we have put our results in lists, we can convert them first to numpy arrays, select their xs, ys and then obtain the mean:

g1_center = [np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean()]g2_center = [np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean()]g1_center, g2_center

Definitive Guide to K-Means Clustering with Scikit-Learn (3)

Advice: Try to use numpy and NumPy arrays as much as possible. They are optimized for better performance and simplify many linear algebra operations. Whenever you are trying to solve some linear algebra problem, you should definitely take a look at the numpy documentation to check if there is any numpy method designed to solve your problem. The chance is that there is!

To help repeat the process with our new center points, let's transform our previous code into a function, execute it and see if there were any changes in how the points are grouped:

def assigns_points_to_two_groups(g1_center, g2_center): points_in_g1 = [] points_in_g2 = [] group = [] for p in points: x1, y1 = p[0], p[1] euclidean_distance_g1 = np.sqrt((g1_center[0] - x1)**2 + (g1_center[1] - y1)**2) euclidean_distance_g2 = np.sqrt((g2_center[0] - x1)**2 + (g2_center[1] - y1)**2) if euclidean_distance_g1 < euclidean_distance_g2: points_in_g1.append(p) group.append(1) else: points_in_g2.append(p) group.append(2) return points_in_g1, points_in_g2, group

Definitive Guide to K-Means Clustering with Scikit-Learn (4)

Note: If you notice you keep repeating the same code over and over again, you should wrap that code into a separate function. It is considered a best practice to organize code into functions, especially because they facilitate testing. It is easier to test an isolated piece of code than a full code without any functions.

Let's call the function and store its results in points_in_g1, points_in_g2, and group variables:

points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)points_in_g1, points_in_g2, group

And also plot the scatter plot with the colored points to visualize the groups division:

sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

It seems the clustering of our points is getting better. But still, there are two points in the middle of the graph that could be assigned to either group when considering their proximity to both groups. The algorithm we've developed so far assigns both of those points to the second group.

This means we can probably repeat the process once more by taking the means of the Xs and Ys, creating two new central points (centroids) to our groups and re-assigning them based on distance.

Let's also create a function to update the centroids. The whole process now can be reduced to multiple calls of that function:

def updates_centroids(points_in_g1, points_in_g2): g1_center = np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean() g2_center = np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean() return g1_center, g2_centerg1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

Notice that after this third iteration, each one of the points now belong to different clusters. It seems the results are getting better - let's do it once again. Now going to the fourth iteration of our method:

g1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)

This fourth time we got the same result as the previous one. So it seems our points won't change groups anymore, our result has reached some kind of stability - it has got to an unchangeable state, or converged. Besides that, we have exactly the same result as we had envisioned for the 2 groups. We can also see if this reached division makes sense.

Let's just quickly recap what we've done so far. We've divided our 10 stores geographically into two sections - ones in the lower southwest regions and others in the northeast. It can be interesting to gather more data besides what we already have - revenue, the daily number of customers, and many more. That way we can conduct a richer analysis and possibly generate more interesting results.

Clustering studies like this can be conducted when an already established brand wants to pick an area to open a new store. In that case, there are many more variables taken into consideration besides location.

What Does All This Have To Do With K-Means Algorithm?

While following these steps you might have wondered what they have to do with the K-Means algorithm. The process we've conducted so far is the K-Means algorithm. In short, we've determined the number of groups/clusters, randomly chosen initial points, and updated centroids in each iteration until clusters converged. We've basically performed the entire algorithm by hand - carefully conducting each step.

The K in K-Means comes from the number of clusters that need to be set prior to starting the iteration process. In our case K = 2. This characteristic is sometimes seen as negative considering there are other clustering methods, such as Hierarchical Clustering, which don't need to have a fixed number of clusters beforehand.

Due to its use of means, K-means also becomes sensitive to outliers and extreme values - they enhance the variability and make it harder for our centroids to play their part. So, be conscious of the need to perform extreme values and outlier analysis before conducting a clustering using the K-Means algorithm.

Also, notice that our points were segmented in straight parts, there aren't curves when creating the clusters. That can also be a disadvantage of the K-Means algorithm.

Definitive Guide to K-Means Clustering with Scikit-Learn (5)

Note: When you need it to be more flexible and adaptable to ellipses and other shapes, try using a generalized K-means Gaussian Mixture model. This model can adapt to elliptical segmentation clusters.

K-Means also has many advantages! It performs well on large datasets which can become difficult to handle if you are using some types of hierarchical clustering algorithms. It also guarantees convergence, and can easily generalize and adapt. Besides that, it is probably the most used clustering algorithm.

Now that we've gone over all the steps performed in the K-Means algorithm, and understood all its pros and cons, we can finally implement K-Means using the Scikit-Learn library.

How to Implement K-Means Algorithm Using Scikit-Learn

To double check our result, let's do this process again, but now using 3 lines of code with sklearn:

from sklearn.cluster import KMeans# The random_state needs to be the same number to get reproducible resultskmeans = KMeans(n_clusters=2, random_state=42) kmeans.fit(points)kmeans.labels_

Here, the labels are the same as our previous groups. Let's just quickly plot the result:

sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)

The resulting plot is the same as the one from the previous section.

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Definitive Guide to K-Means Clustering with Scikit-Learn (6)

Note: Just looking at how we've performed the K-Means algorithm using Scikit-Learn might give you the impression that this is a no-brainer and that you don't need to worry too much about it. Just 3 lines of code perform all the steps we've discussed in the previous section when we've gone over the K-Means algorithm step-by-step. But, the devil is in the details in this case! If you don't understand all the steps and limitations of the algorithm, you'll most likely face the situation where the K-Means algorithm gives you results you were not expecting.

With Scikit-Learn, you can also initialize K-Means for faster convergence by setting the init='k-means++' argument. In broader terms, K-Means++ still chooses the k initial cluster centers at random following a uniform distribution. Then, each subsequent cluster center is chosen from the remaining data points not by calculating only a distance measure - but by using probability. Using the probability speeds up the algorithm and it's helpful when dealing with very large datasets.

Definitive Guide to K-Means Clustering with Scikit-Learn (7)

Advice: You can learn more about K-Means++ details by reading the "K-Means++: The Advantages of Careful Seeding" paper, proposed in 2007 by David Arthur and Sergei Vassilvitskii.

The Elbow Method - Choosing the Best Number of Groups

So far, so good! We've clustered 10 stores based on the Euclidean distance between points and centroids. But what about those two points in the middle of the graph that are a little harder to cluster? Couldn't they form a separate group as well? Did we actually make a mistake by choosing K=2 groups? Maybe we actually had K=3 groups? We could even have more than three groups and not be aware of it.

The question being asked here is how to determine the number of groups (K) in K-Means. To answer that question, we need to understand if there would be a "better" cluster for a different value of K.

The naive way of finding that out is by clustering points with different values of K, so, for K=2, K=3, K=4, and so on:

for number_of_clusters in range(1, 11): kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42) kmeans.fit(points) 

But, clustering points for different Ks alone won't be enough to understand if we've chosen the ideal value for K. We need a way to evaluate the clustering quality for each K we've chosen.

Manually Calculating the Within Cluster Sum of Squares (WCSS)

Here is the ideal place to introduce a measure of how much our clustered points are close to each other. It essentially describes how much variance we have inside a single cluster. This measure is called Within Cluster Sum of Squares, or WCSS for short. The smaller the WCSS is, the closer our points are, therefore we have a more well-formed cluster. The WCSS formula can be used for any number of clusters:

$$
WCSS = \sum(Pi_1 - Centroid_1)^2 + \cdots + \sum(Pi_n - Centroid_n)^2
$$

Definitive Guide to K-Means Clustering with Scikit-Learn (8)

Note: In this guide, we are using the Euclidean distance to obtain the centroids, but other distance measures, such as Manhattan, could also be used.

Now we can assume we've opted to have two clusters and try to implement the WCSS to understand better what the WCSS is and how to use it. As the formula states, we need to sum up the squared differences between all cluster points and centroids. So, if our first point from the first group is (5, 3) and our last centroid (after convergence) of the first group is (16.8, 17.0), the WCSS will be:

$$
WCSS = \sum((5,3) - (16.8, 17.0))^2
$$

$$
WCSS = \sum((5-16.8) + (3-17.0))^2
$$

$$
WCSS = \sum((-11.8) + (-14.0))^2
$$

$$
WCSS = \sum((-25.8))^2
$$

$$
WCSS = 335.24
$$

This example illustrates how we calculate the WCSS for the one point from the cluster. But the cluster usually contains more than one point, and we need to take all of them into consideration when calculating the WCSS. We'll do that by defining a function that receives a cluster of points and centroids, and returns the sum of squares:

def sum_of_squares(cluster, centroid): squares = [] for p in cluster: squares.append((p - centroid)**2) ss = np.array(squares).sum() return ss

Now we can get the sum of squares for each cluster:

g1 = sum_of_squares(points_in_g1, g1_center)g2 = sum_of_squares(points_in_g2, g2_center)

And sum up the results to obtain the total WCSS:

g1 + g2

This results in:

2964.3999999999996

So, in our case, when K is equal to 2, the total WCSS is 2964.39. Now, we can switch Ks and calculate the WCSS for all of them. That way, we can get an insight into what K we should choose to make our clustering perform the best.

Calculating WCSS Using Scikit-Learn

Fortunately, we don't need to manually calculate the WCSS for each K. After performing the K-Means clustering for the given number of clusters, we can obtain its WCSS by using the inertia_ attribute. Now, we can go back to our K-Means for loop, use it to switch the number of clusters, and list corresponding WCSS values:

wcss = [] for number_of_clusters in range(1, 11): kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42) kmeans.fit(points) wcss.append(kmeans.inertia_)wcss

Notice that the second value in the list, is exactly the same we've calculated before for K=2:

[18272.9, # For k=1 2964.3999999999996, # For k=2 1198.75, # For k=3 861.75, 570.5, 337.5, 175.83333333333334, 79.5, 17.0, 0.0]

To visualize those results, let's plot our Ks along with the WCSS values:

ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]plt.plot(ks, wcss)

There is an interruption on a plot when x = 2, a low point in the line, and an even lower one when x = 3. Notice that it reminds us of the shape of an elbow. By plotting the Ks along with the WCSS, we are using the Elbow Method to choose the number of Ks. And the chosen K is exactly the lowest elbow point, so, it would be 3 instead of 2, in our case:

ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]plt.plot(ks, wcss);plt.axvline(3, linestyle='--', color='r')

We can run the K-Means cluster algorithm again, to see how our data would look like with three clusters:

kmeans = KMeans(n_clusters=3, random_state=42)kmeans.fit(points)sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)

We were already happy with two clusters, but according to the elbow method, three clusters would be a better fit for our data. In this case, we would have three kinds of stores instead of two. Before using the elbow method, we thought about southwest and northeast clusters of stores, now we also have stores in the center. Maybe that could be a good location to open another store since it would have less competition nearby.

Alternative Cluster Quality Measures

There are also other measures that can be used when evaluating cluster quality:

  • Silhouette Score - analyzes not only the distance between intra-cluster points but also between clusters themselves
  • Between Clusters Sum of Squares (BCSS) - metric complementary to the WCSS
  • Sum of Squares Error (SSE)
  • Maximum Radius - measures the largest distance from a point to its centroid
  • Average Radius - the sum of the largest distance from a point to its centroid divided by the number of clusters.

It's recommended to experiment and get to know each of them since depending on the problem, some of the alternatives can be more applicable than the most widely used metrics (WCSS and Silhouette Score).

In the end, as with many data science algorithms, we want to reduce the variance inside each cluster and maximize the variance between different clusters. So we have more defined and separable clusters.

Applying K-Means on Another Dataset

Let's use what we have learned on another dataset. This time, we will try to find groups of similar wines.

Definitive Guide to K-Means Clustering with Scikit-Learn (9)

Note: You can download the dataset here.

We begin by importing pandas to read the wine-clustering CSV (Comma-Separated Values) file into a Dataframe structure:

import pandas as pddf = pd.read_csv('wine-clustering.csv')

After loading it, let's take a peek at the first five records of data with the head() method:

df.head()

This results in:

 Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavonoids Nonflavanoid_Phenols Proanthocyanidins Color_Intensity Hue OD280 Proline0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 10651 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 10502 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 11853 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 14804 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

We have many measurements of substances present in wines. Here, we also won't need to transform categorical columns because all of them are numerical. Now, let's take a look at the descriptive statistics with the describe() method:

df.describe().T # T is for transposing the table

The describe table:

 count mean std min 25% 50% 75% maxAlcohol 178.0 13.000618 0.811827 11.03 12.3625 13.050 13.6775 14.83Malic_Acid 178.0 2.336348 1.117146 0.74 1.6025 1.865 3.0825 5.80Ash 178.0 2.366517 0.274344 1.36 2.2100 2.360 2.5575 3.23Ash_Alcanity 178.0 19.494944 3.339564 10.60 17.2000 19.500 21.5000 30.00Magnesium 178.0 99.741573 14.282484 70.00 88.0000 98.000 107.0000 162.00Total_Phenols 178.0 2.295112 0.625851 0.98 1.7425 2.355 2.8000 3.88Flavonoids 178.0 2.029270 0.998859 0.34 1.2050 2.135 2.8750 5.08Nonflavanoid_Phenols 178.0 0.361854 0.124453 0.13 0.2700 0.340 0.4375 0.66Proanthocyanidins 178.0 1.590899 0.572359 0.41 1.2500 1.555 1.9500 3.58Color_Intensity 178.0 5.058090 2.318286 1.28 3.2200 4.690 6.2000 13.00Hue 178.0 0.957449 0.228572 0.48 0.7825 0.965 1.1200 1.71OD280 178.0 2.611685 0.709990 1.27 1.9375 2.780 3.1700 4.00Proline 178.0 746.893258 314.907474 278.00 500.500 673.500 985.0000 1680.00

By looking at the table it is clear that there is some variability in the data - for some columns such as Alcohol there is more, and for others, such as Malic_Acid, less. Now we can check if there are any null, or NaN values in our dataset:

df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 178 entries, 0 to 177Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Alcohol 178 non-null float64 1 Malic_Acid 178 non-null float64 2 Ash 178 non-null float64 3 Ash_Alcanity 178 non-null float64 4 Magnesium 178 non-null int64 5 Total_Phenols 178 non-null float64 6 Flavonoids 178 non-null float64 7 Nonflavanoid_Phenols 178 non-null float64 8 Proanthocyanidins 178 non-null float64 9 Color_Intensity 178 non-null float64 10 Hue 178 non-null float64 11 OD280 178 non-null float64 12 Proline 178 non-null int64 dtypes: float64(11), int64(2)memory usage: 18.2 KB

There's no need to drop or input data, considering there aren't empty values in the dataset. We can use a Seaborn pairplot() to see the data distribution and to check if the dataset forms pairs of columns that can be interesting for clustering:

sns.pairplot(df)

By looking at the pair plot, two columns seem promising for clustering purposes - Alcohol and OD280 (which is a method for determining the protein concentration in wines). It seems that there are 3 distinct clusters on plots combining two of them.

There are other columns that seem to be in correlation as well. Most notably Alcohol and Total_Phenols, and Alcohol and Flavonoids. They have great linear relationships that can be observed in the pair plot.

Since our focus is clustering with K-Means, let's choose one pair of columns, say Alcohol and OD280, and test the elbow method for this dataset.

Definitive Guide to K-Means Clustering with Scikit-Learn (10)

Note: When using more columns of the dataset, there will be a need for either plotting in 3 dimensions or reducing the data to principal components (use of PCA). This is a valid, and more common approach, just make sure to choose the principal components based on how much they explain and keep in mind that when reducing the data dimensions, there is some information loss - so the plot is an approximation of the real data, not how it really is.

Let's plot the scatter plot with those two columns set to be its axis to take a closer look at the points we want to divide into groups:

sns.scatterplot(data=df, x='OD280', y='Alcohol')

Now we can define our columns and use the elbow method to determine the number of clusters. We will also initiate the algorithm with kmeans++ just to make sure it converges more quickly:

values = df[['OD280', 'Alcohol']]wcss_wine = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(values) wcss_wine.append(kmeans.inertia_)

We have calculated the WCSS, so we can plot the results:

clusters_wine = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]plt.plot(clusters_wine, wcss_wine)plt.axvline(3, linestyle='--', color='r')

According to the elbow method we should have 3 clusters here. For the final step, let's cluster our points into 3 clusters and plot the those clusters identified by colors:

kmeans_wine = KMeans(n_clusters=3, random_state=42)kmeans_wine.fit(values)sns.scatterplot(x = values['OD280'], y = values['Alcohol'], hue=kmeans_wine.labels_)

We can see clusters 0, 1, and 2 in the graph. Based on our analysis, group 0 has wines with higher protein content and lower alcohol, group 1 has wines with higher alcohol content and low protein, and group 2 has both high protein and high alcohol in its wines.

This is a very interesting dataset and I encourage you to go further into the analysis by clustering the data after normalization and PCA - also by interpreting the results and finding new connections.

Conclusion

K-Means clustering is a simple yet very effective unsupervised machine learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. K-Means clustering algorithm has many uses for grouping text documents, images, videos, and much more.

Definitive Guide to K-Means Clustering with Scikit-Learn (2024)

References

Top Articles
Spicy Thai Curry Noodle Soup
Easy BBQ Chicken Sandwich Recipe Made in the Slow Cooker
Restored Republic January 20 2023
What Are the Best Cal State Schools? | BestColleges
Lighthouse Diner Taylorsville Menu
Triumph Speed Twin 2025 e Speed Twin RS, nelle concessionarie da gennaio 2025 - News - Moto.it
How to know if a financial advisor is good?
877-668-5260 | 18776685260 - Robocaller Warning!
Videos De Mexicanas Calientes
O'reilly's In Monroe Georgia
Kostenlose Games: Die besten Free to play Spiele 2024 - Update mit einem legendären Shooter
Ecers-3 Cheat Sheet Free
A.e.a.o.n.m.s
Sitcoms Online Message Board
Lima Crime Stoppers
Nj Scratch Off Remaining Prizes
Tracking Your Shipments with Maher Terminal
Shreveport Active 911
Playgirl Magazine Cover Template Free
Bahsid Mclean Uncensored Photo
Interactive Maps: States where guns are sold online most
Suffix With Pent Crossword Clue
Does Breckie Hill Have An Only Fans – Repeat Replay
Boston Gang Map
R Cwbt
How pharmacies can help
Golden Abyss - Chapter 5 - Lunar_Angel
Foxy Brown 2025
The Ultimate Guide to Extras Casting: Everything You Need to Know - MyCastingFile
Spn 520211
Manuela Qm Only
Top 20 scariest Roblox games
Motorcycle Blue Book Value Honda
Hobby Lobby Hours Parkersburg Wv
John Philip Sousa Foundation
Loopnet Properties For Sale
The Hoplite Revolution and the Rise of the Polis
Nail Salon Open On Monday Near Me
About | Swan Medical Group
Sedano's Supermarkets Expands to Orlando - Sedano's Supermarkets
Jennifer Reimold Ex Husband Scott Porter
Greater Keene Men's Softball
Taylor University Baseball Roster
My Locker Ausd
How Does The Common App Work? A Guide To The Common App
Acts 16 Nkjv
Shoecarnival Com Careers
Disassemble Malm Bed Frame
'The Night Agent' Star Luciane Buchanan's Dating Life Is a Mystery
Az Unblocked Games: Complete with ease | airSlate SignNow
Gli italiani buttano sempre più cibo, quasi 7 etti a settimana (a testa)
All Buttons In Blox Fruits
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 5576

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.