## Introduction

*K-Means clustering* is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances.

In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn. Then, we'll discuss how to determine the number of clusters (Ks) in K-Means, and also cover distance metrics, variance, and K-Means pros and cons.

## Motivation

Imagine the following situation. One day, when walking around the neighborhood, you noticed there were 10 convenience stores and started to wonder which stores were similar - closer to each other in proximity. While searching for ways to answer that question, you've come across an interesting approach that divides the stores into groups based on their coordinates on a map.

For instance, if one store was located 5 km West and 3 km North - you'd assign `(5, 3)`

coordinates to it, and represent it in a graph. Let's plot this first point to visualize what's happening:

`import matplotlib.pyplot as pltplt.title("Store With Coordinates (5, 3)")plt.scatter(x=5, y=3)`

This is just the first point, so we can get an idea of how we can represent a store. Say we already have 10 coordinates to the 10 stores collected. After organizing them in a `numpy`

array, we can also plot their locations:

`import numpy as nppoints = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52],[80, 91]])xs = points[:,0] # Selects all xs from the arrayys = points[:,1] # Selects all ys from the arrayplt.title("10 Stores Coordinates")plt.scatter(x=xs, y=ys)`

## How to Manually Implement K-Means Algorithm

Now we can look at the 10 stores on a graph, and the main problem is to find is there a way they could be divided into different groups based on proximity? Just by taking a quick look at the graph, we'll probably notice *two groups of stores* - one is the lower points to the bottom-left, and the other one is the upper-right points. Perhaps, we can even differentiate those two points in the middle as a separate group - therefore creating *three different groups*.

In this section, we'll go over the process of manually clustering points - dividing them into the given number of groups. That way, we'll essentially carefully go over all steps of the **K-Means clustering algorithm**. By the end of this section, you'll gain both an intuitive and practical understanding of all steps performed during the K-Means clustering. After that, we'll delegate it to Scikit-Learn.

What would be the best way of determining if there are two or three groups of points? One simple way would be to simply choose one number of groups - for instance, two - and then try to group points based on that choice.

Let's say we have decided there are *two groups* of our stores (points). Now, we need to find a way to understand which points belong to which group. This could be done by choosing one point to represent *group 1* and one to represent *group 2*. Those points will be used as a reference when measuring the distance from all other points to each group.

In that manner, say point `(5, 3)`

ends up belonging to group 1, and point `(79, 60)`

to group 2. When trying to assign a new point `(6, 3)`

to groups, we need to measure its distance to those two points. In the case of the point `(6, 3)`

is *closer* to the `(5, 3)`

, therefore it belongs to the group represented by that point - *group 1*. This way, we can easily group all points into corresponding groups.

In this example, besides determining the number of groups (

clusters) - we are also choosing some points to be areferenceof distance for new points of each group.

That is the general idea to understand similarities between our stores. Let's put it into practice - we can first choose the two reference points at *random*. The reference point of *group 1* will be `(5, 3)`

and the reference point of *group 2* will be `(10, 15)`

. We can select both points of our `numpy`

array by `[0]`

and `[1]`

indexes and store them in `g1`

(group 1) and `g2`

(group 2) variables:

`g1 = points[0]g2 = points[1]`

After doing this, we need to calculate the distance from all other points to those reference points. This raises an important question - how to measure that distance. We can essentially use any distance measure, but, for the purpose of this guide, let's use Euclidean Distance_.

**Advice:** If you want learn more more about Euclidean distance, you can read our "Calculating Euclidean Distances with NumPy" guide.

It can be useful to know that Euclidean distance measure is based on Pythagoras' theorem:

$$

c^2 = a^2 + b^2

$$

When adapted to points in a plane - `(a1, b1)`

and `(a2, b2)`

, the previous formula becomes:

$$

c^2 = (a2-a1)^2 + (b2-b1)^2

$$

The distance will be the square root of `c`

, so we can also write the formula as:

$$

euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2)]

$$

**Note:** You can also generalize the Euclidean distance formula for multi-dimensional points. For example, in a three-dimensional space, points have three coordinates - our formula reflects that in the following way:

$$

euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2 + (c2 - c1) ^2)]

$$

The same principle is followed no matter the number of dimensions of the space we are operating in.

So far, we have picked the points to represent groups, and we know how to calculate distances. Now, let's put the distances and groups together by assigning each of our collected store points to a group.

To better visualize that, we will declare three lists. The first one to store points of the first group - `points_in_g1`

. The second one to store points from the group 2 - `points_in_g2`

, and the last one - `group`

, to *label* the points as either `1`

(belongs to group 1) or `2`

(belongs to group 2):

`points_in_g1 = []points_in_g2 = []group = []`

We can now iterate through our points and calculate the Euclidean distance between them and each of our group references. Each point will be *closer* to one of two groups - based on which group is closest, we'll assign each point to the corresponding list, while also adding `1`

or `2`

to the `group`

list:

`for p in points: x1, y1 = p[0], p[1] euclidean_distance_g1 = np.sqrt((g1[0] - x1)**2 + (g1[1] - y1)**2) euclidean_distance_g2 = np.sqrt((g2[0] - x1)**2 + (g2[1] - y1)**2) if euclidean_distance_g1 < euclidean_distance_g2: points_in_g1.append(p) group.append('1') else: points_in_g2.append(p) group.append('2')`

Let's look at the results of this iteration to see what happened:

`print(f'points_in_g1:{points_in_g1}\n \\npoints_in_g2:{points_in_g2}\n \\ngroup:{group}')`

Which results in:

`points_in_g1:[array([5, 3])] points_in_g2:[array([10, 15]), array([15, 12]), array([24, 10]), array([30, 45]), array([85, 70]), array([71, 80]), array([60, 78]), array([55, 52]), array([80, 91])] group:[1, 2, 2, 2, 2, 2, 2, 2, 2, 2] `

We can also plot the clustering result, with different colors based on the assigned groups, using Seaborn's `scatterplot()`

with the `group`

as a `hue`

argument:

`import seaborn as snssns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)`

It's clearly visible that only our first point is assigned to group 1, and all other points were assigned to group 2. That result differs from what we had envisioned in the beginning. Considering the difference between our results and our initial expectations - is there a way we could change that? It seems there is!

One approach is to repeat the process and choose different points to be the references of the groups. This will change our results, hopefully, more in line with what we've envisioned in the beginning. This second time, we could choose them not at random as we previously did, but by getting a *mean* of all our already grouped points. That way, those new points could be positioned in the middle of corresponding groups.

For instance, if the second group had only points `(10, 15)`

, `(30, 45)`

. The new *central* point would be `(10 + 30)/2`

and `(15+45)/2`

- which is equal to `(20, 30)`

.

Since we have put our results in lists, we can convert them first to `numpy`

arrays, select their xs, ys and then obtain the *mean*:

`g1_center = [np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean()]g2_center = [np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean()]g1_center, g2_center`

**Advice:** Try to use `numpy`

and NumPy arrays as much as possible. They are optimized for better performance and simplify many linear algebra operations. Whenever you are trying to solve some linear algebra problem, you should definitely take a look at the `numpy`

documentation to check if there is any `numpy`

method designed to solve your problem. The chance is that there is!

To help repeat the process with our new center points, let's transform our previous code into a function, execute it and see if there were any changes in how the points are grouped:

`def assigns_points_to_two_groups(g1_center, g2_center): points_in_g1 = [] points_in_g2 = [] group = [] for p in points: x1, y1 = p[0], p[1] euclidean_distance_g1 = np.sqrt((g1_center[0] - x1)**2 + (g1_center[1] - y1)**2) euclidean_distance_g2 = np.sqrt((g2_center[0] - x1)**2 + (g2_center[1] - y1)**2) if euclidean_distance_g1 < euclidean_distance_g2: points_in_g1.append(p) group.append(1) else: points_in_g2.append(p) group.append(2) return points_in_g1, points_in_g2, group`

**Note:** If you notice you keep repeating the same code over and over again, you should wrap that code into a separate function. It is considered a best practice to organize code into functions, especially because they facilitate testing. It is easier to test an isolated piece of code than a full code without any functions.

Let's call the function and store its results in `points_in_g1`

, `points_in_g2`

, and `group`

variables:

`points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)points_in_g1, points_in_g2, group`

And also plot the scatter plot with the colored points to visualize the groups division:

`sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)`

It seems the clustering of our points is *getting better*. But still, there are two points in the middle of the graph that could be assigned to either group when considering their proximity to both groups. The algorithm we've developed so far assigns both of those points to the second group.

This means we can probably repeat the process once more by taking the means of the Xs and Ys, creating two new central points * (centroids)* to our groups and re-assigning them based on distance.

Let's also create a function to update the centroids. The whole process now can be reduced to multiple calls of that function:

`def updates_centroids(points_in_g1, points_in_g2): g1_center = np.array(points_in_g1)[:, 0].mean(), np.array(points_in_g1)[:, 1].mean() g2_center = np.array(points_in_g2)[:, 0].mean(), np.array(points_in_g2)[:, 1].mean() return g1_center, g2_centerg1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)`

Notice that after this third iteration, each one of the points now belong to different clusters. It seems the results are getting better - let's do it once again. Now going to the *fourth iteration* of our method:

`g1_center, g2_center = updates_centroids(points_in_g1, points_in_g2)points_in_g1, points_in_g2, group = assigns_points_to_two_groups(g1_center, g2_center)sns.scatterplot(x=points[:, 0], y=points[:, 1], hue=group)`

This fourth time we got *the same result* as the previous one. So it seems our points won't change groups anymore, our result has reached some kind of stability - it has got to an unchangeable state, or **converged**. Besides that, we have exactly the same result as we had envisioned for the 2 groups. We can also see if this reached division makes sense.

Let's just quickly recap what we've done so far. We've divided our 10 stores geographically into two sections - ones in the lower southwest regions and others in the northeast. It can be interesting to gather more data besides what we already have - revenue, the daily number of customers, and many more. That way we can conduct a richer analysis and possibly generate more interesting results.

Clustering studies like this can be conducted when an already established brand wants to pick an area to open a new store. In that case, there are many more variables taken into consideration besides location.

### What Does All This Have To Do With K-Means Algorithm?

While following these steps you might have wondered what they have to do with the K-Means algorithm. The process we've conducted so far is the **K-Means algorithm**. In short, we've determined the number of groups/clusters, randomly chosen initial points, and updated centroids in each iteration until clusters converged. We've basically performed the entire algorithm by hand - carefully conducting each step.

The *K* in K-Means comes from the *number of clusters* that need to be set prior to starting the iteration process. In our case *K = 2*. This characteristic is sometimes seen as ** negative** considering there are other clustering methods, such as Hierarchical Clustering, which don't need to have a fixed number of clusters beforehand.

Due to its use of means, K-means also becomes *sensitive to outliers and extreme values* - they enhance the variability and make it harder for our centroids to play their part. So, be conscious of the need to perform *extreme values and outlier analysis* before conducting a clustering using the K-Means algorithm.

Also, notice that our points were segmented in straight parts, there aren't curves when creating the clusters. That can also be a disadvantage of the K-Means algorithm.

**Note:** When you need it to be more flexible and adaptable to ellipses and other shapes, try using a *generalized K-means Gaussian Mixture model*. This model can adapt to elliptical segmentation clusters.

K-Means also has many ** advantages**! It performs well on

*large datasets*which can become difficult to handle if you are using some types of hierarchical clustering algorithms. It also

*guarantees convergence*, and can easily

*generalize*and

*adapt*. Besides that, it is probably the most used clustering algorithm.

Now that we've gone over all the steps performed in the K-Means algorithm, and understood all its pros and cons, we can finally implement K-Means using the Scikit-Learn library.

## How to Implement K-Means Algorithm Using *Scikit-Learn*

To double check our result, let's do this process again, but now using 3 lines of code with `sklearn`

:

`from sklearn.cluster import KMeans# The random_state needs to be the same number to get reproducible resultskmeans = KMeans(n_clusters=2, random_state=42) kmeans.fit(points)kmeans.labels_`

Here, the labels are the same as our previous groups. Let's just quickly plot the result:

`sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)`

The resulting plot is the same as the one from the previous section.

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually *learn* it!

**Note:** Just looking at how we've performed the K-Means algorithm using Scikit-Learn might give you the impression that this is a no-brainer and that you don't need to worry too much about it. Just 3 lines of code perform all the steps we've discussed in the previous section when we've gone over the K-Means algorithm step-by-step. But, * the devil is in the details* in this case! If you don't understand all the steps and limitations of the algorithm, you'll most likely face the situation where the K-Means algorithm gives you results you were not expecting.

With Scikit-Learn, you can also initialize K-Means for faster convergence by setting the `init='k-means++'`

argument. In broader terms, *K-Means++* still chooses the *k* initial cluster centers at random following a uniform distribution. Then, each subsequent cluster center is chosen from the remaining data points not by calculating only a distance measure - but by using probability. Using the probability speeds up the algorithm and it's helpful when dealing with very large datasets.

**Advice:** You can learn more about *K-Means++* details by reading the "K-Means++: The Advantages of Careful Seeding" paper, proposed in 2007 by David Arthur and Sergei Vassilvitskii.

## The Elbow Method - Choosing the Best Number of Groups

So far, so good! We've clustered 10 stores based on the Euclidean distance between points and centroids. But what about those two points in the middle of the graph that are a little harder to cluster? Couldn't they form a separate group as well? Did we actually make a mistake by choosing *K=2* groups? Maybe we actually had *K=3* groups? We could even have more than three groups and not be aware of it.

The question being asked here is * how to determine the number of groups (K) in K-Means*. To answer that question, we need to understand if there would be a "better" cluster for a different value of K.

The naive way of finding that out is by clustering points with different values of *K*, so, for *K=2, K=3, K=4, and so on*:

`for number_of_clusters in range(1, 11): kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42) kmeans.fit(points) `

But, clustering points for different *Ks* alone *won't be enough* to understand if we've chosen the ideal value for *K*. We need a way to evaluate the clustering quality for each *K* we've chosen.

### Manually Calculating the *Within Cluster Sum of Squares (WCSS)*

Here is the ideal place to introduce a measure of how much our clustered points are close to each other. It essentially describes how much *variance* we have inside a single cluster. This measure is called **Within Cluster Sum of Squares**, or * WCSS* for short. The smaller the WCSS is, the closer our points are, therefore we have a more well-formed cluster. The WCSS formula can be used for any number of clusters:

$$

WCSS = \sum(Pi_1 - Centroid_1)^2 + \cdots + \sum(Pi_n - Centroid_n)^2

$$

**Note:** In this guide, we are using the *Euclidean distance* to obtain the centroids, but other distance measures, such as Manhattan, could also be used.

Now we can assume we've opted to have two clusters and try to implement the WCSS to understand better what the WCSS is and how to use it. As the formula states, we need to sum up the squared differences between all cluster points and centroids. So, if our first point from the first group is `(5, 3)`

and our last centroid (after convergence) of the first group is `(16.8, 17.0)`

, the WCSS will be:

$$

WCSS = \sum((5,3) - (16.8, 17.0))^2

$$

$$

WCSS = \sum((5-16.8) + (3-17.0))^2

$$

$$

WCSS = \sum((-11.8) + (-14.0))^2

$$

$$

WCSS = \sum((-25.8))^2

$$

$$

WCSS = 335.24

$$

This example illustrates how we calculate the WCSS for the one point from the cluster. But the cluster usually contains more than one point, and we need to take all of them into consideration when calculating the WCSS. We'll do that by defining a function that receives a cluster of points and centroids, and returns the sum of squares:

`def sum_of_squares(cluster, centroid): squares = [] for p in cluster: squares.append((p - centroid)**2) ss = np.array(squares).sum() return ss`

Now we can get the sum of squares for each cluster:

`g1 = sum_of_squares(points_in_g1, g1_center)g2 = sum_of_squares(points_in_g2, g2_center)`

And sum up the results to obtain the total *WCSS*:

`g1 + g2`

This results in:

`2964.3999999999996`

So, in our case, when *K* is equal to 2, the total WCSS is *2964.39*. Now, we can switch Ks and calculate the WCSS for all of them. That way, we can get an insight into what *K* we should choose to make our clustering perform the best.

### Calculating *WCSS* Using *Scikit-Learn*

Fortunately, we don't need to manually calculate the WCSS for each *K*. After performing the K-Means clustering for the given number of clusters, we can obtain its WCSS by using the `inertia_`

attribute. Now, we can go back to our K-Means `for`

loop, use it to switch the number of clusters, and list corresponding WCSS values:

`wcss = [] for number_of_clusters in range(1, 11): kmeans = KMeans(n_clusters = number_of_clusters, random_state = 42) kmeans.fit(points) wcss.append(kmeans.inertia_)wcss`

Notice that the second value in the list, is exactly the same we've calculated before for *K=2*:

`[18272.9, # For k=1 2964.3999999999996, # For k=2 1198.75, # For k=3 861.75, 570.5, 337.5, 175.83333333333334, 79.5, 17.0, 0.0]`

To visualize those results, let's plot our *Ks* along with the WCSS values:

`ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]plt.plot(ks, wcss)`

There is an interruption on a plot when `x = 2`

, a low point in the line, and an even lower one when `x = 3`

. Notice that it reminds us of the *shape of an elbow*. By plotting the Ks along with the WCSS, we are using the **Elbow Method** to choose the number of Ks. And the *chosen K is exactly the lowest elbow point*, so, it would be `3`

instead of `2`

, in our case:

`ks = [1, 2, 3, 4, 5 , 6 , 7 , 8, 9, 10]plt.plot(ks, wcss);plt.axvline(3, linestyle='--', color='r')`

We can run the K-Means cluster algorithm again, to see how our data would look like with *three clusters*:

`kmeans = KMeans(n_clusters=3, random_state=42)kmeans.fit(points)sns.scatterplot(x = points[:,0], y = points[:,1], hue=kmeans.labels_)`

We were already happy with two clusters, but according to the elbow method, three clusters would be a better fit for our data. In this case, we would have three kinds of stores instead of two. Before using the elbow method, we thought about southwest and northeast clusters of stores, now we also have stores in the center. Maybe that could be a good location to open another store since it would have less competition nearby.

### Alternative Cluster Quality Measures

There are also other measures that can be used when evaluating cluster quality:

**Silhouette Score**- analyzes not only the distance between intra-cluster points but also between clusters themselves**Between Clusters Sum of Squares**- metric complementary to the WCSS*(BCSS)***Sum of Squares Error***(SSE)***Maximum Radius**- measures the largest distance from a point to its centroid**Average Radius**- the sum of the largest distance from a point to its centroid divided by the number of clusters.

It's recommended to experiment and get to know each of them since depending on the problem, some of the alternatives can be more applicable than the most widely used metrics *(WCSS and Silhouette Score)*.

In the end, as with many data science algorithms, we want to reduce the variance inside each cluster and maximize the variance between different clusters. So we have more defined and separable clusters.

## Applying K-Means on Another Dataset

Let's use what we have learned on another dataset. This time, we will try to find groups of similar wines.

**Note:** You can download the dataset here.

We begin by importing `pandas`

to read the `wine-clustering`

CSV *(Comma-Separated Values)* file into a `Dataframe`

structure:

`import pandas as pddf = pd.read_csv('wine-clustering.csv')`

After loading it, let's take a peek at the first five records of data with the `head()`

method:

`df.head()`

This results in:

` Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavonoids Nonflavanoid_Phenols Proanthocyanidins Color_Intensity Hue OD280 Proline0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 10651 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 10502 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 11853 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 14804 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735`

We have many measurements of substances present in wines. Here, we also won't need to transform categorical columns because all of them are numerical. Now, let's take a look at the descriptive statistics with the `describe()`

method:

`df.describe().T # T is for transposing the table`

The describe table:

` count mean std min 25% 50% 75% maxAlcohol 178.0 13.000618 0.811827 11.03 12.3625 13.050 13.6775 14.83Malic_Acid 178.0 2.336348 1.117146 0.74 1.6025 1.865 3.0825 5.80Ash 178.0 2.366517 0.274344 1.36 2.2100 2.360 2.5575 3.23Ash_Alcanity 178.0 19.494944 3.339564 10.60 17.2000 19.500 21.5000 30.00Magnesium 178.0 99.741573 14.282484 70.00 88.0000 98.000 107.0000 162.00Total_Phenols 178.0 2.295112 0.625851 0.98 1.7425 2.355 2.8000 3.88Flavonoids 178.0 2.029270 0.998859 0.34 1.2050 2.135 2.8750 5.08Nonflavanoid_Phenols 178.0 0.361854 0.124453 0.13 0.2700 0.340 0.4375 0.66Proanthocyanidins 178.0 1.590899 0.572359 0.41 1.2500 1.555 1.9500 3.58Color_Intensity 178.0 5.058090 2.318286 1.28 3.2200 4.690 6.2000 13.00Hue 178.0 0.957449 0.228572 0.48 0.7825 0.965 1.1200 1.71OD280 178.0 2.611685 0.709990 1.27 1.9375 2.780 3.1700 4.00Proline 178.0 746.893258 314.907474 278.00 500.500 673.500 985.0000 1680.00`

By looking at the table it is clear that there is some *variability in the data* - for some columns such as `Alcohol`

there is more, and for others, such as `Malic_Acid`

, less. Now we can check if there are any `null`

, or `NaN`

values in our dataset:

`df.info()`

`<class 'pandas.core.frame.DataFrame'>RangeIndex: 178 entries, 0 to 177Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Alcohol 178 non-null float64 1 Malic_Acid 178 non-null float64 2 Ash 178 non-null float64 3 Ash_Alcanity 178 non-null float64 4 Magnesium 178 non-null int64 5 Total_Phenols 178 non-null float64 6 Flavonoids 178 non-null float64 7 Nonflavanoid_Phenols 178 non-null float64 8 Proanthocyanidins 178 non-null float64 9 Color_Intensity 178 non-null float64 10 Hue 178 non-null float64 11 OD280 178 non-null float64 12 Proline 178 non-null int64 dtypes: float64(11), int64(2)memory usage: 18.2 KB`

There's no need to drop or input data, considering there aren't empty values in the dataset. We can use a Seaborn `pairplot()`

to see the data distribution and to check if the dataset forms pairs of columns that can be interesting for clustering:

`sns.pairplot(df)`

By looking at the pair plot, two columns seem promising for clustering purposes - `Alcohol`

and `OD280`

(which is a method for determining the protein concentration in wines). It seems that there are 3 distinct clusters on plots combining two of them.

There are other columns that seem to be in correlation as well. Most notably `Alcohol`

and `Total_Phenols`

, and `Alcohol`

and `Flavonoids`

. They have great linear relationships that can be observed in the pair plot.

Since our focus is clustering with K-Means, let's choose one pair of columns, say `Alcohol`

and `OD280`

, and test the elbow method for this dataset.

**Note:** When using more columns of the dataset, there will be a need for either plotting in 3 dimensions or reducing the data to principal components (use of PCA). This is a valid, and more common approach, just make sure to choose the principal components based on how much they explain and keep in mind that when reducing the data dimensions, there is some information loss - so the plot is an **approximation** of the real data, not how it really is.

Let's plot the scatter plot with those two columns set to be its axis to take a closer look at the points we want to divide into groups:

`sns.scatterplot(data=df, x='OD280', y='Alcohol')`

Now we can define our columns and use the elbow method to determine the number of clusters. We will also initiate the algorithm with `kmeans++`

just to make sure it converges more quickly:

`values = df[['OD280', 'Alcohol']]wcss_wine = [] for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(values) wcss_wine.append(kmeans.inertia_)`

We have calculated the WCSS, so we can plot the results:

`clusters_wine = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]plt.plot(clusters_wine, wcss_wine)plt.axvline(3, linestyle='--', color='r')`

According to the elbow method we should have 3 clusters here. For the final step, let's cluster our points into 3 clusters and plot the those clusters identified by colors:

`kmeans_wine = KMeans(n_clusters=3, random_state=42)kmeans_wine.fit(values)sns.scatterplot(x = values['OD280'], y = values['Alcohol'], hue=kmeans_wine.labels_)`

We can see clusters `0`

, `1`

, and `2`

in the graph. Based on our analysis, *group 0* has wines with higher protein content and lower alcohol, *group 1* has wines with higher alcohol content and low protein, and *group 2* has both high protein and high alcohol in its wines.

This is a very interesting dataset and I encourage you to go further into the analysis by clustering the data after normalization and PCA - also by interpreting the results and finding new connections.

## Conclusion

*K-Means* clustering is a simple yet very effective unsupervised machine learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. K-Means clustering algorithm has many uses for grouping text documents, images, videos, and much more.