Skip to content

K-Means

K-Means Tutorial

1- Overview

K-Means is an unsupervised learning algorithm that can be used for clustering. That means it’s used on data with no labels and it doesn’t require a training process. Clustering algorithms such as K-Means can be used to create clusters and extract meanings from unstructured data.

When you combine these two characteristic traits K-means become a fantastic method to obtain additional insight where other machine learning algorithms wouldn’t be able to. From that perspective K-Means doesn’t compete with many popular supervised machine learning algorithms (such as knn, linear models, svm, decision trees, random forests etc.) and navigates in its own lane.

The fact that K-Means is also very easy to implement, understand and tweak if necessary makes it a very popular and useful unsupervised machine learning algorithm. That being said, all machine learning algorithms have their own sort of fame and coolness to them.

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it tends to fall in local minima. That’s why it can be useful to run it several times which is feasible since its fast.

Why K-Means Algorithm?

2- K-Means Benefits

Main advantage of K-Means is the opportunity of gaining knowledge from datasets without any labels. Normally, a human might have a difficulty making any meaning of many columns with random looking numbers.

But when analyzed with Clustering techniques in machine learning such datasets become more valuable and meaningful. Additionally K-Means is quite fast compared to other algorithms and it’s also very easy to use and interpret.

Also it doesn’t matter if data is linear or not to K-Means machine learning model, it will cluster linear and non-linear data equally well.

K-Means Pros

  • Fast.
  • Simple
  • Convenient
  • Tackles non-linear data very well.
For more detail check out K-Means Advantages.

K-Means Cons

  • Can get stuck in local minima.
  • Can’t cluster overlapping sub-categories
  • Results can vary based on initiating cluster points.
For more detail check out K-Means Disadvantages.

Application Areas

3- Key Industries

Some of the common applications with K-Means are:

  • Social Network Analysis
  • Customer Segmentation
  • Insight Extraction from data (such as web analytics, consumer behavior, geolocation data or app analytics etc.)
  • Pattern Recognition
  • Data Science
  • Compression
  1. Finance
  2. Web
  3. Medicine
  4. Retail
  5. Data Science
  6. E-commerce
  7. Social Network Analysis
  8. Computer Science
  9. Computer Vision

Who Invented K-Means?

4- K-Means History

K-Means emerged during 1950s and 1960s in the works of multiple independent individuals from multiple domains. However, James MacQueen from University of California was the first to mention the term K-Means in 1967 in his research paper.

You can find the original paper in this article:

Is K-Means Fast?

5- K-Means Computational Complexity

K-Means has O(N*P*K) complexity for each iteration where N is the observation size (rows), P is the column size and K is the centroid amounts. This means K-Means time complexity can change from Linear Complexity to Quadratic Complexity.

For a K-Means model time complexity mentioned above will be multiplied by iteration amount after which complexity can be expressed as: O(N*P*K*i) where i is the iteration amount.

Runtime Speed Performances:

56 columnsmax_iter=300, init=k-means++
K-Means (50K): 3.14 seconds
K-Means (500K): 26.48 seconds
K-Means (1M): 27.23 seconds

You can see a more comprehensive analysis of K-Means Complexity and Runtime Performances in this article:

How to Use K-Means?

6- Scikit-Learn K-Means Implementation

Using Scikit-Learn’s cluster module you can create K-Mean Clustering Models very easily. K-Means clustering is a very intuitive and straightforward process and it offers great insight into unlabeled unstructered datasets.

Or even if data is structured it can be used to compliment findings of Supervised Machine Learning algorithms and create hybrid projects in terms of machine learning technique.

You can check out this tutorial to see you can simply create and use a K-Means model using Scikit-Learn library and Python:

In some situations it can be very helpful to create a more custom K-Means model by adjusting and tuning the parameters of KMeans class in Scikit-Learn. These techniques can help you create a clustering model that caters better to the needs of your project. For Tuning K-Means models and K-Means Optimization please refer to the next section.

How Can I Improve K-Means?

7- K-Means Optimization

K-Means Model’s Scikit-Learn implementation comes with a pretty ideal optimization. However, you can still tune a few parameters and adapt K-Means algorithm to your liking and to your project.

Another benefit of tuning K-Means is that it really helps understand the algorithm and how it is constructed. For example init parameter can be used to define centroid initialization algorithm. This makes you really think about what centroids are and how they work. By default init is assigned to “k-means++” a popular algorithm for centroid initiation that tries to ensure ideal initial positions for each cluster center.

Some of the most commonly adjuster K-Means parameters and hyperparameters are:

  • init: Centroid initiation algorithm
  • n_init: Centroid initiation attempts
  • max_iter: Maximum amount for K-Means to be iterated 

You can read more about K-Means Optimization in the article below:

Is there a K-Means Implementation Example?

8- K-Means Example

We prepared a K-Means Implementation example where you can see how K-Means can be used to create clusters with unlabeled data. You can also find useful K-Means Visualization and some K-Means optimization techniques in the same example. Please see page below:

How Do Clustering Algorithms Compare?

K-Means vs Hierarchical Clustering vs DBSCAN

K-Means, Hierarchical Clustering and DBSCAN all have different functions regarding data clustering. These clustering algorithms compliment each other and they are suitable for different cases.

For example, K-Means can be used to create spherical clusters that are separated while DBSCAN can create clusters with arbitrary shapes or overlapping clusters.

Hierarchical Clustering works in a way that’s similar to descision tree structures and it can create clusters with multi-levels in terms of depth. You can then pick any cluster depth level you’d like. Also Hierarchical clustering doesn’t have initation parameters such as centroid amount or density amount which makes it easier to use sometimes.

Besides DBSCAN can be useful to identify outliers since it works based on density and it doesn’t include every single parameter in the clusters like K-Means does.

K-Means vs Hierarchical Clustering

  • K-Means is more performant and scales well.
  • Hierarchical can be slow with large datasets
  • K-Means only creates one set of partition based on K parameter.
  • Hierarchical creates multiple partitions at multiple levels.
  • K-Means needs K parameter (n_clusters) specification for initiation (amount of clusters).
  • Hierarchical doesn’t need such parameters hence can be more simple to use.
  • K-Means output can vary since it establishes from a random initial cluster centroid assumption and iterates it.
  • Hierarchical always returns the same result