Skip to content

DBSCAN

DBSCAN Tutorial

1- Overview

DBSCAN is a very famous Density Based Clustering algorithm and it stands for Density Based Spatial Clustering of Applications with Noise.

Although K-Means and Hierarchical Clustering are different and unique unsupervised clustering methods, there can be situations where they are not enough or suitable to handle data in best possible way.

For example, sometimes data can have clusters within clusters, think of it like a circle inside a circle. But K-Means no Hierarchical Clustering are not suitable to identify these clusters. Once K-Means establishes a cluster it will continue with it but it won’t be able to identify a cluster inside a cluster.

Another situation is when there are a few outliers in the data both K-Means and Hierarchical Clustering will attempt to process these data points with the rest of the clusters. This inclusive approach can be inappropriate in some tasks such as fraud detection, outlier detection or attempts to forecast black swan events.

In those cases we have another clustering tool that can be very useful and those are “Density Based Clustering” methods. What Density Based algorithms offer is density controlling mechanisms such as a radius parameter which allow more control on how underlying clustering mechanism works or what to include or exclude. This also offers a great way to avoid inclusion of noise in the cluster and can be very useful when data is noisy.

DBSCAN has two main parameters. These are:

R: Radius of neighborhood

M: Minimum number of neighbors

Why DBSCAN Algorithm?

2- DBSCAN Benefits

  • Works well with noisy data
  • Can detect outliers
  • Works well with special clusters (such as cluster inside a cluster or an unusual shape)
  • Tackles non-linear data very well.
  • It doesn’t require assumptive parameters such as K of K-Means

DBSCAN Pros

  • Clusters arbitrary shapes successfully
  • No need to specify amount of clusters
  • Robust and Stable, doesn’t get affected by outliers like K-Means does
  • lots of parameters:minpts, s, eps?

DBSCAN Cons

  • Parameters can be difficult to master and optimize
  • Bordering points can be assigned randomly to clusters depending on processing order of data, usually a minor point but must be kept in mind
  • data with large density difference won’t be clustered well

Application Areas

3- Key Industries

Abstract— Data Mining is all about data analysis techniques. It is useful for extracting hidden and interesting patterns from large datasets. 

Clustering techniques are important when it comes to extracting knowledge from large amount of spatial data collected from various applications  ncluding GIS, satellite images, X-ray crystallography, remote sensing and environmental assessment and planning etc. 

To extract useful pattern from these complex data sources several popular spatial
data clustering techniques have been proposed. 

DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a pioneer density based algorithm. It can discover clusters of any
arbitrary shape and size in databases containing even noise and outliers. 

DBSCAN however are known to have a number of problems such as: (a) it requires user’s input to specify parameter values for executing the algorithm; (b) it is prone to dilemma in deciding meaningful clusters from datasets with varying densities; (c) and it incurs certain computational
complexity. 

VDBSCAN, FDBSCAN, DD_DBSCAN, and IDBSCAN.

DBSCAN can be used in a number of specific applications such as

  • Geographic Information Systems
  • Satellite data analysis,
  • X-ray crystallography,
  • remote sensing and
  • environmental assessment
  • pattern recognition
  • financial data clustering
  • customer clustering in an
  • product clusters in e-commerce

This application shows a DBSCAN application on welding:
“https://europepmc.org/article/PMC/PMC7827984”

Who Invented DBSCAN?

4- DBSCAN History

DBSCAN is a fairly new algorithm which was invented by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. DBSCAN name stands for density-based spatial algorithm of applications with noise.

Thanks to its competitive edge in density based clustering of arbitrarily shaped regions DBSCAN is heavily used in unsupervised clustering projects based on unlabeled data in the industry, academia and scientific research.

You can read more details about DBSCAN history in the article below and access related material about the original paper as well as “Test of Time Award” that was given to the DBSCAN algorithm.

Is DBSCAN Fast?

5- DBSCAN Complexity

DBSCAN’s worst case computational complexity is Quadratic Complexity and it can be shown with Big O notation as O(N^2). Below you can see a summary of our runtime performance tests with DBSCAN algorithm:

56 columns, eps = 0.5, min_samples = 5, n_jobs=1
DBSCAN (100K): 57.06 seconds
DBSCAN (1M): 521.29 seconds

56 columns, eps = 0.5, min_samples = 5, n_jobs = –1
DBSCAN (100K): 9.84 seconds
DBSCAN (1M): 81.58 seconds

Please note: the tests were done with non-dimensional data (only 2 features) using a fairly fast laptop (i7 8th Gen processor, 16GB RAM)

You can visit the article below where you can find a more detailed explanation of the DBSCAN complexity as well as  full list of DBSCAN performance results and supplementary materials:

How to Use DBSCAN?

6- Scikit-Learn DBSCAN Implementation

Here are a couple of examples that might be useful to understand ensembles better. Additionally you can use some

How can I improve DBSCAN?

7- DBSCAN Optimization

DBSCAN offers quite a few hyperparameter for tuning and optimization and a good command on them can take a mediocre machine learning implementation to the next level of advanced algorithm mastery.

Fundamentally, eps and min_samples are the most major DBSCAN parameters that will most likely require your sophisticated input based on domain expertise or analytical savvy. These parameters define the distance and minimum sample parameters used in the calculation of density-based region clusters.

Additionally, distance calculations can be masterfully optimized by metric and p parameters when needed. On top of that, dataset is handled and mapped via a core algorithm such as brute, kd_tree or ball_tree and this selection can be made using algorithm parameter. 

leaf_size and n_jobs are other significant hyperparameters which can be optimized through a scikit-learn implementation to improve the performance or accuracy of the DBSCAN machine learning model. You can find a detailed tutorial about how to optimize DBSCAN below:

Is there a DBSCAN Implementation Example?

8- DBSCAN Example

Practice is key to mastering any subject and machine learning is no different. Particularly, we have prepared a particular example with plenty of explanations, optimizations and visualizations based on a DBSCAN implementation which can help you get familiar with the practical side of this algorithm and its implementations.

How Do Clustering Algorithms Compare?