DBSCAN

DBSCAN Benefits | DBSCAN Performance
DBSCAN Tuning | DBSCAN Examples

DBSCAN Tutorial

1- Overview

DBSCAN is a very famous Density Based Clustering algorithm and it stands for Density Based Spatial Clustering of Applications with Noise.

Although K-Means and Hierarchical Clustering are different and unique unsupervised clustering methods, there can be situations where they are not enough or suitable to handle data in best possible way.

For example, sometimes data can have clusters within clusters, think of it like a circle inside a circle. But K-Means no Hierarchical Clustering are not suitable to identify these clusters. Once K-Means establishes a cluster it will continue with it but it won’t be able to identify a cluster inside a cluster.

Another situation is when there are a few outliers in the data both K-Means and Hierarchical Clustering will attempt to process these data points with the rest of the clusters. This inclusive approach can be inappropriate in some tasks such as fraud detection, outlier detection or attempts to forecast black swan events.

In those cases we have another clustering tool that can be very useful and those are “Density Based Clustering” methods. What Density Based algorithms offer is density controlling mechanisms such as a radius parameter which allow more control on how underlying clustering mechanism works or what to include or exclude. This also offers a great way to avoid inclusion of noise in the cluster and can be very useful when data is noisy.

DBSCAN has two main parameters. These are:

R: Radius of neighborhood

M: Minimum number of neighbors

Why DBSCAN Algorithm?

2- DBSCAN Benefits

Works well with noisy data
Can detect outliers
Works well with special clusters (such as cluster inside a cluster or an unusual shape)
Tackles non-linear data very well.
It doesn’t require assumptive parameters such as K of K-Means

DBSCAN Pros

Clusters arbitrary shapes successfully
No need to specify amount of clusters
Robust and Stable, doesn’t get affected by outliers like K-Means does
lots of parameters:minpts, s, eps?

See: Advantages of DBSCAN

DBSCAN Cons

Parameters can be difficult to master and optimize
Bordering points can be assigned randomly to clusters depending on processing order of data, usually a minor point but must be kept in mind
data with large density difference won’t be clustered well

See: Disadvantages of DBSCAN

Application Areas

3- Key Industries

Abstract— Data Mining is all about data analysis techniques. It is useful for extracting hidden and interesting patterns from large datasets.

Clustering techniques are important when it comes to extracting knowledge from large amount of spatial data collected from various applications ncluding GIS, satellite images, X-ray crystallography, remote sensing and environmental assessment and planning etc.

To extract useful pattern from these complex data sources several popular spatial
data clustering techniques have been proposed.

DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a pioneer density based algorithm. It can discover clusters of any
arbitrary shape and size in databases containing even noise and outliers.

DBSCAN however are known to have a number of problems such as: (a) it requires user’s input to specify parameter values for executing the algorithm; (b) it is prone to dilemma in deciding meaningful clusters from datasets with varying densities; (c) and it incurs certain computational
complexity.

VDBSCAN, FDBSCAN, DD_DBSCAN, and IDBSCAN.

DBSCAN can be used in a number of specific applications such as

Geographic Information Systems
Satellite data analysis,
X-ray crystallography,
remote sensing and
environmental assessment
pattern recognition
financial data clustering
customer clustering in an
product clusters in e-commerce

This application shows a DBSCAN application on welding:
“https://europepmc.org/article/PMC/PMC7827984”

Who Invented DBSCAN?

4- DBSCAN History

DBSCAN is a fairly new algorithm which was invented by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. DBSCAN name stands for density-based spatial algorithm of applications with noise.

Thanks to its competitive edge in density based clustering of arbitrarily shaped regions DBSCAN is heavily used in unsupervised clustering projects based on unlabeled data in the industry, academia and scientific research.

You can read more details about DBSCAN history in the article below and access related material about the original paper as well as “Test of Time Award” that was given to the DBSCAN algorithm.

History of DBSCAN

Is DBSCAN Fast?

5- DBSCAN Complexity

DBSCAN’s worst case computational complexity is Quadratic Complexity and it can be shown with Big O notation as O(N^2). Below you can see a summary of our runtime performance tests with DBSCAN algorithm:

56 columns, eps = 0.5, min_samples = 5, n_jobs=1
DBSCAN (100K): 57.06 seconds
DBSCAN (1M): 521.29 seconds

56 columns, eps = 0.5, min_samples = 5, n_jobs = –1
DBSCAN (100K): 9.84 seconds
DBSCAN (1M): 81.58 seconds

Please note: the tests were done with non-dimensional data (only 2 features) using a fairly fast laptop (i7 8th Gen processor, 16GB RAM)

You can visit the article below where you can find a more detailed explanation of the DBSCAN complexity as well as full list of DBSCAN performance results and supplementary materials:

DBSCAN Complexity

How to Use DBSCAN?

6- Scikit-Learn DBSCAN Implementation

Here are a couple of examples that might be useful to understand ensembles better. Additionally you can use some

How can I improve DBSCAN?

7- DBSCAN Optimization

DBSCAN offers quite a few hyperparameter for tuning and optimization and a good command on them can take a mediocre machine learning implementation to the next level of advanced algorithm mastery.

Fundamentally, eps and min_samples are the most major DBSCAN parameters that will most likely require your sophisticated input based on domain expertise or analytical savvy. These parameters define the distance and minimum sample parameters used in the calculation of density-based region clusters.

Additionally, distance calculations can be masterfully optimized by metric and p parameters when needed. On top of that, dataset is handled and mapped via a core algorithm such as brute, kd_tree or ball_tree and this selection can be made using algorithm parameter.

leaf_size and n_jobs are other significant hyperparameters which can be optimized through a scikit-learn implementation to improve the performance or accuracy of the DBSCAN machine learning model. You can find a detailed tutorial about how to optimize DBSCAN below:

Tuning DBSCAN

Is there a DBSCAN Implementation Example?

8- DBSCAN Example

Practice is key to mastering any subject and machine learning is no different. Particularly, we have prepared a particular example with plenty of explanations, optimizations and visualizations based on a DBSCAN implementation which can help you get familiar with the practical side of this algorithm and its implementations.

DBSCAN Example 1

AiFinesse.com