# How does DBSCAN clustering algorithm work?

## the biggest secrets behind a clustering algorithm

Cluster Analysis is an important problem in data analysis. And nowadays DBSCAN is one of the most popular Cluster Analysis techniques. **Density-based spatial clustering of applications with noise** (**DBSCAN**) is the data clustering algorithm proposed in the early 90s by a group of database and data mining community.

Now before we begin, let’s understand the core philosophy of DBSCAN.

Let a given dataset of points (dataset D = {xi}), we have to partition the point into the dense region which we will call them as *Clusters *and sparse region which may contain *noise*.

Before we can discuss density-based clustering, we first need to cover a few topics.

**Parameters:**

DBSCAN algorithm basically requires 2 parameters:

*eps**: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered to be neighbors.*

*minPoints**: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region. It’s basically known as min_samples.*

In this algorithm, we have 3 types of data points. Every point in our dataset D = {xi}, on given ** minPoints **and

**we can categorize every data point into**

*eps***and**

*Core point*,*Border point*

*Noise point.*Let me define each of them.

**Core point:**

- A point is a core point if it has more than a specified number of m
*inPoints*within*eps*radius around it. - Core Point always belongs in a dense region.
- For example, let’s consider ‘
*p’*is set to be a core point if ‘*p’*has ≥*minPoints*in an*eps*radius around it.

**Border point:**

- A point is a border point if it has fewer than
*MinPts*within*Eps*, but is in the neighborhood of a*core point*. - For example, p is set to be a border point if ‘
*p’*is not a core point. i.e ‘*p’*has <*minPoints*in*eps*radius. But ‘*p’*should belong to the neighborhood ‘*q’*. Where ‘*q’*is a core point. - p ∈ neighborhood of
*q*and distance(*p,q*) ≤*eps*.

**Noise point:**

- A noise point is any point that is not a core point or a border point.

There are two more important ideas that we should be aware of before we go and learn the DBSCAN algorithm.

**Density Edge:**

If *p* and *q* both are core points and distance between (*p,q*) ≤ *eps *then we can connect *p, q* vertex in a graph and call it “*Density Edge*”.

**Density Connected Points:**

Two points *p* and *q* are said to be density connected points if both *p* and* q* are core points and they exist a path formed by density edges connecting point (*p*) to point(*q*).

**DBSCAN algorithm:**

So, given all these terms and ideas, let's go into the core of the DBSCAN algorithm to see how it works.

*Algorithm:*

*Algorithm:*

**step 1**: ∀ xi ∈ D, Label points as the core, border, and noise points.

**step 2**: Remove or Eliminate all noise points. (because they belong to the sparse region. i.e they are not belonging to any clusters.)

**step 3**: For every core point *p* that has not yet been assigned to a cluster

a) Create a new cluster with the point *p* and

b) add all the points that are density-connected to *p*.

**step 4**: Assign each border points to the cluster of the closest core point.

After knowing each DBSCAN concept, The question now arises:

## How to choose Min Points?

- Well, one of the most important things that min Point does is to remove the outlier right!
- Typically, the rule of thumb for taking min Points is, you should always take your min points to be greater or equal to the dimensionality(d) of your dataset.
- Typically, people who work most with DBSCAN take min point twice of the dimensionality of data i.e min Point≈2*d.
- If the dataset is noisier, we should choose a larger value of min Points
- While choosing the min points, it really depends a lot on domain knowledge.

## How to determine eps?

- Once you choose your min Point, you can proceed with the eps.
- Let us choose a min point = 4, for each point
*p*I’ll compute “*di*”. where*di*= distance from*p*to the 4th nearest neighbor of*p*. If “*di*” is high, then the chance of*p*is noisy is also high.

- For each point in the data set, I’ll have my di’s. Now sort all di’s in increasing order.
- Since we have sorted our point in the increasing order of di’s, now we’ll plot sorted point index and di’s with elbow or Knee method.

# When DBSCAN work well?

- It can handle Noise very well.
- DBSCAN can handle clusters of different shapes and sizes.

# When not!

- DBSCAN, while it’s so powerful and good for some applications. There is no single algorithm that always works. Every pro has some con side.
- If your dataset has multiple densities or varying densities, DBSCAN tends to fail. It does not work very well in such cases.
- It’s extremely sensitive to the hyperparameters. A slight change in hyperparameters can lead to a drastic change in the outcome.
- As we know, a concept like Density, may not work well in high dimensionality of data. We should avoid using it for text data.

# Time and Space complexity:

**Time Complexity**: O(n log n)

**Space Complexity**: O(n)

**Implementation:**

We don’t need to worry about implementing it ourselves. We can use one of the libraries/packages that can be found on the internet. Also, Sklearn has a DBSCAN implemented package. Let’s see how to code.

**Simple Overview:**

fromsklearn.clusterimportDBSCANfromsklearnimportmetrics

import numpy as npX = #load the data

clustering = DBSCAN(eps=3, min_samples=2).fit(X)#labels = clustering.labels_#Storing the labels formed by the DBSCANmeasure the performance of dbscan algo

#Identifying which points make up ourcore_samples = np.zeros_like(labels, dtype=bool)“core points”

core_samples[clustering.core_sample_indices_] = True

print(core_samples)#Calculating "n_clusters_ = len(set(labels)) - (1the number of clusters"if-1inlabelselse0)

print(n_clusters_)#Computing "the Silhouette Score"("Silhouette Coefficient:%0.3f"

% metrics.silhouette_score(X, labels))

Silhouette Coefficient Score:The Silhouette Coefficient is calculated using the mean intra-cluster distance (

`a`

) and the mean nearest-cluster distance (`b`

) for each sample. The Silhouette Coefficient for a sample is`(b - a) / max(a, b)`

. To clarify,`b`

is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.This function returns the mean Silhouette Coefficient over all samples.

A silhouette score ranges from -1 to 1, with -1 being the worst score possible and 1 being the best score. Silhouette scores of 0 suggest overlapping clusters.

## Implementation of real-world data:

So, now we are going to apply DBSCAN on the most popular real-word dataset, i.e amazon fine food review.

importing required packages

# And now, your journey begins!

I hope you are now convinced to apply the DBSCAN on some datasets. It’s time to get your hands on it!

**References :**

Wikipedia: https://en.wikipedia.org/wiki/DBSCAN

Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Applied AI Course: https://www.appliedaicourse.com/