DBSCAN

Density-based clustering that groups densely packed points and marks sparse regions as noise — no need to specify k in advance.

DBSCAN finds dense connected regions and labels isolated points as noise

Selected point has 3 points within eps; core requires at least 4.

eps=45minPts=4

Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions of points, separated by sparse regions.

Two parameters:

$\epsilon$ (eps): radius for defining "neighborhood"
minPts: minimum number of points to form a dense region

Point types:

Core point: has at least minPts neighbors within distance $\epsilon$
Border point: within $\epsilon$ of a core point but not a core point itself
Noise point (outlier): neither core nor border

Clusters are formed by connecting core points within $\epsilon$ of each other. Unlike k-means, DBSCAN finds arbitrarily-shaped clusters and identifies outliers automatically.

Geographic clustering

GPS coordinates of taxi pickups in a city: DBSCAN with $\epsilon=100$ m, minPts=50 finds dense pickup clusters (airport, stadium, downtown) and marks isolated pickup locations as noise. It correctly identifies non-circular clusters and doesn't require knowing the number of clusters.

Try it

DBSCAN doesn't require you to specify $k$ (number of clusters). Why is this an advantage over k-means? What do you need to specify instead, and how do you choose these parameters?

Solution

The number of clusters is often unknown and hard to estimate. With k-means, a wrong $k$ gives meaningless clusters. DBSCAN discovers the number naturally from density.

Parameters to choose: $\epsilon$ and minPts. A practical approach for $\epsilon$ : plot the distance to the $k$ -th nearest neighbor (sorted) for each point — look for the "elbow" in this plot. The elbow suggests where density transitions from cluster to noise. minPts: commonly $2 \times \text{dimensions}$ for low-dimensional data. Sensitivity to these parameters is DBSCAN's main weakness.

Related concepts

Machine Learning· Unsupervised Learning

K-Means ClusteringPartitioning data into k clusters by iteratively assigning points to the nearest centroid and re-computing centroids until convergence.

Machine Learning· Unsupervised Learning

Hierarchical ClusteringBuilding a tree of nested clusters — either by merging (agglomerative) or splitting (divisive) — visualized as a dendrogram.

Machine Learning· Unsupervised Learning

Cluster EvaluationMeasuring clustering quality without labels — the elbow method, silhouette score, and Davies-Bouldin index.

DBSCAN

Related concepts

Related reading

Contents