DBSCAN

Density-based clustering that groups densely packed points and marks sparse regions as noise — no need to specify k in advance.

DBSCAN finds dense connected regions and labels isolated points as noise
eps neighborhoodcluster 1cluster 2noisecoreborder
Selected point has 3 points within eps; core requires at least 4.
Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions of points, separated by sparse regions.

Two parameters:

  • ϵ\epsilon (eps): radius for defining "neighborhood"
  • minPts: minimum number of points to form a dense region

Point types:

  • Core point: has at least minPts neighbors within distance ϵ\epsilon
  • Border point: within ϵ\epsilon of a core point but not a core point itself
  • Noise point (outlier): neither core nor border

Clusters are formed by connecting core points within ϵ\epsilon of each other. Unlike k-means, DBSCAN finds arbitrarily-shaped clusters and identifies outliers automatically.

Geographic clustering

GPS coordinates of taxi pickups in a city: DBSCAN with ϵ=100\epsilon=100m, minPts=50 finds dense pickup clusters (airport, stadium, downtown) and marks isolated pickup locations as noise. It correctly identifies non-circular clusters and doesn't require knowing the number of clusters.

Try it

DBSCAN doesn't require you to specify kk (number of clusters). Why is this an advantage over k-means? What do you need to specify instead, and how do you choose these parameters?

Solution

The number of clusters is often unknown and hard to estimate. With k-means, a wrong kk gives meaningless clusters. DBSCAN discovers the number naturally from density.

Parameters to choose: ϵ\epsilon and minPts. A practical approach for ϵ\epsilon: plot the distance to the kk-th nearest neighbor (sorted) for each point — look for the "elbow" in this plot. The elbow suggests where density transitions from cluster to noise. minPts: commonly 2×dimensions2 \times \text{dimensions} for low-dimensional data. Sensitivity to these parameters is DBSCAN's main weakness.

Related concepts

Related reading