DBSCAN
Density-based clustering that groups densely packed points and marks sparse regions as noise — no need to specify k in advance.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions of points, separated by sparse regions.
Two parameters:
- (eps): radius for defining "neighborhood"
- minPts: minimum number of points to form a dense region
Point types:
- Core point: has at least minPts neighbors within distance
- Border point: within of a core point but not a core point itself
- Noise point (outlier): neither core nor border
Clusters are formed by connecting core points within of each other. Unlike k-means, DBSCAN finds arbitrarily-shaped clusters and identifies outliers automatically.
GPS coordinates of taxi pickups in a city: DBSCAN with m, minPts=50 finds dense pickup clusters (airport, stadium, downtown) and marks isolated pickup locations as noise. It correctly identifies non-circular clusters and doesn't require knowing the number of clusters.
DBSCAN doesn't require you to specify (number of clusters). Why is this an advantage over k-means? What do you need to specify instead, and how do you choose these parameters?
Solution
The number of clusters is often unknown and hard to estimate. With k-means, a wrong gives meaningless clusters. DBSCAN discovers the number naturally from density.
Parameters to choose: and minPts. A practical approach for : plot the distance to the -th nearest neighbor (sorted) for each point — look for the "elbow" in this plot. The elbow suggests where density transitions from cluster to noise. minPts: commonly for low-dimensional data. Sensitivity to these parameters is DBSCAN's main weakness.