Cluster Evaluation

Measuring clustering quality without labels — the elbow method, silhouette score, and Davies-Bouldin index.

Silhouette compares closeness to your own cluster with closeness to the nearest other cluster

Silhouette

s(i) = (b - a) / max(a, b)

High score: own cluster is close and the nearest other cluster is far away. Negative score: probably assigned badly.

Definition

Cluster evaluation assesses the quality of a clustering result. Unlike classification, clustering is unsupervised — there are no ground-truth labels to compare against, so evaluation is harder.

Two main settings:

External evaluation: you have ground-truth labels (e.g., known species, known document categories). Compare the clustering to the labels.
Internal evaluation: no ground-truth. Assess quality using only the data and cluster assignments.

Key properties

Internal metrics depend only on the data and cluster assignments — useful when no labels exist
External metrics require ground-truth labels and measure agreement, not "correctness" in an absolute sense
A good internal metric must balance compactness against the number of clusters, not reward more clusters unconditionally
No single metric is universally "correct" — different metrics emphasize different notions of cluster quality

Common mistakes

Using raw within-cluster sum of squares to pick $K$ : it always improves as $K$ increases, trivially reaching zero at $K=n$ — it must be balanced against complexity (elbow method) or replaced with a metric like silhouette
Comparing ARI/NMI scores across datasets with very different cluster-size distributions: chance agreement varies with cluster balance, so always check the adjusted/normalized versions rather than raw agreement counts

When to use each

External: you're clustering handwritten digits. You have true digit labels (0–9). Use them to check if the clusters correspond to digits.

Internal: you're clustering customer segments with no pre-defined groups. You must evaluate using compactness (do points within a cluster resemble each other?) and separation (are clusters far from each other?).

Try it

Why can't you always just use the clustering with the highest within-cluster variance reduction (e.g., smallest total within-cluster sum of squares) as the best clustering?

Solution

Because within-cluster SS always decreases as $K$ increases — reaching zero when $K = N$ (each point is its own cluster). Minimizing within-cluster SS alone would always suggest using as many clusters as possible, which is meaningless.

Any good internal metric must balance compactness against the number of clusters or against separation. Metrics like the silhouette score and Davies-Bouldin index do this; raw within-cluster SS does not.

Related concepts

Machine Learning· Unsupervised Learning

K-Means ClusteringPartitioning data into k clusters by iteratively assigning points to the nearest centroid and re-computing centroids until convergence.

Machine Learning· Unsupervised Learning

Hierarchical ClusteringBuilding a tree of nested clusters — either by merging (agglomerative) or splitting (divisive) — visualized as a dendrogram.

Machine Learning· Unsupervised Learning

DBSCANDensity-based clustering that groups densely packed points and marks sparse regions as noise — no need to specify k in advance.

Machine Learning· Model Training

Model EvaluationConfusion matrices, accuracy, precision, recall, F1 score, ROC curves, and AUC — the toolkit for measuring classifier and regressor performance.

Cluster Evaluation

Related concepts

Related reading

Contents