Cluster Evaluation

Measuring clustering quality without labels — the elbow method, silhouette score, and Davies-Bouldin index.

Silhouette compares closeness to your own cluster with closeness to the nearest other cluster
a(i)b(i)

Silhouette

s(i) = (b - a) / max(a, b)

High score: own cluster is close and the nearest other cluster is far away. Negative score: probably assigned badly.

Definition

Cluster evaluation assesses the quality of a clustering result. Unlike classification, clustering is unsupervised — there are no ground-truth labels to compare against, so evaluation is harder.

Two main settings:

  • External evaluation: you have ground-truth labels (e.g., known species, known document categories). Compare the clustering to the labels.
  • Internal evaluation: no ground-truth. Assess quality using only the data and cluster assignments.
Key properties
  • Internal metrics depend only on the data and cluster assignments — useful when no labels exist
  • External metrics require ground-truth labels and measure agreement, not "correctness" in an absolute sense
  • A good internal metric must balance compactness against the number of clusters, not reward more clusters unconditionally
  • No single metric is universally "correct" — different metrics emphasize different notions of cluster quality
Common mistakes
  • Using raw within-cluster sum of squares to pick KK: it always improves as KK increases, trivially reaching zero at K=nK=n — it must be balanced against complexity (elbow method) or replaced with a metric like silhouette
  • Comparing ARI/NMI scores across datasets with very different cluster-size distributions: chance agreement varies with cluster balance, so always check the adjusted/normalized versions rather than raw agreement counts
When to use each

External: you're clustering handwritten digits. You have true digit labels (0–9). Use them to check if the clusters correspond to digits.

Internal: you're clustering customer segments with no pre-defined groups. You must evaluate using compactness (do points within a cluster resemble each other?) and separation (are clusters far from each other?).

Try it

Why can't you always just use the clustering with the highest within-cluster variance reduction (e.g., smallest total within-cluster sum of squares) as the best clustering?

Solution

Because within-cluster SS always decreases as KK increases — reaching zero when K=NK = N (each point is its own cluster). Minimizing within-cluster SS alone would always suggest using as many clusters as possible, which is meaningless.

Any good internal metric must balance compactness against the number of clusters or against separation. Metrics like the silhouette score and Davies-Bouldin index do this; raw within-cluster SS does not.

Related concepts

Related reading