Multiple Testing

The inflation of false positives when running many tests simultaneously — and corrections like Bonferroni and the false discovery rate.

Hypothesis test — rejection region and test statistic
z=1.96z=-1.96z=1.80-3-2-10123α/2 = 0.025α/2 = 0.025
✓ Fail to reject H₀ — p-value ≈ 0.0719 > α=0.05
α = 0.05
z = 1.80
Click to toggle
Definition

Multiple testing refers to performing many statistical tests simultaneously. The more tests you run, the more likely you are to find "significant" results by chance.

At α=0.05\alpha = 0.05, if you run 1 test under H0H_0, you have a 5% chance of a false positive. If you run 20 tests, you expect 1 false positive on average.

The probability of at least one false positive among mm independent tests at level α\alpha:

FWER=1(1α)m\text{FWER} = 1 - (1-\alpha)^m

For m=20m=20, α=0.05\alpha=0.05: FWER =10.95200.64= 1 - 0.95^{20} \approx 0.64. A 64% chance of at least one false positive!

Key properties
  • FWER grows toward 1 as the number of tests increases, even though each individual test's error rate stays fixed at α\alpha
  • Correction procedures trade power (ability to detect true effects) for tighter error control
  • FDR control is strictly weaker than FWER control, which is exactly why FDR procedures can reject more hypotheses at the same nominal level
  • All standard corrections assume the uncorrected per-test error rate is valid to begin with — they don't fix a biased test, only the multiplicity problem
Common mistakes
  • Running many tests and reporting only the significant ones without correction: this is the single most common source of false "discoveries" in science (p-hacking)
  • Confusing FWER and FDR control: a method controlling FDR at 5% does not mean only a 5% chance of any false positive — it means false positives make up at most 5% of rejected hypotheses on average
GWAS study

A genome-wide association study tests 1 million genetic variants for association with a disease. At α=0.05\alpha = 0.05, we expect 50,000 false positives under H0H_0. The standard threshold is α=5×108\alpha = 5 \times 10^{-8} (Bonferroni for 1 million tests), which keeps the expected number of false positives below 0.05.

Try it

A researcher tests whether any of 10 drugs improves outcomes, running 10 t-tests. She finds 2 with p<0.05p < 0.05. Why is this not strong evidence of drug efficacy?

Solution

By chance alone, she would expect 10×0.05=0.510 \times 0.05 = 0.5 false positives per experiment, and the probability of 2 or more false positives by chance is non-negligible. With 10 tests, the FWER is 10.951040%1-0.95^{10} \approx 40\% — nearly half the time, she'd find at least one false positive even if no drug works.

Without correction, the two significant results could easily be chance findings. She needs to apply Bonferroni correction (test at α=0.05/10=0.005\alpha = 0.05/10 = 0.005) or a similar method.

Related concepts