Multiple Testing

The inflation of false positives when running many tests simultaneously — and corrections like Bonferroni and the false discovery rate.

Multiple testing: small error rates accumulate across a family

m = 20

Expected false positives

1.0

at alpha = 0.05

At least one false positive

64%

FWER = 1 - 0.95^m

Definition

Multiple testing refers to performing many statistical tests simultaneously. The more tests you run, the more likely you are to find "significant" results by chance.

At $\alpha = 0.05$ , if you run 1 test under $H_0$ , you have a 5% chance of a false positive. If you run 20 tests, you expect 1 false positive on average.

The probability of at least one false positive among $m$ independent tests at level $\alpha$ :

$\text{FWER} = 1 - (1-\alpha)^m$

For $m=20$ , $\alpha=0.05$ : FWER $= 1 - 0.95^{20} \approx 0.64$ . A 64% chance of at least one false positive!

Key properties

FWER grows toward 1 as the number of tests increases, even though each individual test's error rate stays fixed at $\alpha$
Correction procedures trade power (ability to detect true effects) for tighter error control
FDR control is strictly weaker than FWER control, which is exactly why FDR procedures can reject more hypotheses at the same nominal level
All standard corrections assume the uncorrected per-test error rate is valid to begin with — they don't fix a biased test, only the multiplicity problem

Common mistakes

Running many tests and reporting only the significant ones without correction: this is the single most common source of false "discoveries" in science (p-hacking)
Confusing FWER and FDR control: a method controlling FDR at 5% does not mean only a 5% chance of any false positive — it means false positives make up at most 5% of rejected hypotheses on average

GWAS study

A genome-wide association study tests 1 million genetic variants for association with a disease. At $\alpha = 0.05$ , we expect 50,000 false positives under $H_0$ . The standard threshold is $\alpha = 5 \times 10^{-8}$ (Bonferroni for 1 million tests), which keeps the expected number of false positives below 0.05.

Try it

A researcher tests whether any of 10 drugs improves outcomes, running 10 t-tests. She finds 2 with $p < 0.05$ . Why is this not strong evidence of drug efficacy?

Solution

By chance alone, she would expect $10 \times 0.05 = 0.5$ false positives per experiment, and the probability of 2 or more false positives by chance is non-negligible. With 10 tests, the FWER is $1-0.95^{10} \approx 40\%$ — nearly half the time, she'd find at least one false positive even if no drug works.

Without correction, the two significant results could easily be chance findings. She needs to apply Bonferroni correction (test at $\alpha = 0.05/10 = 0.005$ ) or a similar method.

Related concepts

Needs first

Hypothesis Testing Probability

ANOVA Chi-Square Test

View in full concept graph →