Multiple Testing
The inflation of false positives when running many tests simultaneously — and corrections like Bonferroni and the false discovery rate.
Multiple testing refers to performing many statistical tests simultaneously. The more tests you run, the more likely you are to find "significant" results by chance.
At , if you run 1 test under , you have a 5% chance of a false positive. If you run 20 tests, you expect 1 false positive on average.
The probability of at least one false positive among independent tests at level :
For , : FWER . A 64% chance of at least one false positive!
- FWER grows toward 1 as the number of tests increases, even though each individual test's error rate stays fixed at
- Correction procedures trade power (ability to detect true effects) for tighter error control
- FDR control is strictly weaker than FWER control, which is exactly why FDR procedures can reject more hypotheses at the same nominal level
- All standard corrections assume the uncorrected per-test error rate is valid to begin with — they don't fix a biased test, only the multiplicity problem
- Running many tests and reporting only the significant ones without correction: this is the single most common source of false "discoveries" in science (p-hacking)
- Confusing FWER and FDR control: a method controlling FDR at 5% does not mean only a 5% chance of any false positive — it means false positives make up at most 5% of rejected hypotheses on average
A genome-wide association study tests 1 million genetic variants for association with a disease. At , we expect 50,000 false positives under . The standard threshold is (Bonferroni for 1 million tests), which keeps the expected number of false positives below 0.05.
A researcher tests whether any of 10 drugs improves outcomes, running 10 t-tests. She finds 2 with . Why is this not strong evidence of drug efficacy?
Solution
By chance alone, she would expect false positives per experiment, and the probability of 2 or more false positives by chance is non-negligible. With 10 tests, the FWER is — nearly half the time, she'd find at least one false positive even if no drug works.
Without correction, the two significant results could easily be chance findings. She needs to apply Bonferroni correction (test at ) or a similar method.