Benjamini-Hochberg FDR Correction

The Benjamini-Hochberg (BH) procedure controls the False Discovery Rate (FDR) — the expected proportion of false positives among all rejected hypotheses — when testing multiple metrics simultaneously. It is less conservative than Bonferroni while still providing strong statistical guarantees.

Why FDR is Better Than Bonferroni for Many Metrics

When an experiment measures 10 or more metrics, Bonferroni correction becomes extremely conservative: it controls the Family-Wise Error Rate (FWER) — the probability of even one false positive — at the cost of almost never detecting true effects.

Property	Bonferroni	BH FDR
Controls	FWER (any false positive)	FDR (proportion of false positives)
Threshold per metric	$\alpha / m$	$k/m \times \alpha$ (rank-dependent)
Recommended for	1–4 metrics	5+ metrics
Power (sensitivity)	Low for many tests	Higher
False positives	Nearly zero	At most $\alpha$ proportion

Rule of thumb: Use Bonferroni for $\leq 4$ metrics; use BH FDR for $\geq 5$ metrics.

The BH Procedure

Given $m$ raw p-values $p_1, p_2, \ldots, p_m$ and a desired FDR threshold $\alpha$:

Sort p-values ascending: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$.
Compute the BH threshold for rank $k$: $T_k = \frac{k}{m} \times \alpha$.
Find the largest rank $k^$ such that $p_{(k^)} \leq T_{k^*}$.
Reject all hypotheses with rank $\leq k^*$ (declare significant).

Example

10 p-values with $\alpha = 0.10$:

Rank	p-value	BH threshold $k/10 \times 0.10$	Reject?
1	0.001	0.010	Yes
2	0.008	0.020	Yes
3	0.039	0.030	No
4	0.041	0.040	No
5	0.042	0.050	Yes
6	0.060	0.060	Yes ← largest k*
7	0.074	0.070	No
8	0.205	0.080	No
9	0.212	0.090	No
10	0.391	0.100	No

Result: Reject 6 hypotheses (all with rank $\leq 6$), even though ranks 3–5 individually don't satisfy their per-rank threshold — the step-up property means we look for the largest k, so ranks 1, 2, 5, 6 passing brings along ranks 3 and 4.

At $\alpha = 0.05$ with the same p-values, only 2 are significant (k=1: 0.001≤0.005 ✓, k=2: 0.008≤0.010 ✓, k=3: 0.039>0.015 ✗).

Adjusted P-values (Holm Step-Up Formula)

The BH-adjusted p-value for rank $k$ is:

$$p_{\text{adj}}(k) = \min\left(\frac{p_{(k)} \cdot m}{k},\ 1.0\right)$$

with monotonicity enforced from right to left: $$p_{\text{adj,mono}}(k) = \min\left(p_{\text{adj}}(k),\ p_{\text{adj,mono}}(k+1)\right)$$

A metric is significant when $p_{\text{adj,mono}}(k) < \alpha$.

When to Use BH vs. Bonferroni

Situation	Recommendation
1–4 primary metrics (confirmatory)	Bonferroni or no correction
5–20 metrics (exploratory)	Benjamini-Hochberg FDR
Dimensional breakdown (per-segment)	Bonferroni (small number of segments)
20+ metrics or large-scale analysis	Benjamini-Hochberg FDR
Sequential / always-valid testing	mSPRT (no multiple-testing correction needed)

REST API Reference

POST `/api/v1/results/{experiment_id}/fdr-correction`

Apply Benjamini-Hochberg FDR correction to per-metric p-values.

Request body:

{
  "p_values": {
    "revenue": 0.001,
    "click_through_rate": 0.032,
    "session_duration": 0.08,
    "churn_rate": 0.41,
    "engagement_score": 0.015
  },
  "fdr_threshold": 0.05
}

Field	Type	Required	Default	Description
`p_values`	`dict[str, float]`	Yes	—	Metric name → raw p-value. All values in [0,1]. Non-empty.
`fdr_threshold`	`float`	No	`0.05`	FDR control level α in [0,1].

Response (200):

[
  {
    "metric_name": "revenue",
    "raw_p_value": 0.001,
    "adjusted_p_value": 0.005,
    "rank": 1,
    "is_significant": true
  },
  {
    "metric_name": "engagement_score",
    "raw_p_value": 0.015,
    "adjusted_p_value": 0.0375,
    "rank": 2,
    "is_significant": true
  },
  {
    "metric_name": "click_through_rate",
    "raw_p_value": 0.032,
    "adjusted_p_value": 0.0533,
    "rank": 3,
    "is_significant": false
  },
  {
    "metric_name": "session_duration",
    "raw_p_value": 0.08,
    "adjusted_p_value": 0.1,
    "rank": 4,
    "is_significant": false
  },
  {
    "metric_name": "churn_rate",
    "raw_p_value": 0.41,
    "adjusted_p_value": 0.41,
    "rank": 5,
    "is_significant": false
  }
]

The response is sorted by rank ascending (smallest raw p-value first).

Error codes:

Status	Condition
404	Experiment not found
422	Empty `p_values`, any value outside [0,1], or `fdr_threshold` outside [0,1]
500	Internal computation error

Example Python Usage

from backend.app.services.fdr_correction_service import BenjaminiHochbergService

svc = BenjaminiHochbergService()

p_values = {
    "revenue": 0.001,
    "ctr": 0.032,
    "session_duration": 0.08,
    "churn": 0.41,
    "engagement": 0.015,
}

results = svc.correct(p_values, fdr_threshold=0.05)

for r in results:
    status = "SIGNIFICANT" if r.is_significant else "not significant"
    print(
        f"Rank {r.rank:2d}: {r.metric_name:<20s} "
        f"p={r.raw_p_value:.4f}  adj_p={r.adjusted_p_value:.4f}  {status}"
    )

Output:

Rank  1: revenue              p=0.0010  adj_p=0.0050  SIGNIFICANT
Rank  2: engagement           p=0.0150  adj_p=0.0375  SIGNIFICANT
Rank  3: ctr                  p=0.0320  adj_p=0.0533  not significant
Rank  4: session_duration     p=0.0800  adj_p=0.1000  not significant
Rank  5: churn                p=0.4100  adj_p=0.4100  not significant

Integration with the Existing Results API

The GET /api/v1/results/{experiment_id} endpoint already accepts a correction_method query parameter:

GET /api/v1/results/{experiment_id}?correction_method=benjamini_hochberg

This applies BH correction to per-metric p-values returned in the main results response when multiple metrics are defined.

The POST /api/v1/results/{experiment_id}/fdr-correction endpoint provides finer control: you supply any set of p-values (from post-stratification, CUPED, or raw analysis) and receive BH-corrected results immediately.

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. JRSS-B, 57(1), 289–300.
Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. PNAS, 100(16), 9440–9445.