factrix.multi_factor.bhy_hierarchical ¶
bhy_hierarchical(profiles: Iterable[FactorProfile], *, group: str, estimator: Estimator | None = None, q: float = 0.05) -> Survivors
Hierarchical Benjamini-Hochberg-Yekutieli (BHY): control false discovery rate (FDR) across groups then within groups.
For factor sets with natural group structure (momentum / value /
quality families; cross-region universes), the Yekutieli (2008)
two-stage procedure controls group-level FDR ≤ q on the outer
layer (Simes group representative + BHY) and within-group FDR
≤ q on the inner layer (BHY restricted to passing groups). Flat
BHY across the whole input loses group-level interpretability and
pays full m-correction even when most groups are dead.
Pick this over the alternatives by survivor unit and claim shape:
| Claim | Survivor unit | Function |
|---|---|---|
| "factor X significant in each universe / horizon" | (factor, context) pair | bhy(expand_over=) |
| "factor X significant in ≥ k of m conditions" | factor identity | partial_conjunction |
| "which families have signal, and within those, which factors" | factor identity (group-then-within FDR) | bhy_hierarchical |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
profiles
|
Iterable[FactorProfile]
|
Iterable of :class: |
required |
group
|
str
|
Single context key naming the group axis (e.g.
|
required |
estimator
|
Estimator | None
|
Optional inference-method override. |
None
|
q
|
float
|
Nominal FDR target shared by both layers. Default |
0.05
|
Returns:
| Type | Description |
|---|---|
Survivors
|
class: |
Survivors
|
max-of-layers fold |
Survivors
|
the universal duality |
Survivors
|
|
Survivors
|
(covers all input groups, not just survivors — counter to |
Survivors
|
|
Survivors
|
only). |
Survivors
|
labels are recoverable via |
Raises:
| Type | Description |
|---|---|
UserInputError
|
|
Warns:
| Type | Description |
|---|---|
RuntimeWarning
|
More than half of input groups contain a single profile — inner BHY on n=1 is a raw cutoff and the outer Simes representative equals that single p, so those groups get no FDR correction at either layer. |
Notes
- Simes as the outer representative: dominates Bonferroni
m * min(p)and is the Yekutieli (2008) recommended choice. The kwarg is not exposed at v1 (Edgington-style mean p has no valid null; Bonferroni-min is strictly worse than Simes under the procedure's positive regression dependence on a subset (PRDS) assumption). - PRDS within group: Simes is valid under positive regression dependence (typical for factors within one family — they share style exposure). If a group mixes structurally opposite factors (e.g. momentum and reversal in one bucket), the within-group PRDS assumption can fail; split the bucket or pre-orthogonalize.
- Pre-filtered input:
bhy_hierarchicalassumes the input is the candidate family. If profiles came from upstream pre-filtering (e.g. top-50 of 500 candidates), the FDR claim does not cover the full screening pipeline — count K accordingly per the Haircut Sharpe / experiment-log discipline.
References
Yekutieli, D. (2008). "Hierarchical false discovery rate- controlling methodology." JASA 103(481), 309-316.
Examples:
Six candidate factors split into two family groups; FDR is controlled across groups (outer) and within each group (inner):
>>> import dataclasses
>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> cfg = fx.AnalysisConfig.individual_continuous(forward_periods=5)
>>> profiles = [
... dataclasses.replace(
... fx.evaluate(
... compute_forward_return(
... fx.datasets.make_cs_panel(
... n_assets=100, n_dates=250, seed=i,
... ),
... forward_periods=5,
... ),
... cfg,
... ),
... factor_id=f"f_{family}_{i}",
... context={"family": family},
... )
... for family in ("momentum", "value")
... for i in range(3)
... ]
>>> survivors = fx.multi_factor.bhy_hierarchical(
... profiles, group="family"
... )
Two-stage false discovery rate (FDR) for factor sets with natural group structure (factor families, regions, sectors). Outer Benjamini-Hochberg-Yekutieli (BHY) on Simes (1986) group representatives + inner BHY within each passing group, per Yekutieli (2008).
import factrix as fx
# "Which factor families have signal, and within those, which factors?"
profiles = [
fx.evaluate(panel_mom_1m, cfg, factor_col="mom_1m",
context={"family": "momentum"}),
fx.evaluate(panel_mom_12m, cfg, factor_col="mom_12m",
context={"family": "momentum"}),
fx.evaluate(panel_pb, cfg, factor_col="pb",
context={"family": "value"}),
fx.evaluate(panel_pe, cfg, factor_col="pe",
context={"family": "value"}),
# ... + quality, low-vol, etc.
]
survivors = fx.multi_factor.bhy_hierarchical(profiles, group="family", q=0.05)
Which function fits this question?¶
Same input shape (one profile per (factor, condition)), three different claims:
| Claim | Survivor unit | Function |
|---|---|---|
| "Factor X significant in each condition / universe" | (factor, condition) pair |
bhy(expand_over=) |
| "Factor X significant in \(\ge k\) of \(m\) conditions" | factor identity | partial_conjunction |
| "Which families have signal, and within those, which factors?" | factor identity (group-then-within) | bhy_hierarchical |
bhy_hierarchical is the only one of the three that keeps the
family-level answer first-class — readers learn both "5 of 8
families showed signal" and "within those, factors A / B / C survived"
from a single Survivors container.
How the math works¶
Per group \(g\) with \(m_g\) member p-values:
-
Compute the group representative
\[ p_{\text{Simes},g} = \min_{k=1,\ldots,m_g} \frac{m_g}{k} \cdot p_{(k),g} \]
where \(p_{(k),g}\) is the \(k\)-th smallest p-value in group \(g\). Simes dominates the Bonferroni representative \(m_g \cdot \min(p)\) and is the Yekutieli 2008 recommended choice.
-
Outer BHY across the \(G\) group representatives gives \(p_{\text{outer},g}^{\text{adj}}\).
-
Inner BHY within each group gives \(p_{\text{inner},i}^{\text{adj}}\) for member \(i\) of group \(g(i)\).
-
The cell-level adjusted p is the max-of-layers fold
\[ p_i^{\text{adj}} = \max\bigl(p_{\text{outer},g(i)}^{\text{adj}},\; p_{\text{inner},i}^{\text{adj}}\bigr) \]
This preserves the universal Survivors duality
survivor[i] iff adj_p[i] <= q while encoding the two-layer logic:
a cell can fail because its group failed outer, because the cell
itself failed inner, or both.
Survivors output¶
| Field | Meaning |
|---|---|
profiles |
Surviving profiles in input order |
adj_p |
Max-of-layers \(\text{adj}_p\); survivor iff adj_p <= q |
q |
The q you passed (single target, both layers) |
expand_over |
(group,) — single-element tuple |
n_tests |
Mapping (group_value,) -> m_group for every input group (covers dead families too, so "N of M families survived" claims are computable directly). Counter to partial_conjunction, which keeps surviving identities only. |
Per-survivor group label: profile.context[group].
When not to reach for bhy_hierarchical¶
| Real intent | Reach for | Why |
|---|---|---|
| No natural group structure | bhy |
The grouping is real or it isn't; faking a group axis trivializes the procedure. |
| "Factor X passes in every condition" | partial_conjunction with min_pass == m |
Hierarchical is "group-then-within", not "joint across conditions". |
| Flat BHY split by family for display only | bhy(expand_over=["family"]) |
Independent step-ups per bucket, no group-level inference. Use when you do not need a "this family has signal" answer. |
| Mixed-sign factors in one bucket | Split the bucket / pre-orthogonalize | Within-group Simes assumes positive regression dependence on a subset (PRDS); structurally opposite factors (e.g. momentum + reversal in one group) can violate it. |
Validation summary¶
| Trigger | Outcome |
|---|---|
group shadows an identity field (factor_id / forward_periods) |
UserInputError. |
group key missing from a profile's context |
UserInputError. |
| Only one distinct group value across input | UserInputError — points at bhy. |
| Every profile is its own group at \(n \ge 3\) (group axis near-unique) | UserInputError — pick a coarser categorical. |
Duplicate (identity, group_value) partition key |
UserInputError. |
| More than half of input groups contain a single profile | RuntimeWarning — inner BHY on \(n=1\) is a raw cutoff. |
Caveats¶
- Simes outer representative: not exposed as a kwarg. Dominates Bonferroni-min under PRDS; Edgington-style mean-p has no valid null distribution and is rejected.
- PRDS within group: Simes is valid under positive regression dependence — typical for factors within one family that share style exposure. If a group mixes structurally opposite factors (e.g. momentum + reversal in one bucket), the within-group PRDS assumption can fail; split the group or pre-orthogonalize.
- Pre-filtered input:
bhy_hierarchicalassumes the input is the candidate family. If profiles came from upstream pre-filtering (e.g. top-50 of 500 candidates), the FDR claim does not cover the full screening pipeline — track \(K\) per the experiment-log discipline. - Composed FDR is approximate at exact \(q\): Yekutieli 2008 bounds group-level FDR \(\le q\) and within-group FDR \(\le q\) conditional on group passing; the composed per-hypothesis FDR under PRDS is bounded but not exactly \(q\). Researcher claims should be "FDR-controlled at \(q\) in each layer", not "joint FDR \(= q\)".
References¶
- [S1986] Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73(3), 751–754.
- [Y2008] Yekutieli, D. (2008). Hierarchical false discovery rate-controlling methodology. JASA, 103(481), 309–316.
- [NBER34050] NBER WP 34050 (2025). Hierarchical Multiple Testing in Empirical Asset Pricing.