Multi-factor screening
Apply Benjamini-Hochberg-Yekutieli (BHY) false discovery rate (FDR) control to a batch of candidate factors. Demonstrates the explicit-family contract and the duplicate-identity defense — the behaviour that is impossible to learn from docstrings alone.
Runnable notebook: examples/multi_factor_screening.ipynb.
Factor type¶
This recipe uses multi_factor.bhy(...) over a list of FactorProfile
objects. Each profile is produced by evaluate(panel, cfg,
factor_col=<name>) for any registered cell — screening is
factor-type-agnostic.
The input list is the family. Each profile must carry a unique
identity = (factor_id, forward_periods) — the anti-shopping defense.
The recommended path: name the factor column distinctly per candidate
panel and pass it via factor_col=; evaluate auto-stamps factor_id
from that name. dataclasses.replace(profile, factor_id=...) is an
escape hatch for cases where you cannot rename the column.
Pass expand_over=[<context key>] to declare per-bucket independent
step-ups (Benjamini & Bogomolov 2014 selective inference). Mixing
forward_periods without expand_over emits RuntimeWarning —
different horizons carry different null distributions and pooling
silently inflates FDR.
Literature: Benjamini & Yekutieli (2001).
Use this when¶
- You have ≥2 candidate factors and want type-I error control under multiple testing.
- Candidates are evaluable under the same
AnalysisConfig(same scope, signal, metric, forward horizon). Mixed cells / horizons in one call are caller's responsibility — see § 4 below. - You prefer FDR control (BHY) over family-wise error (Bonferroni) — appropriate when discoveries can tolerate a known false-positive rate and power matters.
What it tests¶
For an input family of size N, BHY step-up keeps profiles whose
ranked p-values satisfy p_(k) ≤ q · k / (N · c(N)) where
c(N) = Σ 1/i (Benjamini-Yekutieli adjustment for arbitrary
dependence). Pass your nominal q directly — the c(N) correction
is baked in. Cross-family aggregation is not performed; that is the
user's responsibility and is deliberately out of scope.
Output to read¶
len(survivors)(=len(survivors.profiles)) vslen(profiles)— coarse hit rate.survivors.profiles[i].primary_pand.factor_id— which factor cleared the family-adjusted bar.survivors.adj_p[i]— the BHY-adjusted p-value driving the survivor decision (survivor[i] iff adj_p[i] <= q). Computed bucket-locally whenexpand_overis set.survivors.n_tests— per-bucketmfed into each step-up; audit trail for two-stage screening.- Any emitted
RuntimeWarning— flags mixedforward_periodswithoutexpand_over(FDR-inflation foot-gun) or singletonexpand_overbuckets (BHY on n=1 is a raw cutoff).
1. Setup¶
import factrix as fx
import numpy as np
import polars as pl
from factrix.preprocess import compute_forward_return
2. Build a single-family batch¶
Five candidate factors, all under the same
individual_continuous(IC, forward_periods=5) cell — a valid BHY
input where the step-up actually controls FDR.
We start from one ground-truth factor and add increasing independent and identically distributed (IID) noise
to produce variants with varying signal strengths. Each variant is
materialised under its own column name (variant_0 … variant_4)
so evaluate(..., factor_col=name) auto-stamps a distinct
factor_id per profile — no post-hoc identity surgery needed.
raw = fx.datasets.make_cs_panel(
n_assets=100,
n_dates=500,
ic_target=0.08,
seed=2024,
)
panel = compute_forward_return(raw, forward_periods=5)
cfg = fx.AnalysisConfig.individual_continuous(
metric=fx.Metric.IC,
forward_periods=5,
)
def variant_panel(
base: pl.DataFrame, *, name: str, scale: float, seed: int
) -> pl.DataFrame:
"""Add IID noise on top of the ground-truth factor and store under a fresh column name."""
rng = np.random.default_rng(seed)
noisy = base["factor"].to_numpy() + scale * rng.standard_normal(base.height)
return base.drop("factor").with_columns(pl.Series(name, noisy))
candidates = {
f"variant_{i}": variant_panel(
panel, name=f"variant_{i}", scale=0.5 + 0.3 * i, seed=100 + i
)
for i in range(5)
}
profiles = [fx.evaluate(p, cfg, factor_col=name) for name, p in candidates.items()]
# Raw primary_p only — FDR-controlled decisions wait for BHY in §3.
# Per-factor `primary_p < 0.05` thresholding on N candidates is the
# spec-search anti-pattern factrix explicitly avoids.
for prof in profiles:
print(f" {prof.factor_id:12s} primary_p={prof.primary_p:.4g}")
3. Apply BHY¶
The input list is the family. bhy runs one Benjamini-Yekutieli
step-up over all profiles and returns a Survivors container —
.profiles in input order, .adj_p aligned, .q / .expand_over /
.n_tests for audit. Survivor membership is definitionally
adj_p <= q. The c(N) correction is applied internally; pass your
nominal q.
survivors = fx.multi_factor.bhy(profiles, q=0.05)
print(f"BHY survivors: {len(survivors)} / {len(profiles)}")
for prof, adj in zip(survivors.profiles, survivors.adj_p, strict=True):
print(f" {prof.factor_id:12s} primary_p={prof.primary_p:.4g} adj_p={adj:.4g}")
Illustrative output:
BHY survivors: 5 / 5
variant_0 primary_p=2.136e-39 adj_p=2.439e-38
variant_1 primary_p=4.052e-26 adj_p=2.313e-25
variant_2 primary_p=6.682e-22 adj_p=2.543e-21
variant_3 primary_p=3.987e-17 adj_p=1.138e-16
variant_4 primary_p=7.407e-14 adj_p=1.691e-13
In Jupyter, survivors additionally renders as a three-column
identity | primary_p | adj_p table via Survivors._repr_html_.
4. Duplicate-identity defense¶
evaluate() defaults factor_col="factor", which means every
profile built via the lazy path lands at identity = ("factor",
forward_periods). Pass two such profiles to bhy() and the
family-resolution layer raises UserInputError
rather than silently treating distinct candidates as one hypothesis.
The canonical fix is what § 2 already does: distinct column name per
panel + evaluate(..., factor_col=name). When you cannot rename the
column (e.g. because the panel comes from an upstream loader you do
not own), dataclasses.replace(profile, factor_id=name) is the
escape hatch.
Below we deliberately take the lazy path to surface the error.
import dataclasses
cfg_fm = fx.AnalysisConfig.individual_continuous(
metric=fx.Metric.FM,
forward_periods=5,
)
# Lazy path: both calls default factor_col="factor" → identity collide.
unstamped = [
fx.evaluate(panel, cfg), # IC, factor_id="factor"
fx.evaluate(panel, cfg_fm), # FM, factor_id="factor" — same identity
]
try:
fx.multi_factor.bhy(unstamped, q=0.05)
except fx.UserInputError as exc:
print("UserInputError raised as expected:")
print(str(exc))
# Canonical fix: pass factor_col= on each evaluate call. Requires distinct
# column names; here we rename the same panel's "factor" column upfront.
stamped = [
fx.evaluate(panel.rename({"factor": "ic_var"}), cfg, factor_col="ic_var"),
fx.evaluate(panel.rename({"factor": "fm_var"}), cfg_fm, factor_col="fm_var"),
]
print(f"\ncanonical fix → identities: {[p.factor_id for p in stamped]}")
# Escape hatch: when you cannot rename the column, post-hoc stamp via
# dataclasses.replace. Identical end state, slightly more imperative.
escape = [
dataclasses.replace(fx.evaluate(panel, cfg), factor_id="ic_var"),
dataclasses.replace(fx.evaluate(panel, cfg_fm), factor_id="fm_var"),
]
print(f"escape hatch → identities: {[p.factor_id for p in escape]}")
Illustrative output:
UserInputError raised as expected:
bhy(): invalid profiles=('factor', 5)
Expected: unique partition key across input; duplicate first seen at index 0, again at 1. Stamp distinct factor_id per profile via `evaluate(..., factor_col=<name>)` (canonical) or `dataclasses.replace(profile, factor_id=<name>)` (escape hatch when the column cannot be renamed); or pass `expand_over=[<context key>]` to declare per-bucket families
Docs: https://awwesomeman.github.io/factrix/api/bhy#partition-key
canonical fix → identities: ['ic_var', 'fm_var']
escape hatch → identities: ['ic_var', 'fm_var']
5. Where to go next¶
For the broader n_assets × factory behaviour matrix see Guides §
Panel vs timeseries; for the BHY
contract specifically see Guides § Batch
screening.