factrix.metrics.spanning ¶
Spanning regression — single-factor test and multi-factor selection.
spanning_alpha: does a single factor have alpha after controlling for
base factors? Standard factor research tool (Barillas-Shanken (2017)).
greedy_forward_selection: given a pool of PASS factors, iteratively
select those with incremental alpha (Stage 2).
Both use factor return time series (quantile spread series), not information coefficient (IC).
Notes
Pipeline. Regression of factor return time-series on base-factor returns (time-series step); Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC) t on alpha. The greedy stepwise selection variant inflates t-stats and is not for inference.
References
- Barillas & Shanken (2017), "Which Alpha?"
- Feng, Giglio & Xiu (2020), "Taming the Factor Zoo: A Test of New Factors."
factrix.metrics.spanning.SpanningResult
dataclass
¶
Result of a single spanning regression for one candidate factor.
factrix.metrics.spanning.ForwardSelectionResult
dataclass
¶
ForwardSelectionResult(selected_factors: list[SpanningResult] = list(), eliminated_factors: list[SpanningResult] = list(), all_candidates: list[SpanningResult] = list())
Output of greedy forward selection.
t_stats_inference_invalid: a fixed True — stepwise selection
searches over the candidate pool and picks by |alpha|, so the
t-statistics on selected_factors and eliminated_factors are
conditioned on having been chosen. They do not have a valid
t-distribution null and must not be used for inference
(White (2000); Harvey-Liu-Zhu (2016)).
For post-selection significance, re-evaluate
survivors on a held-out sample or with a bootstrap.
factrix.metrics.spanning.spanning_alpha ¶
spanning_alpha(factor_spread: DataFrame, base_spreads: dict[str, DataFrame] | None = None) -> MetricOutput
Test whether a factor has alpha after controlling for base factors.
Runs: factor_spread = alpha + beta_1 * base_1 + ... + epsilon If alpha is significantly != 0, the factor provides incremental value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
factor_spread
|
DataFrame
|
DataFrame with |
required |
base_spreads
|
dict[str, DataFrame] | None
|
Mapping of base factor name → DataFrame with |
None
|
Returns:
| Type | Description |
|---|---|
MetricOutput
|
MetricOutput with value=alpha, t_stat, significance. |
Notes
Run ordinary least squares (OLS) r_t = alpha + sum_k beta_k * base_k(t) + eps_t on
common-date intersected spread series. Test H0: alpha = 0 via
t = alpha / SE(alpha) from the OLS covariance.
factrix uses plain OLS standard errors here rather than Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC): the inputs are non-overlap quantile spreads (single-period stride) so MA(h-1) overlap is absent. Callers feeding HAC-relevant overlapping series should either pre-resample or wrap the call with their own HAC SE.
References
- Barillas & Shanken (2017). "Which Alpha?" Review of Financial Studies, 30(4), 1316–1338. Spanning-test framework for nested factor models.
Examples:
Build a spread series via
:func:~factrix.metrics.quantile.compute_spread_series, then
test its alpha standalone:
>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.quantile import compute_spread_series
>>> from factrix.metrics.spanning import spanning_alpha
>>> panel = compute_forward_return(
... fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
... forward_periods=5,
... )
>>> spread = compute_spread_series(panel, forward_periods=5)
>>> result = spanning_alpha(spread)
>>> result.name
'spanning_alpha'
factrix.metrics.spanning.greedy_forward_selection ¶
greedy_forward_selection(factor_spreads: dict[str, DataFrame], base_spreads: dict[str, DataFrame] | None = None, significance_threshold: float = 2.0, max_factors: int = 20, suppress_snooping_warning: bool = False) -> ForwardSelectionResult
Greedy forward selection with backward elimination.
WARNING — data snooping / selection bias:
Stepwise selection over a candidate pool of K factors inflates
the per-selected-factor t-stat by an order-statistic factor
(typical estimates 2-4× on K=10-100 pools). The t-stats on
selected_factors are NOT valid for hypothesis testing —
they are conditional on survival, not draws from the t-null.
Use this function as a model-construction helper, not as
an inference tool. For post-selection significance, re-evaluate
the surviving set on a held-out window, or use a White (2000)
Reality Check / Hansen (2005) SPA procedure on the pre-selection
stage. The returned t_stats_inference_invalid=True flag
encodes this contract.
Algorithm
- Start with base factor set (e.g., Size, Value, Momentum spreads).
- For each candidate PASS factor, compute spanning alpha.
- Select the candidate with largest |alpha| if t-stat >= threshold.
- After each addition, backward-check all selected factors: re-run spanning regression for each against all others. Remove any whose alpha becomes insignificant.
- Repeat until no candidate has significant alpha.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
factor_spreads
|
dict[str, DataFrame]
|
Mapping of factor name → DataFrame with |
required |
base_spreads
|
dict[str, DataFrame] | None
|
Mapping of base factor name → DataFrame with |
None
|
significance_threshold
|
float
|
Minimum |t-stat| for selection (default 2.0). |
2.0
|
max_factors
|
int
|
Maximum number of factors to select. |
20
|
suppress_snooping_warning
|
bool
|
Silence the one-shot |
False
|
Returns:
| Type | Description |
|---|---|
ForwardSelectionResult
|
ForwardSelectionResult with selected factors in order. |
Notes
Iteratively run spanning_alpha(candidate, base ∪ selected);
add the candidate with the largest |alpha| whose |t| >=
threshold; after each add, re-test all selected factors against
the others and drop any that lose significance. Repeat until the
pool dries up or max_factors is hit.
factrix flags t_stats_inference_invalid = True because the
retained t-stats are conditional on selection — they are
order-statistic inflated and must not be read as draws from the
t-null. Use the result as a model-construction helper; verify
survivors on a held-out window.
References
- White (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097–1126. Bootstrap reality-check for data-snooping bias — the canonical correction this function does not apply (inflates t-stats by design).
- Harvey, Liu & Zhu (2016). "…and the Cross-Section of Expected Returns." Review of Financial Studies, 29(1), 5–68. Empirical case for raising t-thresholds; section on stepwise-selection bias.
Examples:
Greedy step-wise selection across two candidate spread series.
suppress_snooping_warning=True acknowledges the inflated-t
contract documented on
:class:ForwardSelectionResult:
>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.quantile import compute_spread_series
>>> from factrix.metrics.spanning import greedy_forward_selection
>>> seeds = [0, 1]
>>> spreads = {
... f"cand_{s}": compute_spread_series(
... compute_forward_return(
... fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=s),
... forward_periods=5,
... ),
... forward_periods=5,
... )
... for s in seeds
... }
>>> result = greedy_forward_selection(
... spreads,
... suppress_snooping_warning=True,
... )
>>> result.t_stats_inference_invalid
True
Inputs are factor-return series, not raw panels
spanning is a post-PANEL consumer: both callables operate on
spread-series DataFrames (date, spread) produced by
compute_spread_series (or any equivalent factor-mimicking-portfolio
return series), not the raw (date, asset_id, factor, forward_return)
panel consumed by the other metrics in this cell.
Use cases¶
-
Single-factor incremental alpha
spanning_alpharegresses the candidate spread series on a set of base-factor spread series; tests \(H_0: \alpha = 0\) via ordinary least squares (OLS) \(t\)-stat. Standard tool for "does this factor add anything beyond the existing model?" (Barillas-Shanken 2017). -
Mean-return test when no base factors
With
base_spreads=None(or empty),spanning_alphacollapses to a plain mean-return \(t\)-test on the candidate's spread series — convenient when the question is "is there any alpha here" before pulling in controls. -
Greedy model construction over a pool
greedy_forward_selectioniteratively adds the candidate with largest \(|\alpha|\) above a \(|t|\) threshold, then backward-eliminates any selected factor that loses significance. Use as a model-construction helper only — the returned \(t\)-stats are selection-conditioned and not valid for inference.
Stepwise selection inflates t-stats
greedy_forward_selection searches the candidate pool and picks
by \(|\alpha|\); the per-selected-factor \(t\)-stat is order-statistic
inflated (typically 2-4x on pools of 10-100 candidates) and is
not a draw from the \(t\)-null (White 2000; Harvey-Liu-Zhu 2016).
The returned t_stats_inference_invalid=True flag encodes this
contract. For post-selection significance, re-evaluate survivors
on a held-out window, or use a Hansen (2005) SPA / White (2000)
Reality Check on the pre-selection stage.
Choosing a function¶
| Goal | Function |
|---|---|
| Single-factor spanning regression — incremental alpha vs base factors | spanning_alpha |
| Greedy multi-factor selection over a candidate pool (model-construction only) | greedy_forward_selection |
Worked example — single-factor spanning then greedy selection¶
compute_spread_series → spanning_alpha → greedy_forward_selection
import factrix as fx
from factrix.metrics.quantile import compute_spread_series
from factrix.metrics.spanning import (
spanning_alpha, greedy_forward_selection,
)
from factrix.preprocess import compute_forward_return
# Build a spread series for each factor on the same panel dates.
panels = {
name: compute_forward_return(
fx.datasets.make_cs_panel(
n_assets=200, n_dates=500, ic_target=ic, seed=seed,
),
forward_periods=5,
)
for name, ic, seed in [
("size", 0.05, 1),
("value", 0.06, 2),
("momentum", 0.08, 3),
("candidate",0.04, 4),
]
}
spreads = {
name: compute_spread_series(p, forward_periods=5, n_groups=5)
for name, p in panels.items()
}
# Single-factor: candidate vs the base set
out = spanning_alpha(
factor_spread = spreads["candidate"],
base_spreads = {k: spreads[k] for k in ("size", "value", "momentum")},
)
print(out.value, out.stat, out.metadata["p_value"], out.metadata["r_squared"])
# 0.0011 1.83 0.068 0.21 (approximate)
# Multi-factor: greedy build a parsimonious set
pool = {k: spreads[k] for k in ("size", "value", "momentum", "candidate")}
sel = greedy_forward_selection(
factor_spreads = pool,
significance_threshold = 2.0,
max_factors = 4,
suppress_snooping_warning = True, # acknowledged: construction-only
)
for s in sel.selected_factors:
print(s.factor_name, s.alpha, s.t_stat)
# momentum 0.0028 4.10
# value 0.0019 2.51 (approximate; t_stats inflated, do not infer)
See also¶
-
compute_spread_series/quantile_spread
Produces the per-date spread series consumed here.
-
compute_factor_returns(preprocess)
Any factor-return / spread time series with
(date, spread)shape is a valid input; this is the upstream pipeline for the post-PANEL cell. -
slice_pairwise_test/slice_joint_test
Cross-slice inference on spanning alphas (Wald \(\chi^2\) + Holm / Romano-Wolf adjusted \(p\)).
-
Statistical methods
OLS \(t\) on the alpha; when overlap is added, swap to Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC) SE via the same kernel discipline used elsewhere in factrix.
-
Metric applicability reference
When this metric applies and the post-selection-inference contracts that govern
greedy_forward_selection. -
Individual × Continuous landing
Adjacent metrics in the same cell.