factrix.metrics.spanning ¶

Spanning regression — single-factor test and multi-factor selection.

spanning_alpha: does a single factor have alpha after controlling for base factors? Standard factor research tool (Barillas-Shanken (2017)).

greedy_forward_selection: given a pool of PASS factors, iteratively select those with incremental alpha (Stage 2).

Both use factor return time series (quantile spread series), not information coefficient (IC).

Notes

Pipeline. Regression of factor return time-series on base-factor returns (time-series step); Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC) t on alpha. The greedy stepwise selection variant inflates t-stats and is not for inference.

References

Barillas & Shanken (2017), "Which Alpha?"
Feng, Giglio & Xiu (2020), "Taming the Factor Zoo: A Test of New Factors."

factrix.metrics.spanning.SpanningResult `dataclass` ¶

SpanningResult(factor_name: str, alpha: float, t_stat: float, selected: bool)

Result of a single spanning regression for one candidate factor.

factrix.metrics.spanning.ForwardSelectionResult `dataclass` ¶

ForwardSelectionResult(selected_factors: list[SpanningResult] = list(), eliminated_factors: list[SpanningResult] = list(), all_candidates: list[SpanningResult] = list())

Output of greedy forward selection.

t_stats_inference_invalid: a fixed True — stepwise selection searches over the candidate pool and picks by |alpha|, so the t-statistics on selected_factors and eliminated_factors are conditioned on having been chosen. They do not have a valid t-distribution null and must not be used for inference (White (2000); Harvey-Liu-Zhu (2016)). For post-selection significance, re-evaluate survivors on a held-out sample or with a bootstrap.

factrix.metrics.spanning.spanning_alpha ¶

spanning_alpha(factor_spread: DataFrame, base_spreads: dict[str, DataFrame] | None = None) -> MetricOutput

Test whether a factor has alpha after controlling for base factors.

Runs: factor_spread = alpha + beta_1 * base_1 + ... + epsilon If alpha is significantly != 0, the factor provides incremental value.

Parameters:

Name	Type	Description	Default
`factor_spread`	`DataFrame`	DataFrame with `date, spread` for the candidate factor.	required
`base_spreads`	`dict[str, DataFrame] \| None`	Mapping of base factor name → DataFrame with `date, spread`. If None or empty, tests whether the factor has nonzero mean return.	`None`

Returns:

Type	Description
`MetricOutput`	MetricOutput with value=alpha, t_stat, significance.

Notes

Run ordinary least squares (OLS) r_t = alpha + sum_k beta_k * base_k(t) + eps_t on common-date intersected spread series. Test H0: alpha = 0 via t = alpha / SE(alpha) from the OLS covariance.

factrix uses plain OLS standard errors here rather than Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC): the inputs are non-overlap quantile spreads (single-period stride) so MA(h-1) overlap is absent. Callers feeding HAC-relevant overlapping series should either pre-resample or wrap the call with their own HAC SE.

References

Barillas & Shanken (2017). "Which Alpha?" Review of Financial Studies, 30(4), 1316–1338. Spanning-test framework for nested factor models.

Examples:

Build a spread series via :func:~factrix.metrics.quantile.compute_spread_series, then test its alpha standalone:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.quantile import compute_spread_series
>>> from factrix.metrics.spanning import spanning_alpha
>>> panel = compute_forward_return(
...     fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
...     forward_periods=5,
... )
>>> spread = compute_spread_series(panel, forward_periods=5)
>>> result = spanning_alpha(spread)
>>> result.name
'spanning_alpha'

factrix.metrics.spanning.greedy_forward_selection ¶

greedy_forward_selection(factor_spreads: dict[str, DataFrame], base_spreads: dict[str, DataFrame] | None = None, significance_threshold: float = 2.0, max_factors: int = 20, suppress_snooping_warning: bool = False) -> ForwardSelectionResult

Greedy forward selection with backward elimination.

WARNING — data snooping / selection bias: Stepwise selection over a candidate pool of K factors inflates the per-selected-factor t-stat by an order-statistic factor (typical estimates 2-4× on K=10-100 pools). The t-stats on selected_factors are NOT valid for hypothesis testing — they are conditional on survival, not draws from the t-null. Use this function as a model-construction helper, not as an inference tool. For post-selection significance, re-evaluate the surviving set on a held-out window, or use a White (2000) Reality Check / Hansen (2005) SPA procedure on the pre-selection stage. The returned t_stats_inference_invalid=True flag encodes this contract.

Algorithm

Start with base factor set (e.g., Size, Value, Momentum spreads).
For each candidate PASS factor, compute spanning alpha.
Select the candidate with largest |alpha| if t-stat >= threshold.
After each addition, backward-check all selected factors: re-run spanning regression for each against all others. Remove any whose alpha becomes insignificant.
Repeat until no candidate has significant alpha.

Parameters:

Name	Type	Description	Default
`factor_spreads`	`dict[str, DataFrame]`	Mapping of factor name → DataFrame with `date, spread`.	required
`base_spreads`	`dict[str, DataFrame] \| None`	Mapping of base factor name → DataFrame with `date, spread`. If None, starts with an empty base.	`None`
`significance_threshold`	`float`	Minimum \|t-stat\| for selection (default 2.0).	`2.0`
`max_factors`	`int`	Maximum number of factors to select.	`20`
`suppress_snooping_warning`	`bool`	Silence the one-shot `UserWarning`. Only set when the caller has explicitly acknowledged that the returned t-stats are for model-construction, not inference.	`False`

Returns:

Type	Description
`ForwardSelectionResult`	ForwardSelectionResult with selected factors in order.

Notes

Iteratively run spanning_alpha(candidate, base ∪ selected); add the candidate with the largest |alpha| whose |t| >= threshold; after each add, re-test all selected factors against the others and drop any that lose significance. Repeat until the pool dries up or max_factors is hit.

factrix flags t_stats_inference_invalid = True because the retained t-stats are conditional on selection — they are order-statistic inflated and must not be read as draws from the t-null. Use the result as a model-construction helper; verify survivors on a held-out window.

References

White (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097–1126. Bootstrap reality-check for data-snooping bias — the canonical correction this function does not apply (inflates t-stats by design).
Harvey, Liu & Zhu (2016). "…and the Cross-Section of Expected Returns." Review of Financial Studies, 29(1), 5–68. Empirical case for raising t-thresholds; section on stepwise-selection bias.

Examples:

Greedy step-wise selection across two candidate spread series. suppress_snooping_warning=True acknowledges the inflated-t contract documented on :class:ForwardSelectionResult:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.quantile import compute_spread_series
>>> from factrix.metrics.spanning import greedy_forward_selection
>>> seeds = [0, 1]
>>> spreads = {
...     f"cand_{s}": compute_spread_series(
...         compute_forward_return(
...             fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=s),
...             forward_periods=5,
...         ),
...         forward_periods=5,
...     )
...     for s in seeds
... }
>>> result = greedy_forward_selection(
...     spreads,
...     suppress_snooping_warning=True,
... )
>>> result.t_stats_inference_invalid
True

Inputs are factor-return series, not raw panels

spanning is a post-PANEL consumer: both callables operate on spread-series DataFrames (date, spread) produced by compute_spread_series (or any equivalent factor-mimicking-portfolio return series), not the raw (date, asset_id, factor, forward_return) panel consumed by the other metrics in this cell.

Use cases¶

Single-factor incremental alpha

spanning_alpha regresses the candidate spread series on a set of base-factor spread series; tests \(H_0: \alpha = 0\) via ordinary least squares (OLS) \(t\)-stat. Standard tool for "does this factor add anything beyond the existing model?" (Barillas-Shanken 2017).
Mean-return test when no base factors

With base_spreads=None (or empty), spanning_alpha collapses to a plain mean-return \(t\)-test on the candidate's spread series — convenient when the question is "is there any alpha here" before pulling in controls.
Greedy model construction over a pool

greedy_forward_selection iteratively adds the candidate with largest \(|\alpha|\) above a \(|t|\) threshold, then backward-eliminates any selected factor that loses significance. Use as a model-construction helper only — the returned \(t\)-stats are selection-conditioned and not valid for inference.

Stepwise selection inflates t-stats

greedy_forward_selection searches the candidate pool and picks by \(|\alpha|\); the per-selected-factor \(t\)-stat is order-statistic inflated (typically 2-4x on pools of 10-100 candidates) and is not a draw from the \(t\)-null (White 2000; Harvey-Liu-Zhu 2016). The returned t_stats_inference_invalid=True flag encodes this contract. For post-selection significance, re-evaluate survivors on a held-out window, or use a Hansen (2005) SPA / White (2000) Reality Check on the pre-selection stage.

Choosing a function¶

Goal	Function
Single-factor spanning regression — incremental alpha vs base factors	`spanning_alpha`
Greedy multi-factor selection over a candidate pool (model-construction only)	`greedy_forward_selection`

Worked example — single-factor spanning then greedy selection¶

compute_spread_series → spanning_alpha → greedy_forward_selection

import factrix as fx
from factrix.metrics.quantile import compute_spread_series
from factrix.metrics.spanning import (
    spanning_alpha, greedy_forward_selection,
)
from factrix.preprocess import compute_forward_return

# Build a spread series for each factor on the same panel dates.
panels = {
    name: compute_forward_return(
        fx.datasets.make_cs_panel(
            n_assets=200, n_dates=500, ic_target=ic, seed=seed,
        ),
        forward_periods=5,
    )
    for name, ic, seed in [
        ("size",     0.05, 1),
        ("value",    0.06, 2),
        ("momentum", 0.08, 3),
        ("candidate",0.04, 4),
    ]
}
spreads = {
    name: compute_spread_series(p, forward_periods=5, n_groups=5)
    for name, p in panels.items()
}

# Single-factor: candidate vs the base set
out = spanning_alpha(
    factor_spread = spreads["candidate"],
    base_spreads  = {k: spreads[k] for k in ("size", "value", "momentum")},
)
print(out.value, out.stat, out.metadata["p_value"], out.metadata["r_squared"])
# 0.0011  1.83  0.068  0.21   (approximate)

# Multi-factor: greedy build a parsimonious set
pool = {k: spreads[k] for k in ("size", "value", "momentum", "candidate")}
sel = greedy_forward_selection(
    factor_spreads          = pool,
    significance_threshold  = 2.0,
    max_factors             = 4,
    suppress_snooping_warning = True,  # acknowledged: construction-only
)
for s in sel.selected_factors:
    print(s.factor_name, s.alpha, s.t_stat)
# momentum  0.0028  4.10
# value     0.0019  2.51   (approximate; t_stats inflated, do not infer)

factrix.metrics.spanning ¶

factrix.metrics.spanning.SpanningResult dataclass ¶

factrix.metrics.spanning.ForwardSelectionResult dataclass ¶

factrix.metrics.spanning.spanning_alpha ¶

factrix.metrics.spanning.greedy_forward_selection ¶

Use cases¶

Choosing a function¶

Worked example — single-factor spanning then greedy selection¶

See also¶

factrix.metrics.spanning.SpanningResult `dataclass` ¶

factrix.metrics.spanning.ForwardSelectionResult `dataclass` ¶