Skip to content

factrix.metrics.fama_macbeth

Fama-MacBeth regression — FM-canonical metric for the Individual × Continuous cell.

compute_fm_betas: per-date cross-sectional ordinary least squares (OLS) → (date, beta) DataFrame. fama_macbeth: Newey-West t-test on the beta series. pooled_ols: pooled OLS with clustered SE by date. beta_sign_consistency: fraction of periods with correct beta sign.

Notes

Pipeline. Per-date cross-sectional OLS slope \(\lambda\) (cross-section step) → time series of \(\lambda\), then Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC) \(t\) on its mean; pooled OLS variant clusters SE by date.

References
  • Fama & MacBeth (1973), "Risk, Return, and Equilibrium: Empirical Tests."
  • Newey & West (1987), "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix."
  • Petersen (2009), "Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches."

factrix.metrics.fama_macbeth.compute_fm_betas

compute_fm_betas(df: DataFrame, *, factor_col: str = 'factor', return_col: str = 'forward_return') -> DataFrame

Per-date cross-sectional ordinary least squares (OLS): \(R_i = \alpha + \beta \cdot \text{Signal}_i + \varepsilon\).

Parameters:

Name Type Description Default
df DataFrame

Long panel with date, asset_id, factor, forward_return.

required
factor_col str

Column carrying the factor exposure.

'factor'
return_col str

Column carrying the forward return.

'forward_return'

Returns:

Type Description
DataFrame

DataFrame with date, beta (one row per date that admits a

DataFrame

finite OLS solution; dates with fewer than 3 observations or

DataFrame

a singular design are dropped).

Notes

Per date \(t\), solve the cross-sectional OLS \(R_{i,t} = \alpha_t + \beta_t \cdot \text{Signal}_{i,t} + \varepsilon_{i,t}\) and emit the slope \(\beta_t\). The output series feeds the stage-2 Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC) \(t\)-test in fama_macbeth.

factrix drops dates with fewer than 3 cross-sectional observations or a singular design rather than coercing to NaN — this keeps stage-2 a clean t-test on a finite, well-defined series with no NaN propagation in the NW kernel.

References
  • Fama & MacBeth (1973). "Risk, Return, and Equilibrium: Empirical Tests." Journal of Political Economy, 81(3), 607–636. The per-date cross-sectional regression at stage 1 of the FM procedure.

Examples:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.fama_macbeth import compute_fm_betas
>>> panel = compute_forward_return(
...     fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
...     forward_periods=5,
... )
>>> beta_df = compute_fm_betas(panel)
>>> set(beta_df.columns) >= {"date", "beta"}
True

factrix.metrics.fama_macbeth.fama_macbeth

fama_macbeth(beta_df: DataFrame, *, newey_west_lags: int | None = None, forward_periods: int | None = None, is_estimated_factor: bool = False, factor_return_var: float | None = None) -> MetricOutput

Newey-West t-test on FM beta series. \(H_0: \mathrm{mean}(\beta) = 0\).

Parameters:

Name Type Description Default
beta_df DataFrame

DataFrame with date, beta columns (from compute_fm_betas).

required
newey_west_lags int | None

Number of Newey-West (NW) lags. Defaults to \(\lfloor T^{1/3} \rfloor\).

None
forward_periods int | None

Overlap horizon of the regression's forward return. When set, the NW bandwidth is floored at forward_periods - 1 so the kernel is consistent under the MA(\(h-1\)) overlap structure of \(h\)-period returns.

None
is_estimated_factor bool

Set True when the Signal_i column used by compute_fm_betas is itself an estimated quantity (rolling ordinary least squares (OLS) \(\beta\) to another factor, PCA score, ML-predicted score, residual from a first-stage regression). Shanken (1992) shows the naive FM SE ignores sampling error in the regressor, inflating \(t\)-stats. Do NOT set this on raw characteristics (book-to-market, momentum price signal, accounting ratios) — those are observed, not estimated, and enabling the correction will spuriously deflate \(t\)-stats.

Implementation: Shanken (1992) single-factor special case — the NW SE is scaled by \(\sqrt{1 + \hat\lambda^2/\sigma^2_f}\) (Shanken's general multi-factor multiplicative term \(1 + \lambda'\Sigma_f^{-1}\lambda\) collapses to \(1 + \hat\lambda^2/\sigma^2_f\) when there is one factor). factrix's simplification omits the additive \(+\sigma^2_f/T\) term of the full Shanken variance and is therefore only honest for large \(T\).

Note: is_estimated_factor corrects the sampling-error dimension of using an estimated regressor. A separate failure mode — the estimated factor itself being weak or unidentified — produces its own spuriously-significant FM \(t\)-stats and is not addressed by this scaling; see Kan-Zhang (1999) for the useless-factor diagnostic literature.

False
factor_return_var float | None

\(\sigma^2_f\), the time-series variance of the factor-mimicking portfolio return. Prefer supplying this when you have a spread-portfolio return series (the long-short spread actually traded on the signal). When None and is_estimated_factor=True, falls back to \(\mathrm{var}(\beta_t)\) as a rough placeholder — \(\hat\beta_t\) is not the factor-mimicking return but is usually the only readily-available series. Because \(\mathrm{var}(\hat\beta_t)\) already absorbs upstream estimation noise, it inflates the denominator of the EIV factor and so deflates the SE correction; treat the betas_timeseries_proxy result as a lower bound on the true SE inflation — i.e. an upper bound on the reported \(t\)-stat — not a precise estimate.

None
Notes

Stage 2 of FM: \(\overline{\beta} = \mathrm{mean}_t\,\beta_t\); \(t = \overline{\beta} / \mathrm{SE}_{\mathrm{NW}}(\beta)\) with kernel lag \(L = \max(\lfloor T^{1/3} \rfloor,\, h - 1)\). With is_estimated_factor=True, the Shanken (1992) single-factor correction scales SE by \(\sqrt{1 + \overline{\beta}^2 / \sigma^2_f}\).

factrix uses the Andrews (1991) \(T^{1/3}\) bandwidth floored against the Hansen-Hodrick overlap horizon rather than the Newey-West (1994) data-adaptive plug-in — simpler, deterministic, and adequate at typical research \(T\). factrix's simplification of the Shanken variance omits the additive \(+\sigma^2_f / T\) term, so the correction is honest only for large \(T\).

References
  • Fama & MacBeth (1973). "Risk, Return, and Equilibrium: Empirical Tests." Journal of Political Economy, 81(3), 607–636. Two-stage λ procedure underlying this test.
  • Newey & West (1987). "A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix." Econometrica, 55(3), 703–708. HAC variance estimator.
  • Andrews (1991). "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation." Econometrica, 59(3), 817–858. Optimal Bartlett growth rate.
  • Hansen & Hodrick (1980). "Forward Exchange Rates as Optimal Predictors of Future Spot Rates." Journal of Political Economy, 88(5), 829–853. Overlap horizon flooring the kernel.
  • Shanken (1992). "On the Estimation of Beta-Pricing Models." Review of Financial Studies, 5(1), 1–33. Errors-in-variables correction for FM stage-2 t when the regressor is itself estimated.
  • Kan & Zhang (1999). "Two-Pass Tests of Asset Pricing Models with Useless Factors." Journal of Finance, 54(1), 203–235. Useless-factor diagnostic; cited as cautionary background on factor validity beyond the EIV inflation that is_estimated_factor addresses.

Examples:

Chain from :func:compute_fm_betas output:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.fama_macbeth import compute_fm_betas, fama_macbeth
>>> panel = compute_forward_return(
...     fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
...     forward_periods=5,
... )
>>> beta_df = compute_fm_betas(panel)
>>> result = fama_macbeth(beta_df, forward_periods=5)
>>> result.name
'fm_beta'

factrix.metrics.fama_macbeth.pooled_ols

pooled_ols(df: DataFrame, *, factor_col: str = 'factor', return_col: str = 'forward_return', cluster_col: str = 'date', two_way_cluster_col: str | None = None) -> MetricOutput

Pooled ordinary least squares (OLS) with clustered SE — robustness check against FM.

Clustering on date alone catches contemporaneous cross-sectional dependence but misses asset-level persistence; on asset alone the reverse. Petersen (2009) shows panel data usually has both — single-way clusters understate SE by 20-50% in that regime.

FM and single-way share the same point estimate under a balanced panel but typically disagree on SE; when \(\hat\beta\) and FM \(\hat\lambda\) have opposite signs, profile.diagnose() flags an FM/pooled sign-mismatch — a red flag for misspecification.

Short-circuits when \(N < 10\) (no regression), returns stat=None with \(p=1.0\) when the effective \(G < 3\) (SE undefined with < 3 clusters).

Formula

Point estimate:

\[ [\hat\alpha, \hat\beta] = (X'X)^{-1} X'R \]

where \(X = [1, \text{Signal}]\) stacked across all \((\text{date}, \text{asset})\) observations.

Single-way clustered sandwich SE (default, cluster on cluster_col):

\[ \mathrm{meat}_g = \sum_g (X_g' e_g)(X_g' e_g)', \quad V = c \cdot (X'X)^{-1} \cdot \mathrm{meat}_g \cdot (X'X)^{-1}, \]

with finite-sample correction \(c = \tfrac{G}{G-1} \cdot \tfrac{N-1}{N-K}\), \(\mathrm{SE}(\hat\beta) = \sqrt{V_{1,1}}\), \(t = \hat\beta / \mathrm{SE}\), \(\mathrm{df} = G - 1\).

Two-way clustered sandwich SE (when two_way_cluster_col is set — Cameron-Gelbach-Miller (2011) / Petersen (2009)):

\[ V_{\text{two-way}} = V_A + V_B - V_{A \cap B} \]

where \(V_A\), \(V_B\), \(V_{A \cap B}\) are single-way variances clustered on \(A\), on \(B\), and on the intersection cells \((A, B)\). Each component uses its own finite-sample correction. \(\mathrm{df} = \min(G_A, G_B) - 1\) (Thompson (2011)).

Notes

Pool (date, asset) rows and run a single OLS R = alpha + beta * Signal + eps with the appropriate cluster-robust sandwich covariance described above. Single-way: df = G - 1 with G the number of clusters; two-way: df = min(G_A, G_B) - 1 per Thompson (2011).

factrix reports stat = None (rather than 0) when G < 3 because the cluster-robust variance is undefined with too few clusters; falling back to a homoskedastic SE in that regime would silently break the panel-correlation guarantee that motivated using clustered SE in the first place.

References
  • Petersen (2009). "Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches." Review of Financial Studies, 22(1), 435–480. Comparison of FM, clustered, and two-way SE under firm/time correlation.
  • Cameron, Gelbach & Miller (2011). "Robust Inference With Multiway Clustering." Journal of Business & Economic Statistics, 29(2), 238–249. Two-way clustering formula V_AB = V_A + V_B − V_{A∩B}.
  • Thompson (2011). "Simple Formulas for Standard Errors that Cluster by Both Firm and Time." Journal of Financial Economics, 99(1), 1–10. Finite-sample df correction min(G_A, G_B) − 1.

Examples:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.fama_macbeth import pooled_ols
>>> panel = compute_forward_return(
...     fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
...     forward_periods=5,
... )
>>> result = pooled_ols(panel)
>>> result.name
'pooled_beta'

factrix.metrics.fama_macbeth.beta_sign_consistency

beta_sign_consistency(beta_df: DataFrame, *, expected_sign: int = 1) -> MetricOutput

Fraction of FM per-date \(\beta\)s carrying the expected sign — value \(= \mathrm{mean}_t \mathbb{1}\{\mathrm{sign}(\beta_t) = s^\star\}\).

\(\beta_t\) is the per-date ordinary least squares (OLS) \(\beta\) from compute_fm_betas. Range \([0, 1]\); \(1.0\) = \(\beta\) always has the expected sign across periods. Unlike ts_beta_sign_consistency (which symmetrizes via \(\max(p, 1-p)\) where \(p\) is the positive-sign fraction), this one is directional — you must supply the a-priori expected sign. Typical use: paired with a prior on factor direction to check stability.

Short-circuits to NaN when no non-null \(\beta\) observations exist.

Notes

value \(= \mathrm{mean}_t \mathbb{1}\{\mathrm{sign}(\beta_t) = s^\star\}\) over the FM per-date beta series. Range \([0, 1]\); \(1.0\) = beta always agrees with the prior. Descriptive (no formal \(H_0\)); pair with fama_macbeth for inferential significance.

factrix splits this directional check from the symmetric ts_beta_sign_consistency so the two answer different questions: this one requires the caller to commit to a prior sign; the symmetric variant tests cross-asset agreement only.

Examples:

Chain from :func:compute_fm_betas output:

>>> import factrix as fx
>>> from factrix.preprocess import compute_forward_return
>>> from factrix.metrics.fama_macbeth import (
...     compute_fm_betas,
...     beta_sign_consistency,
... )
>>> panel = compute_forward_return(
...     fx.datasets.make_cs_panel(n_assets=80, n_dates=180, seed=0),
...     forward_periods=5,
... )
>>> beta_df = compute_fm_betas(panel)
>>> result = beta_sign_consistency(beta_df, expected_sign=1)
>>> result.name
'beta_sign_consistency'

Use cases

  • Compute per-date FM beta series


    Stage 1 of Fama-MacBeth: per-date cross-sectional ordinary least squares (OLS) slope \(\beta_t\) in \(R_{i,t} = \alpha_t + \beta_t \cdot \text{Signal}_{i,t} + \varepsilon_{i,t}\). Pre-step for fama_macbeth and the descriptive beta_sign_consistency check.

  • Mean-\(\beta\) significance, Newey-West (NW) heteroskedasticity-and-autocorrelation-consistent (HAC)


    Stage 2 of Fama-MacBeth: \(t\)-test on \(\mathbb{E}[\beta_t] = 0\) with Newey-West HAC SE, bandwidth \(\max(\lfloor T^{1/3} \rfloor, h-1)\). Default inferential test for the Individual x Continuous cell.

  • Errors-in-variables correction for estimated signals


    Set is_estimated_factor=True (with factor_return_var= where the factor-mimicking-portfolio return series is available) to apply the Shanken (1992) single-factor EIV correction (the multi-factor multiplicative term collapses to \(1 + \hat\lambda^2/\sigma^2_f\)). Required when the Signal column is itself estimated — rolling beta, PCA score, ML prediction.

  • Pooled OLS robustness check


    pooled_ols runs a single regression across the stacked panel with cluster-robust SE (one-way on date, or two-way with two_way_cluster_col). When pooled \(\hat\beta\) and FM \(\hat\lambda\) disagree in sign, profile.diagnose() flags a misspecification red flag.

Choosing a function

Goal Function
Per-date FM beta table for downstream inspection / slicing compute_fm_betas
Mean-\(\beta\) significance with NW HAC SE (default Stage 2) fama_macbeth
Pooled OLS with cluster-robust SE (one-way on date, or two-way) pooled_ols
Directional stability — fraction of periods with the expected \(\beta\) sign beta_sign_consistency

Worked example — per-date FM beta then NW HAC significance

compute_fm_betas → fama_macbeth on a synthetic cross-sectional panel

import factrix as fx
from factrix.metrics.fama_macbeth import compute_fm_betas, fama_macbeth
from factrix.preprocess import compute_forward_return

raw   = fx.datasets.make_cs_panel(
    n_assets=100, n_dates=500, ic_target=0.08, seed=2024,
)
panel = compute_forward_return(raw, forward_periods=5)

beta_df = compute_fm_betas(panel)
print(beta_df.head())
# ┌────────────┬───────────┐
# │ date       ┆ beta      │
# ├────────────┼───────────┤
# │ 2024-01-01 ┆  0.0091   │
# │ 2024-01-02 ┆  0.0077   │
# │ ...        ┆  ...      │
# └────────────┴───────────┘

out = fama_macbeth(beta_df, forward_periods=5)
print(out.value, out.stat, out.metadata["p_value"])
# 0.0084  6.10  1.3e-09   (approximate)

See also