Skip to content

Identity / context

Every FactorProfile carries two structured fields that describe "what hypothesis was tested" separately from "under what sample conditions":

Field Type Meaning
identity tuple[str, int] (factor_id, forward_periods) — the hypothesis tuple.
context Mapping[str, Any] Sample restriction / conditioning dimensions (universe_id, regime_id, future axes).

Convenience accessors on the profile:

  • profile.factor_ididentity[0]
  • profile.forward_periodsidentity[1]

Why split them?

The split is the v1 anti-shopping defense for multi-horizon and multi-universe factor research. Putting the wrong axis in identity silently changes which hypotheses are considered "the same family" by multiple-testing correction — and lets researchers walk the family boundary until something looks significant.

Spec-search variant API-level guard Status
Estimator shopping (swap SE method until significant) Cell → procedure 1:1 (registry SSOT) shipped
Stat shopping (swap p_stat per factor) Study-level p_stat= only; per-factor not allowed shipped
Universe shopping (swap large-cap → small-cap) context["universe_id"] is a sample restriction, not a hypothesis dimension; promotion to family member requires explicit expand_over= partial — context split shipped; expand_over= lands in #161
Family-scope shopping (multiplicative → per-slice) No implicit default — expand_over= must be explicit #161
Horizon shopping (run every forward_periods ∈ {1d, 5d, 1m, 3m, 6m, 12m}, report the smallest p) forward_periods is part of identity; bhy(profiles) over a horizon sweep auto-forms the full family shipped (Benjamini-Hochberg-Yekutieli (BHY) family already partitions on forward_periods)

The path of least resistance — [evaluate(panel, cfg) for cfg in horizon_grid] followed by bhy(profiles) — is also the statistically correct one. Shopping has to actively shrink the profile list, which is visible in code review.

How identity is populated

evaluate() stamps identity from two sources:

profile = evaluate(panel, cfg, factor_col="momentum_12_1")
profile.identity         # ("momentum_12_1", cfg.forward_periods)
profile.factor_id        # "momentum_12_1"
profile.forward_periods  # cfg.forward_periods
  • factor_id ← the factor_col argument (the column name on panel)
  • forward_periodscfg.forward_periods

Procedures themselves stay schema-agnostic; the stamp happens once at the dispatch boundary inside _evaluate. factor_id and forward_periods are read-only properties that proxy identity[0] / identity[1]dataclasses.replace(p, identity=(new_id, fwd)) is the way to override them; replace(p, factor_id=...) does not work.

How context is populated

context ships empty by default. Higher-level functions that operate on a filtered or sliced panel populate it via dataclasses.replace:

import dataclasses

p = evaluate(panel_large_cap, cfg, factor_col="momentum_12_1")
p = dataclasses.replace(p, context={"universe_id": "us_large_cap"})

The by_slice consumer and the upcoming run_metrics function populate context automatically — manual replace is the escape hatch for callers who run their own slicing.

Querying context as a sample restriction

Treating universe / regime as sample restriction (the common case) is a plain comprehension before the screening function:

import factrix as fx

profiles = [
    evaluate(panel, cfg, factor_col=name)
    for name in factor_cols
]
large_cap = [p for p in profiles if p.context.get("universe_id") == "us_large_cap"]
fx.multi_factor.bhy(large_cap, q=0.05)

When the universe / regime axis IS a hypothesis dimension (e.g., "is this factor significant in some universe?"), promote it via expand_over= (see multi_factor.bhy). Mixing the two paths is the single most common screening-loop bug; the split makes the choice explicit at the call site.

Reading the rendered profile

repr(profile) lists identity, mode, primary_p, sample sizes, and omits context / warnings when empty:

FactorProfile(factor_id='momentum_12_1', forward_periods=5, mode=panel,
primary_p=0.0312, n_obs=240, n_assets=500)

In Jupyter, _repr_html_ renders the same fields as a table and unfolds non-empty context entries as context.<key> rows so universe / regime restrictions are visible without calling diagnose().