Skip to content

Preparing data

The reader-flow from a raw price / signal dataset to a (date, asset_id, factor, forward_return) panel that evaluate consumes. For the column-level four-column contract, see Panel schema; this page is the task-oriented walk-through.

At a glance

Step What you do Function Output added
1 Reshape raw inputs to long format with price manual / Polars ops (date, asset_id, price, factor)
2 Ensure regular spacing per asset on the time axis manual / Polars ops spacing-regular panel
3 Attach forward return compute_forward_return adds forward_return
4 (Optional) drop / impute NaN, align frequencies manual clean panel

1. Long-format shape with price and the factor column

factrix expects long-format panel data — one row per (date, asset_id) pair. Wide-format (one column per asset) is not accepted by any entry point.

compute_forward_return computes the look-ahead return from a price column; the factor column is a parallel signal you construct yourself (factor construction is outside factrix's scope — see Where factrix fits § 1).

The factor column name is user-definedevaluate() / run_metrics() accept a factor_col= kwarg that binds an arbitrary column to the canonical role at dispatch time. The examples below use momentum to make this binding visible; you can equally pick alpha, value_score, or whatever is meaningful for the strategy.

For per-asset factors (INDIVIDUAL scope), each (date, asset_id) carries its own factor value alongside the price:

import polars as pl
from datetime import date

raw = pl.DataFrame({
    "date":          [date(2024, 1, 1), date(2024, 1, 1),
                      date(2024, 1, 2), date(2024, 1, 2)],
    "asset_id":      ["AAPL", "MSFT", "AAPL", "MSFT"],
    "price":         [185.0, 372.0, 186.5, 374.5],
    "momentum": [0.42, -0.15, 0.51, -0.08],
})

For market-wide factors (COMMON scope, e.g. VIX, DXY), the factor value is identical across asset_id on a given date. Verify with the one-liner from Concepts § scope (swap the column name for whichever the panel carries):

raw.group_by("date").agg(pl.col("vix").n_unique() == 1).all()

2. Regular spacing per asset is load-bearing

compute_forward_return sorts the input by (asset_id, date) itself, so an unsorted panel is fine. What it does not inspect is the calendar gap between successive rows — the function shifts by row count, not by date.

If asset A has daily rows but asset B is missing two trading days in the middle, asset B's row-shift skips the gap silently and the forward return on the row before the gap measures the wrong horizon. Verify per-asset spacing before calling:

gaps = raw.sort(["asset_id", "date"]).with_columns(
    (pl.col("date").diff().over("asset_id")).alias("gap")
)
# Inspect gaps.group_by("asset_id").agg(pl.col("gap").n_unique())
# — single unique gap per asset is the goal.

If the panel is sparse by design (event series, irregular trading days), see step 5 on sparse signals.

3. Attach forward return

from factrix.preprocess import compute_forward_return

panel = compute_forward_return(raw, forward_periods=5)

The function computes a per-period normalized forward return:

forward_return[t] = (price[t + 1 + N] / price[t + 1] - 1) / N

Three things to know about this formula:

  • Entry at t + 1, not t — the function assumes you trade on the bar after the signal is observed, preserving a strict signal-then-trade causal boundary.
  • Exit at t + 1 + N — the holding horizon spans N rows of the asset's own date series, where N = forward_periods.
  • Divided by N — returns are normalized to a per-period basis, so forward_periods=5 and forward_periods=20 are directly comparable. This differs from the cumulative-return convention used by qlib (Ref($close, -N)/$close - 1) and alphalens.

The horizon counts rows of the asset's own date series, not calendar days. forward_periods=5 on a daily panel is a five-trading-day lookahead; on a monthly panel it is five months. Frequency is the user's responsibility — see step 4.

The forward_periods you pass here must match the AnalysisConfig.forward_periods you later pass to evaluate. Bind the custom factor column via factor_col=:

import factrix as fx

cfg = fx.AnalysisConfig.individual_continuous(forward_periods=5)
profile = fx.evaluate(panel, cfg, factor_col="momentum")

If the column is already named factor (the default), factor_col= can be omitted. See Panel schema § factor_col= for the in-place rename contract and the conflict rule (a panel cannot carry both factor and a non-default factor_col at once).

4. Frequency alignment is the caller's job

factrix is calendar-agnostic — it shifts rows, not calendar time. Three responsibilities sit upstream of compute_forward_return:

  • Same date axis for factor and price source. If the factor is monthly and the price source is daily, downsample (or upsample) one side before joining. A frequency mismatch will not raise; it will silently mean the wrong thing.
  • Same forward_periods interpretation. Five rows on a daily panel is one week of trading days; five rows on a monthly panel is five months. Pick the horizon against your panel's actual cadence.
  • Slice / regime labels aligned by date. If you attach a regime_id or universe column for downstream slicing, align it on the same date axis the panel uses; mismatched labels propagate silently into by_slice and screening calls.

5. Missing data

Source factrix behaviour Caller action
NaN in factor Not auto-imputed; flows through to the procedure, where it depresses n_obs and may trip sample-size guards. Drop or impute before compute_forward_return.
NaN in price compute_forward_return produces NaN forward_return for the row and then drops it from the output. Tail rows where t + 1 + N runs off the end of the series are dropped by the same filter. If a daily NaN reflects a true gap (suspended trading, holiday), the drop is correct. If imputable (forward-fill from previous close), impute before calling.
Single-asset panel (N = 1) Mode auto-switches to TIMESERIES. individual_continuous at N = 1 raises ModeAxisError with suggested_fix=common_continuous(...). Either pass N ≥ 1 explicitly or use the *_sparse / common_* factories.
T < MIN_PERIODS_HARD (= 20) periods Raises InsufficientSampleError; procedures never silently produce a result on under-sample data. Extend the window or accept the procedure's refusal.

6. Sparse and event signals

For (INDIVIDUAL, SPARSE) or (COMMON, SPARSE) factors — buy/sell flags, FOMC dummies, event magnitudes — the factor column is the {0, R} event vector:

  • 0 on non-event rows.
  • any real value on event rows (R is unrestricted — positive, negative, or any magnitude). Common forms: {0, 1} for a pure event flag and {0, R} for an event carrying signed or unsigned magnitude.
  • expect ≥ 50% zeros.

Sort and forward-return attachment are identical to step 2-3; the dispatch routes sparse signals to event-study procedures (caar, ts_beta on dummies). See Concepts § signal for the contract.

Helpers not yet public

factrix.preprocess currently re-exports only compute_forward_return. Submodule code under factrix/preprocess/ carries normalization (mad_winsorize, cross_sectional_zscore), forward-return cleaning (winsorize_forward_return, compute_abnormal_return), and orthogonalization (orthogonalize_factor); publicization is tracked under #323. Until then, treat the submodule paths as internal — they may be renamed or re-shaped before they land in __all__.

See also