Skip to content
Overfit

Cross-validation for time series, and why k-fold betrays you

5 min read

Ordinary k-fold assumes exchangeable observations. Financial time series are ordered, autocorrelated, and often labelled with overlapping future returns. Use time splits, purging, and embargoes instead.

Random k-fold cross-validation is a small machine for laundering tomorrow into yesterday. It was built for settings where observations can be treated as exchangeable. Financial time series do not give you that luxury. Order matters. Neighbouring samples share information. Labels often use future returns. Regimes arrive in clumps and leave without forwarding addresses.

If you shuffle that structure away, the score you get back may be measuring leakage rather than predictive power.

This does not mean cross-validation is useless for time series. It means the splitter is part of the model. A bad splitter can make a mediocre signal look like a discovery.

Why ordinary k-fold fails

In standard k-fold, the data is split into k blocks, the model trains on k minus one blocks, and it tests on the held-out block. If the blocks are random, a model evaluated on 2015 may have trained on 2016, 2018, and 2020. That is already fatal for a trading rule whose only legitimate information set is the past.

Even unshuffled k-fold can leak. Suppose each sample at date t uses a 20-day feature window and has a label equal to the next five-day return. A training sample ending just before the test fold may share feature or label information with the test fold. The model is not seeing the exact answer, but it is close enough to make the score optimistic.

Autocorrelation adds another problem. Adjacent returns, volumes, spreads, and volatility estimates are not independent draws from a tidy urn. If a model trains on points immediately beside the test fold, it may exploit local continuity that will not survive a real deployment gap.

TimeSeriesSplit is a start

Scikit-learn's TimeSeriesSplit is the first tool many people reach for, and for ordinary forecasting examples it is a sane start. The training set grows or rolls forward, and each test set lies after its training set. The documentation also includes a gap parameter, which excludes observations between train and test.

That is already better than random k-fold. It respects chronology. It gives you several out-of-sample periods instead of one final holdout. It makes the evaluation shape explicit in code:

from sklearn.model_selection import TimeSeriesSplit
splitter = TimeSeriesSplit(n_splits=5, gap=5, test_size=60)
for train_idx, test_idx in splitter.split(X):
fit_model(X.iloc[train_idx], y.iloc[train_idx])
score_model(X.iloc[test_idx], y.iloc[test_idx])

The gap is not a superstition. It is an admission that adjacent samples may contaminate each other. For some strategies five rows is too much. For others it is laughably little.

Purging removes overlapping labels

Marcos López de Prado's Advances in Financial Machine Learning gives the finance-specific version: purged cross-validation with embargoing. The key observation is that many financial labels span intervals. If a sample's label uses returns from t to t + h, any training sample whose label interval overlaps a test label can leak information.

Purging removes those overlapping training samples. It is not the same as leaving a fixed gap by row count, although a gap can approximate it when horizons are fixed and sampling is regular. Purging uses the actual event start and end times. That matters for event-driven bars, triple-barrier labels, trades with variable holding periods, and anything where the label is not a neat one-day step.

The purged cross-validation overview is a useful map of the idea, but the implementation should be driven by your own label horizons. If you do not know the start and end time of each label, you do not have enough metadata to prove the split is clean.

Embargoing handles the afterglow

An embargo removes training observations just after the test fold. The reason is not mystical. Market information bleeds. A large event can affect features and labels after the nominal test period. A model trained immediately after the test fold may learn from the aftermath of the same event it is being scored on.

In daily equity data, an embargo might be a few days. In slower fundamental data, it might be a filing cycle. In intraday futures, it might be a session boundary. The right value depends on the signal, the holding period, and the way features are built.

Embargoes are expensive because they throw data away. That is the cost of not lying to yourself. If a model only works when trained on observations pressed directly against the test set, I would rather know that before live trading supplies the lesson.

Combinatorial splits give a distribution, not a trophy

The final step is to stop worshipping one path. A single walk-forward sequence tells you how the strategy did through one historical ordering. Combinatorial purged cross-validation builds many train-test combinations while still purging and embargoing, then reports a distribution of performance estimates.

That distribution is more useful than one trophy Sharpe. It can show that the median result is mediocre even though the best path looks excellent. It can show that drawdown risk is concentrated in a few regimes. It can feed directly into the probability of backtest overfitting and the deflated Sharpe ratio from earlier posts.

There is a price. The computation grows quickly, and the code is less convenient than calling cross_val_score. Convenience is not the objective. A clean split is.

My default rule

For plain one-step forecasting, I start with TimeSeriesSplit, set an explicit gap, and make the execution timing match the production rule. For trade labels with overlapping horizons, I use purging. If the signal or market reaction can bleed beyond the fold, I add an embargo. If the strategy will be selected from a large research set, I want a distribution across multiple purged paths, not one lucky walk.

The main question is always the same: could the training process have known anything about the test outcome at the decision time. If the answer is maybe, the split is not clean enough.

K-fold does not betray you because it is badly designed. It betrays you because it was designed for a world your backtest does not live in.