Skip to content
Overfit

The backtest that survived because I stopped looking

6 min read

A personal post-mortem on the strategy that made it through research only because the search stopped before the curve broke. Iterative overfitting in plain sight.

The strategy looked best on the Tuesday when I stopped touching it. That is the embarrassing part. It was not blessed by a clean experiment. It was not endorsed by a patient statistician in a linen jacket. It survived because I got bored, closed the notebook, and stopped giving myself chances to ruin it.

This was a US equity overnight reversal system. Buy the previous day's worst liquid names near the close, short the previous day's best liquid names, flatten the next morning, keep it roughly sector neutral, and try not to get filled like an idiot. The first pass covered 2005 to 2018 and showed an after-cost Sharpe a little above 1. The final research version showed something closer to 1.4. That sounds like progress. It was mostly me shopping.

The earlier posts in this series covered why a Sharpe estimate has error bars and why the best strategy out of many trials needs a multiple-testing haircut. This is the same problem from the inside of the research process, where every little change feels defensible and the totality is not.

The first version was dull enough to be interesting

The original test had four moving parts: a universe filter, a one-day ranking variable, equal dollar weights, and a simple close-to-open holding period. It made money in the places you would expect. It did well after broad sell-offs, worse during slow grind-up markets, and terribly around single-name news where yesterday's loser was not temporarily cheap but permanently impaired.

That first version did not feel publishable to the part of my brain that likes clean charts. The return stream was lumpy. Borrow costs mattered. The short book carried more pain than the long book paid for in a few regimes. So I did what researchers do when they are pretending to be prudent. I improved it.

I excluded earnings dates. Then I excluded the day before earnings as well. I added a minimum price filter. I tried ADV buckets. I capped single-name volatility. I removed names with very wide closing auction imbalance. I changed the sector neutralisation from GICS sector to industry group. I tried a stop-loss even though the holding period was measured in hours, which should have made me stand up and go outside.

None of those edits was absurd in isolation. That is how iterative overfitting gets in. It does not arrive wearing a sign that says data snooping. It arrives as a series of sensible improvements.

A research log is not the same as honesty

I had a log, which made me feel virtuous. It was worse than useless for a while because I only logged the versions I thought I might keep. The dead ends were scattered through notebook checkpoints, git commits, filenames like overnight_reversal_new_new_final.ipynb, and one shameful spreadsheet.

A more honest reconstruction looked roughly like this:

Change family Approximate trials Comment
Universe and liquidity filters 30 Mostly defensible, heavily correlated
Ranking transforms 18 Winsorisation, z-scores, raw ranks
Neutralisation choices 12 Sector, industry, beta, nothing
Cost assumptions 10 The easiest place to lie by accident
Event exclusions 20 Earnings, corporate actions, index changes
Holding window 16 Open, VWAP, first hour, next close

That is already north of 100 trials, before counting combinations. The selected strategy was not one draw from the research distribution. It was the variant still standing after I had made a mess and then laundered the mess into a single equity curve.

Bailey, Borwein, López de Prado and Zhu's paper on the probability of backtest overfitting is uncomfortable reading because it describes this exact workflow without the romance. Alpha Architect's write-up on overfitting bias in multiple-signal strategies makes the same point from the factor side: once you choose and combine signals after seeing their in-sample behaviour, normal significance tests stop meaning what you want them to mean.

The holdout was not clean

I had a holdout period. Of course I did. Everyone has a holdout period in the same way everyone has a gym membership in January.

Mine was 2019 to 2020. I did not use it in the code until the end, but I knew what had happened in those years. I knew about the Q4 2018 sell-off. I knew what COVID had done to overnight gaps, liquidity, and single-name reversals. I knew, in broad terms, what kind of strategy would have looked foolish. That knowledge leaked into design choices long before I ran the holdout cell.

This is why the holdout is not a magic compartment. If the researcher has already seen the period, traded through it, or read a dozen papers using it, the wall is thinner than it looks. The Bailey paper is blunt on this. A holdout can be useful, but it does not account for the number of configurations you tried, and it does not erase what you already know.

Carr and López de Prado go further in Determining Optimal Trading Rules without Backtesting, arguing that calibrating trading rules through repeated historical simulation is itself a source of overfitting. I do not think that means we stop backtesting. I think it means we stop pretending the last untouched column in a notebook is a confessional booth.

The part that survived

The strategy went into a paper-trading account for nine months with one rule: research changes were frozen. Bugs could be fixed. Data corrections could be applied. Parameters could not move.

That period did not reproduce the research Sharpe. It made money, but less neatly. The short book hurt more than expected. Most of the edge came from names I would rather not discuss at dinner. Capacity was worse than the spreadsheet said, because the spreadsheet assumed the closing auction was a polite venue where liquidity appears when summoned.

I put a small amount of capital behind the stripped-down version anyway. The live result over the next year looked like a Sharpe around 0.6, with a lot of uncertainty around that number. That was not glorious. It was also not zero. More importantly, it matched the version I had stopped improving, not the fantasy version I could have produced by continuing to stare at the data.

The change in my process was simple. Every trial gets logged, including the stupid ones. A parameter touched after seeing an out-of-sample result makes that result in-sample. Research changes and implementation fixes get separate tickets. When I cannot reconstruct the trial count, I assume the count is large and discount the result hard.

A backtest can survive because it is good. It can also survive because you stopped asking questions at the lucky moment. On a chart, those two cases look annoyingly similar.