How many strategies did you try before this one
Your reported Sharpe is the best of everything you tested, and the maximum of many noisy backtests is inflated before any skill enters. How selection bias works, and how to deflate it.
There is a question I ask before I look at anyone's backtest, including my own. How many strategies did you try before you found this one. Nobody enjoys answering it, because the honest number is usually large and it wrecks the result. The Sharpe you are showing me is not a random strategy's Sharpe. It is the best of everything you tested, and the best of a pile of noisy numbers is biased upward before any skill enters the picture.
This follows an earlier post here, Your Sharpe ratio is a random variable, which covered the error bars on a single estimate. Selection bias is the same problem wearing a worse hat. Once you are choosing the maximum over many trials, the error bars stop being symmetric and start working entirely against you.
The maximum is not the average
Run one strategy with no real edge and its backtested Sharpe will scatter around zero, sometimes positive, sometimes negative. Run fifty and keep the best and you are no longer sampling the average outcome, you are sampling the extreme. Fifty independent tests at a 5 percent significance level give you roughly two and a half hits by chance alone, and you will present whichever one looks best. This is the multiple comparisons problem, and it is the default state of quantitative research rather than a fringe worry.
The correction statisticians reach for first is crude but instructive. Under a Bonferroni adjustment, if you considered ten strategies a result needs to clear the 0.5 percent level rather than the 5 percent level to earn the same confidence. Alpha Architect summarised the sharper version from Robert Novy-Marx: combining the best k out of n candidate signals inflates your statistics almost as much as picking the single best of n times k candidates. The bias compounds faster than intuition suggests.
How big the free Sharpe gets
Put a number on it. In the earlier post, five years of monthly returns gave a standard error on the annualised Sharpe of about 0.45. Under the null of zero skill, each strategy's annualised Sharpe is roughly a normal variable centred on zero with that 0.45 spread. The question becomes what the maximum of many such variables looks like.
The expected maximum of a set of standard normal draws is a known quantity. For 100 draws it is about 2.5, and for 1,000 draws it is about 3.2, measured in standard errors. Multiply through by the 0.45 spread and the result is bleak. Test 100 worthless strategies over five years and the best is expected to show an annualised Sharpe near 1.1. Test 1,000 and the best lands near 1.5. None of them earn anything in truth. The Sharpe was manufactured by the act of choosing the winner.
That assumes the strategies are independent, which they rarely are. Heavily correlated variants behave like a smaller effective number of trials, so the real inflation sits below these figures. It does not sit at zero, and it is never in your favour.
Count the trials honestly, including the ones you forgot
The number of trials is almost always larger than the number you would admit to. Every lookback window you swept, every entry threshold you nudged, every stop you added and later removed, every universe you swapped in, each of those is a trial. A recent practical writeup put it plainly: you test fifty variations, three look good, you keep the best and evaluate it as though it were the only thing you ever ran. The forty-nine you discarded still count.
This is why the honest trial count matters more than any single result. If you cannot say how many configurations you tried, you cannot say what your Sharpe means, and neither can I.
The deflated Sharpe ratio
The tool built for this is the Deflated Sharpe Ratio from David Bailey and Marcos López de Prado. It takes your reported Sharpe, adjusts the significance threshold for the number of trials you ran and for the skewness and kurtosis of the returns, and returns the probability that the true Sharpe is above zero. A free PDF sits on Bailey's site. The companion idea is the probability of backtest overfitting, worked out in a separate paper, which estimates how likely your selected strategy is to underperform the median out of sample.
If you would rather feel the mechanism than read the algebra, a free calculator implements the probability of backtest overfitting. Feed it your best Sharpe, your sample length and an honest count of the parameters you tuned, and watch the deflated number drop. The exercise is uncomfortable, which is the point.
The 73 percent haircut
None of this is theoretical. Alpha Architect examined 215 commercially promoted alternative beta strategies across five asset classes and found a median deterioration of 73 percent in Sharpe ratio between the backtested period and the live period. Half of these products shed nearly three quarters of their apparent edge the moment real money was watching, and these were strategies polished enough to sell. A garage backtest is not exempt from the same physics.
The academic literature says it from another angle. Campbell Harvey and Yan Liu, surveying the factor zoo, argue that with hundreds of factors already tested a new one should clear a t-statistic of 3 rather than the conventional 2 before you believe it. That bar is higher than almost anyone applies.
What I actually do about it
I keep a log of every configuration I test, so the trial count is a real number rather than a flattering guess. I hold out a slice of data I do not touch until the strategy is otherwise finished, and I treat a failure there as fatal. I deflate the Sharpe for the trial count before I get attached to it, and I raise the threshold as the search space grows. When I cannot reconstruct how many things I tried, I assume the worst and trust the result less.
If you read one aggregator to stay honest about this, Quantocracy collects the practitioners who keep returning to the problem.
Nothing here is advice to trade or avoid any particular strategy. It is a warning about a selection effect that quietly inflates almost every backtest, including the ones I run on myself.