ADVERTISEMENT

Dr. G&W Math of Sports Study Hall: The Spread vs. Probability of Victory

Dr. Green and White

All-Flintstone
Staff
Sep 4, 2003
5,214
12,854
113
Howell, MI
Of all of my sports/math posts that I have made over the years, this one might be the "mathiest." You have been warned. But, I think this topic is really interesting, and I hope at least some of you do as well.

A few years back, I developed a sports-math fascination that quickly turned into a bit of an obsession. The topic? The relationship between the Vegas Spread and the probability that the favored team will be victorious. I am not sure exactly where this came from, but its origin is probably linked to EPSN’s FPI metric, which attempts to do some very similar things to what I try to do, but with seemingly much more dubious methodology (such as an undying reliance on the importance of recruiting rankings over on-field results). In any event, I became a bit obsessed with finding the answer to the question of how spreads correlate to the probability of victory. I have tried a few google searches to find out if someone else has written on the subject, and so far I not come up with much at all. But, I think I have found that answer, or at least I think I am very close.

On first glance, you would think that this question should be quite simple to solve. After all, you just need to plot the data a fit a line to it right? Well, as it turns out, spread vs. victory data is pretty scattered and you need quite a bit of data to make sense of it. I have been logging spread data since 2009, which now includes over 4000 data points. I have not yet added the 2017 data, but for those 8 full years of college football data, this is what the raw data looks like:



To a first approximation, it looks like it might be a linear-ish line that reachs 100% somewhere in the mid-20. But, this is not a terribly satisfying result. So, I decided to see what would happen if I tried to smooth the raw spread data using a 7-point boxcar / moving average method. If I do that, the data now looks like this:



To me, this plot makes a bit more sense, as it has the shape that one would generally expect. That is, when the spread is a pick'em (zero) the probability of victory is naturally 50% and the curve approaches 100% asymptotically as the spread increases. It does get a little wonky as the spread increases, however, mainly because there is generally less data in that region and a single upset can cause a visible spike, as is clear from the raw data.

In an attempt to fit this data, I decided to use a simple quadratic equation, also shown above. In this case, setting the parameters was pretty straightforward, as the function should be 0.50 at x=0 and I somewhat arbitrarily set the probability of victory to reach 100% at a spread equal to 30. This is based on the general observation in the data that upsets do tend to happen up until the spread reaches 30. After that, they are very rare (in the 8 yr span, there are only 2 upsets out of almost 250 games with spreads this large). I found that a very simple equation fits the data reasonably well. It is:

where x is the opening Vegas spread. Once the spread exceeds 30, I just set the probability of victory at 100% which is not true, but was "close enough at the time." I used this formula in my various mathematical calculations for several years.

But, it was not very satisfying. I felt like there should be a more mathematical answer to this question rather than just a pretty-looking empirical formula. Then, earlier this year, I was having a related discussion on this board when someone (and I am sorry but I forget who) gave me a hint that allowed me to derive a solution to this problem. That hint was the observation that teams that are favored by x-points will win by an average of... x-points. This seems somewhat obvious, but I never actually checked this fact. So, the first thing I did was to plot this data, which looks like this:



As you can see, based on 8 years of data, the statement above is true. There is obviously more scatter in the data once the spread gets above that magical value of 30, but once again, that is due to increasingly sparse data at the higher spreads. But, this got me thinking, if I can take the average value of the margin of victory at each value of the spread, I can perform other mathematical / statistical operations. The general problem in the raw data is that there is just not enough of it to see the "real" correlation easily. But, what if I assume that the distribution of game outcomes fits a Gaussian / Normal Distribution which is centered on the spread? As we will see in a moment, if this is true, it is relatively simple to calculate the function that had eluded me for several years. But, the first question is: is the data at each value of the spread normally distributed?

As it turns out, there is still not enough data even after 8 years to see a true distribution at any given value of the spread. But, what you can do is to take all of the data together and plot the distribution in relative difference between the final margin of victory and the opening spread for all the data and see what it looks like. That data is shown here, including a best fit to a Normal Distribution curve:



While it is not a perfect fit for a bell / Gaussian curve, it is pretty close. Furthermore, this plot gives us another useful piece of information, which is the standard deviation of the margin of victory. When all the data across all spreads is taken into account, the standard deviation is 15.84, just slightly over 2 TDs.

From a pure football standpoint, this information in itself is pretty interesting. It speaks to the overall variance / chaos that is inherent to college football. Based on the math of the normal distribution, 68% of the population should fall within one standard deviation of the mean (in this case, the opening spread). That means that Vegas can only get within 2 TDs of the actual margin of victory a little over half the time. That is extraordinary. I think that most football fans would intuitively think this number was less than a TD, but it is not. Chaos rules in reality.

Getting back to the main point, it turns out that this same math behind the normal distribution allows us to solve the initial puzzle using basic statistics. This is because all statistical distributions have an associated probability density function that allows you to calculate the percentage of the population above and below any value. For the normal distribution, all you need is the mean (the opening spread), and the standard deviation (around 15.84) and you can calculate the probability of the population (in this case, the final point differential) being above or below a fixed number. Since we care about winning and losing, this value is simply zero. Excel even has a simple formula "NORMDIST" that can be used. Literally all you need to generate this curve is make a column in Excel with the possible spreads (usually 1 to however high you want to go in increments of 0.5) and in a 2nd column use the formula 1-NORMDIST(0,spread,15.84,true), where "spread" is the value in the "spread" column.

Unfortunately, it is not quite that simple. The problem is the standard deviation. The assumption in the above paragraph is that standard deviation is fixed for all values of the spread. In reality, this does not seem to be quite true. If you actually plot the standard deviation as a function of the opening spread, you get this:



So, while the standard deviation does hover over 15 in cases where the spread is small, it generally trends down (again, with significant scatter). So, one could simply use the linear regression in the formula above for the effective standard deviation, which would seem to make the most sense.

But, as it turns out, using both the fixed value and the linear fit above give a curve that does not quite fit the data. In both cases, the curve under-predicts the probability of the favored team winning, see below:



So, with some amount of hesitation, I decided to use my box-car smoothed data to perform a regression on the standard deviation data to find an "optimized" line. That optimized fit to the standard deviation line is shown here in red.



Clearly, this line doesn't fit the standard deviation data as well, but if you use this line and generate the probability of victory curve, you get this:



I think this fits the data pretty damn well. Yes, I did have to use one adjustable parameter instead of zero, but I am pretty happy with it. It also has the benefit that the probability of victory goes to 99% when a team is favored by 31.5 points, which seems in line with reality. The fact that I had to use an empirical fit on the standard deviation data is a bit weird, and I am not sure why it is needed, but it might suggest the population distribution is not exactly Gaussian, or something about the variance of the population versus the variation of my (still very large) sample. I did not make an exhaustive search of other distributions, so there may be a better one out there. As I accumulate more data, I might be able to figure that out.

Quite honestly, I am not sure if this is the method used by ESPN or Nate Silver, etc. When they show probabilities of victory, their numbers are similar to mine but not quite the same. Regardless, I do believe that this is the correct methodology, even if the exactly correct value for the standard deviation is still a bit unclear.

That is all for now. Thus ends the lesson. Enjoy!
 
ADVERTISEMENT
ADVERTISEMENT

Go Big.
Get Premium.

Join Rivals to access this premium section.

  • Say your piece in exclusive fan communities.
  • Unlock Premium news from the largest network of experts.
  • Dominate with stats, athlete data, Rivals250 rankings, and more.
Log in or subscribe today Go Back