It has been almost three weeks since Villanova cut down the nets in San Antonio, and with that, three weeks since the beginning of great annual 5 month gap between meaningful college sports. While we patiently pass the time this summer until the real fun begins again this fall, I thought that I would reflect back on the NCAA basketball tournament with a series of my math-infused musings. Today's topic: why did it take 34 years and 136 attempts before a 1-seed finally lost to a 16-seed while a 2-seed has fallen on average about once every 4 years? The answer, as it turns out, is the same answer that I often find when I decide to dig deep into some of these topics: because, math.
My journey to solve this mystery began a few years ago when I made the observation that the probability of an upset in the NCAA tournament occurring scales roughly linearly with the difference in value between the two seeds. The correlation is quite strong for the most common seed pairs (i.e. those found in first round games), but it also holds loosely for all seed combinations. This correlation is shown here: (where the size of each marker is scaled to the relative frequency of that pairing occurring in past tournaments.)
Although this plot is a pretty good rule of thumb, it did slightly trouble me that there really is no clear reason for the correlation to be as linear as it is. A hint to understand this all came earlier this year, when a quite different sports math question was rattling around in my brain: what is the correlation between the Vegas spread of a given basketball game and the probability of an upset? Fortunately, I was able to use properties of the normal distribution to answer this question, which I summarized in a bit more detail here.
Earlier this year, it occurred to me to try to compare the probability of an upset based on the spread to the probability of an upset based on the seed differential. It was a bit tricky to find good historical data on NCAA tournament lines, but I did find one website that has enough data to look at the numbers. When I plotted the seed differential against the spread, the data looks like this:
Once again, the data is a bit scattered, but there is a decent linear correlation. Where the correlation gets even stronger is when you consider the probability of an upset derived from the average spread for a particular seed combination to the actual upset rate for that combination in tournament play. That correlation is shown here (where the probability of the favorite team winning are plotted as opposed to the upset rate)
Similar to the other plots shown above, most of the data fall on or near the central line, with a few notable exceptions. Namely, 1-seeds have an oddly good record against 9-seeds, while 2-seed have a surprisingly hard time with 10-seeds.
But, I was still a bit troubled by the fairly thin nature of the spread data that I could find. The website I found listed only the 30 most recent incidents of a given seed pair. That works fine for the 2 vs. 6 match-up, but is a bit sparse for the 5 vs. 12 match-up. However, my previous study of the Vegas spread also suggested that the average margin of victory in a given match-up is actually very highly correlated to the final Vegas spread. That data for over 45,000 college games is shown here:
With this in mind, I was able to predict the average spread of each seed combination using the data for all games back to 1979 when seeding began. Using this "spread" to calculate the upset rate (as opposed to the actual spread) results in this plot, which does have a better R2 compared to the sparse spread data.
With this all in mind, it is time to come back to the original question: why do 2-seeds get upset so much more often than 1-seeds do. The answer is: because games in the NCAA tournament behave in exactly the same way that they do in the regular season when considering upsets as a function of the Vegas line.
When it comes to the 2 vs. 15 match-up, both the spread data that I could find and the average margin of victory data suggest 2-seeds are, on average 16.5 point favorites over 15-seeds. This corresponds to a 93.8% chance that the 2-seed advances. Based on this probability, one would expect a total of 8.5 upsets in the 136 times this match-up has occurred. In reality, there have been 8 since the 64 team tournament expansion in 1985. Score one for math.
As for 1-seeds, it is slightly less clear, as the sparse actual spread data that I found suggests 1-seeds are favored an average by 22.3 points, while the margin of victory data suggests the spread should be a bit higher at 24.5 points. Taken together, this suggests that a 16-seed has between a 1.8% and 1.1% chance of an upset. Over 34 years and 136 attempts, we should have observed between 1.5 and 2.5 16-seed over 1-seed upsets. We, of course, have only observed one: this year's UMBC upset over the University of Virginia. Those probabilities suggest that a 16 over 1 upset should be observed somewhere between once every 14 years and once every 23 years. In other words, we were a bit over due before this year. That said, it is likely that it will be at least another decade (or 2 or maybe even 3) until we can experience a UMBC-sized upset again in March. But, the next 15 over 2-seed upset is basically due any year now. After all, it's just math.
My journey to solve this mystery began a few years ago when I made the observation that the probability of an upset in the NCAA tournament occurring scales roughly linearly with the difference in value between the two seeds. The correlation is quite strong for the most common seed pairs (i.e. those found in first round games), but it also holds loosely for all seed combinations. This correlation is shown here: (where the size of each marker is scaled to the relative frequency of that pairing occurring in past tournaments.)
Although this plot is a pretty good rule of thumb, it did slightly trouble me that there really is no clear reason for the correlation to be as linear as it is. A hint to understand this all came earlier this year, when a quite different sports math question was rattling around in my brain: what is the correlation between the Vegas spread of a given basketball game and the probability of an upset? Fortunately, I was able to use properties of the normal distribution to answer this question, which I summarized in a bit more detail here.
Earlier this year, it occurred to me to try to compare the probability of an upset based on the spread to the probability of an upset based on the seed differential. It was a bit tricky to find good historical data on NCAA tournament lines, but I did find one website that has enough data to look at the numbers. When I plotted the seed differential against the spread, the data looks like this:
Once again, the data is a bit scattered, but there is a decent linear correlation. Where the correlation gets even stronger is when you consider the probability of an upset derived from the average spread for a particular seed combination to the actual upset rate for that combination in tournament play. That correlation is shown here (where the probability of the favorite team winning are plotted as opposed to the upset rate)
Similar to the other plots shown above, most of the data fall on or near the central line, with a few notable exceptions. Namely, 1-seeds have an oddly good record against 9-seeds, while 2-seed have a surprisingly hard time with 10-seeds.
But, I was still a bit troubled by the fairly thin nature of the spread data that I could find. The website I found listed only the 30 most recent incidents of a given seed pair. That works fine for the 2 vs. 6 match-up, but is a bit sparse for the 5 vs. 12 match-up. However, my previous study of the Vegas spread also suggested that the average margin of victory in a given match-up is actually very highly correlated to the final Vegas spread. That data for over 45,000 college games is shown here:
With this in mind, I was able to predict the average spread of each seed combination using the data for all games back to 1979 when seeding began. Using this "spread" to calculate the upset rate (as opposed to the actual spread) results in this plot, which does have a better R2 compared to the sparse spread data.
With this all in mind, it is time to come back to the original question: why do 2-seeds get upset so much more often than 1-seeds do. The answer is: because games in the NCAA tournament behave in exactly the same way that they do in the regular season when considering upsets as a function of the Vegas line.
When it comes to the 2 vs. 15 match-up, both the spread data that I could find and the average margin of victory data suggest 2-seeds are, on average 16.5 point favorites over 15-seeds. This corresponds to a 93.8% chance that the 2-seed advances. Based on this probability, one would expect a total of 8.5 upsets in the 136 times this match-up has occurred. In reality, there have been 8 since the 64 team tournament expansion in 1985. Score one for math.
As for 1-seeds, it is slightly less clear, as the sparse actual spread data that I found suggests 1-seeds are favored an average by 22.3 points, while the margin of victory data suggests the spread should be a bit higher at 24.5 points. Taken together, this suggests that a 16-seed has between a 1.8% and 1.1% chance of an upset. Over 34 years and 136 attempts, we should have observed between 1.5 and 2.5 16-seed over 1-seed upsets. We, of course, have only observed one: this year's UMBC upset over the University of Virginia. Those probabilities suggest that a 16 over 1 upset should be observed somewhere between once every 14 years and once every 23 years. In other words, we were a bit over due before this year. That said, it is likely that it will be at least another decade (or 2 or maybe even 3) until we can experience a UMBC-sized upset again in March. But, the next 15 over 2-seed upset is basically due any year now. After all, it's just math.