Benford’s Law in the NFL
Posted by Doug on November 30, 2006
Benford's Law is a fascinating bit of mathematical trivia that has nothing to do with football. Yesterday's post was superficially related to it, so I'm using that as an excuse to introduce it to those of you who haven't seen it before.
Yesterday's post was about the yards that get rounded out of a players' fantasy point total in a lot of leagues. The amount of yards a player loses to rounding depends on the last digit of his rushing yardage total for each game. In the comments, someone asked whether the distribution of final digits on rushing totals is uniform leaguewide. Well, it doesn't appear to be exactly uniform, but it's pretty close.
Final
digit Freq PCT
======================
0 1521 0.077
1 2334 0.118
2 2472 0.125
3 2223 0.113
4 2123 0.108
5 1994 0.101
6 1883 0.095
7 1838 0.093
8 1703 0.086
9 1644 0.083
Now, a very different thing happens if you take a look at the first digits of rushing totals:
First
digit Freq PCT
======================
1 5982 0.303
2 3146 0.159
3 2229 0.113
4 1923 0.097
5 1712 0.087
6 1439 0.073
7 1218 0.062
8 1133 0.057
9 953 0.048
Now that's clearly not uniform and far from it. And that's just what you'd be expecting if you know about as Benford's Law. Here is the wikipedia description:
Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is 1 almost one-third of the time, and further, larger numbers occur as the leading digit with less and less frequency as they grow in magnitude, to the point that 9 is the leading digit less than one time in twenty.
That's almost exactly what we see with the NFL rushing data. Now you may be thinking at this point that the distribution of NFL rushing yardage leading digits is an artifact of the game itself. People rush for between 100 and 199 yards all the time, but hardly ever between 200 and 299. Maybe that's the explanation. But maybe not. Benford's Law is pretty pervasive. It applies to populations of cities and countries, to lengths of rivers, to stock prices, and even to the collection of numbers --- from whatever source --- that appear on the front page of the newspaper over a long period of time. In a whole lot of real life data sets, you'll find numbers with a leading digit of 1 much, much more often than numbers with a leading digits of 9.
You won't find the same pattern in all sets of data. If we did this with yards per rush instead of yardage totals, we would not get a similar distribution. If we looked at the heights of NFL players, the distribution of first digits would not follow Benford's Law. But it is remarkable that it applies to so many data sets including, at least roughly, rushing yardage totals.
Now let's investigate whether the Benford phenomenon, as observed in this case, is merely an artifact of the structure of NFL football games.
What if we measured rushing totals in feet instead of yards? LaDainian Tomlinson gained 327 feet rushing last week, Rudi Johnson notched 192 feet, and so on. Here are what the leading digits look like:
First
digit Freq PCT
======================
1 5600 0.284
2 3539 0.179
3 3603 0.183
4 1366 0.069
5 1013 0.051
6 2012 0.102
7 587 0.030
8 547 0.028
9 1468 0.074
Not exactly the same pattern. But still far from uniform and still skewed in essentially the same way. Did you know that Rudi Johnson rushed for 5852 centimeters last week? Here is the distribution of leading digits of rushing yardage totals measured in cm:
First
digit Freq PCT
======================
1 5780 0.293
2 2901 0.147
3 2193 0.111
4 1886 0.096
5 1602 0.081
6 1348 0.068
7 1211 0.061
8 1012 0.051
9 1802 0.091
Rudi also rushed for .0364 miles last week (that counts as a leading digit of 3, not zero). Here is the distribution of leading digits for the "rushing miles" totals of all games played in the NFL since 1995:
First
digit Freq PCT
======================
1 5537 0.281
2 3442 0.174
3 2778 0.141
4 1591 0.081
5 2664 0.135
6 1398 0.071
7 1053 0.053
8 534 0.027
9 738 0.037
So it really doesn't have much to do with the fact that 100--199 yards is a more common total than 200--299, or anything like that. If that were the cause of the distribution of leading digits, then the pattern would likely disappear if we measured in some other units.
And that's actually the key to why Benford's Law works. For sets of data that have units, the distribution has to be (subject to a few caveats) one that is invariant to changes of units. It just so happens that the Benford distribution has that property.
If you find this interesting, the previously-cited wikipedia writeup has more information. If you want something more hardcore, check out the Mathworld entry.

November 30th, 2006 at 7:44 am
Very interesting stuff. Question--are your distributions based on all players' single game rushing totals? all running backs?
On a Benford unrelated but statgeekery note, I have a question about WR TD's (and perhaps even other positions). Do the distribution of TD's follow a normal or a poisson distribution, or some other distribution pattern.
For all WR's who score exactly 5 TD's in a season, do we have the expected number of single 3 TD games from that population as a whole, or are there more or fewer players with exactly 1 TD in 5 different games than we might otherwise expect?
November 30th, 2006 at 9:25 am
Fascinating.
You mentioned measuring feet instead of yards, and I'm wondering if you think the NFL should do that. Right now if it's third down with one inch to go and the running back gains two inches, it's called a yard. If the running back gains 36 inches, that's also called a yard. Would it make more sense for the official scorers to call the first one a gain of one foot and the second one a gain of three feet?
November 30th, 2006 at 3:48 pm
Actually, I'm more intrigued by the last-digit pattern. It seems skewed to the low end, which indicates to me that coaches let RBs attain milestones, going just over 100 yards, and then yank them.
November 30th, 2006 at 7:39 pm
All nonzero rushing totals for players at all positions from 1995--present.
I'd bet poisson, but that's a good question. It is officially on the to-blog-about list. Thanks for the idea.
November 30th, 2006 at 7:43 pm
If they had it to do over again from scratch, I'd be in favor of that. But I wouldn't advocate a change at this point. People (including me) are too set in their ways to change.
I doubt there's any real difference between 3rd-and-15 (feet) and 3-and-16 (feet), so it's not a pressing need except when you get under a couple of yards.
I guess it would be nice if they could somehow make the "and inches" concept jive with the official stats. If a guy runs for 9.94 yards on first down and then the QB sneaks for .06 on second, it would be nice if they could be credited with 10 and 0 yards instead of 9 and 1.
December 6th, 2006 at 3:21 pm
Vince said:
"Actually, I’m more intrigued by the last-digit pattern. It seems skewed to the low end, which indicates to me that coaches let RBs attain milestones, going just over 100 yards, and then yank them. "
I think a more likely reason is that backup RBs getting limited carries skew the digit distribution to the low end.
December 8th, 2006 at 2:15 pm
This is baseball.
"OB" is times on base. Based on all non-pitchers in baseball history, using seasonal totals (Abreu gets one record for 2006). "H" is hits, based on all players in baseball history, using team-seasonal totals (Abreu gets two records for 2006).
n OB H
1 41% 43%
2 21% 13%
3 7% 9%
4 6% 8%
5 5% 7%
6 5% 6%
7 5% 5%
8 5% 5%
9 4% 5%
Here's the SQL for the Lahman database, if someone wants to have fun:
SELECT IIf([H]>99,Int([H]/100),IIf([H]>9,Int([H]/10),[H])) AS firstDigit, Sum(1) AS n
FROM Batting
WHERE [H]0
GROUP BY IIf([H]>99,Int([H]/100),IIf([H]>9,Int([H]/10),[H]));
Other databases would use CASE WHEN or DECODE, instead of IIF.
December 8th, 2006 at 2:17 pm
The where clause should be
H NE 0
where NE is a less than sign followed by greater than sign
January 4th, 2007 at 7:09 am
I provide free software to develop a variety of Benford analysis statistics, such as first 1-3 digits, last 1-2 digits, second digit, etc. Includes charts. Also, conformity with the Benford distribution can be tested with the statistical approach used by Kolmogorv-Smirnov, called a "d-statistic". There are quite a variety of applications for Benford's law, including detetcion of insurance fraud, tax fraud, "curb-stoning" in surveys, etc.
The software is available at http://www.ezrstats.com/Downloads.htm.
January 10th, 2008 at 12:26 pm
Touchdowns do not follow a normal distribution. That distribution is continuous (fractional touchdowns don't have any meaning) and can always take negative values (there are no such things as negative touchdowns.
If touchdowns followed some sort of well known distribution, I would expect it to be either a "zero inflated" poisson or "zero inflated" negative binomial distribiution.
If someone can give me a reference to a database with the information, I will be happy to give the board a more definitive answer.
July 1st, 2009 at 11:56 am
"So it really doesn’t have much to do with the fact that 100–199 yards is a more common total than 200–299, or anything like that."
It does actually. It's 2x harder to rush for 20 yards/feet/cm/miles as 10 (or 200 and 100). It's only 1.1x harder to rush for 90 units as 80 units.