**SITE NEWS:**
We are moving all of our site and company news into a single blog for Sports-Reference.com. We'll tag all PFR content, so you can quickly and easily find the content you want.

Also, our existing PFR blog rss feed will be redirected to the new site's feed.

Pro-Football-Reference.com » Sports Reference

For more from Chase and Jason, check out their work at Football Perspective and The Big Lead.

## Benford’s Law in the NFL

Benford's Law is a fascinating bit of mathematical trivia that has nothing to do with football. Yesterday's post was superficially related to it, so I'm using that as an excuse to introduce it to those of you who haven't seen it before.

Yesterday's post was about the yards that get rounded out of a players' fantasy point total in a lot of leagues. The amount of yards a player loses to rounding depends on the last digit of his rushing yardage total for each game. In the comments, someone asked whether the distribution of final digits on rushing totals is uniform leaguewide. Well, it doesn't appear to be exactly uniform, but it's pretty close.

Final

digit Freq PCT

======================

0 1521 0.077

1 2334 0.118

2 2472 0.125

3 2223 0.113

4 2123 0.108

5 1994 0.101

6 1883 0.095

7 1838 0.093

8 1703 0.086

9 1644 0.083

Now, a very different thing happens if you take a look at the *first* digits of rushing totals:

First

digit Freq PCT

======================

1 5982 0.303

2 3146 0.159

3 2229 0.113

4 1923 0.097

5 1712 0.087

6 1439 0.073

7 1218 0.062

8 1133 0.057

9 953 0.048

Now that's clearly not uniform and far from it. And that's just what you'd be expecting if you know about as Benford's Law. Here is the wikipedia description:

Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the

leading digit is 1 almost one-third of the time, and further, larger numbers occur as the leading digit with less and less frequency as they grow in magnitude, to the point that9 is the leading digit less than one time in twenty.

That's almost exactly what we see with the NFL rushing data. Now you may be thinking at this point that the distribution of NFL rushing yardage leading digits is an artifact of the game itself. People rush for between 100 and 199 yards all the time, but hardly ever between 200 and 299. Maybe that's the explanation. But maybe not. Benford's Law is pretty pervasive. It applies to populations of cities and countries, to lengths of rivers, to stock prices, and even to the collection of numbers --- from whatever source --- that appear on the front page of the newspaper over a long period of time. In a whole lot of real life data sets, you'll find numbers with a leading digit of 1 much, much more often than numbers with a leading digits of 9.

You won't find the same pattern in *all* sets of data. If we did this with yards per rush instead of yardage totals, we would not get a similar distribution. If we looked at the heights of NFL players, the distribution of first digits would not follow Benford's Law. But it is remarkable that it applies to so many data sets including, at least roughly, rushing yardage totals.

Now let's investigate whether the Benford phenomenon, as observed in this case, is merely an artifact of the structure of NFL football games.

What if we measured rushing totals in feet instead of yards? LaDainian Tomlinson gained 327 feet rushing last week, Rudi Johnson notched 192 feet, and so on. Here are what the leading digits look like:

First

digit Freq PCT

======================

1 5600 0.284

2 3539 0.179

3 3603 0.183

4 1366 0.069

5 1013 0.051

6 2012 0.102

7 587 0.030

8 547 0.028

9 1468 0.074

Not exactly the same pattern. But still far from uniform and still skewed in essentially the same way. Did you know that Rudi Johnson rushed for 5852 centimeters last week? Here is the distribution of leading digits of rushing yardage totals measured in cm:

First

digit Freq PCT

======================

1 5780 0.293

2 2901 0.147

3 2193 0.111

4 1886 0.096

5 1602 0.081

6 1348 0.068

7 1211 0.061

8 1012 0.051

9 1802 0.091

Rudi also rushed for .0364 miles last week (that counts as a leading digit of 3, not zero). Here is the distribution of leading digits for the "rushing miles" totals of all games played in the NFL since 1995:

First

digit Freq PCT

======================

1 5537 0.281

2 3442 0.174

3 2778 0.141

4 1591 0.081

5 2664 0.135

6 1398 0.071

7 1053 0.053

8 534 0.027

9 738 0.037

So it really doesn't have much to do with the fact that 100--199 *yards* is a more common total than 200--299, or anything like that. If that were the cause of the distribution of leading digits, then the pattern would likely disappear if we measured in some other units.

And that's actually the key to why Benford's Law works. For sets of data that have units, the distribution has to be (subject to a few caveats) one that is invariant to changes of units. It just so happens that the Benford distribution has that property.

If you find this interesting, the previously-cited wikipedia writeup has more information. If you want something more hardcore, check out the Mathworld entry.

This entry was posted on Thursday, November 30th, 2006 at 5:05 am and is filed under Statgeekery. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Very interesting stuff. Question--are your distributions based on all players' single game rushing totals? all running backs?

On a Benford unrelated but statgeekery note, I have a question about WR TD's (and perhaps even other positions). Do the distribution of TD's follow a normal or a poisson distribution, or some other distribution pattern.

For all WR's who score exactly 5 TD's in a season, do we have the expected number of single 3 TD games from that population as a whole, or are there more or fewer players with exactly 1 TD in 5 different games than we might otherwise expect?

Fascinating.

You mentioned measuring feet instead of yards, and I'm wondering if you think the NFL should do that. Right now if it's third down with one inch to go and the running back gains two inches, it's called a yard. If the running back gains 36 inches, that's also called a yard. Would it make more sense for the official scorers to call the first one a gain of one foot and the second one a gain of three feet?

Actually, I'm more intrigued by the last-digit pattern. It seems skewed to the low end, which indicates to me that coaches let RBs attain milestones, going just over 100 yards, and then yank them.

All nonzero rushing totals for players at all positions from 1995--present.

I'd bet poisson, but that's a good question. It is officially on the to-blog-about list. Thanks for the idea.

If they had it to do over again from scratch, I'd be in favor of that. But I wouldn't advocate a change at this point. People (including me) are too set in their ways to change.

I doubt there's any real difference between 3rd-and-15 (feet) and 3-and-16 (feet), so it's not a pressing need except when you get under a couple of yards.

I guess it would be nice if they could somehow make the "and inches" concept jive with the official stats. If a guy runs for 9.94 yards on first down and then the QB sneaks for .06 on second, it would be nice if they could be credited with 10 and 0 yards instead of 9 and 1.

Vince said:

"Actually, I’m more intrigued by the last-digit pattern. It seems skewed to the low end, which indicates to me that coaches let RBs attain milestones, going just over 100 yards, and then yank them. "

I think a more likely reason is that backup RBs getting limited carries skew the digit distribution to the low end.

This is baseball.

"OB" is times on base. Based on all non-pitchers in baseball history, using seasonal totals (Abreu gets one record for 2006). "H" is hits, based on all players in baseball history, using team-seasonal totals (Abreu gets two records for 2006).

n OB H

1 41% 43%

2 21% 13%

3 7% 9%

4 6% 8%

5 5% 7%

6 5% 6%

7 5% 5%

8 5% 5%

9 4% 5%

Here's the SQL for the Lahman database, if someone wants to have fun:

SELECT IIf([H]>99,Int([H]/100),IIf([H]>9,Int([H]/10),[H])) AS firstDigit, Sum(1) AS n

FROM Batting

WHERE [H]0

GROUP BY IIf([H]>99,Int([H]/100),IIf([H]>9,Int([H]/10),[H]));

Other databases would use CASE WHEN or DECODE, instead of IIF.

The where clause should be

H NE 0

where NE is a less than sign followed by greater than sign

I provide free software to develop a variety of Benford analysis statistics, such as first 1-3 digits, last 1-2 digits, second digit, etc. Includes charts. Also, conformity with the Benford distribution can be tested with the statistical approach used by Kolmogorv-Smirnov, called a "d-statistic". There are quite a variety of applications for Benford's law, including detetcion of insurance fraud, tax fraud, "curb-stoning" in surveys, etc.

The software is available at http://www.ezrstats.com/Downloads.htm.

Touchdowns do not follow a normal distribution. That distribution is continuous (fractional touchdowns don't have any meaning) and can always take negative values (there are no such things as negative touchdowns.

If touchdowns followed some sort of well known distribution, I would expect it to be either a "zero inflated" poisson or "zero inflated" negative binomial distribiution.

If someone can give me a reference to a database with the information, I will be happy to give the board a more definitive answer.

"So it really doesn’t have much to do with the fact that 100–199 yards is a more common total than 200–299, or anything like that."

It does actually. It's 2x harder to rush for 20 yards/feet/cm/miles as 10 (or 200 and 100). It's only 1.1x harder to rush for 90 units as 80 units.