This is our old blog. It hasn't been active since 2011. Please see the link above for our current blog or click the logo above to see all of the great data and content on this site.

Archive for the 'Statgeekery' Category

Rebuilding the Favorite Toy again

20th April 2006

Awhile back I posted a few entries (I, II, III) about estimating a player's chances at reaching a career milestone using a mathematical gadget called a Markov chain.

I figured that some baseball stathead had probably attempted something similar, so I did some googling to see if they had any luck. I did not find any Markov models, but what I did find was this interesting article at baseballthinkfactory. It was written by a guy named Jesse Frey and it's a neat idea. I'll run through the basic gist of it using --- guess who --- Clinton Portis and the rushing record as an example.

We start by collecting all 25-year-old running backs throughout NFL history (subject to some fine print). We then record how many yards they gained at age 23 and at age 24, and how many yards they gained in the rest of their careers. So we've got a list that looks something like this:

Player Age23RshYD Age24RshYD RestOfCareer
Robert Smith 632 692 4989
Ricky Ervins 680 495 939
Terrell Davis 1117 1538 4952
Barry Foster 488 1690 1562
[... another hundred-or-so guys ...]

There is, of course, no exact formula that tells you the RestOfCareer rushing yards based on the age 23 and age 24 rushing yards, but using a technique called regression we can estimate the formula that works "best."

Given the above data, what we end up with is this:

Rest-of-career yards ~= -943 + 2.64*(age24yards) + 2.39*(age23yards)

Plugging Clinton Portis' 1516 age 24 yards and 1315 age 23 yards into that formula gives an estimate of 6202 yards for the remainder of his career.

That tells us that we expect Portis to gain about 6202 more yards in the rest of his career. But of course we're not saying he'll end up with exactly that. What we're saying is that we don't know, but our best guess is that it'll be somewhere in the neighborhood of 6202. But how big is that neighborhood? Obviously there is some chance of him exceeding that by a thousand yards. There is some chance of him exceeding that by 5000 yards. How big are those chances? To answer these questions in a mathematically justifiable way is beyond the scope of this post, but we can get pretty close with the data and our intuition.

Of the 106 running backs that comprised this data set, 20 of them (about 19%) doubled the rest-of-career rushing yards estimate provided by this formula. Doubling his expected rest-of-career rushing yards is almost exactly what Portis needs to do to break Emmitt Smith's record. So this calculation indicates that Portis has about a 19% chance of retiring as the rushing king. That's pretty close to the original Favorite Toy estimate and generally agrees with my gut feeling.

Neat, huh? If I get some time, I'll run this for some other players.

Posted in Statgeekery | 4 Comments »

Strength of schedule

10th April 2006

The NFL schedule was released last week. Like most people who are neither season ticket holders nor executives for FOX or CBS, I like the new flexible scheduling plan that will allow more interesting games to be shown on Sunday nights.

As has been noted elsewhere, the toughest schedules (based on last year's records) belong to the Giants and Bengals, whose 2006 opponents were a combined 139-117 in 2005. The Bears have the easiest slate; their opponents were 114-142 last year.

But as we all know, some teams that were bad in 2005 will be good in 2006 and vice versa. And some schedules that look easy right now will actually be tough and vice versa. The question is: to what extent, if any, do the Bears have an advantage over the Giants because of their schedules. Two games? One game? Half a game?

To investigate this, I went back to 1990 and recorded three bits of data about every team.

  1. their own record in Year N-1
  2. their preseason estimated strength of schedule. I.e. the combined Year N-1 records of the team's Year N opponents.
  3. their record in Year N

For the 2005 New York Jets, for example, I have

  1. .625 (their 2004 record was 10-6)
  2. .535 (the combined 2004 record of their 2005 opponents)
  3. .250 (their 2005 record ended up being 4-12)

I then labeled every team, based on their Year N-1 performance, as either Very Bad (less than 5 wins), Bad (5 or 6 wins), Mediocre (7 to 9 wins), Good (10 or 11 wins), or Very Good (12 or more wins). I also labeled each team's projected schedule as either Easy (combined opponents record under .500) or Hard (over .500).

Take a look at the Very Bad teams, for example. The Very Bad teams with a projected Easy schedule averaged 6.44 wins the next year. The Very Bad teams with a projected Hard schedule averaged 6.63 wins. The difference is not significant, and that's the point. Here is the complete breakdown:

Average Wins in Year N
Easy Sched Hard Sched
Very Bad in Year N-1 6.44 6.63
Bad in Year N-1 7.67 7.26
Mediocre in Year N-1 7.82 8.27
Good in Year N-1 8.94 8.57
Very Good in Year N-1 8.78 10.06
TOTAL 7.73 8.27

An eyeballing of this table indicates that the estimated schedule strength is essentially irrelevant and official statistical tests confirm that. [For example, a regression of Year N record on Year N-1 record and projected Year N schedule strength produces a not-even-close-to-significant coefficient for schedule strength.]

Note that I'm not saying that schedule strength isn't important. Some teams will have harder schedules than others in 2006 and it will make a difference. The point is that these strength-of-schedule estimates that are being thrown around right now seem to have no role at all in determining teams' 2006 records.

Posted in General, Statgeekery | 7 Comments »

Rebuilding the Favorite Toy II

4th April 2006

Let's start by keeping this as simple as possible. Clinton Portis last year was a 24-year-old running back with 1516 rushing yards. We'll ignore those last two digits and place him in the 1500--1600 category, which we'll abbreviate '15.' Clinton Portis was a 24-15 last year.

The next step is to sift through the historical data to find out what other 24-15s have done. What percentage turned into 25-17s? What percentage turned into 25-9s? And so on. As it turns out, 100% of all 24-15s --- yep, all one of them --- turned into 25-17s. We need to widen the net a bit, and that introduces the usual problems. As we widen it, we increase the sample (which is good), but we also introduce more runners who are not truly comparable to Portis (which is bad). There is no right answer. We just play around until we get something that appears to pass the eyeball test.

Here are all the runners aged 23--25 with between 1400--1699 rushing yards, along with how they did the next year;

Runner YR YD NextYrYd
Thurman Thomas 1991 1407 1487
O.J. Simpson 1972 1251 2003
Deuce McAllister 2003 1641 1074
Terrell Davis 1996 1538 1750
LaDainian Tomlinson 2003 1645 1335
Franco Harris 1975 1246 1128
Wilbert Montgomery 1979 1512 778
Walter Payton 1979 1610 1460
Earl Campbell 1979 1697 1934
Barry Foster 1992 1690 711
Gerald Riggs 1984 1486 1719
Mark VanEeghen 1977 1273 1080
Travis Henry 2002 1438 1356
Emmitt Smith 1994 1484 1773
Otis Armstrong 1974 1407 155
George Rogers 1981 1674 535
Earl Campbell 1978 1450 1697
Barry Sanders 1991 1548 1352
Curt Warner 1986 1481 985
Rudi Johnson 2004 1454 1458
Jerome Bettis 1997 1665 1185
Stephen Davis 1999 1405 1318
Emmitt Smith 1993 1486 1484
Jerome Bettis 1996 1431 1665
LaDainian Tomlinson 2002 1683 1645

Note that, e.g., Mark van Eeghen did not fall into the 1400--1699 yard range, but if you pro-rate his season to 16 team games he did.

Which leads to the following probabilities for Portis next year:

Yardage Probability
0-- 99 0.0
100-- 199 4.0
200-- 299 0.0
300-- 399 0.0
400-- 499 0.0
500-- 599 0.0
600-- 699 0.0
700-- 799 8.0
800-- 899 0.0
900-- 999 4.0
1000--1099 12.0
1100--1199 4.0
1200--1299 4.0
1300--1399 16.0
1400--1499 16.0
1500--1599 0.0
1600--1699 12.0
1700--2200 20.0

Although it looks choppier than it ought to, this has the right general feel. It sets the over-under for Portis' rushing yards next year at about 1400. It gives him a respectable chance of breaking out for a huge year, a slim chance of a catastrophic injury, and also a chance of a minor injury or a major decline.

So we roll a die to determine how many yards Portis will have next year. Based on what he gets, we estimate his probabilities for the following year using the same technique, roll another die, and so on.

OK, here we go. According to this method, here is the probability of Portis reaching various career yardage levels.

Yardage PctChance
18000+ 0.3%
17000+ 0.6%
16000+ 1.4%
15000+ 2.9%
14000+ 5.8%
13000+ 10.8%
12000+ 19.7%
11000+ 32.3%
10000+ 48.2%
9000+ 66.0%
8000+ 81.7%
7000+ 94.1%

The original Favorite Toy said Portis was about a 3-to-1 shot to break Smith's record. This one says he's a 300-to-1 shot. This method depends heavily on real historical data. Records, by their very nature, are historically very rare accomplishments. So we shouldn't be too surprised to see that this method thinks Portis is a longshot. But I am surprised at just how much of a longshot it thinks he is.

The problem is that we're only using one year's worth of data to estimate the following year's production. Go back to that second table at the top of the page. It says that Portis has a 4% chance of gaining between 100 and 200 yards this year. I don't think that's unreasonable. What is unreasonable is projecting the rest of Portis' career under the assumption that he is a morally a 100-yard-per-year running back. If Portis gains only 150 yards in 2006, it will be because he got hurt. But the mathematical model doesn't know that. It thinks Portis is just another Heath Evans or Shaud Williams who will be out of football shortly.

So when the simulated Portis suffers a major injury, he has almost no chance of coming back. The model needs more information. [Markov chain fans will note at this point that we're up against the "memoryless" assumption of Markov chains that I glossed over in my last post.] I can think of two ways to provide this information:

  • We could take into account more than one year's worth of statistics when determining the historical probabilities. In other words, instead of calling a hypothetically injured Portis a 25-year-old back who gained 150 yards last year, we could call him a 25-year-old back who gained 150 yards last year and 1500 the year before. This would certainly allow the model to distinguish between a hypothetical injured Clinton Portis and a healthy Shaud Williams. But it drastically cuts down the pool of available comps.
  • We could measure everything in terms of yards per game instead of raw yards. Then independently assess the probability of injuries at each age. Under this scheme, our hypothetically injured Portis would be called a 25-year-old back who averaged 78 yards per game and played two games. It's just a guess at this point but I think this plan, while not without its problems, might actually yield some reasonable probabilities.

Either way, it's going to take more programming, which takes more time, which I don't think I have right now. For now we will have to file this under "crazy ideas that don't work and may or may not be salvageable." I will throw it on the to-do list and hope to attempt to salvage it sometime.

Posted in Statgeekery | 1 Comment »

Rebuilding the Favorite Toy with Markov chains

3rd April 2006

As promised in this post, I'm going to create a more sohpisticated way to estimate players' chances of reaching records or milestones. I'll spend this post describing the mathematics behind my method.

A Markov chain is what I'll be using and you can think of a Markov chain in terms of a random walk. Imagine you are hiking on the following collection of trails, starting at location 1:

. . .
. . .
. . .
. . .
. . .
2 3 4
. . . . .. .
. . . . . . .
. .. .. . .
5 6 7 8 9

Assume that, whenever you come to a fork in the road, you are equally likely to take any of the options available to you. So for example, from location 1, you have a 1/3 probability of going to 2, a 1/3 probability of going to 3, and a 1/3 probability of going to 4. If you happen to choose 3, then you now have a 1/2 probability of proceeding to 6 and a 1/2 probability of landing at 7.

There are five possible ending points for your hike, locations 5 through 9. For a given ending point, you can compute your probability of ending up there by (1) multiplying the probabilities of the choices you had to make to end up there for each path and (2) adding the product for each path. For example, your probability of landing at location 7 via location 3 is 1/3 * 1/2 = 1/6. The probability of landing at location 7 via location 4 is 1/3 * 1/3 = 1/9. So the probability of landing at location 7 would be 1/6 + 1/9, which is 5/18 or roughly 28%.

Now, there is no reason why you, the hiker at a given fork in the road, have to choose all roads with equal probability. From location 1, you might go to 2, 3, and 4 with probabilities .2, .7, and .1 repsectively. This would, of course, lead to different probabilities of landing at each of the ending points. But as long as we know the probabilities of choosing each road (and as long as one mathematical technicality --- that we'll deal with later --- is satisfied), we can multiply and add to compute the probability of landing at each of the ending points.

You can use a Markov chain to model any system that moves from state to state with known probabilities. For example...

  • A plinko board - Plinko is an old Price Is Right game where the contestant drops a game piece into a sequence of pegs. The piece ranomly bounces down through the pegs and finally lands somewhere. For a javascript demo, go here and click the plinko link. Anyway, this can be modeled as a random walk. At each level the plinko chip has some probability of going right and some probability of going left. As it works its way down the board, it makes several of these "choices" and eventually lands at one of its ending points.
  • A game of tennis - every game starts at 0-0. From there it can go to either 15-0 or 0-15, with probabilities depending on the abilities of the competitors. From 15-0, it can go to 15-15 or it can go to 30-0, and so on. Eventually it will land on either "server wins" or "returner wins."

If this were a math class, what we'd do now is talk about how to put all the various probabilities of moving from one state to another into a big matrix. Then we'd learn how to massage that matrix into various other matrices that tell us what we want to know: what is the probability of ending up in each state? How long it will take (on average) before we land there? How much time can we expect to spend in each state?

Now let's talk about Clinton Portis. In 2005 he was a 24-year-old running back with 1516 rushing yards. Next year, he might be a 25-year-old running back with 1900 rushing yards. Or he might be a 25-year-old running back with 800 rushing yards. Or he might be a 25-year-old running back with 1600 yards. Any of those numbers, as well as several others, is possible and there is a certain probability of each. Using historical data from other 24-year-old running backs with around 1500 yards might provide us with a decent starting point for those probabilities.

If Portis gets, say, 1300 rushing yards in 2006 at age 25, then based on that we can estimate his chances of getting any given number of yards in 2007 at age 26. And so on. In other words, we can view Portis as a hiker on a system of trails not unlike the one pictured above. Or we can view him as a Plinko chip and the different pegs that he hits correspond to different yardage totals.

Eventually, Portis will retire, thus landing at some ending point in his random walk, and we can compute the probability of his landing at each of them. At some of those ending points, he will be the rushing champ. At others he won't. If we add up the probabilities of all the ending points where he's the champ, we've got our estimate of Portis' chances of breaking the record.

I claimed that this was a more mathematically sophisticated way of estimating Portis' chances of breaking Smith's rushing record, but that was a lie. Yes, there is some very serious mathematics involved in working with Markov chains. But for our purposes, we don't need any of the heavy stuff. What we're doing is no different from simulating Portis' career a gazillion times and observing how frequently he breaks the record. It can be put into a Markov chain context, but that's really not necessary. In other words, I've just tricked you into learning some math. Just so it's not a total loss, I'll point out that there is a football-related application that does rely on some mathematically deep results about Markov chains. Namely, some of college football's computer ranking systems have Markov chains at their core. I'll blog on that sometime.

The next post will describe my process for converting historical data into probabilities, and then I'll get to the results.

Posted in Statgeekery | Comments Off on Rebuilding the Favorite Toy with Markov chains

Milestones and the favorite toy

26th March 2006

In the 1980s, legendary baseball author Bill James developed a quick-and-dirty method of estimating a player's chance of eclipsing a particular milestone. I'll describe it while working through LaDainian Tomlinson's chance of breaking Emmitt Smith's rushing record:

  1. Compute the "need yards." Tomlinson has 7361 yards and needs 18355 to catch Emmitt, so his need yards is 10994

  2. Compute the years remaining. James' formula for this was 24 - .6(age). Tomlinson is 27, so this would give him 7.8 remaining seasons. Clearly this part of the formula needs a tweak; running backs don't stick around as long as left fielders do. We'll investigate this further at some point, but as a first guess, let's change the .6 to a .7, which gives Tomlinson 5.1 more seasons.

  3. Compute the established yardage level. James used the usual three-year weighted average: three times last year's yards, plus twice the year before's yards, plus the previous year's yards, all divided by 6. For Tomlinson, that estimate would be 1450 yards, which seems reasonable.

  4. Compute the projected remaining yards. 5.1 times 1450 = 7395

  5. The probability of reaching the milestone is estimated at
    (ProjectedRemainingYards / NeedYards) - .5.
    For Tomlinson this is about 17%.

Does that feel right? Would you take five-to-one odds on Tomlinson breaking Emmitt's record? Would you best against it at one-to-five? Bill James called this method The Favorite Toy, which conveys both that it is fun to play around with and that it shouldn't be taken too seriously.

In subsequent posts I'll investigate some more mathematically elaborate --- but not necessarily more accurate --- methods of estimating these sorts of things. For now I'll leave you with the short list of runners who, according to The Favorite Toy, have a shot at Emmitt Smith's rushing record.

Runner Pct Chance
Clinton Portis 26.5
Edgerrin James 21.3
LaDainian Tomlinson 17.3
Shaun Alexander 11.3

If these estimates are to be believed, there is about a 57% chance that one of these four guys will break Emmitt's record.

Posted in Statgeekery | 14 Comments »