Posted by Doug on April 4, 2006
Let's start by keeping this as simple as possible. Clinton Portis last year was a 24-year-old running back with 1516 rushing yards. We'll ignore those last two digits and place him in the 1500--1600 category, which we'll abbreviate '15.' Clinton Portis was a 24-15 last year.
The next step is to sift through the historical data to find out what other 24-15s have done. What percentage turned into 25-17s? What percentage turned into 25-9s? And so on. As it turns out, 100% of all 24-15s --- yep, all one of them --- turned into 25-17s. We need to widen the net a bit, and that introduces the usual problems. As we widen it, we increase the sample (which is good), but we also introduce more runners who are not truly comparable to Portis (which is bad). There is no right answer. We just play around until we get something that appears to pass the eyeball test.
Here are all the runners aged 23--25 with between 1400--1699 rushing yards, along with how they did the next year;
Runner YR YD NextYrYd
Thurman Thomas 1991 1407 1487
O.J. Simpson 1972 1251 2003
Deuce McAllister 2003 1641 1074
Terrell Davis 1996 1538 1750
LaDainian Tomlinson 2003 1645 1335
Franco Harris 1975 1246 1128
Wilbert Montgomery 1979 1512 778
Walter Payton 1979 1610 1460
Earl Campbell 1979 1697 1934
Barry Foster 1992 1690 711
Gerald Riggs 1984 1486 1719
Mark VanEeghen 1977 1273 1080
Travis Henry 2002 1438 1356
Emmitt Smith 1994 1484 1773
Otis Armstrong 1974 1407 155
George Rogers 1981 1674 535
Earl Campbell 1978 1450 1697
Barry Sanders 1991 1548 1352
Curt Warner 1986 1481 985
Rudi Johnson 2004 1454 1458
Jerome Bettis 1997 1665 1185
Stephen Davis 1999 1405 1318
Emmitt Smith 1993 1486 1484
Jerome Bettis 1996 1431 1665
LaDainian Tomlinson 2002 1683 1645
Note that, e.g., Mark van Eeghen did not fall into the 1400--1699 yard range, but if you pro-rate his season to 16 team games he did.
Which leads to the following probabilities for Portis next year:
0-- 99 0.0
100-- 199 4.0
200-- 299 0.0
300-- 399 0.0
400-- 499 0.0
500-- 599 0.0
600-- 699 0.0
700-- 799 8.0
800-- 899 0.0
900-- 999 4.0
Although it looks choppier than it ought to, this has the right general feel. It sets the over-under for Portis' rushing yards next year at about 1400. It gives him a respectable chance of breaking out for a huge year, a slim chance of a catastrophic injury, and also a chance of a minor injury or a major decline.
So we roll a die to determine how many yards Portis will have next year. Based on what he gets, we estimate his probabilities for the following year using the same technique, roll another die, and so on.
OK, here we go. According to this method, here is the probability of Portis reaching various career yardage levels.
The original Favorite Toy said Portis was about a 3-to-1 shot to break Smith's record. This one says he's a 300-to-1 shot. This method depends heavily on real historical data. Records, by their very nature, are historically very rare accomplishments. So we shouldn't be too surprised to see that this method thinks Portis is a longshot. But I am surprised at just how much of a longshot it thinks he is.
The problem is that we're only using one year's worth of data to estimate the following year's production. Go back to that second table at the top of the page. It says that Portis has a 4% chance of gaining between 100 and 200 yards this year. I don't think that's unreasonable. What is unreasonable is projecting the rest of Portis' career under the assumption that he is a morally a 100-yard-per-year running back. If Portis gains only 150 yards in 2006, it will be because he got hurt. But the mathematical model doesn't know that. It thinks Portis is just another Heath Evans or Shaud Williams who will be out of football shortly.
So when the simulated Portis suffers a major injury, he has almost no chance of coming back. The model needs more information. [Markov chain fans will note at this point that we're up against the "memoryless" assumption of Markov chains that I glossed over in my last post.] I can think of two ways to provide this information:
- We could take into account more than one year's worth of statistics when determining the historical probabilities. In other words, instead of calling a hypothetically injured Portis a 25-year-old back who gained 150 yards last year, we could call him a 25-year-old back who gained 150 yards last year and 1500 the year before. This would certainly allow the model to distinguish between a hypothetical injured Clinton Portis and a healthy Shaud Williams. But it drastically cuts down the pool of available comps.
- We could measure everything in terms of yards per game instead of raw yards. Then independently assess the probability of injuries at each age. Under this scheme, our hypothetically injured Portis would be called a 25-year-old back who averaged 78 yards per game and played two games. It's just a guess at this point but I think this plan, while not without its problems, might actually yield some reasonable probabilities.
Either way, it's going to take more programming, which takes more time, which I don't think I have right now. For now we will have to file this under "crazy ideas that don't work and may or may not be salvageable." I will throw it on the to-do list and hope to attempt to salvage it sometime.