As promised in this post, I'm going to create a more sohpisticated way to estimate players' chances of reaching records or milestones. I'll spend this post describing the mathematics behind my method.
A Markov chain is what I'll be using and you can think of a Markov chain in terms of a random walk. Imagine you are hiking on the following collection of trails, starting at location 1:
. . .
. . .
. . .
. . .
. . .
2 3 4
. . . . .. .
. . . . . . .
. .. .. . .
5 6 7 8 9
Assume that, whenever you come to a fork in the road, you are equally likely to take any of the options available to you. So for example, from location 1, you have a 1/3 probability of going to 2, a 1/3 probability of going to 3, and a 1/3 probability of going to 4. If you happen to choose 3, then you now have a 1/2 probability of proceeding to 6 and a 1/2 probability of landing at 7.
There are five possible ending points for your hike, locations 5 through 9. For a given ending point, you can compute your probability of ending up there by (1) multiplying the probabilities of the choices you had to make to end up there for each path and (2) adding the product for each path. For example, your probability of landing at location 7 via location 3 is 1/3 * 1/2 = 1/6. The probability of landing at location 7 via location 4 is 1/3 * 1/3 = 1/9. So the probability of landing at location 7 would be 1/6 + 1/9, which is 5/18 or roughly 28%.
Now, there is no reason why you, the hiker at a given fork in the road, have to choose all roads with equal probability. From location 1, you might go to 2, 3, and 4 with probabilities .2, .7, and .1 repsectively. This would, of course, lead to different probabilities of landing at each of the ending points. But as long as we know the probabilities of choosing each road (and as long as one mathematical technicality --- that we'll deal with later --- is satisfied), we can multiply and add to compute the probability of landing at each of the ending points.
You can use a Markov chain to model any system that moves from state to state with known probabilities. For example...
- A game of tennis - every game starts at 0-0. From there it can go to either 15-0 or 0-15, with probabilities depending on the abilities of the competitors. From 15-0, it can go to 15-15 or it can go to 30-0, and so on. Eventually it will land on either "server wins" or "returner wins."
If this were a math class, what we'd do now is talk about how to put all the various probabilities of moving from one state to another into a big matrix. Then we'd learn how to massage that matrix into various other matrices that tell us what we want to know: what is the probability of ending up in each state? How long it will take (on average) before we land there? How much time can we expect to spend in each state?
Now let's talk about Clinton Portis. In 2005 he was a 24-year-old running back with 1516 rushing yards. Next year, he might be a 25-year-old running back with 1900 rushing yards. Or he might be a 25-year-old running back with 800 rushing yards. Or he might be a 25-year-old running back with 1600 yards. Any of those numbers, as well as several others, is possible and there is a certain probability of each. Using historical data from other 24-year-old running backs with around 1500 yards might provide us with a decent starting point for those probabilities.
If Portis gets, say, 1300 rushing yards in 2006 at age 25, then based on that we can estimate his chances of getting any given number of yards in 2007 at age 26. And so on. In other words, we can view Portis as a hiker on a system of trails not unlike the one pictured above. Or we can view him as a Plinko chip and the different pegs that he hits correspond to different yardage totals.
Eventually, Portis will retire, thus landing at some ending point in his random walk, and we can compute the probability of his landing at each of them. At some of those ending points, he will be the rushing champ. At others he won't. If we add up the probabilities of all the ending points where he's the champ, we've got our estimate of Portis' chances of breaking the record.
I claimed that this was a more mathematically sophisticated way of estimating Portis' chances of breaking Smith's rushing record, but that was a lie. Yes, there is some very serious mathematics involved in working with Markov chains. But for our purposes, we don't need any of the heavy stuff. What we're doing is no different from simulating Portis' career a gazillion times and observing how frequently he breaks the record. It can be put into a Markov chain context, but that's really not necessary. In other words, I've just tricked you into learning some math. Just so it's not a total loss, I'll point out that there is a football-related application that does rely on some mathematically deep results about Markov chains. Namely, some of college football's computer ranking systems have Markov chains at their core. I'll blog on that sometime.
The next post will describe my process for converting historical data into probabilities, and then I'll get to the results.