SITE NEWS: We are moving all of our site and company news into a single blog for We'll tag all PFR content, so you can quickly and easily find the content you want.

Also, our existing PFR blog rss feed will be redirected to the new site's feed. » Sports Reference

For more from Chase and Jason, check out their work at Football Perspective and The Big Lead.

Another rating system: maximum likelihood

Posted by Doug on December 14, 2006

Several months ago, I spent two posts (1, 2) talking about mathematical algorithms for ranking teams. All the chatter that comes along with the BCS standings has gotten me inspired to write up another one.

This one does not take into account margin of victory, and it is very similar to one of the BCS computer polls. I'll tell you about that at the end of the post.

Let's start with a 3-team league:

A beat B
B beat C
C beat A
A beat C

So A is 2-1, B is 1-1, and C is 1-2. We want to give each team a rating R_A, R_B, and R_C. And we want all those ratings to satisfy the following property:

Prob. of team i beating team j = R_i / (R_i + R_j)

What if we just arbitrarily picked some numbers. Say R_A = 10, R_B = 5, and R_C = 1. If those are the ratings, then (assuming the games are independent) the probability of seeing the results we actually saw would be:

(Prob of A beating B) * (Prob of B beating C) * (Prob of C beating A) * (Prob of A beating C)

which would be

10/(10+5) * 5/(5+1) * 1/ (1+10) * 10/(1+10) =~ .0459

To summarize: if 10, 5, and 1 represented the "true" strengths of the three teams, then there would be a 4.59% chance of seeing the results we actually saw. That number (4.59) is a measure of how well our ratings (10, 5, and 1) explain what actually happened. If we could find a trio of numbers that explained the actual data better, it would be reasonable to say that that trio of numbers is a better estimate of the teams' true strengths. So let's try 10, 6, and 2. That gives the real life data a 6.51% chance of happening, so 10, 6, and 2 is a better set of ratings than 10, 5, and 1.

What we want to do is find the set of ratings that best explain the data. That is, find the set of ratings that produce the maximum likelihood of seeing the results that actually happened. Hence the name; this is called the method of maximum likelihood. Imagine you have three dials you can control: one marked A, one B, and one C. You're trying to maximize this quanity:

(R_A / (R_A + R_B)) * (R_B / (R_B + R_C)) * (R_C / (R_A + R_C)) * (R_A / (R_A + R_C))

One way to increase the product might be to turn up the A dial; that will increase the first and fourth of those numbers. But there are diminishing returns to cranking the A dial. Once it's been turned up pretty high, then turning it up further doesn't increase the first and fourth terms much. Furthermore, turning up the A dial decreases the third number in the product, because A lost that third game. So you want to stop turning when the increases in the first and fourth terms are balanced by the decreases in the third.

The game is to simulaneously set all three dials at the place that maximizes the product. How exactly we find that maximum is a bit math-y, so I'll skip it. If people are interested, I can post it as an appendix in the comments [UPDATE: here it is]. But the point is, it can be done.

If we do it in this simplified example, we get this:

Team A: 8.37
Team B: 5.50
Team C: 3.62

[Of course, if you multiplied or divided all those numbers by the same constant, you'd have an equivalent set of ratings. It's the ratios and the order that matter, not the numbers themselves.]

Using these numbers we could estimate, for example, that the probability of A beating B is 8.37/(8.37+5.5), which is approximately 60.3%. I've never seen these predictions actually tested on future games. That is, if you look at all games where this method estimates a 60% chance of one team beating another, does the predicted winner actually win 60% of the time? Maybe I'll test that in a future post, but for now it's beside the point. Perhaps the best way to interpret the 60.3% figure is not: this method predicts that A has a 60.3% chance of beating B tomorrow. Rather it's this: assigning a 60.3% probability to A beating B is most consistent with the past data.

This distinction is reinforced when we look at the rankings produced by this method through week 14 of the 2006 NFL season:

TM Rating Record
sdg 4.790 11- 2- 0
ind 3.716 10- 3- 0
chi 3.617 11- 2- 0
bal 3.469 10- 3- 0
nwe 2.439 9- 4- 0
cin 1.714 8- 5- 0
nor 1.666 9- 4- 0
jax 1.617 8- 5- 0
dal 1.256 8- 5- 0
den 1.232 7- 6- 0
nyj 1.209 7- 6- 0
nyg 1.097 7- 6- 0
ten 1.056 6- 7- 0
buf 0.976 6- 7- 0
kan 0.887 7- 6- 0
phi 0.851 7- 6- 0
pit 0.777 6- 7- 0
mia 0.764 6- 7- 0
atl 0.753 7- 6- 0
sea 0.712 8- 5- 0
car 0.603 6- 7- 0
min 0.469 6- 7- 0
cle 0.448 4- 9- 0
hou 0.395 4- 9- 0
gnb 0.391 5- 8- 0
was 0.362 4- 9- 0
stl 0.312 5- 8- 0
sfo 0.306 5- 8- 0
tam 0.278 3-10- 0
ari 0.192 4- 9- 0
oak 0.134 2-11- 0
det 0.101 2-11- 0

The Colts' probability of beating the Lions, according to this method, is 3.72/(3.72+.101), which is about 97.4%. That's a bit higher than my intuition says it ought to be. Part of that, remember, is that the method doesn't take into account margin of victory and therefore does not know that the Colts have squeaked by in a lot of games and were destroyed by the Jaguars. All it sees is a team that has played a very tough schedule and still has nearly the best record in the league. But the other part is that this isn't designed to predict the future, it's designed to explain the past.

I told you that this method is similar to one of those actually in use by the BCS. That method is Peter Wolfe's, and he describes the method here.

The method we use is called a maximum likelihood estimate. In it, each team i is assigned a rating value R_i that is used in predicting the expected result between it and its opponent j, with the likelihood of i beating j given by:

R_i / (R_i + R_j)

The probability P of all the results happening as they actually did is simply the product of multiplying together all the individual probabilities derived from each game. The rating values are chosen in such a way that the number P is as large as possible.

That is precisely the system we've described above, but if you load up all the games and run the numbers, you won't get numbers that match up with the ones Wolfe publishes. I'll explain why in the next post.

This entry was posted on Thursday, December 14th, 2006 at 5:37 am and is filed under BCS, Statgeekery. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.