**SITE NEWS:**
We are moving all of our site and company news into a single blog for Sports-Reference.com. We'll tag all PFR content, so you can quickly and easily find the content you want.

Also, our existing PFR blog rss feed will be redirected to the new site's feed.

Pro-Football-Reference.com » Sports Reference

For more from Chase and Jason, check out their work at Football Perspective and The Big Lead.

## Another rating system: maximum likelihood

Several months ago, I spent two posts (1, 2) talking about mathematical algorithms for ranking teams. All the chatter that comes along with the BCS standings has gotten me inspired to write up another one.

This one does not take into account margin of victory, and it is very similar to one of the BCS computer polls. I'll tell you about that at the end of the post.

Let's start with a 3-team league:

A beat B

B beat C

C beat A

A beat C

So A is 2-1, B is 1-1, and C is 1-2. We want to give each team a rating R_A, R_B, and R_C. And we want all those ratings to satisfy the following property:

Prob. of team i beating team j = R_i / (R_i + R_j)

What if we just arbitrarily picked some numbers. Say R_A = 10, R_B = 5, and R_C = 1. If those are the ratings, then (assuming the games are independent) the probability of seeing the results we actually saw would be:

(Prob of A beating B) * (Prob of B beating C) * (Prob of C beating A) * (Prob of A beating C)

which would be

10/(10+5) * 5/(5+1) * 1/ (1+10) * 10/(1+10) =~ .0459

To summarize: **if 10, 5, and 1 represented the "true" strengths of the three teams, then there would be a 4.59% chance of seeing the results we actually saw.** That number (4.59) is a measure of how well our ratings (10, 5, and 1) explain what actually happened. If we could find a trio of numbers that explained the actual data better, it would be reasonable to say that that trio of numbers is a better estimate of the teams' true strengths. So let's try 10, 6, and 2. That gives the real life data a 6.51% chance of happening, so 10, 6, and 2 is a better set of ratings than 10, 5, and 1.

What we want to do is find the set of ratings that *best* explain the data. That is, find the set of ratings that produce the maximum likelihood of seeing the results that actually happened. Hence the name; this is called the *method of maximum likelihood*. Imagine you have three dials you can control: one marked A, one B, and one C. You're trying to maximize this quanity:

(R_A / (R_A + R_B)) * (R_B / (R_B + R_C)) * (R_C / (R_A + R_C)) * (R_A / (R_A + R_C))

One way to increase the product might be to turn up the A dial; that will increase the first and fourth of those numbers. But there are diminishing returns to cranking the A dial. Once it's been turned up pretty high, then turning it up further doesn't increase the first and fourth terms much. Furthermore, turning up the A dial decreases the third number in the product, because A lost that third game. So you want to stop turning when the increases in the first and fourth terms are balanced by the decreases in the third.

The game is to simulaneously set all three dials at the place that maximizes the product. How exactly we find that maximum is a bit math-y, so I'll skip it. If people are interested, I can post it as an appendix in the comments [UPDATE: here it is]. But the point is, it can be done.

If we do it in this simplified example, we get this:

Team A: 8.37

Team B: 5.50

Team C: 3.62

[Of course, if you multiplied or divided all those numbers by the same constant, you'd have an equivalent set of ratings. It's the ratios and the order that matter, not the numbers themselves.]

Using these numbers we could estimate, for example, that the probability of A beating B is 8.37/(8.37+5.5), which is approximately 60.3%. I've never seen these predictions actually tested on future games. That is, if you look at all games where this method estimates a 60% chance of one team beating another, does the predicted winner actually win 60% of the time? Maybe I'll test that in a future post, but for now it's beside the point. Perhaps the best way to interpret the 60.3% figure is not: this method predicts that A has a 60.3% chance of beating B tomorrow. Rather it's this: assigning a 60.3% probability to A beating B is most consistent with the past data.

This distinction is reinforced when we look at the rankings produced by this method through week 14 of the 2006 NFL season:

TM Rating Record

======================

sdg 4.790 11- 2- 0

ind 3.716 10- 3- 0

chi 3.617 11- 2- 0

bal 3.469 10- 3- 0

nwe 2.439 9- 4- 0

cin 1.714 8- 5- 0

nor 1.666 9- 4- 0

jax 1.617 8- 5- 0

dal 1.256 8- 5- 0

den 1.232 7- 6- 0

nyj 1.209 7- 6- 0

nyg 1.097 7- 6- 0

ten 1.056 6- 7- 0

buf 0.976 6- 7- 0

kan 0.887 7- 6- 0

phi 0.851 7- 6- 0

pit 0.777 6- 7- 0

mia 0.764 6- 7- 0

atl 0.753 7- 6- 0

sea 0.712 8- 5- 0

car 0.603 6- 7- 0

min 0.469 6- 7- 0

cle 0.448 4- 9- 0

hou 0.395 4- 9- 0

gnb 0.391 5- 8- 0

was 0.362 4- 9- 0

stl 0.312 5- 8- 0

sfo 0.306 5- 8- 0

tam 0.278 3-10- 0

ari 0.192 4- 9- 0

oak 0.134 2-11- 0

det 0.101 2-11- 0

The Colts' probability of beating the Lions, according to this method, is 3.72/(3.72+.101), which is about 97.4%. That's a bit higher than my intuition says it ought to be. Part of that, remember, is that the method doesn't take into account margin of victory and therefore does not know that the Colts have squeaked by in a lot of games and were destroyed by the Jaguars. All it sees is a team that has played a very tough schedule and still has nearly the best record in the league. But the other part is that this isn't designed to predict the future, it's designed to explain the past.

I told you that this method is similar to one of those actually in use by the BCS. That method is Peter Wolfe's, and he describes the method here.

The method we use is called a maximum likelihood estimate. In it, each team

iis assigned a rating valueR_ithat is used in predicting the expected result between it and its opponentj, with the likelihood ofibeatingjgiven by:

R_i / (R_i + R_j)The probability

Pof all the results happening as they actually did is simply the product of multiplying together all the individual probabilities derived from each game. The rating values are chosen in such a way that the numberPis as large as possible.

That is precisely the system we've described above, but if you load up all the games and run the numbers, you won't get numbers that match up with the ones Wolfe publishes. I'll explain why in the next post.

This entry was posted on Thursday, December 14th, 2006 at 5:37 am and is filed under BCS, Statgeekery. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

I like this kind of exercise because 1) it gives us some sort of logical ranking, while 2) also showing the limitations in some of the BCS style rankings.

Just how likely is the maximum likelihood, for example? It may be the best fit, but that may not mean our confidence in the fit is particularly strong.

And on the margin of victory thing, while we can dispute the value of winning by 17 rather than 24, I think the evidence is fairly strong that "clutch" ability in close games is mostly random and does not continue. Exhibit #2,542: the Colts vs. Dallas and Tennessee after early performance. I would be interested in seeing the same method used, but counting games decided by 6 points or less as ties for purposes of the ranking method. I select 6 rather than 7 because at 6, the game is one event away from the outcome being changed, while at 7, it is 2 events away.

Oh, and in light of this, Colts = Gators? We all know that the Gators have been "clutch", and that certainly is not due to small sample size and randomness, but rather due to them having "the heart of a champion" and being "clutch."

And final random thought--can we go ahead and kick the NFC West out of the BCS? Maybe we can make the Seahawks finish in the top 12 in the BCS rankings to qualify for a bowl, er, playoffs, despite their record against the NFC West and non-Chicago NFC North.

Somewhat related to what someone does at this site: http://www.beatpaths.com.

Only takes into account who beats who. Removes "loops" where A beat B, B beat C and C beat A.

Great stuff, Doug! I'd be interested in reading the math-y details if you can do that.

Since this method is designed to explain the past rather than predict the future, we would call it retrodictive rather than predictive. This distinction fascinates me and makes me wonder how much overlap there can possibly be between these two types of systems, mathematically speaking.

One thought I had is what if you replaced A beat B with A is favored to beat B. Would that make the system more predictive?

Jim A,

I will see if I can't find the time to write down the math today or tomorrow. Prerequisites will be multivariable calculus (although folks who are familiar with single-variable calculus might be able to follow along) and the ability to work confortably with logarithms.

And yes, it is a retrodictive system. I need to devote a whole post (or two) to the difference between predictive and retrodictive systems. I think that about 95% of the time, when someone looks at the Sagarin PREDICTOR rankings or one of the BCS computer rankings and complains about where some particular team is ranked, it's because the person is thinking predictively and the system is retrodictive (or vice versa). Understanding the distinction is the first step to understanding computer ratings.

The more generic name for this system is Bradley-Terry, and there are some college hockey types (where this method is called KRACH - Ken's Ratings About College Hockey) with considerable exposition on the math.

The confidence intervals for this in college hockey are very large. There was a discussion of that confidence interval on a Cornell message board for those who want to dig into some numbers. Summary: of 60 teams you can pin any one down with 95% confidence to a range of about 16 places in the rankings. Not too good if you want to separate #2 from #3. And that's with a 30 game season, rather than 12 or 16 in football.

I've wondered how Wolfe modifies his rankings, as this method would leave Boise and OSU tied, since undefeated teams all get infinite ratings. If you have the answer, I'll be looking forward to reading it, I have partial guesses, but they never seemed like they'd match how far down a team like Boise ends up.

Technical appendix:Continuing the same example:

A beat B

B beat C

C beat A

A beat C

Just to make the notation a bit less cumbersome, I'll use A, B, and C as the ratings instead of R_A, R_B, and R_C.

The quantity to be maximized is

P = A/(A+B) * B/(B+C) * C/(A+C) * A/(A+C)

P is a function of three variables (A, B, C). When you want to maximize such a thing, one strategy is to take the partial derivatives with respect to A, B, and C, set them all to zero, and try to solve that system of three equations in three unknowns. Now, since this is a product of quotients, the derivatives are going to get very ugly. So instead of working with P, let's work with ln(P). Since the natural log is an increasing function, the derivative of P always has the same sign as the derivative of ln(P) and therefore P and ln(P) will be maximized at the same place.

ln(P) = ln(A) - ln(A+B) + ln(B) - ln(B+C) + ln(C) - ln(A+C) + ln(A) - ln(A+C)

or

ln(P) = 2ln(A) + ln(B) + ln(C) - (ln(A+B) + ln(B+C) + 2ln(A+C))

So we take the partial derivative of ln(P) with respect to A:

d(ln(P)) / dA = 2/A - (1/(A+B) + 2/(A+C))

We want that to be zero, so we want 2/A =1/(A+B) + 2/(A+C)

or

A = 2 / ( 1/(A+B) + 2/(A+C) )So that's one equation. Taking the partials with respect to B and C and setting them to zero will give you two more equations in A, B, and C. Here they are:

B = 1 / ( 1/(A+B) + 1/(B+C) )C = 1 / ( 1/(B+C) + 2/(A+C) )Our goal is to solve the three bold equations for the three unknowns A, B, and C. We will do so via iteration.

Start with A=B=C=1.

Now find a new A via the first bold equation above: newA = 2 / ( 1/(1+1) + 2/(1+1) ) = 4/3

Likewise, newB = 1 / ( 1/(1+1) + 1/(1+1) ) = 1 and newC = 1 / ( 1/(1+1) + 2/(1+1) ) = 2/3

Now set A=newA, B=newB, and C=newC and do the whole thing over again. Then again. Then again. Keep doing it until the numbers stop changing. It can be proven (this appeared in the mathematical literature as early as the 1920s) that, as long as certain conditions are met, they will eventually stop changing. And the numbers that you get will solve the three equations.

Now, if you've got 32 teams and 208 games --- or 119 teams and 800ish games --- it's going to get ugly. But if you trace back through the bold equations and how they came to be, you'll be able to convince yourself that the general formula for a particular team, call them Team A, is:

A = (A's wins) / ( (games against B)/(A+B) + (games against C)/(A+C) + ... + (games against Z) / (A+Z) )

Using this, it's actually not too tough to program a computer to compute these rankings.

A few notes...

1) Very good post, Doug.

2) Logarithms are easy!

3) The Jets rank 11th in your system

4) Yes, you should to a post on the retrodictive/postdictive stuff. :popcorn:

5) Here's what interests me right now: I think this system will be able to test whether a team "has another team's number". For example, the Texans will rank below the Jags every year, I'm sure, in this system, but they're now 6-4 against them. How about this one: what's Team A's record when playing @ Team B, when it is ranked several spots ahead of Team B, but lost earlier in the season at home to Team B? My guess is it will depend a bit on how big of a cut-off you set, but mostly the answer will be a factor of two things: the cut-off difference, and the appropriate weight given to home-field. Negligible, at best, will be the knowledge that Team A already lost. (Yes, I'm looking at you Willie McGahee; one team does not own another team. I think.)

Very nice article. My NFL-Forecast software uses a similar maximum likelihood method. I use it predictively as Jim A suggests. To overcome some of the early season limitations of predictive methods, I use Vegas over-under lines for season wins to calibrate my early sesson ratings for early weeks (1-4) followed by a mixture of the preseason and ML ratings up to week 8, then only the ML ratings for the rest of the season.

I'm probably going to devote most of the off-season development cycle to predictive ratings systesm. Specifically, I'm thinking of Bayesian methods and Artificial Neural Networks. Football Outsiders ratings seem to have very strong predictive power and I may do somehting with those as well. If anyone has any insight on any of the above, I'd love to hear it, and would even consider some kind of collaboration.

Doug-

I've always worried that a gradient search would leave you in a local minima and have used a histeretic search algorithms to find global minima. I haven't looked closely at your rating function, but is it guaranteed to find a global minima?

Sorry for the serial posting, but this is my bread and butter. My objective function is actually a best fit to the histogram of past results, as you describe below. In other words, I evaluate each guess of power rankings by calculating the probablility of each game, putting them in bins according to % chance of the home team winnning, then comparing that to the actual % for that bin.

In limited testing, my method does do a decent job of predicting the percent wins of future games. Over the off-season I'll check it over multi-year data.

Doug wrote:

"Using these numbers we could estimate, for example, that the probability of A beating B is 8.37/(8.37+5.5), which is approximately 60.3%. I’ve never seen these predictions actually tested on future games. That is, if you look at all games where this method estimates a 60% chance of one team beating another, does the predicted winner actually win 60% of the time? Maybe I’ll test that in a future post, but for now it’s beside the point. Perhaps the best way to interpret the 60.3% figure is not: this method predicts that A has a 60.3% chance of beating B tomorrow. Rather it’s this: assigning a 60.3% probability to A beating B is most consistent with the past data."

Larry, thanks for pointing me to that hockey site. That link covers a lot of what I was going to talk about tomorrow. (But I'm still going to talk about it, since the post is already written. )

Larry:

One way to deal with the infinity problem is to use a penalized likelihood method. Here is a link to an application of this technique to college football:

members.accesstoledo.com/measefam/paper.pdf

You guys are freakin' killing me with the spoilers of tomorrow's post!

Doug, I'll enjoy your telling of the story regardless. You have a gift of explaining very complicated material in a straight-forward and easy to understand manner.

I puposely didn't actually reveal the spoiler, and since the College Hockey version hasn't used the suspected spoiler in a long time, I didn't think I was revealing that either. I'll save comments on said method until tomorrow's post.

This is my favorite ranking system, for its elegance and simplicity. Glad to see it discussed here.

Anybody interested in the math of rating systems might be interested in any of Arpad Elo's books or articles. (Interested in an over-bid. Put it this way, I've slept through many of his articles)

As with the system Doug presents here, Elo doesn't take margin of victory into account.

And it's worth noting that he seriously rounds (as a measure of his lack of confidence in the estimates) any ratings based on fewer than 30 trials.

I used a similar system for ranking players for a beer pong tournament. We spent a couple days playing random games, then needed seeding for the tournament. An interesting difference was that each team consisted of two individuals, so while each game couldn't distinguish between the two players on a team, different combinations of teams brought out differences in ratings. I also incorporated margin of victory. I'm a math guy, but used Excel's solver function to minimize the error between what the model predicted and the actual outcomes. The good news is that people thought it was awesome. The bad news is that three of the four top seeds were upset in the first round. I guess I can now blame my model's lack of predictiveness versus retrodictiveness.

In my school days, I had only derived a mathematical expression for the MLE of a parameter and I did not know a direct application of the method in our daily life untill I came accross this blog.

Having said that, May I request the blog owner, who, I assume, is a statician, to please discuss few case studies on the method of BLUE and MME?