This is our old blog. It hasn't been active since 2011. Please see the link above for our current blog or click the logo above to see all of the great data and content on this site.

Elo ratings explained

Posted by Doug on December 10, 2008

About two and a half years ago, I wrote this:

As you probably know, the participants in the BCS championship game are determined in part by a collection of computer rankings. Those computer rankings are implementing algorithms that “work” because of various mathematical theorems. At some point, I’m going to use this blog to write down everything I know about the topic (which by the way is a drop in the bucket compared to what many other people know; I am not an expert, just a fan) in language that a sufficiently interested and patient non-mathematician can understand.

Since then, I have only written a handful of posts about the mathematics of ranking systems. Here they are:

Simple Ranking System

Another way to derive the simple ranking system

The Maximum Likelihood Method

Some discussion of the technical difficulties involved with the Maximum Likelihood method

Incorporating home-field and/or margin of victory into the Maximum Likelihood Method

I'm going to add another post to this list today by writing about a method that Jeff Sagarin cryptically calls ELO_CHESS. Sagarin's ELO_CHESS method is one of the six computer ranking systems that figures into the BCS, although as we'll soon see, we don't have quite enough information to reproduce his rankings exactly. That's OK. The point of this post is to understand the theory behind it.

First, a bit of background.

The strange-sounding name isn't as mysterious as it sounds. It comes from the name of its creator, Hungarian physicist Arpad Elo, who developed the system for the purpose of rating chess players in the 1950s. The world of competitive chess has many of the same elements that cause the need for computer rating systems in college football: lots of "teams" who play schedules of widely varying degrees of quality. It also has plenty of sharp analytical minds, so it's not surprising that lots of good work on rating systems has originated there. The mechanics of implementing Elo's system are different in college football than in chess, but the underlying theory is the same.

Before describing the Elo system, let me explain why I think it's an aesthetically pleasing one, and before I do that, I'll remind you why I like the Simple Rating System so much.

The reason I like the Simple Rating System is because the final rankings have a sort of balancing property. If you run the SRS on any league for any year, and then pick out any team, you can look at a table of opponents, ratings, and scores, and see precisely why that team's rating is what it is. Here for example are the data for the 2005 Colts, who I used as an example in my original writeup on SRS:

                   OPP    Adj
WK  OPP  Margin  Rating Margin
 1  bal    17    -1.83   15.17
 2  jax     7     4.76   11.76
 3  cle     7    -4.22    2.78
 4  ten    21    -7.57   13.43
 5  sfo    25   -11.15   13.85
 6  ram    17    -5.15   11.85
 7  hou    18   -10.03    7.97
 9  nwe    19     3.14   22.14
10  hou    14   -10.03    3.97
11  cin     8     3.82   11.82
12  pit    19     7.81   26.81
13  ten    32    -7.57   24.43
14  jax     8     4.76   12.76
15  sdg    -9     9.94    0.94
16  sea   -15     9.11   -5.89
17  ari     4    -4.98   -0.98
AVERAGE  12.0    -1.20   10.80

The Colts' rating is +10.8, which is exactly their average point margin plus the average rating of their opponents. In looking at their schedule, you can see exactly how each game contributed to that rating. You don't have to agree that this is the best way to describe the Colts' performance, but nothing is hidden. It's not a black box. It's clear where everything came from. The SRS is nothing more and nothing less than the unique set of ratings that create a table like this for every team in such a way that all the teams' tables agree with each other.

Now, the process of finding these ratings isn't always easy. But once found, they balance. They make sense. The system explains itself. That, to me, is a property that a good rating system ought to have.

With that in mind, I'm going to skip to the end of the Elo story before circling back around to explain how I ended up there.

The Elo method also has a balancing property. Instead of balancing points, it balances expected wins. In particular, its goal is to give every team a rating, then use those ratings to compute the probability of any team beating any other, then fine-tune those ratings so that every team's expected wins exactly match its actual wins.

As an example let's consider the 2008 Pitt Panthers. After all the work is done, Pitt's rating will turn out to be 11.75. Now take one of their opponents, say Rutgers, who has a rating of 5.57. This kind of rating difference means that Pitt should have about a .717 probability of beating Rutgers (I'll explain how I got that shortly; just go with it for now). So Pitt's game against Rutgers should result in .717 expected wins for the Panthers. Pitt lost, so they got zero actual wins for that game. Against Syracuse, the ratings indicate .940 expected wins for Pitt, and they got one actual win. Against Navy, the ratings indicate .692 expected wins for Pitt, and again they got one actual win.

When you look at the entire schedule, you see this:

Opponent                 OppRating RatingDiff ExpWins Wins
BowlingGreenState           -9.13    20.88    0.958    0
Buffalo                     -1.14    12.90    0.874    1
Iowa                         6.70     5.05    0.681    1
Syracuse                    -6.65    18.40    0.940    1
SouthFlorida                 4.91     6.85    0.736    1
Navy                         6.34     5.41    0.692    1
Rutgers                      5.57     6.19    0.717    0
NotreDame                    2.28     9.47    0.806    1
Louisville                  -4.02    15.77    0.914    1
Cincinnati                  17.47    -5.72    0.298    0
WestVirginia                 7.02     4.73    0.670    1
Connecticut                  5.72     6.03    0.712    1
TOTAL                                         9.00     9

Again, it's no small amount of work to get here. But once you're here, you can see what the method "means."

In discussions of both the SRS and the Maximum Likelihood Method, I likened the process of getting these ratings to tuning a bunch of dials. Whenever you twiddle the Pittsburgh dial, you affect the calculation of all Pitt's opponents, which forces you to twist their dials one way or the other just a little. And then that affects all their opponents' (which include Pitt) ratings, forcing you to re-adjust their dials a little. The Elo ratings are nothing more and nothing less than the collection of ratings that are tuned just right so that there is a chart like the above for every team and they're all consistent with each other.

That's the big picture. Now how did we get there? As you can probably guess, we iterate, just as we do for the SRS.

STEP 1: give every team a rating of zero

STEP 2: using those ratings, check to see if every team's expected wins equal their actual wins

if so, you're done. If not, then...

STEP 3: if a team had more wins than expected, then it's better than those ratings thought it was. So twist its dial a bit to the right. if a team had fewer wins than expected, twist its dial a smidge to the left. How much you twist depends on how far apart the expected and actual wins are.

STEP 4: go to STEP 2

In order to actually do this, you need (in step 2) to be able to turn ratings into expected wins (i.e. probabilities). You've got lots of choices for this, but here is the function I'm using:

Win Probability = 1 / (1 + e^(-.15*(rating difference)))

Here is a picture:

Because its output is supposed to be a probability, you have to pick a function that always returns numbers between 0 and 1. And because the two teams' probabilities have to add to one in a given game, the function has to be symmetric about the point (0, 1/2). But you could use any function that satisfies those two properties and get a reasonable set of ratings. There are some philosophical/theoretical reasons why you might want to pick one such function over another such function, but I'm going to skip that discussion.

So we start with every team having a rating of zero. Then we run through every team's schedule. Here is Pitt again:

Opponent                 OppRating RatingDiff ExpWins Wins
BowlingGreenState               0        0     .5      0
Buffalo                         0        0     .5      1
Iowa                            0        0     .5      1
Syracuse                        0        0     .5      1
SouthFlorida                    0        0     .5      1
Navy                            0        0     .5      1
Rutgers                         0        0     .5      0
NotreDame                       0        0     .5      1
Louisville                      0        0     .5      1
Cincinnati                      0        0     .5      0
WestVirginia                    0        0     .5      1
Connecticut                     0        0     .5      1
                                              6.0      9

Pitt won more games (9) than this set of ratings thinks it should have (6), so we need to increase Pitt's rating. In particular, let's go ahead and increase Pitt's rating by the difference between the expected and actual wins. So Pitt's new rating is +3.0. Meanwhile, of course, Syracuse got a new rating (which is negative), Cincinnati got a new rating (which is positive), and so did every other team. So we work back through the process with the new set of ratings:

Opponent                 OppRating RatingDiff ExpWins Wins
BowlingGreenState            0.00     3.00    0.611    0
Buffalo                      1.50     1.50    0.556    1
Iowa                         2.00     1.00    0.537    1
Syracuse                    -3.00     6.00    0.711    1
SouthFlorida                 1.00     2.00    0.574    1
Navy                         2.00     1.00    0.537    1
Rutgers                      1.00     2.00    0.574    0
NotreDame                    0.00     3.00    0.611    1
Louisville                  -1.00     4.00    0.646    1
Cincinnati                   4.50    -1.50    0.444    0
WestVirginia                 2.00     1.00    0.537    1
Connecticut                  1.00     2.00    0.574    1
                                              6.914    9

What this says is that, at this point, it looks like giving Pitt a rating of +3.0 isn't high enough. It underestimates their wins by about 2.086. So we increase Pitt's rating by that much, to +5.084. Then we go through the same process for every other team, and we evaluate Pitt yet again in light of the revised estimates:

Opponent                 OppRating RatingDiff ExpWins Wins
BowlingGreenState           -0.21     5.30    0.689    0
Buffalo                      2.37     2.72    0.601    1
Iowa                         2.89     2.19    0.582    1
Syracuse                    -4.68     9.76    0.812    1
SouthFlorida                 1.30     3.79    0.638    1
Navy                         2.77     2.32    0.586    1
Rutgers                      1.45     3.64    0.633    0
NotreDame                   -0.05     5.13    0.684    1
Louisville                  -1.82     6.91    0.738    1
Cincinnati                   6.71    -1.62    0.440    0
WestVirginia                 2.88     2.21    0.582    1
Connecticut                  1.30     3.79    0.638    1
                                              7.622    9

Repeat this process another couple of hundred times, and the ratings will eventually stop changing. When that happens, Pitt's rating will be at about 11.75 and its expected wins will exactly equal 9.

You can see how this method implicitly adjusts for strength of schedule. If two teams, like say Ball State and Florida, had the same record, then they must have the same number of expected wins also. But Florida played a collection of teams with much higher ratings. So the only way to expect Florida to have as many wins as Ball State against a tougher schedule is to give Florida a higher rating. So even though there's no point in the process where you can say, THIS is where I'm adjusting for strength of schedule, the adjustment is built into the system.

What's more, I think this method does a better job of adjusting for strength of schedule than some other methods do. The SRS, for example, is based on averages, so as far as the SRS is concerned, if you play a +10 and a -10, that's the same as playing two zeros. But in terms of win expectancies, that's far from true. If you're a +8, then a +10 and a -10 is a much tougher schedule than two zeros. If you're a -8, then you'd probably rather play the +10 and the -10.

And the Elo method captures this. A relevant example is that of Oklahoma and Texas just prior to the Big XII championship game. Both teams were 11-1, and the average rating of Texas' opponents was higher than the average rating of OU's opponents. Yet Oklahoma was still rated higher than Texas by this method. Why? Because OU's nonconference slate featured two good teams and two awful teams, whereas Texas's featured four fairly average teams. If you're as good as OU and UT were this year, the difference between Arkansas and Washington is almost negligible, but the difference between Rice and TCU is not. Even though the average of Rice's and Arkansas's rating is higher than the average of Washington's and TCU's, TCU/Washington is the tougher slate. Elo's ability to measure strength of schedule in a team-specific way is one of its strengths.

But the Elo method, as described above, has a weakness. It's the same issue that causes problems with the Maximum Likelihood Method. Namely, undefeated teams will inevitably end up with unreasonably high ratings. Think back to the iteration process above. Now matter how high a rating you give Utah or Boise State, they are always going to have more wins than expected wins, so you're going to keep turning their dial up. And once you've turned Utah's dial way, way up, then it wants to start turning up the dials of all of Utah's opponents to keep things in balance. And eventually, inevitably, BYU is rated ahead of Florida and USC.

This disease can be cured. There are lots of ways to deal with it. But all the cures rob the method of its mathematical elegance. I was going to talk a bit about the fixes, but then I realized I was repeating virtually this entire post, where I discussed the same issue as it pertains to the Maximum Likelihood Method.