## Elo ratings explained

Posted by Doug on December 10, 2008

About two and a half years ago, I wrote this:

As you probably know, the participants in the BCS championship game are determined in part by a collection of computer rankings. Those computer rankings are implementing algorithms that “work” because of various mathematical theorems. At some point, I’m going to use this blog to write down everything I know about the topic (which by the way is a drop in the bucket compared to what many other people know; I am not an expert, just a fan) in language that a sufficiently interested and patient non-mathematician can understand.

Since then, I have only written a handful of posts about the mathematics of ranking systems. Here they are:

Another way to derive the simple ranking system

Some discussion of the technical difficulties involved with the Maximum Likelihood method

Incorporating home-field and/or margin of victory into the Maximum Likelihood Method

I'm going to add another post to this list today by writing about a method that Jeff Sagarin cryptically calls ELO_CHESS. Sagarin's ELO_CHESS method is one of the six computer ranking systems that figures into the BCS, although as we'll soon see, we don't have quite enough information to reproduce his rankings exactly. That's OK. The point of this post is to understand the theory behind it.

First, a bit of background.

The strange-sounding name isn't as mysterious as it sounds. It comes from the name of its creator, Hungarian physicist Arpad Elo, who developed the system for the purpose of rating chess players in the 1950s. The world of competitive chess has many of the same elements that cause the need for computer rating systems in college football: lots of "teams" who play schedules of widely varying degrees of quality. It also has plenty of sharp analytical minds, so it's not surprising that lots of good work on rating systems has originated there. The mechanics of implementing Elo's system are different in college football than in chess, but the underlying theory is the same.

Before describing the Elo system, let me explain why I think it's an aesthetically pleasing one, and before I do that, I'll remind you why I like the Simple Rating System so much.

The reason I like the Simple Rating System is because the final rankings have a sort of balancing property. If you run the SRS on any league for any year, and then pick out any team, you can look at a table of opponents, ratings, and scores, and see precisely why that team's rating is what it is. Here for example are the data for the 2005 Colts, who I used as an example in my original writeup on SRS:

OPP Adj WK OPP Margin Rating Margin ============================== 1 bal 17 -1.83 15.17 2 jax 7 4.76 11.76 3 cle 7 -4.22 2.78 4 ten 21 -7.57 13.43 5 sfo 25 -11.15 13.85 6 ram 17 -5.15 11.85 7 hou 18 -10.03 7.97 9 nwe 19 3.14 22.14 10 hou 14 -10.03 3.97 11 cin 8 3.82 11.82 12 pit 19 7.81 26.81 13 ten 32 -7.57 24.43 14 jax 8 4.76 12.76 15 sdg -9 9.94 0.94 16 sea -15 9.11 -5.89 17 ari 4 -4.98 -0.98 ============================== AVERAGE 12.0 -1.20 10.80 ==============================

The Colts' rating is +10.8, which is exactly their average point margin plus the average rating of their opponents. In looking at their schedule, you can see exactly how each game contributed to that rating. You don't have to agree that this is the best way to describe the Colts' performance, but nothing is hidden. It's not a black box. It's clear where everything came from. The SRS is nothing more and nothing less than the unique set of ratings that create a table like this for every team in such a way that all the teams' tables agree with each other.

Now, the process of finding these ratings isn't always easy. But once found, they balance. They make sense. The system explains itself. That, to me, is a property that a good rating system ought to have.

With that in mind, I'm going to skip to the end of the Elo story before circling back around to explain how I ended up there.

The Elo method also has a balancing property. Instead of balancing points, it balances **expected wins**. In particular, its goal is to give every team a rating, then use those ratings to compute the probability of any team beating any other, then fine-tune those ratings so that every team's expected wins exactly match its actual wins.

As an example let's consider the 2008 Pitt Panthers. After all the work is done, Pitt's rating will turn out to be 11.75. Now take one of their opponents, say Rutgers, who has a rating of 5.57. This kind of rating difference means that Pitt should have about a .717 probability of beating Rutgers (I'll explain how I got that shortly; just go with it for now). So Pitt's game against Rutgers should result in .717 expected wins for the Panthers. Pitt lost, so they got zero actual wins for that game. Against Syracuse, the ratings indicate .940 expected wins for Pitt, and they got one actual win. Against Navy, the ratings indicate .692 expected wins for Pitt, and again they got one actual win.

When you look at the entire schedule, you see this:

Opponent OppRating RatingDiff ExpWins Wins ========================================================== BowlingGreenState -9.13 20.88 0.958 0 Buffalo -1.14 12.90 0.874 1 Iowa 6.70 5.05 0.681 1 Syracuse -6.65 18.40 0.940 1 SouthFlorida 4.91 6.85 0.736 1 Navy 6.34 5.41 0.692 1 Rutgers 5.57 6.19 0.717 0 NotreDame 2.28 9.47 0.806 1 Louisville -4.02 15.77 0.914 1 Cincinnati 17.47 -5.72 0.298 0 WestVirginia 7.02 4.73 0.670 1 Connecticut 5.72 6.03 0.712 1 ========================================================== TOTAL 9.00 9

Again, it's no small amount of work to get here. But once you're here, you can see what the method "means."

In discussions of both the SRS and the Maximum Likelihood Method, I likened the process of getting these ratings to tuning a bunch of dials. Whenever you twiddle the Pittsburgh dial, you affect the calculation of all Pitt's opponents, which forces you to twist their dials one way or the other just a little. And then that affects all *their* opponents' (which include Pitt) ratings, forcing you to re-adjust their dials a little. The Elo ratings are nothing more and nothing less than the collection of ratings that are tuned just right so that there is a chart like the above for every team and they're all consistent with each other.

That's the big picture. Now how did we get there? As you can probably guess, we iterate, just as we do for the SRS.

STEP 1: give every team a rating of zero

STEP 2: using those ratings, check to see if every team's expected wins equal their actual wins

if so, you're done. If not, then...

STEP 3: if a team had more wins than expected, then it's better than those ratings thought it was. So twist its dial a bit to the right. if a team had fewer wins than expected, twist its dial a smidge to the left. How much you twist depends on how far apart the expected and actual wins are.

STEP 4: go to STEP 2

In order to actually do this, you need (in step 2) to be able to turn ratings into expected wins (i.e. probabilities). You've got lots of choices for this, but here is the function I'm using:

Win Probability = 1 / (1 + e^(-.15*(rating difference)))

Here is a picture:

Because its output is supposed to be a probability, you have to pick a function that always returns numbers between 0 and 1. And because the two teams' probabilities have to add to one in a given game, the function has to be symmetric about the point (0, 1/2). But you could use any function that satisfies those two properties and get a reasonable set of ratings. There are some philosophical/theoretical reasons why you might want to pick one such function over another such function, but I'm going to skip that discussion.

So we start with every team having a rating of zero. Then we run through every team's schedule. Here is Pitt again:

Opponent OppRating RatingDiff ExpWins Wins ========================================================== BowlingGreenState 0 0 .5 0 Buffalo 0 0 .5 1 Iowa 0 0 .5 1 Syracuse 0 0 .5 1 SouthFlorida 0 0 .5 1 Navy 0 0 .5 1 Rutgers 0 0 .5 0 NotreDame 0 0 .5 1 Louisville 0 0 .5 1 Cincinnati 0 0 .5 0 WestVirginia 0 0 .5 1 Connecticut 0 0 .5 1 ========================================================== 6.0 9

Pitt won more games (9) than this set of ratings thinks it should have (6), so we need to increase Pitt's rating. In particular, let's go ahead and increase Pitt's rating by the difference between the expected and actual wins. So Pitt's new rating is +3.0. Meanwhile, of course, Syracuse got a new rating (which is negative), Cincinnati got a new rating (which is positive), and so did every other team. So we work back through the process with the new set of ratings:

Opponent OppRating RatingDiff ExpWins Wins ========================================================== BowlingGreenState 0.00 3.00 0.611 0 Buffalo 1.50 1.50 0.556 1 Iowa 2.00 1.00 0.537 1 Syracuse -3.00 6.00 0.711 1 SouthFlorida 1.00 2.00 0.574 1 Navy 2.00 1.00 0.537 1 Rutgers 1.00 2.00 0.574 0 NotreDame 0.00 3.00 0.611 1 Louisville -1.00 4.00 0.646 1 Cincinnati 4.50 -1.50 0.444 0 WestVirginia 2.00 1.00 0.537 1 Connecticut 1.00 2.00 0.574 1 ========================================================== 6.914 9

What this says is that, at this point, it looks like giving Pitt a rating of +3.0 isn't high enough. It underestimates their wins by about 2.086. So we increase Pitt's rating by that much, to +5.084. Then we go through the same process for every other team, and we evaluate Pitt yet again in light of the revised estimates:

Opponent OppRating RatingDiff ExpWins Wins ========================================================== BowlingGreenState -0.21 5.30 0.689 0 Buffalo 2.37 2.72 0.601 1 Iowa 2.89 2.19 0.582 1 Syracuse -4.68 9.76 0.812 1 SouthFlorida 1.30 3.79 0.638 1 Navy 2.77 2.32 0.586 1 Rutgers 1.45 3.64 0.633 0 NotreDame -0.05 5.13 0.684 1 Louisville -1.82 6.91 0.738 1 Cincinnati 6.71 -1.62 0.440 0 WestVirginia 2.88 2.21 0.582 1 Connecticut 1.30 3.79 0.638 1 ========================================================== 7.622 9

Repeat this process another couple of hundred times, and the ratings will eventually stop changing. When that happens, Pitt's rating will be at about 11.75 and its expected wins will exactly equal 9.

You can see how this method implicitly adjusts for strength of schedule. If two teams, like say Ball State and Florida, had the same record, then they must have the same number of expected wins also. But Florida played a collection of teams with much higher ratings. So the only way to **expect** Florida to have as many wins as Ball State against a tougher schedule is to give Florida a higher rating. So even though there's no point in the process where you can say, *THIS is where I'm adjusting for strength of schedule*, the adjustment is built into the system.

What's more, I think this method does a better job of adjusting for strength of schedule than some other methods do. The SRS, for example, is based on averages, so as far as the SRS is concerned, if you play a +10 and a -10, that's the same as playing two zeros. But in terms of win expectancies, that's far from true. If you're a +8, then a +10 and a -10 is a much tougher schedule than two zeros. If you're a -8, then you'd probably rather play the +10 and the -10.

And the Elo method captures this. A relevant example is that of Oklahoma and Texas just prior to the Big XII championship game. Both teams were 11-1, and the average rating of Texas' opponents was higher than the average rating of OU's opponents. Yet Oklahoma was still rated higher than Texas by this method. Why? Because OU's nonconference slate featured two good teams and two awful teams, whereas Texas's featured four fairly average teams. If you're as good as OU and UT were this year, the difference between Arkansas and Washington is almost negligible, but the difference between Rice and TCU is not. Even though the average of Rice's and Arkansas's rating is higher than the average of Washington's and TCU's, TCU/Washington is the tougher slate. Elo's ability to measure strength of schedule in a team-specific way is one of its strengths.

But the Elo method, as described above, has a weakness. It's the same issue that causes problems with the Maximum Likelihood Method. Namely, undefeated teams will inevitably end up with unreasonably high ratings. Think back to the iteration process above. Now matter how high a rating you give Utah or Boise State, they are always going to have more wins than expected wins, so you're going to keep turning their dial up. And once you've turned Utah's dial way, way up, then it wants to start turning up the dials of all of Utah's opponents to keep things in balance. And eventually, inevitably, BYU is rated ahead of Florida and USC.

This disease can be cured. There are lots of ways to deal with it. But all the cures rob the method of its mathematical elegance. I was going to talk a bit about the fixes, but then I realized I was repeating virtually this entire post, where I discussed the same issue as it pertains to the Maximum Likelihood Method.

Great stuff, Doug.

A link to the '08 Mease rankings: http://www.davemease.com/football/

Jeff Sonas (the guy who ended up making ChessMetrics did some work on "fixing" Elo by number crunching actual chess results, and the result was more simple and more predictive than Elo ratings. I found the article I was thinking of here. Football may have its own set of rules...

.

In football, you could attempt to measure the quality of the win by the score:

score = floor( (winningScore-losingScore)/21 )

or some thing like that. That'd get around the whole perfect record issue, unless a team won or lost all games by more than the denominator...

And of course you can complicate things to your heart's content, giving different weights to different weeks, different values for away games across the country, or in Denver, or in cold weather stadiums, and so on.

.

I did some playing around with pro football data and it was interesting, the sort of things you can find. I intentionally made the ratings very slow changing and it was like a football dynasty timeline. I broke points scored and points allowed into two separate ratings (creating two virtual games for each real game), giving something like an offense rating and a defense rating. The possibilities are endless!

err, I screwed up the formula, but what I intended was a score near 0.5 for narrow victories and near 1 for more impressive victories. 🙂

MattieShoes:

That's what I've been doing for my ELO NFL ratings program. A win is Minimum(1,0.7+(Winner Score - Loser Score) /(28/0.3) And it comes out as pretty accurate, in the later weeks of the year the higher rated team wins about 70% of the time.

A big problem with including win margin is that your result ends up biased against teams that play at a slower pace. Margin of victory is not independent of game pace. Having a biased rating system in place, especially one that is biased against a nearly invisible variable (there aren't many high profile places that track and report meaningful possessions per game for all the D1 teams). The bias issue is the main reason MOV was removed from the computer polls.

Sorry, didn't notice another "Nick" posting. The 7:10 am post is a different "Nick" from the 8:34 post.

Nick #2 makes a good point, although there are plenty of ways to deal with that bias (those ways have problems of their own, of course).

.

Also, I'm not sure about this.

I am not convinced that anyone with decision-making authority within the BCS is even capable of understanding your comment #5. I always assumed MOV was removed to prevent running up the score, or the appearance thereof.

I've actually been thinking about this for a bit, and I believe the idea ranking method would be a mix of the ELO_CHESS and the SRS. I think the "averaging" component of the SRS that makes facing a +10.0 and a -10.0 equal to facing two 0.0s is a legitimate problem. However, ELO obviously misses the boat by ignoring MOV. I agree that +/- differential is biased. I like the sliding scale method (huge win = .95 wins, regular win = .7 wins, etc.), but think we should use pythagorean differentials instead of straight differentials. That way we don't penalize the Alabamas of the world. We can also cap the pythagorean method to avoid running up the score issues.

I don't think the decision makers understand the statement at all, but in 2002 or so the statistical advisers to those decision makers definitely recommended that MOV be removed from the BCS computer algorithms because of the biasing issue. I'm certainly there were other considerations (fear of running up the score for one and the results of the previous years polls were also important, though both are related to the biasing issue), but the BCS committee was told about the biasing issue and their advisers recommended MOV be removed for simplicity's sake.

Running up the score is related to the biasing issue, as fast paced teams are much more capable of creating larger margins of victory and thus creating the perception of running up the score. So a system that accounts for MOV without accounting for the number of possessions in the game will naturally encourage teams to play at a faster pace.

I think ratings wonks (self included) use the term "margin of victory" as a shorthand for "game score."

.

There are lots of ways to take the score of the game into account other than simply looking at the difference, and not all of them are subject to the same kind of bias that straight margin is. When I say MOV, it's just my lazy way of saying "some sort of function that measures the degree of victory."