Three kinds of liars

There are three kinds of liars:
  1. liars,
  2. damn liars,
  3. people who lie with statistics.
I write a lot of material that makes use of statistics, so I've heard this old quote (which you'll note that I altered slightly) many times. But I'm here to tell you that statistics do not lie. Statistics cannot lie. Statistics are nothing more than recorded observations. That's all.

People, on the other hand, can and do lie. And they often use statistics as an aid, which is why the "damn lies" quote is so popular. I thoroughly approve of heavy doses of skepticism when reading through statistical material (or anything else, for that matter), but there's no reason to throw out the good with the bad. What I'd like to do in this article is to make you aware of some of the ways people can mislead you -- sometimes intentionally and sometimes not -- using statistics. Armed with this knowledge, you can distinguish the damn liars from the people who (while they may be boring and even a bit geeky) are legitimately attempting to convey useful information through the use of numbers.

Like a drunk uses a lamp post

My favorite quote about statistics (I wish I could remember where I read this):
"The problem arises when people use statistics like a drunk uses a lamp post: for support instead of illumination."
This is by far the most important thing to keep in mind when reading through a statistical study. Illumination means looking at all available evidence to help answer a question. Support means finding and citing particular statistics that support a point that's already been decided upon by the author.

If, when reading through an article, you get the idea that the author had made up his/her mind on the issue before ever looking at the numbers, you should proceed with extreme skepticism. If statistics are not brought in until after a conclusion has been drawn, there is a good chance that you're getting the truth but not the whole truth.

For example, suppose I was trying to sell you on the theory that RBs who get a lot of carries one season are likely to get hurt the following year. I could rant and rave for a few paragraphs about how the human body simply isn't made to withstand the punishment that workhorse NFL backs get, and that they are therefore more susceptible to future injury. Then, to drive home the point, I'd produce this:
"Over the last two years, 17 RBs have gotten 275 or more carries. 11 of those 17 missed time due to injury the following season. 6 of the 17 suffered serious season-ending and/or career threatening injuries the next year."

Pretty convincing, huh? (It's true, by the way). But I've misled you in several ways:

  1. It was technically 17 different RBs, but some of them had 275+ carries twice. So I was actually looking at 22 pairs of seasons. But 11 of 17 sounds more convincing than 11 of 22. And of course, I didn't show you the data, making it extremely difficult for you to check it out for yourself.

  2. I haven't told you what percentage of all RBs get injured. In fact, it turns out that 11 of 22 is a very typical injury rate among RBs (see this article for more details). So my stat above actually shows that RBs who carried the ball a lot in one year got injured at roughly the same rate as other RBs. It directly contradicts the point I was making.

  3. Why did I choose 275 as a cutoff instead of 300 or 250 or some other number? Well, if I choose 300 as my cutoff, I lose several of my injured players (Robert Edwards, Olandis Gary, Dorsey Levens, Stephen Davis). If I choose 250 as my cutoff, I let in too many extra healthy players. 275 was chosen specifically because it best serves my point. What's more, I chose 275 after looking at the data. This technique is called multiple endpoints (here is a more extensive article about multiple endpoints).

  4. Why did I look at just two years? Why not three or four or five? In situations like this, more data is almost always better. But I chose not to use more data. Why? Because more data, in this case, might have hurt my argument.
If you don't have to tell the whole story, you can cook up a stat or two that will support almost any point you want to make. If you want to honestly answer a question using statistics, you have to look at all available evidence. Or, failing that, at least acknowledge that more could be done.

Greatness by association

"In the last 15 years, the only RBs to amass 1000 rushing yards, 700 receiving yards, and 9 TDs in a season are Marshall Faulk and Tiki Barber."

It's true. And it sure makes Tiki look good. Think of all the truly great RBs that have come and gone in 15 years, but none of them (except Faulk) could do what Tiki did last year.

There's really nothing wrong with this comment, as long as you recognize it for what it is: essentially meaningless trivia. Did Barber have a fine season last year? Yes. But this blurb somehow implies that it was a truly special season, which it wasn't.

Again, notice that the cutoffs (15, 1000, 700, 9) are specifically crafted to allow Tiki in while keeping others out.

  1. Roger Craig did it 16 years ago.
  2. If you reduce the 700 receiving yards to 600, you let in James Brooks, Ricky Watters, Charlie Garner, and three different Thurman Thomas seasons.
  3. If you only require 1700 total yards (not specifically 1000 and 700) along with the 9 TDs, then the feat has been accomplished 55 times in the last 15 years.

The important thing to realize is that you can put almost anyone in a class with elite players if you choose just the right categories and just the right cutoffs.

Ed McCaffrey?

"The only WRs with at least 1000 yards and 7 TDs each of the last three seasons are Randy Moss, Cris Carter, and Ed McCaffrey."

Ed McCaffrey is a very good WR, but he's not in a class with Moss and Carter as the above quote implies. Tinker with the cutoffs a little, and you'll get guys that are actually more comparable to McCaffrey. Make it two years, 1000 yards, and 6 TDs, and now you've just added Jimmy Smith, Tim Brown, Isaac Bruce, Marvin Harrison, Amani Toomer, and Muhsin Muhammad to the list. Doesn't sound quite so impressive anymore, but by setting cutoffs so that McCaffrey is in the middle, rather than at the bottom, of the list, we get a more realistic assessment of McCaffrey's achievements.

Keenan McCardell?

"The only players with 60 receptions and 850 yards in each of the last 5 seasons are Jimmy Smith, Tim Brown, Cris Carter, and Keenan McCardell."

Hell, if I get a little creative, I can even make Kevin Faulk look good:

"The only players under 25 years old last year who led their team in rushing and had over 450 yards receiving were Edgerrin James, Ahman Green, and Kevin Faulk."

My dad can beat up your sister

Another common ploy for making a player look good is to selectively compare his accomplishments to those of several other players at the same time.

"Last year, Tyrone Wheatley had more rushing yards than Ricky Williams, more rushing TDs than Robert Smith, and more receiving yards than Emmitt Smith."

The key is to select backs who were better than Wheatley last year, but then pick the weakest part of each of their games before comparing with Wheatley. Robert Smith had a great year last year, but only had seven rushing TDs. That's his weakest link. Ricky Williams' yardage total was suppressed by an injury. Emmitt Smith had only 79 receiving yards.

It also doesn't hurt that Williams and Emmitt have a great deal of name recognition. If you're not paying close attention, you might read that and think, "wow, I didn't realize Wheatley is right up there with all those great backs," which is the intended effect.

Correlation and causation

This one is made up, but you've heard things like it many times before:

"The Cowboys are 63-1 when Emmitt Smith carries the ball 25 or more times."

First, note that you're only getting half the story. What's the Cowboys' record when Emmitt doesn't get 25 carries? But that's not the main issue here.

The author of the (fictitious) quote above is trying to convince you that giving Emmitt a lot of carries helps the Cowboys win. In pictures:

Emmitt gets lots of carries  ========>   Cowboys win
But isn't it possible that that arrow might be pointing the wrong direction? Maybe what's actually happening is that, whenever the Cowboys have the game wrapped up, they give Emmitt a lot of carries at the end to kill the clock. That is,
Cowboys win  =======>   Emmitt gets a lot of carries
Which is it? I don't know, but the above quote doesn't give you any information. In short, just because two things (like Cowboy wins and big Emmitt games) are related -- even strongly related -- does not necessarily mean that one causes the other.

The classic (non-football) example of this is that ice cream sales are correlated with violent crime. It's a fact. In months where ice cream sales are high, violent crime rates are also high. When ice cream sales are down, violent crime is down. Does this mean that ice cream causes crime? Maybe, but probably not. The more likely explanation is that some other factor (like maybe the weather) is a factor in causing both.

To bring this back to football, suppose I produced irrefutable evidence that players who changed teams were more likely to have their numbers drop than players who didn't (I don't know if this is true or not, but suppose it is). Does this mean that changing teams hurts a player's stats? Maybe, but maybe not. Ask yourself if there might be another factor at work affecting both. Age might be such a factor. Maybe players who switch teams are more likely to be old and players whose numbers drop are also more likely to be old. It's possible that this bias is what's causing the correlation and that the team-switching has absolutely nothing to do with it.

Another example: suppose it were true that players with high salaries are more likely to be injured than players with low salaries (I don't know if this is true or not, but suppose it is). Does this mean that Eric Moulds and his new contract should be avoided? Maybe, but probably not. More plausible, I think, is that quarterbacks are more likely to have high salaries and quarterbacks are more likely to get injured. That's probably where the correlation is coming from.

Wrapup

These are just a few of the ways that statistics can become damn lies. Often, the misuse of statistics is much more subtle than in the examples I've shown above. But if you keep your eyes peeled, you'll spot some of these techniques at work.

I wish I could say I'd never perpetrated a damn lie, but I can't. I wish I could say I'll never do it again, but I probably will. Like all human beings, I have biases -- some that I'm aware of and some that I'm not -- and these can creep in to the work I do. What I can say is that I've never knowingly told you a damn lie. And the best way to make sure that I never tell you one in the future is to let you know how to spot them and invite you to question everything I do and my reasons for doing it.