This is our old blog. It hasn't been active since 2011. Please see the link above for our current blog or click the logo above to see all of the great data and content on this site.

Sample size vs. sample relevance

Posted by Doug on January 12, 2009

On Sunday morning, we posted some thoughts about the Titans' decision to kick a field goal instead of going on 4th-and-inches. Among the many considerations, perhaps the most important was this: what was the Ravens' chance of scoring on the ensuing drive? There was discussion of that in the post, and more in the comments, and the main disagreement centered on whether Ravens' previous ten drives from that very game were a better barometer of their chances of scoring than, say, the few hundred drives in which the Ravens offense and Titans defense took part during the 2008 season.

And this is an issue with broader applicability. You run into it all the time when you're trying to assess probabilities having to do with football (and probably other things too, but I wouldn't know anything about that).

Take Chris Johnson, for example. How is his career going to develop? The obvious thing to try is to find guys from the past who look like Chris Johnson looks now --- first round pick, very good rookie season, obscenely fast, playing on a successful team with a relatively conservative offense --- and see how they turned out.

The problem is trying to figure out exactly which guys "looked like Chris Johnson looks now." You might find a small handful of guys who resemble Johnson very closely. But estimates based on small handfuls tend not to be very reliable. So you're tempted to expand the sample size. And you can do that. But the price you pay is that you have to stretch the definition of what it means to be similar to Chris Johnson. If you want a bigger sample, the players in it will necessarily be, on average, less similar to Chris Johnson. Big sample is good, but less similar is bad. More similar is good, but small sample is bad. Where is the sweet spot? There is rarely an obvious answer.

And it's the same with estimating the Ravens' chances of scoring a field goal when starting from their own 30ish yard line at about the 4:00 mark of the fourth quarter on January 10th, 2009, against the Titans. Just as Chris Johnson is a special snowflake, so too is the drive by the Ravens at the 4:00 minute mark of the fourth quarter on January 10th, 2009.

The small-but-highly-relevant sample would be the ten Raven drives from earlier in that game. A bigger-but-less-relevant sample would be all Raven drives from the entire year. But there are a whole lot of drives in there that took place in better weather, against worse defenses, without hostile crowds that, you know, make it more difficult to snap the ball before the play clock expires.

You could limit your comparison to Raven drives against good defenses, on the road, in a tie game. If you did so, even if your sample was big enough (which it wouldn't be at that point), you'd still be missing the possibility that there was something special about this good defense in these field/crowd/weather conditions on this day.

You get the point.

I'm not going to attempt to estimate the Ravens' chances of scoring on that drive. Instead, I'm going to look at a similar but broader question, and I'll let you decide how or if it applies to this situation. The question is this: when a team plays bad offense for three quarters, does it play bad offense in the fourth?

Here's what I did.

I looked at every game from 1983 to 2007 which was tied after three quarters. I wanted to restrict the sample to tied-after-three-quarters games to eliminate as much as possible the possibility that garbage time points would pollute the results. Since I've got a quarter-century's worth of games, I can do that and still have a decent sized sample, more than 300 games.

To establish an expectation for a given team's offense, I used the Vegas line and over-under to generate a projected point total for each team. For example, the Ravens were a 3-point underdog against the Titans and the over-under was 33.5. Putting these together implies an expectation of about 15.25 points for Baltimore (and 18.25 for the Titans). [yes, we now have Vegas lines back to 1983, and yes, they will be integrated into the site soon.]

In all those games, the teams scored, on average, 68.8% of their points in the first three quarters, and 31.2% in the fourth (I ignored overtimes).

Therefore I declared that an offense was having a Bad Day if, during the first three quarters of their game, they scored less than 68.8% of their projected point total. Otherwise, they were having a Good Day.

Then I looked at the actual fourth quarter points for each team, and compared that to the projected fourth quarter points (which would be 31.2% of the overall projected points). Here are the averages for each group:

Teams that were having a bad day through three quarters
First three quarters: 5.00 points below their projected first-3-quarters total. (1.67 points per quarter)
Fourth quarter: 0.41 points below projected 4th-quarter total.

Teams that were having a good day through three quarters
First three quarters: 6.28 points above their projected first-3-quarters total. (2.09 points per quarter)
Fourth quarter: 0.64 points above projected 4th-quarter total.

The Bad Day teams went from 1.67 points per quarter below expectation to .41 points per quarter below expectation. The Good Day teams went from 2.09 PPQ above expectation to .64 points above. In both cases, the teams' fourth quarter performances, on average, landed roughly a quarter to 30% of the way between their prior expectation and their in-that-game expectation, closer to the prior expectation.

So the pre-game estimate seems to be a bit better than the game-so-far estimate, if you had to pick just one of them. But it might be wrong to not adjust it at all based on the results of the first three quarters.