Volume 24: The 2018 Election, Who Projected It Best?
A log-loss comparative analysis of quantitative and qualitative 2018 U.S. House of Representatives election projections
“Well, how did your projections do?” – Dale Cohodes.
It will come as a shock to nobody that I maintained a personal set of projections for the recently completed elections to the House of Representatives. It may surprise you more to know that reviewing my projections alongside the so-called “professionals” gives us an excellent opportunity to think through one of our favorite topics—probability. Elections are an interesting class of random event: probabilistic with a single trial and a discrete outcome. The tools we have to predict their outcome—polls, demographics, past voting patterns—result in distributions that include deviations from a mean. But no matter how much we’d like to, we cannot re-run the recent election in Georgia’s 7th or North Carolina’s 9th Congressional district, even though each was decided by fewer than 1,000 votes. And no matter how small the margin, the candidate with a plurality of the votes wins; a margin of 10,000 votes or 1,000,000 votes results in the same practical outcome. Elections are fundamentally different from random processes like flipping a coin or tomorrow’s high temperature.
Because of this, a simple question about forecast quality can be extended to provide insight into the general nature of probabilistic forecasts.
• What’s a good probabilistic forecast? • Whose House projections were the best?
What’s a good probabilistic forecast?
Let’s start with the basics. We define a probabilistic forecast as a statement of the likelihood of the occurrence of a discrete event, made by a person (the forecaster) before the event is decided.(1) When a sports handicapper says the Wolverines have a 75% chance of winning their next game, that is a probabilistic forecast. When your local weatherman says there is a 50% chance of rain tomorrow, that is also a probabilistic forecast.(2)
We already know that probabilistic thinking is a skill the human mind does not necessarily possess. We are not good at translating concepts like “possible,” “likely,” and “almost certain” into quantitative likelihoods of occurrence. If we are told that the probability of something happening is 80%, and it doesn’t occur, we are frequently quite distraught. And maybe we should be. But a forecaster who places an 80% probability on an event that always happens is also not doing a very good job. Saying that there is an 80% chance of the sun rising tomorrow is not a show of forecasting skill, but rather a lack thereof.
So how do we know a good probabilistic forecast when we see one? Consider a weatherman(3) who says that there is a 50% chance of rain on Tuesday. If it rains, then the weatherman wasn’t wrong; it was clearly something in the realm of possibilities. But the rub is that, if the same prediction is made the following day, and the sun in fact comes out, the forecast is equally good—and equally bad. Over the two-day span, the forecasts did not add any informational value. A weather forecast that says day after day after day that the chance of rain is 50% is useless. Such a weatherman would soon be exited from your local television station, and they should be.
But let’s move to Phoenix, where it rains only 10% of the time on average.(4) Now a forecast showing a 50% chance of rain that is borne out is a fantastic one. On the other hand, if it doesn’t rain, then it isn’t such a bad prediction, as it almost never rains in Phoenix. A 2-day forecast showing a 50% chance of rain each day, one day of which is borne out, has a lot of value in the desert. Which brings us to a principle: the quality of a forecast depends on how different it is from the probability that would be assigned in its absence.(5) Showing a few different sets of Phoenix predictions gives us more information on which weathermen should keep their jobs.
First, let’s check against our prior. It rains in Phoenix 10% of the time, and we had one shower in ten days. Check; our expectations for long-run rain held out.(6)
Let’s start with Weatherman Ugly. These were some bad predictions. Not only did he think rain was likely on five dry days, but he also put a probability of 0% on the one day where it did rain.(7) This man is bad at his job; listening to him is literally worse than just going with the long-term average of 10% chance of rain every day.
Which is precisely what our Bad Weatherman did. These predictions were not so bad as his Uglier brother-in-forecasting, but they are also essentially useless. You don’t need a degree in meteorology or fancy weather radar to make these predictions. He should still be fired.
On the other hand, our Good Weatherman in fact did some strong work. It rained on one of the three days on which he thought it might rain; 33% realization on a 40% prediction isn’t bad. He also confidently predicted no rain on seven days and was correct on each. Using these predictions is far superior to simply relying on the long-run average.
Before we finally describe our metric for the quality of a probabilistic forecast,(8) let’s run through one more set of forecasters. For this, we go back to a wet sub-tropical climate where we can expect rain 50% of the time.
Our Bad Forecaster…well he’s still doing his thing, going with the historical average. I’m getting tired of this guy. Our Good Weatherman puts up a reasonable showing. When he predicts a 60% chance of rain, it rains; when he predicts a 40% chance of rain, it doesn’t. It almost seems that he is better at this job than he thinks he is. The days are segregated properly, but he lacks confidence. And this is made our Better forecaster better. Even though each individual day is still far from certain, these predictions are clearly better than the previous set. Perhaps these predictions should also be more confident; after all, on days where rain was likely, she was right 100% of the time, not 80%. But we’ve come far enough to state two principles of probabilistic forecasting:
A probabilistic forecast is “good” if it is better than a relevant, uninformed estimate.
A more certain forecast is better than a less certain forecastif it is correct.
Now for our very simple metric of the quality of a probabilistic forecast: log-loss.(9) For a probabilistic forecast with probability p,
If the event occurs,
Log-loss = -1 * log ( p )
If the event does not occur,
Log-loss = -1 * log ( 1 – p )
That’s it. Just be sure that for your “log,” you use the natural log of base e. Also, don’t try to use a probability of 0% of 100%; use a very small number of your choice.(10) Rather than describe what this looks like, let’s visualize it:
The first thing that we see is that log-loss is a penalty. See those big numbers for low probability events that occur (and high probability events that don’t)? You don’t want to be out there. Don’t make especially confident predictions that don’t come true. The two lines intersect at a probability of exactly 50%. Estimating a 50% probability on a coin flip is equally good or bad no matter what you forecast, or what happens.
Log-loss is especially useful when you sum it over several probabilistic forecasts.(11) We can do so for the first set of weather forecasters we considered earlier.
As we expected, our Good Forecaster did the best. One drawback of log-loss is that the number has little objective meaning. Our Good Weatherman had a total log-loss of 1.945, but that means nothing on its own. However, when we compare multiple sets of forecasts,(12) we can start to have some qualitative and quantitative opinions. Our Good Forecaster is much better than our Bad Forecaster, and Ugly is way off base.
Whose House projections were the best?
And now we apply what we’ve learned.
Elections to the United States House of Representatives are a dream for forecasters and statisticians alike. A large natural experiment, with 435 simultaneous trials, quantitative results, a large data set, local specifics to learn, and many other things that just can’t be quantified; each of these was present on November 6, 2018.
Even forgetting those running for office, managing the two political parties, and spending the money, there was a lot of interest in the results of these elections. Therefore, many people (and groups) attempted to forecast the results. Not only does forecasting provide a public service of some value, it also provides a lot of hits on one’s website.
Broadly speaking, there are two types of election forecasters. Quantitative forecasters look for publicly available information like polls and fundraising, as well as endemic variables like economic conditions. Using some type of fitting, they decide which of these variables are predictive of upcoming House elections results. They also attempt to determine the best way to “mix” these variables together to predict results. Because of the nature of their forecasting, they typically offer numerical, probabilistic predictions: Candidate A has a 75% chance of winning. They might also predict vote shares: Candidate B is projected to win 45% of the total 2-party vote.
Qualitative forecasters use the same variables but add in other factors, such as knowledge of the candidate or the district that isn’t quantifiable. Typically, they offer qualitative forecasts; for example, “Candidate C is likely to win the election”. I created my own set of qualitative forecasts.
If you didn’t skip the first section, you are probably getting excited, because this is the perfect setup for a log-loss analysis. The only slight hitch is that we are forced to change the qualitative rankings into probabilities. I used the following mapping, expressed as the probability of the Democrat winning the given seat:(13)
Figure 1 - Race Rating to Probabilities
Recall that the log-loss analysis runs into trouble with probabilities of 0% and 100%, hence the small deviations for Safe predictions.
As a first cut of the data, we can look at each forecaster to see their distribution of projections. (See the appendix for descriptions of the professional forecasters shown in Figure 2.) Note that only Inside Elections and I used the “Lean” designation. Sabato’s Crystal Ball took a stand just before the election, forbidding the use of the “Toss Up” designation.(14)
Figure 2 - Seat Projection Distributions
We can already glean a few interesting pieces of data. IE and CNN (and RCP to a lesser extent) had many more seats viewed as Safe R; other forecasters took seriously the polls showing at least the possibility of a truly massive Democratic wave. Similarly, RCP had very few Safe D seats, while Crosstab, 538, and I had many. I’m not quite sure why RCP considered the result in New York’s 18th district or Iowa’s 2nd to be in doubt; this hurt their performance. Like the quantitative forecasters, I was not afraid to call some seats previously held by Republicans as “Safe D.” For example, there was no doubt in my mind that Democrats would win two seats in Pennsylvania that had changed due to redistricting. Also, Sabato’s proscription of the Toss Up rating greatly increased the number of Lean D seats in his projection.
Using this raw data, as well as the probabilities in Figure 1, we can translate into a projected number of Democratic seats won for each forecaster.(15) We can also see the number of seats in which each party was favored by a forecaster. I call the former metric the Mean and the latter the Median.(16)
Figure 3 - Projected Seats Won
The totals won fell within a reasonably tight range: 7 on median and 9 on mean. It bears noting that each forecaster predicted a skewed distribution, with the mean Democratic seats won greater than the median. This is again due to the potential wipeout of the GOP House caucus; evidence of this possibility was seen by both the qualitative and quantitative forecasters. A GOP enthusiasm plunge could have put normally safe seats at risk. Quantitatively, there is also a tradeoff of the aggressive gerrymanders created by the GOP state governments in 2010. A greater likelihood of gaining a majority in the House is balanced somewhat by the potential of losing more seats than expected in extreme scenarios.(17)
With the description out of the way, we can look and see how everybody did. Remember that log-loss analysis is like golf: a lower score is better.