Saturday, March 13, 2021

Forecasting the 2020 election: a retrospective

What I did  

One of my hobbies is forecasting US presidential elections using opinion poll data. The election is over and Joe Biden has been sworn in, so this seems like a good time to look back on what I got right and what I got wrong. 

I built a computer model that ingests state-level opinion poll data and outputs a state-level forecast of the election results. My model aggregates polling data, using the previous election results as a starting point. It's written in Python and you can get it from my GitHub page. The polling data comes from the ever-wonderful 538.
(This pole works, unlike some other polls. Image source: Wikimedia Commons, License: Creative Commons, Author: Daniel FR.)

What I got right

My final model correctly predicted the results of 49 out of 51 states (including Washington D.C.). 

What I got wrong

The two states my model got wrong were Florida and North Carolina, and these were big misses - beyond my confidence interval. The cause in both cases was polling data. In both states, the polls were consistently wrong and way overstated Biden's vote share. 

My model also overstated Biden's margin of victory in many of the states he won. This is hidden because my model forecast a Biden victory and Biden won, but in several cases, his margin of victory was less than my model predicted - and significantly so.

The cause of the problem was opinion polls overstating Biden's vote share.

The polling industry and 2020

The polling industry as a whole overstated Biden's support by several percentage points across many states. This is disguised because they got most states directionally correct, but it's still a wide miss. 

In the aftermath of 2016, the industry did a self-examination and promised it would do better next time, but 2020 was still way off. The industry is going to do a retrospective to find out what went wrong in 2020.

I've read a number of explanations of polling misses in the press but their motivation is selling advertising, not getting to the root cause. Polling is hard and 2020 was very different from previous years; there was a pandemic and Donald Trump was a highly polarizing candidate. This led to a higher voter turnout and many, many more absentee ballots. If the cause was easy to find, we'd have found it by now.

The 2020 investigation needs to be thorough and credible, which means it will be several months at least before we hear anything. My best guess is, there will be an industry paper in six months, and several independent research papers starting in a few months. I'm looking forward to the analysis: I'm convinced I'm going to learn something new.

Where next?

There are lots of tweaks I could make to my model, but I'm not going to do any of them until the underlying polling data improves. In other words, I'm going to forget about it all for three years. In fact, I'd quite like to forget about politics for a while.

If you liked this post, you might like these ones

No comments:

Post a Comment