Showing posts with label opinion polls. Show all posts
Showing posts with label opinion polls. Show all posts

Monday, August 2, 2021

Poleaxed opinion polls: the ongoing 2020 disaster

Why the polls failed in the US Presidential Election of 2020

In the wake of the widespread failure of opinion polls to accurately predict the outcome of the 2020 US Presidential election, the American Association for Public Opinion Research (AAPOR) commissioned a study to investigate the causes and make recommendations. Their findings were recently released.

(This is the key question for 2020 opinion pollsters. The answer is yes, but they don't know why. Image source: Wikimedia)

Summary of AAPOR's findings

I've read the report and I've dug through the findings. Here's my summary:

  1. The polls overstated support for Democratic candidates.
  2. We don't really know why.
  3. Er... that's it.

Yes, I'm being harsh, but I'm underwhelmed by the report and I find some of the statements in it unconvincing. I'll present some of their main findings and talk through them. I encourage you to read the report for yourself and reach your own conclusions.

(We don't know why we didn't get the results right.)

Factors they ruled out for 2020

  • Late-breaking changes in favor of Republican candidates. This happened in 2016 but didn't happen in 2020. The polls were directionally consistent throughout the campaign.
  • Weighting for education. In 2016, most polls didn't weight for education and education did seem to be a factor. In 2020, most polls did weigh for education. Educational weighting wasn't a factor.
  • Pollsters got the demographics wrong. Pollsters don't use random sampling, they often use stratified sampling based on demographics. There's no evidence that errors in demographics led to widespread polling errors in 2020.
  • People were afraid to say they voted for Trump. In races not involving Trump, the opinion polls were still wrong and still favored Democratic candidates. Trump wasn't the cause.
  • Intention to vote vs. actually voting. The results can't be explained by voters saying they were going to vote but who didn't actually vote. For example, if Democratic voters said they were going to vote Democratic and didn't actually vote, this would explain the error, but it didn't happen.
  • Proportion of early voters or election day voters. Early voting/election day voting didn't make a difference to the polling error.

Factors they couldn't rule out

  • Republican voters chose not to take part in surveys at a higher number than Democratic voters.
  • The weighting model used to adjust sampling may have been wrong. Pollsters use models of the electorate to adjust their results. If these models are wrong, the results will be biased.
  • Many more people voted in 2020 than in 2016 ("new voters" in the report) - maybe pollsters couldn't model these new voters very well.

Here's a paragraph from the report:

"Unfortunately, the ability to determine the cause or causes of polling error in 2020 is limited by the available data. Unless the composition of the overall electorate is known, looking only at who responded says nothing about who did not respond. Not knowing if the Republicans (or unaffiliated voters, or new voters) who responded to polls were more supportive of Biden than those who did not respond, for example, it is impossible to identify the primary source of polling error."

Let me put that paragraph another way: we don't have enough data to investigate the problem so we can't say what went wrong.

Rinse and repeat - or just don't

I'm going to quote some sentences from the report's conclusions and comments:

  • "Considering that the average margin of error among the state-level presidential polls in 2020 was 3.9 points, that means candidate margins smaller than 7.8 points would be difficult to statistically distinguish from zero using conventional levels of statistical significance. Furthermore, accounting for uncertainty of statistical adjustments and other factors, the total survey error would be even larger."
  • "Most pre-election polls lack the precision necessary to predict the outcome of semi-close contests."
  • "Our investigation reveals a systemic overstatement of the Democratic-Republican margin in nearly every contest, regardless of mode or proximity to the election. This overstatement is largest in states with more Republican supporters"

Some of the report's statements are extraordinary if you stop and think for a moment. I want you to ponder the key question: "what use are polls"?

The people paying for polls are mostly (but not completely) political campaigns and the media. The media want to report on an accurate snapshot of where the election is now and make an assessment of who will win. Political campaigns largely want the same thing. 

In places like Alaska or Hawaii, polls aren't very useful because voters tend to vote strongly Democratic or Republican. For example, Wyoming is overwhelmingly a Republican stronghold, and Washington D.C. a Democratic stronghold. My forecast for 2024 is simple: Wyoming will vote Republican and Washington D.C. Democratic. 

Polls are useful where the race is close, or, in the words of the report "semi-close". But, according to the report, polls in semi-close states don't have sufficient accuracy to predict the result.

So, if polls aren't useful in strongly Democratic or Republican states, and they lack predictive power in "semi-close" races, what use are they? Why should anyone pay for them?

There's an even deadlier issue for polling organizations. You can very clearly judge the accuracy of political opinion polls. Opinion poll companies run all kinds of polls on all kinds of topics, not just elections. How accurate are they in other areas where their success is harder to assess?

Where to next?

The polling industry has an existential credibility crisis. It can't continue to sell a product that doesn't work. It's extraordinary that an industry that's been around for nearly 100 years doesn't have the data to diagnose its failures. The industry needs to come together to fix its problems as soon as possible - or face irrelevancy in the near future.

Monday, February 1, 2021

What do Presidential approval polls really tell us?

This is a technical piece about the meaning of a type of polling. It is not political in favor of or against President Trump. I will remove any political comments.

What are presidential approval polls?

Presidential approval polls are a simple concept to grasp: do you approve or disapprove of President X? Because newspapers and TV channels can always use them for a headline or an on air-segment, they love to commission them. During President Trump's presidency, I counted 16,500 published approval polls.

But what do these polls mean and how should we interpret them? As it turns out, understanding what they're telling us is slippery. I'm going to offer you my guide for understanding what they mean.

(Image source: Wikimedia Commons. License: Public domain.)

My data comes from the ever-wonderful 538 which has a page showing the approval ratings for President Trump. Not only can you download the data from the page, but you can also compare President Trump's approval ratings with many previous presidents' approval ratings.

Example approval results

On 2020-10-29, Fox News ran an approval poll for President Trump. Of the 1,246 people surveyed:

  • 46% approved of President Trump
  • 54% disapproved of President Trump

which seems fairly conclusive that the majority disapproves. But not so fast. On the same day, Rasmussen Reports/Pulse Opinion Research also ran an approval poll, this time of 1,500 people, their results were:

  • 51% approved of President Trump
  • 48% disapproved of President Trump.

These were both fairly large surveys. How could they be so different?

Actually, it gets worse because these other surveys were taken on the same day too:

  • Gravis Marketing, 1,281 respondents, 52% approve, 47% disapprove
  • Morning Consult, 31,920 respondents, 42% approve, 53% disapprove

Let's plot out the data and see what the spread is, but as with everything with polls, this is harder than it seems.

Plotting approval and disapproval over time

Plotting out the results of approval polls seems simple, the x-axis is the day of the poll and the y-axis is the approval or disapproval percentage. But polls are typically conducted over several days and there's uncertainty in the results. 

To take a typical example, Global Marketing Research Services conducted a poll over three days 2020-10-23 to 2020-10-27. It's misleading to just plot the last day of the poll; we should plot the results over all the days the poll was conducted. 

The actual approval or disapproval number is subject to sampling error. If we assume random sampling (I'm going to come back to this later), we can work out the uncertainty in the results, more formally, we can work out a confidence interval. Here's how this works out in practice. YouGov did a poll over three days (2020-10-25 to 2020-10-27) and recorded 42% approval and 56% disapproval for 1,365 respondents. Using some math I won't explain here, we can write these results as:

  • 2020-10-25, approval 42 ± 2.6%, disapproval 56 ± 2.6%, undecided 2 ± 0.7%
  • 2020-10-26, approval 42 ± 2.6%, disapproval 56 ± 2.6%, undecided 2 ± 0.7%
  • 2020-10-27, approval 42 ± 2.6%, disapproval 56 ± 2.6%, undecided 2 ± 0.7%

We can plot this poll result like this:

Before we get to the plot of all approval ratings, let's do one last thing. If you're plotting large amounts of data, it's helpful to set a transparency level for the points you're plotting (often called alpha). There are 16,500 polls and we'll be plotting approve, disapprove, and undecided, which is a lot of data. By setting the transparency level appropriately, the plot will have the property where the more intense the color is, the more the poll results overlap. With this addition, let's see the plot of approval, disapproval, and undecided over time.

Wow. There's quite a lot going on here. It's hard to get a sense of changes over time. I've added a trend line for approval, disapproval, and undecided so you can get a better sense of the aggregate behavior of the data.

Variation between pollsters

There's wide variation between opinion pollsters. I've picked out just two, Rasmussen Reports/Pulse Opinion Research and Morning Consult. To see the variation more clearly, I'll just show approvals for President Trump and just show these two pollsters and the average for all polls.

To state the obvious, the difference is huge and way above random sampling error. Who's right, Rasmussen Reports or Morning Consult? How can we tell?

To understand what this chart means, we have to know a little bit more about how these polls are conducted.

How might you run an approval poll?

There are two types of approval polls.

  • One-off polls. You select your sample of subjects and ask them your questions. You only do it once.
  • Tracking polls. Technically, this is also called a longitudinal study. You select your population sample and ask them questions. You then ask the same group the same questions at a later date. The idea is, you can see how opinions change over time using the same group.

Different polling organizations use different methods for population sampling. It's almost never entirely random sampling. Bear in mind, subjects can say no to being involved, and can in principle drop out any time they choose. 

It's very, very easy to introduce bias by the people you select, slight differences in selection may give big differences in results. Let's say you're trying to measure President Trump's approval. Some people will approve of everything he does while others will disapprove of everything he does. There's very little point in measuring how either of these groups approves or disapproves over time. If your group includes a big measure of either of these groups, you're not going to see much variation. However, are you selecting for population representation or selecting to measure change over time? 

For these reasons, the sampling error in the polls is likely to be larger than random sampling error alone and may have different characteristics.

How accurate are approval polls?

This is the big question. For polls related to voting intention, you can compare what the polls said and the election result. But there's no such moment of truth for approval polls. I might disapprove of a President, but vote for them anyway (because of party affiliations or because I hate the other candidate more), so election results are a poor indicator of success.

One measure of accuracy might be agreement among approval polls from a number of organizations, but it's possible that the other pollsters could be wrong too. There's a polling industry problem called herding which has been a big issue in UK political polls. Herding means pollsters choose methodologies similar to other pollsters to avoid being outliers, which leads to polling results from different pollsters herding together. In a couple of notorious cases in the UK, they herded together and herded wrongly. A poll's similarity to other polls does not mean it's more accurate.

What about averaging?

What about aggregating polls? Even this isn't simple. In your aggregation:

  • Do you include tracking polls or all polls?
  • Do you weight polls by their size?
  • Do you weight polls by accuracy or partisan bias?
  • Do you remove 'don't knows'?
  • If a poll took place over more than one day, do you average results over each day the poll took place?

I'm sure you could add your own factors. The bottom line is, even aggregation isn't straightforward.

What all this means

Is Rasmussen Reports more accurate than Morning Consult? I can't say. There is no external source of truth for measuring who's more correct.

Even worse, we can see changes in the Rasmussen Reports approval that don't occur in the Morning Consult data (and vice versa). Was the effect Rasmussen Reports saw real and Morning Consult missed it, or was Morning Consult correct? I can't say.

It's not just these two pollsters. The Pew Research Center claims their data, showing a decline in President's Trump approval rating at the end of his presidency, is real. This may well be correct, but what external sources can we use to say for sure?

What can I conclude for President Trump's approval rating?

Here's my takeaway story after all this. 

President Trump had an approval rating above 50% from most polling organizations when he took office. Most, but not all, polling organizations reported a drop below 50% soon after the start of his presidency. After that, his approval ratings stayed pretty flat throughout his entire presidency, except for a drop at the very end. 

The remarkable story is how steady his approval ratings were. For most presidents, there are ups and downs throughout their presidency, but not so much for President Trump. It seems that people made their minds up very quickly and didn't change their opinions much. 

Despite the large number of approval polls, the headline for most of the last four years should have been: "President Trump's approval rating: very little change".

What about President Biden?

At a guess, the polls will start positive and decline. I'm not going to get excited about any one poll. I want to see averages, and I want to see a sustained trend over time. Only then do I think the polls might tell us something worth listening to.

If you liked this post, you might like these ones

Tuesday, September 8, 2020

Can you believe the polls?

Opinion polls have known sin

Polling companies have run into trouble over the years in ways that render some poll results doubtful at best. Here are just a few of the problems:

  • Fraud allegations.
  • Leading questions
  • Choosing not to publish results/picking methodologies so that polls agree.

Running reliable polls is hard work that takes a lot of expertise and commitment. Sadly, companies sometimes get it wrong for several reasons:

  • Ineptitude.
  • Lack of money. 
  • Telling people what they want to hear. 
  • Fakery.

In this blog post, I'm going to look at some high-profile cases of dodgy polling and I'm going to draw some lessons from what happened.

(Are some polls real or fake? Image source: Wikimedia Commons. Image credit: Basile Morin. License: Creative Commons.)

Allegations of fraud part 1 - Research 2000

Backstory

Research 2000 started operating around 1999 and gained some solid early clients. In 2008, The Daily Kos contracted with Research 2000 for polling during the upcoming US elections. In early 2010, Nate Silver at FiveThirtyEight rated Research 2000 as an F and stopped using their polls. As a direct result, The Daily Kos terminated their contract and later took legal action to reclaim fees, alleging fraud.

Nate Silver's and others' analysis

After the 2010 Senate elections, Nate Silver analyzed polling results for 'house effects' and found a bias towards the Democratic party for Research 2000. These kinds of biases appear all the time and vary from election to election. The Research 2000 bias was large (at 4.4%), but not crazy; the Rasmussen Republican bias was larger for example. Nonetheless, for many reasons, he graded Research 2000 an F and stopped using their polling data.

In June of 2010, The Daily Kos publicly dismissed Research 2000 as their pollster based on Nate Silver's ranking and more detailed discussions with him. Three weeks later, The Daily Kos sued Research 2000 for fraud. After the legal action was public, Nate Silver blogged some more details of his misgivings about Research 2000's results, which led to a cease and desist letter from Research 2000's lawyers. Subsequent to the cease-and-desist letter, Silver published yet more details of his misgivings. To summarize his results, he was seeing data inconsistent with real polling - the distribution of the numbers was wrong. As it turned out, Research 2000 was having financial trouble around the time of the polling allegations and was negotiating low-cost or free polling with The Daily Kos in exchange for accelerated payments. 

Others were onto Research 2000 too. Three statisticians analyzed some of the polling data and found patterns inconsistent with real polling - again, real polls tend to have results distributed in certain ways and some of the Research 2000 polls did not.

The result

The lawsuit progressed with strong evidence in favor of The Daily Kos. Perhaps unsurprisingly, the parties agreed a settlement, with Research 2000 agreeing to pay The Daily Kos a settlement fee. Research 2000 effectively shut down after the agreement.

Allegations of fraud part 2 - Strategic Vision, LLC

Backstory

This story requires some care in the telling. At the time of the story, there were two companies called Strategic Vision, one company is well-respected and wholly innocent, the other not so much. The innocent and well-respected company is Strategic Vision based in San Diego. They have nothing to do with this story. The other company is Strategic Vision, LLC based in Atlanta. When I talk about Strategic Vision, LLC from now on it will be solely about the Atlanta company.

To maintain trust in the polling industry, the American Association for Public Opinion Research (AAPOR) has guidelines and asks polling companies to disclose some details of their polling methodologies. They rarely censure companies, and their censures don't have the force of law, but public shaming is effective as we'll see. 

What happened

In 2008, the AAPOR asked 21 polling organizations for details of their 2008 pre-election polling, including polling for the New Hampshire Democratic primary. Their goal was to quality-check the state of polling in the industry.

One polling company didn't respond for a year, despite repeated requests to do so. As a result, in September 2009, the AAPOR published a public censure of Strategic Vision, LLC which you can read here

It's very unusual for the AAPOR to issue a censure, so the story was widely reported at the time, for example in the New York Times, The Hill, and The Wall Street Journal. Strategic Vision LLC's public response to the press coverage was that they were complying but didn't have time to submit their data. They denied any wrongdoing.

Subsequent to the censure, Nate Silver looked more closely at Strategic Vision LLC's results. Initially, he asked some very pointed and blunt questions. In a subsequent post, Nate Silver used Benford's Law to investigate Strategic Vision LLC's data, and based on his analysis he stated there was a suggestion of fraud - more specifically, that the data had been made up. In a post the following day, Nate Silver offered some more analysis and a great example of using Benford's Law in practice. Again, Strategic Vision LLC vigorously denied any wrongdoing.

One of the most entertaining parts of this story is a citizenship poll conducted by Strategic Vision, LLC among high school students in Oklahoma. The poll was commissioned by the Oklahoma Council on Public Affairs, a think tank. The poll asked eight various straightforward questions, for example:

  • who was the first US president? 
  • what are the two main political parties in the US?  

and so on. The results were dismal: only 23% of students answered George Washington and only 43% of students knew Democratic and Republican. Not one student in 1,000 got all questions correct - which is extraordinary. These types of polls are beloved of the press; there are easy headlines to be squeezed from students doing poorly, especially on issues around citizenship. Unfortunately, the poll results looked odd at best. Nate Silver analyzed the distribution of the results and concluded that something didn't seem right - the data was not distributed as you might expect. To their great credit, when the Oklahoma Council on Public Affairs became aware of problems with the poll, they removed it from their website and put up a page explaining what happened. They subsequently terminated their relationship with Strategic Vision, LLC.

In 2010, a University of Cincinnati professor awarded Strategic Vision LLC the ''Phantom of the Soap Opera" award on the Media Ethics site. This site has a little more back story on the odd story of Strategic Vision LLC's offices or lack of them.

The results

Strategic Vision, LLC continued to deny any wrongdoing. They never supplied their data to the AAPOR and they stopped publishing polls in late 2009. They've disappeared from the polling scene.

Other polling companies

Nate Silver rated other pollsters an F and stopped using them. Not all of the tales are as lurid as the ones I've described here, but there are accusations of fraud and fakery in some cases, and in other cases, there are methodology disputes and no suggestion of impropriety. Here's a list of pollsters Nate Silver rates an F.

Anarchy in the UK

It's time to cross the Atlantic and look at polling shenanigans in the UK. The UK hasn't seen the rise and fall of dodgy polling companies, but it has seen dodgy polling methodologies.

Herding

Let's imagine you commission a poll on who will win the UK general election. You get a result different from the other polls. Do you publish your result? Now imagine you're a polling analyst, you have a choice of methodologies for analyzing your results, do you do what everyone else does and get similar results, or do you do your own thing and maybe get different results from everyone else?

Sadly, there are many cases when contrarian polls weren't published and there is evidence that polling companies made very similar analysis choices to deliberately give similar results. This leads to the phenomenon called herding where published poll results tend to herd together. Sometimes, this is OK, but sometimes it can lead to multiple companies calling an election wrongly.

In 2015, the UK polls predicted a hung parliament, but the result was a working majority for the Conservative party. The subsequent industry poll analysis identified herding as one of the causes of the polling miss. 

This isn't the first time herding has been an issue with UK polling and it's occasionally happened in the US too.

Leading questions

The old British TV show 'Yes, Prime Minister' has a great piece of dialog neatly showing how leading questions work in surveys. 'Yes, Prime Minister' is a comedy, but UK polls have suffered from leading questions for a while.

The oldest example I've come across dates from the 1970's and the original European Economic Community membership referendum. Apparently, one poll asked the following questions to two different groups:

  • France, Germany, Italy, Holland, Belgium and Luxembourg approved their membership of the EEC by a vote of their national parliaments. Do you think Britain should do the same?
  • Ireland, Denmark and Norway are voting in a referendum to decide whether to join the EEC. Do you think Britain should do the same?

These questions are highly leading and unsurprisingly elicited the expected positive result in both (contradictory) cases.

Moving forward in time to 2012, leading questions or artful question wording, came up again. The background is press regulation. After a series of scandals where the press behaved shockingly badly, the UK government considered press regulation to curb abuses. Various parties were for or against various aspects of press regulation and they commissioned polls to support their viewpoints. 

The polling company YouGov published a poll, paid for by The Media Standards Trust, that showed 79% of people thought there should be an independent government-sanctioned regulator to investigate complaints against the press. Sounds comprehensive and definitive. 

But there was another poll at about the same time, this time paid for by The Sun newspaper,  that found that only 24% of the British public wanted a government regulator for the press - the polling company here was also YouGov! 

The difference between the 79% and 24% came through careful question wording - a nuance that was lost in the subsequent press reporting of the results. You can listen to the story on the BBC's More Or Less program that gives the wording of the question used.

What does all this mean?

The quality of the polling company is everything

The established, reputable companies got that way through high-quality reliable work over a period of years. They will make mistakes from time to time, but they learn from them. When you're considering whether or not to believe a poll,  you should ask who conducted the poll and consider the reputation of the company behind it.

With some exceptions, the press is unreliable

None of the cases of polling impropriety were caught by the press. In fact, the press has a perverse incentive to promote the wild and outlandish, which favors results from dodgy pollsters. Be aware that a newspaper that paid for a poll is not going to criticize its own paid-for product, especially when it's getting headlines out of it.

Most press coverage of polls focuses on discussing what the poll results mean, not how accurate they are and sources of bias. If these things are discussed, they're discussed in a partisan manner (disagreeing with a poll because the writer holds a different political view). I've never seen the kind of analysis Nate Silver does elsewhere - and this is to the great detriment of the press and their credibility.

Vested interests

A great way to get press coverage is by commissioning polls and publishing the results; especially if you can ask leading questions. Sometimes, the press gets very lazy and doesn't even report who commissioned a poll, even when there's plainly a vested interest.

Anytime you read a survey, ask who paid for it and what the exact questions were.

Outliers are outliers, not trends

Outlier poll results get more play than results in line with other pollsters. As I write this in early September 2020, Biden is about 7% ahead in the polls. Let's imagine two survey results coming in early September:

  • Biden ahead by 8%.
  • Trump ahead by 3%

Which do you think would get more space in the media? Probably the shocking result, even though the dull result may be more likely. Trump-supporting journalists might start writing articles on a campaign resurgence while Biden-supporting journalists might talk about his lead slipping and losing momentum. In reality, the 3% poll might be an anomaly and probably doesn't justify consideration until it's backed by other polls. 

Bottom line: outlier polls are probably outliers and you shouldn't set too much store by them.

There's only one Nate Silver

Nate Silver seems like a one-man army, routing out false polling and pollsters. He's stood up to various legal threats over the years. It's a good thing that he exists, but it's a bad thing that there's only one of him. It would be great if the press could take inspiration from him and take a more nuanced, skeptical, and statistical view of polls. 

Can you believe the polls?

Let me close by answering my own question: yes you can believe the polls, but within limits and depending on who the pollster is.

Reading more

This blog post is one of a series of blog posts about opinion polls. 

Wednesday, August 19, 2020

President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong

What the pollsters got wrong

Had the US presidential polls been correct in 2016, Nate Silver and other forecasters would be anointed oracles and the polling companies would be viewed as soothsayers revealing fundamental truths about society. None of these things happened. Instead, forecasters were embarrassed and polling companies got a bloody nose. If we want to understand if things will go any differently in 2020, we have to understand what happened in 2016 and why.

What happened in 2016

The simple narrative is: "the polls got it wrong in 2016", but this is a reductio ad absurdum. Let's look at what actually happened.

Generally speaking, there are two types of US presidential election opinion polls: national and state. National polls are conducted across the US and are intended to give a sense of national intentions. Prediction-wise, they are most closely related to the national electoral vote. State polls are conducted within a state and are meant to predict the election in the state.

All pollsters recognize uncertainty in their measurement and most of them quote a margin of error, which is usually a 95% confidence interval. For example, I might say candidate 'cat' has 49% and candidate 'dog' has 51% with a 4% margin of error. This means you should read my results as 'cat': 49±4% and 'dog': 51±4%, or more simply, that I think candidate 'dog' will get between 47% and 55% of the vote and candidate 'cat' between 45% and 53%. If the actual results are 'cat' 52% and 'dog' 48%, technically, that's within the margin of error and is a successful forecast. You can also work out a probability of a candidate winning based on opinion poll data.

The 2016 national polling was largely correct. Clinton won the popular vote with a 2.1% margin over Trump. Wikipedia has a list of 2016 national polls, and it's apparent that the polls conducted closer to the election gave better results than those conducted earlier (unsurprisingly) as I've shown in the chart below. Of course, the US does not elect presidents on the popular vote, so this point is of academic interest.

(Based on data from Wikipedia.)

The state polls are a different matter. First off, we have to understand that polls aren't conducted in every state. Wyoming is very, very Republican and as a result, few people would pay for a poll there - no newspaper is going to get a headline from "Republican leads in Wyoming". Obviously, the same thing applies to very, very Democratic states. Polls are conducted more often in hotly contested areas with plenty of electoral college votes. So how did the state polls actually do in 2016? To keep things simple, I'll look at the results from the poll aggregator Sam Wang and compare them to the actual results. The poll aggregation missed in these states:


State Election spread 
(Trump - Clinton)
Poll aggregator spread
(Trump - Clinton)
Florida 1.2% -1.5%
North Carolina 3.66% -1%
Pennsylvania 0.72% -2.5%
Michigan 0.23% -2.5%
Wisconsin 0.77% < -5%

Poll aggregators use different error models for calculating their aggregated margin of error, but typically they'll vary from 2-3%. A few of these results are outside of the margin of error, but more tellingly, they're all in the same direction.  A wider analysis looking at all the state results shows the same pattern. The polls were biased in favor of Clinton, but why?

Why they got it wrong

In the aftermath of the election, the American Association for Public Opinion Research created an ad-hoc commission to understand what went wrong. The AAPOR published their findings and I'm going to provide a summary here.

Quoting directly from the report, late changes in voter decisions led earlier polls to overestimate Clinton's support: 

"Real change in vote preference during the final week or so of the campaign. About 13 percent of voters in Wisconsin, Florida and Pennsylvania decided on their presidential vote choice in the final week, according to the best available data. These voters broke for Trump by near 30 points in Wisconsin and by 17 points in Florida and Pennsylvania."

The polls oversampled those with college degrees and undersampled those without: "In 2016 there was a strong correlation between education and presidential vote in key states. Voters with higher education levels were more likely to support Clinton. Furthermore, recent studies are clear that people with more formal education are significantly more likely to participate in surveys than those with less education. Many polls – especially at the state level – did not adjust their weights to correct for the over-representation of college graduates in their surveys, and the result was over-estimation of support for Clinton."

The report also suggests that the "shy Trump voter" effect may have played a part.

Others also investigated the result, and a very helpful paper by Kennedy et al provides some key supporting data. Kennedy also states that voter education was a key factor, and shows charts that illustrated the connection between education and voting in 2016 and in 2012. As you might expect, there was little influence in 2012, but in 2016, education was a strong influence. In 2016, most state-level polls did not adjust for education.

Although the polls in New Hampshire called the results correctly, they predicted a much larger win for Clinton. Kennedy quotes Andrew Smith, a UNH pollster, and I'm going to repeat the quote here because it's so important: "We have not weighted by level of education in our election polling in the past and we have consistently been the most accurate poll in NH (it hasn’t made any difference and I prefer to use as few weights as possible), but we think it was a major factor this year. When we include a weight for level of education, our predictions match the final number."

Kennedy also found good evidence of a late swing to Trump that was not caught by polls conducted earlier in the campaign.

On the whole, there does seem to be agreement that two factors were important in 2016:

  • Voter education. In previous elections, it didn't matter, in this one it did. State-level polls on the whole didn't control for it.
  • Late swing to Trump missed by earlier polls.

2020 and beyond

The pollsters' business depends on making accurate forecasts and elections are the ultimate high-profile test of the predictive power of polls. There's good evidence that at least some pollsters will correct for education in this election, but what if there's some other factor that's important, for example, housing type, or diet, or something else? How will we be able to spot bias during an election campaign? The answer is, we can't. What we can do is assume the result is a lot less certain than the pollsters, or the poll aggregators, claim.

Commentary

In the run-up to the 2016 election, I created an opinion poll-aggregation model. My model was based on the work of Sam Wang and used election probabilities. I was disturbed by how quickly a small spread in favor of a candidate gave a very high probability of winning; the election results always seemed more uncertain to me. Textbook poll aggregation models reduced the uncertainty still further.

The margin of error quoted by pollsters is just the sampling error assuming random sampling. But sampling isn't wholly random and there may be house effects or election-specific effects that bias the results. Pollsters and others make the assumption that these effects are zero, which isn't the case. Of course, pollsters change their methodology with each election to avoid previous mistakes. The upshot is, it's almost impossible to assess the size of these non-random bias effects during an election. My feeling is, opinion poll results are a lot less certain than the quoted margin of error, and a 'real' margin of error may be much greater.

The lesson for poll aggregators like me is to allow for other biases and uncertainty in our models. To his great credit, Nate Silver is ahead here as he is in so many other areas.

If you liked this post, you might like these ones

Wednesday, August 12, 2020

Who will win the election? Election victory probabilities from opinion polls

Polls to probabilities

How likely is it that your favorite candidate will win the election? If your candidate is ahead of their opponent by 5%, are they certain to win? What about 10%? Or if they're down by 2%, are they out of the race? Victory probabilities are related to how far ahead or behind a candidate is in the polls, but the relationship isn't a simple one and has some surprising consequences as we'll see.

Opinion poll example

Let's imagine there's a hard-fought election between candidates A and B. A newspaper publishes an opinion poll a few days before the election:

  • Candidate A: 52%
  • Candidate B: 48%
  • Sample size: 1,000

Should candidate A's supporters pop the champagne and candidate B's supporters start crying?

The spread and standard error

Let's use some standard notation. From the theory of proportions, the mean and standard error for the proportion of respondents who chose A is:

\[ p_a = {n_a \over n} \] \[ \sigma_a = { \sqrt {{p_a(1-p_a)} \over n}} \]

where \( n_a \) is the number of respondents who chose A and \( n \) is the total number of respondents. If the proportion of people who answered candidate B is \(p_b\), then obviously, \( p_a + p_b = 1\).

Election probability theory usually uses the spread, \(d\), which is the difference between the candidates: \[d = p_a - p_b = 2p_a - 1 \] From statistics theory, the standard error of \( d \)  is: \[\sigma_d = 2\sigma_a\] (these relationships are easy to prove, but a bit tedious, if anyone asks, I'll show the proof.)

Obviously, for a candidate to win, their spread, \(d\), must be > 0.

Everything is normal

From the central limit theorem (CLT), we know \(p_a\) and \(p_b\) are normally distributed, and also from the CLT, we know \(d\) is normally distributed. The next step to probability is viewing the normal distribution for candidate A's spread. The chart below shows the normal distribution with mean \(d\) and standard error \(\sigma_d\).

As with most things with the normal distribution, it's easier if we transform everything to the standard normal using the transformation: \[z = {(x - d) \over \sigma_d}\] The chart below is the standard normal representation of the same data.

The standard normal form of this distribution is a probability density function. We want the probability that \(d>0\) which is the light green shaded area, so it's time to turn to the cumulative distribution function (CDF), and its complement, the complementary cumulative distribution function (CCDF).

CDF and CCDF

The CDF gives us the probability that we will get a result less than or equal to some value I'll label \(z_c\). We can write this as: \[P(z \leq z_c) = CDF(z_c) = \phi(z_c) \] The CCDF is defined so that: \[1 = P(z \leq z_c) + P(z > z_c)= CDF(z_c) + CCDF(z_c) = \phi(z_c) + \phi_c(z_c)\] Which is a long-winded way of saying the CCDF is defined as:  \[CCDF(z_c) = P(z_c \gt 0) = \phi_c(z_c)\]

The CDF is the integral of the PDF, and from standard textbooks: \[ \phi(z_c) = {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right) \] We want the CCDF,  \(P(z > z_c)\), which is simply 1 - CDF.

Our critical value occurs when the spread is zero. The transformation to the standard normal in this case is: \[z_c = {(x - d) \over \sigma_d} = {-d \over \sigma_d}\] We can write the CCDF as: \[\phi_c(z_c) = 1 - \phi(z_c) = 1- {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right)\ \] \[= 1 - {1 \over 2} \left( 1 + erf\left( {-d \over {\sigma_d\sqrt2}} \right) \right)\] We can easily show that: \[erf(x) = -erf(-x)\] Using this relationship, we can rewrite the above equation as: \[ P(d > 0) = {1 \over 2} \left( 1 + erf\left( {d \over {\sigma_d\sqrt2}} \right) \right)\]

What we have is an equation that takes data we've derived from an opinion poll and gives us a probability of a candidate winning.

Probabilities for our example

For candidate A:

  • \(n=1000\)
  • \( p_a = {520 \over 1000} = 0.52 \)
  • \(\alpha_a = 0.016 \)
  • \(d = {{520 - 480} \over 1000} = 0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 90\%\)

For candidate B:

  • \(n=1000\)
  • \( p_b = {480 \over 1000} = 0.48 \)
  • \(\alpha_b = 0.016 \)
  • \(d = {{480 - 520} \over 1000} = -0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 10\%\)

Obviously, the two probabilities add up to 1. But note the probability for candidate A. Did you expect a number like this? A 4% point lead in the polls giving a 90% chance of victory?

Some consequences

Because the probability is based on \( erf \), you can quite quickly get to highly probable events as I'm going to show in an example. I've plotted the probability for candidate A for various leads (spreads) in the polls. Most polls nowadays tend to have about 800 or so respondents (some are more and some are a lot less), so I've taken 800 as my poll size. Obviously, if the spread is zero, the election is 50%:50%. Note how quickly the probability of victory increases as the spread increases.

What about the size of the poll, how does that change things? Let's fix the spread to 2% and vary the size of the poll from 200 to 2,000 (the usual upper and lower bounds on poll sizes). Here's how the probability varies with poll size for a spread of 2%.

Now imagine you're a cynical and seasoned poll analyst working on candidate A's campaign. The young and excitable intern comes rushing in, shouting to everyone that A is ahead in the polls! You ask the intern two questions, and then, like the Oracle at Delphi, you predict happiness or not. What two questions do you ask?

  • What's the spread?
  • What's the size of the poll?

What's missing

There are two elephants in the room, and I've been avoiding talking about them. Can you guess what they are?

All of this analysis assumes the only source of error is random noise. In other words, there's no systemic bias. In the real world, that's not true. Polls aren't wholly based on random sampling, and the sampling method can introduce bias. I haven't modeled it at all in this analysis. There are at least two systemic biases:

  • Pollster house effects arising from house sampling methods
  • Election effects arising from different population groups voting in different ways compared to previous elections.

Understanding and allowing for bias is key to making a successful election forecast. This is an advanced topic for another blog post.

The other missing item is more subtle. It's undecided voters. Imagine there are two elections and two opinion polls. Both polls have 1,000 respondents.

Election 1:

  • Candidate A chosen by 20%
  • Candidate B chosen by 10%
  • Undecided voters are 70%
  • Spread is 10%
Election 2:

  • Candidate A chosen by 55%
  • Candidate B chosen by 45%
  • Undecided voters are 0%
  • Spread is 10%
In both elections, the spread from the polls is 10%, so candidate A has the same higher chance of winning in both elections, but this doesn't seem right. Intuitively, we should be less certain about an election with a high number of undecided voters. Modeling undecided voters is a topic for another blog post!

Reading more

The best source of election analysis I've read is in the book "Introduction to data science" and the associated edX course "Inference and modeling", both by Rafael Irizarry. The analysis in this blog post was culled from multiple books and websites, each of which only gave part of the story.

If you liked this post, you might like these ones

Monday, August 3, 2020

Sampling the goods: how opinion polls are made

How opinion polls work on the ground

I worked as a street interviewer for an opinion polling organization and I know how opinion polls are made and executed. In this blog post, I'm going to explain how opinion polls were run on the ground,  educate you on why polls can go wrong, and illustrate how difficult it is to run a valid poll. I'm also going to tell you why everything you learned from statistical textbooks about polling is wrong.


(Image Credit: Wikimedia Commons, License: Public Domain)

Random sampling is impossible

In my experience, this is something that's almost never mentioned in statistics textbooks but is a huge issue in polling. If they talk about sampling at all, textbooks assume random sampling, but that's not what happens.

Random sampling sounds wonderful in theory, but in practice, it can be very hard; people aren't beads in an urn. How do you randomly select people on the street or on the phone - what's the selection mechanism? How do you guard against bias? Let me give you some real examples.

Imagine you're a street interviewer. Where do you stand to take your random sample? If you take your sample outside the library, you'll get a biased sample. If you take it outside the factory gates, or outside a school, or outside a large office complex, or outside a playground, you'll get another set of biases. What about time of day? The people out on the streets at 7am are different from the people at 10am and different from the people at 11pm.

Similar logic applies to phone polls. If you call landlines only, you'll get one set of biases. If you call people during working hours, your sample will be biased (is the mechanic fixing a car going to put down their power tool to talk to you?). But calling outside of office hours means you might not get shift workers or parents putting their kids to bed. The list goes on.

You might be tempted to say, do all the things: sample at 10am, 3pm, and 11pm; sample outside the library, factory, and school; call on landlines and mobile phones, and so on, but what about the cost? How can you keep opinion polls affordable? How do you balance calls at 10am with calls at 3pm?

Because there are very subtle biases in "random" samples, most of the time, polling organizations don't do wholly 'random' sampling.

Sampling and quotas

If you can't get a random sample, you'd like your sample to be representative of a population. Here, representative means that it will behave in the same way as the population for the topics you're interested in, for example, voting in the same way or buying butter in the same way. The most obvious way of sampling is demographics: age and gender etc.

Let's say you were conducting a poll in a town to find out residents' views on a tax increase. You might find out the age and gender demographics of the town and sample people in a representative way so that the demographics of your sample match the demographics of the town. In other words, the proportion of men and women in your sample matches that of the town, the age distribution matches that of the town, and so on.


(US demographics. Image credit: Wikimedia Commons. License: Public domain)

In practice, polling organizations use a number of sampling factors depending on the survey. They might include sampling by:

  • Gender
  • Age
  • Ethnicity
  • Income
  • Social class or employment category 
  • Education 

but more likely, some combination of them.

In practice, interviewers may be given a sheet outlining the people they should interview, for example, so many women aged 45-50, so many people with degrees, so many people earning over $100,000, and so on. This is often called a quota. Phone interviews might be conducted on a pre-selected list of numbers, with guidance on how many times to call back, etc.

Some groups of people can be very hard to reach, and of course, not everyone answers questions. When it comes to analysis time, the results are weighted to correct bias.  For example, if the survey could only reach 75% of its target for men aged 20-25, the results for men in this category might be weighted by 4/3.

Who do you talk to?

Let's imagine you're a street interviewer,  you have your quota to fulfill, and you're interviewing people on the street, who do you talk to? Let me give you a real example from my polling days; I needed a man aged 20-25 for my quota. On the street, I saw what looked like a typical and innocuous student, but I also saw an aggressive-looking skinhead in full skinhead clothing and boots. Who would you choose to interview?

(Image credit: XxxBaloooxxx via Wikimedia Commons. License: Creative Commons.)

Most people would choose the innocuous student, but that's introducing bias. You can imagine multiple interviewers making similar decisions resulting in a heavily biased sample. To counter this problem, we were given guidance on who to select, for example, we were told to sample every seventh person or to take the first person who met our quota regardless of their appearance. This at least meant we were supposed to ask the skinhead, but of course, whether he chose to reply or not is another matter.

The rules sometimes led to absurdity. I did a survey where I was supposed to interview every 10th person who passed by. One man volunteered, but I said no because he was the 5th person. He hung around so long that eventually, he became the 10th person to pass me by. Should I have interviewed him? He met the rules and he met my sampling quota.

I came across a woman who was exactly what I needed for my quota. She was a care worker who had been on a day trip with severely mentally handicapped children and was in the process of moving them from the bus to the care home. Would you take her time to interview her? What about the young parent holding his child when I knocked on the door? The apartment was clearly used for recent drug-taking. Would you interview him? 

As you might expect, interviewers interpreted the rules more flexibly as the deadline approached and as it got later in the day. I once interviewed a very old man whose wife answered all the questions for him. This is against the rules, but he agreed with her answers, it was getting late, and I needed his gender/age group/employment status for my quota.

The company sent out supervisors to check our work on the streets, but of course, supervisors weren't there all the time, and they tended to vanish after 5pm anyway.

The point is, when it comes to it, there's no such thing as random sampling. Even with quotas and other guided selection methods, there are a thousand ways for bias to creep into sampling and the biases can be subtle. The sampling methodology one company uses will be different from another company's, which means their biases will not be the same.

What does the question mean?

One of the biggest lessons I learned was the importance of clear and unambiguous questions, and the unfortunate creativity of the public. All of the surveys I worked on had clearly worded questions, and to me, they always seemed unambiguous. But once you hit the streets, it's a different world. I've had people answer questions with the most astonishing levels of interpretation and creativity; regrettably, their interpretations were almost never what the survey wanted. 

What surprised me was how willing people were to answer difficult questions about salary and other topics. If the question is worded well (and I know all the techniques now!), you can get strangers to tell you all kinds of things. In almost all cases, I got people to tell me their age, and when required, I got salary levels from almost everyone.

A well-worded question led to a revelation that shocked me and shook me out of my complacency.  A candidate had unexpectedly just lost an election in the East End of London and the polling organization I worked for had been contracted to find out why. To help people answer one of the questions, I had a card with a list of reasons why the candidate lost, including the option: "The candidate was not suitable for the area." A lot of people chose that as their reason. I was naive and didn't know what it meant, but at the end of the day, I interviewed a white man in pseudo-skinhead clothes, who told me exactly what it meant. He selected "not suitable for the area" as his answer and added: "She was black, weren't she?".

The question setters weren't naive. They knew that people would hesitate before admitting racism was the cause, but by carefully wording the question and having people choose from options, they provided a socially acceptable way for people to answer the question.

Question setting requires real skill and thought.

(Oddly. there are very few technical resources on wording questions well. The best I've found is: "The Art of Asking Questions", by Stanley Le Baron Payne, but the book has been out of print for a long time.)

Order, order

Question order isn't accidental either, you can bias a survey by the order you ask questions. Of course, you have to avoid leading questions. The textbook example is survey questions on gun control. Let's imagine there were two surveys with these questions:

Survey 1:
  • Are you concerned about violent crime in your neighborhood?
  • Do you think people should be able to protect their families?
  • Do you believe in gun control?
Survey 2:
  • Are you concerned about the number of weapons in society?
  • Do you think all gun owners secure their weapons?
  • Do you believe in gun control?

What answers do you think you might get?

As well as avoiding bias, question order is important to build trust, especially if the topic is a sensitive one. The political survey I did in the East End of London was very carefully constructed to build the respondent's trust to get to the key 'why' question. This was necessary for other surveys too. I did a survey on police recruitment, but as I'm sure you're aware, some people are very suspicious of the police. Once again, the survey was constructed so the questions that revealed it was about police recruitment came later on after the interviewer (me!) had built some trust with the respondent.

How long is the survey?

This is my favorite story from my polling days. I was doing a survey on bus transport in London and I was asked to interview people waiting for a bus. The goal of the survey was to find out where people were going so London could plan for new or changed bus routes. For obvious reasons, the set of questions were shorter than usual, but in practice, not short enough; a big fraction of my interviews were cut short because the bus turned up! In several cases, I was asking questions as people were getting on the bus, and in a few cases, we had a shouted back and forth to finish the survey before their bus pulled off out of earshot.


(Image credit: David McKay via Wikimedia Commons. License: Creative Commons)

To avoid exactly this sort of problem, most polling organizations use pilot surveys. These are test surveys done on a handful of people to debug the survey. In this case, the pilot should have uncovered the fact that the survey was too long, but regrettably, it didn't.

(Sometime later, I designed and executed a survey in Boston. I did a pilot survey and found that some of my questions were confusing and I could shorten the survey by using a freeform question rather than asking for people to choose from a list. In any survey of more than a handful of respondents, I strongly recommend running a pilot - especially if you don't have a background in polling.)

The general lesson for any survey is to keep it as short as possible and understand the circumstances people will be in when you're asking them questions.

What it all means - advice for running surveys

Surveys are hard. It's hard to sample right, it's hard to write questions well, and it's hard to order questions to avoid bias. 

Over the years, I've sat in meetings when someone has enthusiastically suggested a survey. The survey could be a HR survey of employees, or a marketing survey of customers, or something else. Usually, the level of enthusiasm is inversely related to survey experience. The most enthusiastic people are often very resistant to advice about question phrasing and order, and most resistant of all to the idea of a pilot survey. I've seen a lot of enthusiastic people come to grief because they didn't listen. 

If you're thinking about running a survey, here's my advice.

  • Make your questions as clear and unambiguous as you can. Get someone who will tell you you're wrong to review them.
  • Think about how you want the questions answered. Do you want freeform text, multiple choice, or a scale? Surprisingly, in some cases, free form can be faster than multiple choice.
  • Keep it short.
  • Always run a pilot survey. 

What it means - understanding polling results

Once you understand that polling organizations use customized sampling methodologies, you can understand why polling organizations can get the results wrong. To put it simply, if their sampling methodology misses a crucial factor, they'll get biased results. The most obvious example is state-level polling in the US 2016 Presidential Election, but there are a number of other polls that got very different results from the actual election. In a future blog post, I'll look at why the 2016 polls were so wrong and why polls were wrong in other cases too.

If you liked this post, you might like these ones