Showing posts with label election data analysis. Show all posts
Showing posts with label election data analysis. Show all posts

Monday, October 12, 2020

Fundamentally wrong? Using economic data as an election predictor

What were you thinking?

Think back to the last time you voted. Why did you vote the way you did? Here are some popular reasons, how many apply to you?

  • The country's going in the wrong direction, we need something new.
  • My kind of people vote for party X, or my kind of people never vote for party Y.
  • I'm a lifelong party X voter.
  • Candidate X or party X is best suited to running the country right now.
  • Candidate Y or party Y will ruin the country.
  • Candidate Y or party X are the best for defense/the economy/my children's education and that's what's important to me right now.

(Ballot drop box. Image Source: Wikimedia Commons. Author: Paul Sableman. License: Creative Commons.)

Using fundamentals to forecast elections

In political science circles, there's been a movement to use economic data to forecast election results. The idea is, homo economicus is a rational being whose voting behavior depends on his or her economic conditions. If the economy is going well, then incumbents (or incumbent parties) are reelected, if things are going badly, then challengers are elected instead. If this assertion is true, then people will respond rationally and predictably to changing economic circumstances. If we understand how the economy is changing, we can forecast who will win elections.

Building models based on fundamentals follows a straightforward process:

  1. Choose an economic indicator (e.g. inflation, unemployment, GDP) and see how well it forecasts elections.
  2. Get it wrong for an election.
  3. Add another economic indicator to the forecast to correctly predict the wrong election.
  4. Get it wrong for an election.
  5. Either re-adjust the model weights or go to 3.

These models can get very sophisticated. In the United States, some of the models include state-level data and make state-level forecasts of results.

What happens in practice

Two University of Colorado professors, Berry and Bickers, followed this approach to forecast the 2012 presidential election.  They very carefully analyzed elections back to 1980 using state-level economic data.  Their model was detailed and thorough and they helpfully included various statistical metrics to guide the reader to understand the model uncertainties. Their forecast was very clear: Romney would win 330 electoral college votes - a very strong victory. As a result, they became darlings for the Republican party.

Unfortunately for them, things didn't work out that way. The actual result was 332 electoral college votes for Obama and 206 for Romney, an almost complete reversal of their forecast.

In a subsequent follow-up (much shorter than their original paper), the professors argued in essence that although the economy had performed poorly, voters didn't blame Obama for it. In other words, the state of the economy was not a useful indicator for the 2012 election, even considering state-level effects.

This kind of failure is very common for fundamentals. While Nate Silver was at the New York Times, he published a long piece on why and how these models fail. To cut to the chase, there is no evidence voters are homo economicus when it comes to voting. All kinds of factors affect how someone votes, not just economic ones. There are cultural, social class, educational, and many other factors at work.

Why these models fail - post hoc ergo propter hoc and spurious correlations

The post hoc fallacy is to assume that because X follows Y, Y must cause X. In election terms, the fundamentalists assume that an improving or declining economy leads to certain types of election results. However, as we've said, there are many factors that affect voting. Take George Bush's approval rating, in the aftermath of 9/11 it peaked around 88% and he won re-election in 2004. Factors other than the economy were clearly at work.

A related phenomenon is spurious correlations which I've blogged about before. Spurious correlations occur when two unrelated phenomena show the same trend and are correlated, but one does not cause the other. Tyler Vigen has a great website that shows many spurious correlations.

Let's imagine you're a political science researcher. You have access to large amounts of economic data and you can direct your graduate students to find more. What you can do is trawl through your data set to find economic or other indicators that correlate with election results. To build your model, you weigh each factor differently, for example, inflation might have a weighting of 0.7 and unemployment 0.9. Or you could even have time-varying weights. You can then test your model against existing election results and publish your forecast for the next election cycle. This process is almost guaranteed to find spurious correlations and produce models that don't forecast very accurately. 

Forecasting using odd data happens elsewhere, but usually, more entertainingly. Paul the Octopus had a good track record of forecasting the 2010 World Cup and other football results - Wikipedia says he had an 85.7% success rate. How was he so successful? Probably dumb luck. Bear in mind, many animals have been used for forecasting and we only hear about the successful ones.



(Paul the Octopus at work. Image source: Wikimedia Commons. License: Creative Commons.)

To put it simply, models built with economic data alone are highly susceptible to error because there is no evidence voters consider economic factors in the way that proponents of these models suggest. 

All models are wrong - some are useful

The statistician George Box is supposed to have said, "all models are wrong, some are useful". The idea is simple, the simplifications involved in model building often reduce their fidelity, but some models produce useful (actionable) results. All election forecast models are just that, forecast models that may be right or wrong. The question is, how useful are they? 

Let's imagine that a fundamental model was an accurate forecaster. We would have to accept that campaigns had little or no effect on the outcome. But this is clearly at odds with reality. The polling data indicates that the course of the 2016 US presidential election changed course in the closing weeks of the campaign. Perhaps most famously, the same thing happened in 1948. One of the key issues in the 2004 US presidential election was the 'war on terror'. This isn't an economic effect and it's not at all clear how it could be reduced to a number.

In other words, election results depend on more than economic effects and may depend on factors that are hard to quantify.

To attempt to quantify these effects, we could turn to opinion polls. In 2004, we could have asked voters about their view of the war on terror and we could have factored that into a fundamentals model. But why not just ask them how they intend to vote?


(Paul the Octopus died and was memorialized by a statue. How many other forecasters will get statues? Image Source: Wikimedia Commons. Author: Christophe95. License: Creative Commons.)

Where I stand

I'm reluctant to throw the baby out with the bathwater. I think fundamentals may have some effect, but it's heavily moderated by other factors and what happens during the campaign. Maybe their best use might be to give politicians some idea of factors that might be important in a campaign. But as the UK Highway Code says of the green traffic light, it doesn't mean go, it means "proceed with caution".

If you liked this post, you might like these ones

Wednesday, October 7, 2020

Opinion polling blog posts

Why a 'greatest hits' polling blog post?

Over the past few months, I've blogged about elections and opinion polling several times. On October 8th, 2020, I gave a talk at PyData Boston on forecasting US presidential elections, and I thought I would bring these blog posts together into one convenient place so the people at the talk could more easily find them.

(Mexican bird men dancing on a pole. I subtitled my talk on opinion polls 'poll dancing' - and I'm sure I disappointed my audience as a result. Image credit: Wikimedia Commons. License: Creative Commons. Author: Juan Felipe Rios.)

Polling

Can you believe the polls? - fake polls, leading questions, and other sins of opinion polling.

President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.

Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.

Sampling the goods: how opinion polls are made - my experiences working for an opinion polling company as a street interviewer.

Probability theory

Who will win the election? Election victory probabilities from opinion polls - a quick derivation of a key formula and an explanation of why random sampling alone underestimates the uncertainty.

US democracy

These blog posts provided some background on US presidential elections.

The Electoral College for beginners - the post explains how the electoral college works and how it came to be.

Finding electoral fraud - the democracy data deficit - the post looks at the evidence (or the lack of it) for vote fraud and suggests a way citizen-analysts can contribute to American democracy.

Silkworm - lessons learned from a BI app in Python

Faster Python BI app development through code generation - how I generated the code for the Silkworm project and why I did it.

Monday, August 3, 2020

Sampling the goods: how opinion polls are made

How opinion polls work on the ground

I worked as a street interviewer for an opinion polling organization and I know how opinion polls are made and executed. In this blog post, I'm going to explain how opinion polls were run on the ground,  educate you on why polls can go wrong, and illustrate how difficult it is to run a valid poll. I'm also going to tell you why everything you learned from statistical textbooks about polling is wrong.


(Image Credit: Wikimedia Commons, License: Public Domain)

Random sampling is impossible

In my experience, this is something that's almost never mentioned in statistics textbooks but is a huge issue in polling. If they talk about sampling at all, textbooks assume random sampling, but that's not what happens.

Random sampling sounds wonderful in theory, but in practice, it can be very hard; people aren't beads in an urn. How do you randomly select people on the street or on the phone - what's the selection mechanism? How do you guard against bias? Let me give you some real examples.

Imagine you're a street interviewer. Where do you stand to take your random sample? If you take your sample outside the library, you'll get a biased sample. If you take it outside the factory gates, or outside a school, or outside a large office complex, or outside a playground, you'll get another set of biases. What about time of day? The people out on the streets at 7am are different from the people at 10am and different from the people at 11pm.

Similar logic applies to phone polls. If you call landlines only, you'll get one set of biases. If you call people during working hours, your sample will be biased (is the mechanic fixing a car going to put down their power tool to talk to you?). But calling outside of office hours means you might not get shift workers or parents putting their kids to bed. The list goes on.

You might be tempted to say, do all the things: sample at 10am, 3pm, and 11pm; sample outside the library, factory, and school; call on landlines and mobile phones, and so on, but what about the cost? How can you keep opinion polls affordable? How do you balance calls at 10am with calls at 3pm?

Because there are very subtle biases in "random" samples, most of the time, polling organizations don't do wholly 'random' sampling.

Sampling and quotas

If you can't get a random sample, you'd like your sample to be representative of a population. Here, representative means that it will behave in the same way as the population for the topics you're interested in, for example, voting in the same way or buying butter in the same way. The most obvious way of sampling is demographics: age and gender etc.

Let's say you were conducting a poll in a town to find out residents' views on a tax increase. You might find out the age and gender demographics of the town and sample people in a representative way so that the demographics of your sample match the demographics of the town. In other words, the proportion of men and women in your sample matches that of the town, the age distribution matches that of the town, and so on.


(US demographics. Image credit: Wikimedia Commons. License: Public domain)

In practice, polling organizations use a number of sampling factors depending on the survey. They might include sampling by:

  • Gender
  • Age
  • Ethnicity
  • Income
  • Social class or employment category 
  • Education 

but more likely, some combination of them.

In practice, interviewers may be given a sheet outlining the people they should interview, for example, so many women aged 45-50, so many people with degrees, so many people earning over $100,000, and so on. This is often called a quota. Phone interviews might be conducted on a pre-selected list of numbers, with guidance on how many times to call back, etc.

Some groups of people can be very hard to reach, and of course, not everyone answers questions. When it comes to analysis time, the results are weighted to correct bias.  For example, if the survey could only reach 75% of its target for men aged 20-25, the results for men in this category might be weighted by 4/3.

Who do you talk to?

Let's imagine you're a street interviewer,  you have your quota to fulfill, and you're interviewing people on the street, who do you talk to? Let me give you a real example from my polling days; I needed a man aged 20-25 for my quota. On the street, I saw what looked like a typical and innocuous student, but I also saw an aggressive-looking skinhead in full skinhead clothing and boots. Who would you choose to interview?

(Image credit: XxxBaloooxxx via Wikimedia Commons. License: Creative Commons.)

Most people would choose the innocuous student, but that's introducing bias. You can imagine multiple interviewers making similar decisions resulting in a heavily biased sample. To counter this problem, we were given guidance on who to select, for example, we were told to sample every seventh person or to take the first person who met our quota regardless of their appearance. This at least meant we were supposed to ask the skinhead, but of course, whether he chose to reply or not is another matter.

The rules sometimes led to absurdity. I did a survey where I was supposed to interview every 10th person who passed by. One man volunteered, but I said no because he was the 5th person. He hung around so long that eventually, he became the 10th person to pass me by. Should I have interviewed him? He met the rules and he met my sampling quota.

I came across a woman who was exactly what I needed for my quota. She was a care worker who had been on a day trip with severely mentally handicapped children and was in the process of moving them from the bus to the care home. Would you take her time to interview her? What about the young parent holding his child when I knocked on the door? The apartment was clearly used for recent drug-taking. Would you interview him? 

As you might expect, interviewers interpreted the rules more flexibly as the deadline approached and as it got later in the day. I once interviewed a very old man whose wife answered all the questions for him. This is against the rules, but he agreed with her answers, it was getting late, and I needed his gender/age group/employment status for my quota.

The company sent out supervisors to check our work on the streets, but of course, supervisors weren't there all the time, and they tended to vanish after 5pm anyway.

The point is, when it comes to it, there's no such thing as random sampling. Even with quotas and other guided selection methods, there are a thousand ways for bias to creep into sampling and the biases can be subtle. The sampling methodology one company uses will be different from another company's, which means their biases will not be the same.

What does the question mean?

One of the biggest lessons I learned was the importance of clear and unambiguous questions, and the unfortunate creativity of the public. All of the surveys I worked on had clearly worded questions, and to me, they always seemed unambiguous. But once you hit the streets, it's a different world. I've had people answer questions with the most astonishing levels of interpretation and creativity; regrettably, their interpretations were almost never what the survey wanted. 

What surprised me was how willing people were to answer difficult questions about salary and other topics. If the question is worded well (and I know all the techniques now!), you can get strangers to tell you all kinds of things. In almost all cases, I got people to tell me their age, and when required, I got salary levels from almost everyone.

A well-worded question led to a revelation that shocked me and shook me out of my complacency.  A candidate had unexpectedly just lost an election in the East End of London and the polling organization I worked for had been contracted to find out why. To help people answer one of the questions, I had a card with a list of reasons why the candidate lost, including the option: "The candidate was not suitable for the area." A lot of people chose that as their reason. I was naive and didn't know what it meant, but at the end of the day, I interviewed a white man in pseudo-skinhead clothes, who told me exactly what it meant. He selected "not suitable for the area" as his answer and added: "She was black, weren't she?".

The question setters weren't naive. They knew that people would hesitate before admitting racism was the cause, but by carefully wording the question and having people choose from options, they provided a socially acceptable way for people to answer the question.

Question setting requires real skill and thought.

(Oddly. there are very few technical resources on wording questions well. The best I've found is: "The Art of Asking Questions", by Stanley Le Baron Payne, but the book has been out of print for a long time.)

Order, order

Question order isn't accidental either, you can bias a survey by the order you ask questions. Of course, you have to avoid leading questions. The textbook example is survey questions on gun control. Let's imagine there were two surveys with these questions:

Survey 1:
  • Are you concerned about violent crime in your neighborhood?
  • Do you think people should be able to protect their families?
  • Do you believe in gun control?
Survey 2:
  • Are you concerned about the number of weapons in society?
  • Do you think all gun owners secure their weapons?
  • Do you believe in gun control?

What answers do you think you might get?

As well as avoiding bias, question order is important to build trust, especially if the topic is a sensitive one. The political survey I did in the East End of London was very carefully constructed to build the respondent's trust to get to the key 'why' question. This was necessary for other surveys too. I did a survey on police recruitment, but as I'm sure you're aware, some people are very suspicious of the police. Once again, the survey was constructed so the questions that revealed it was about police recruitment came later on after the interviewer (me!) had built some trust with the respondent.

How long is the survey?

This is my favorite story from my polling days. I was doing a survey on bus transport in London and I was asked to interview people waiting for a bus. The goal of the survey was to find out where people were going so London could plan for new or changed bus routes. For obvious reasons, the set of questions were shorter than usual, but in practice, not short enough; a big fraction of my interviews were cut short because the bus turned up! In several cases, I was asking questions as people were getting on the bus, and in a few cases, we had a shouted back and forth to finish the survey before their bus pulled off out of earshot.


(Image credit: David McKay via Wikimedia Commons. License: Creative Commons)

To avoid exactly this sort of problem, most polling organizations use pilot surveys. These are test surveys done on a handful of people to debug the survey. In this case, the pilot should have uncovered the fact that the survey was too long, but regrettably, it didn't.

(Sometime later, I designed and executed a survey in Boston. I did a pilot survey and found that some of my questions were confusing and I could shorten the survey by using a freeform question rather than asking for people to choose from a list. In any survey of more than a handful of respondents, I strongly recommend running a pilot - especially if you don't have a background in polling.)

The general lesson for any survey is to keep it as short as possible and understand the circumstances people will be in when you're asking them questions.

What it all means - advice for running surveys

Surveys are hard. It's hard to sample right, it's hard to write questions well, and it's hard to order questions to avoid bias. 

Over the years, I've sat in meetings when someone has enthusiastically suggested a survey. The survey could be a HR survey of employees, or a marketing survey of customers, or something else. Usually, the level of enthusiasm is inversely related to survey experience. The most enthusiastic people are often very resistant to advice about question phrasing and order, and most resistant of all to the idea of a pilot survey. I've seen a lot of enthusiastic people come to grief because they didn't listen. 

If you're thinking about running a survey, here's my advice.

  • Make your questions as clear and unambiguous as you can. Get someone who will tell you you're wrong to review them.
  • Think about how you want the questions answered. Do you want freeform text, multiple choice, or a scale? Surprisingly, in some cases, free form can be faster than multiple choice.
  • Keep it short.
  • Always run a pilot survey. 

What it means - understanding polling results

Once you understand that polling organizations use customized sampling methodologies, you can understand why polling organizations can get the results wrong. To put it simply, if their sampling methodology misses a crucial factor, they'll get biased results. The most obvious example is state-level polling in the US 2016 Presidential Election, but there are a number of other polls that got very different results from the actual election. In a future blog post, I'll look at why the 2016 polls were so wrong and why polls were wrong in other cases too.

If you liked this post, you might like these ones

Monday, July 27, 2020

The Electoral College for beginners

The (in)famous electoral college

We're coming up to the US Presidential election so it's time for pundits and real people to discuss the electoral college. There's a lot of misunderstanding about what it is, its role, and its potential to undermine democracy. In this post, I'm going to tell you how it came to be, the role it serves, and some issues with it that may cause trouble.

(Ohio Electoral College 2012. Image credit: Wikimedia Commons. Contributor: Ibagli. License: Creative Commons.)

How it came to be

The thirteen original colonies had the desire for independence in common but had stridently different views on government. In the aftermath of independence, the US was a confederacy, a country with a limited and small (federal) government. After about ten years, it became obvious that this form of government wasn't working and something new was needed. So the states created a Constitutional Convention to discuss and decide on a new constitution and form of government.

Remember, the thirteen states were the size of European countries and had very different views on issues like slavery. The states with smaller populations were afraid they would be dominated by the more populous states, which was a major stumbling block to agreements. The issue was resolved by the Great Compromise (or Connecticut Compromise if you come from Connecticut). The Convention created a two-chamber congress and a more powerful presidency than before. Here's how they were to be elected:

  • The lower house, the House of Representatives, was to have representatives elected in proportion to the population of the state (bigger states get more representatives). 
  • The upper house, the Senate, was to have two Senators per state regardless of the population of the state
  • Presidents were to be elected through an electoral college, with each elector having one vote. Each state would be allocated a number of electors (and hence votes) based on their seats in congress. The electors would meet and vote for the President. For example in 1789, the state of New Hampshire had three representatives and two senators, which meant New Hampshire sent five electors (votes) to the electoral college. The states decided who the electoral college electors were.

Think for a minute about why this solution worked. The states were huge geographically with low population densities and often poor communications. Travel was a big undertaking and mail was slow. It made sense to send voters to vote on your behalf at a college and these delegates may have to change their vote depending on circumstances. In short, the electoral college was a way of deciding the presidency in a big country with slow communications.

Electoral college vote allocation is and was only partially representative of the underlying population size. Remember, each state gets two Senators (and therefore two electoral college votes) regardless of its population. This grants power disproportionately to lower-population states, which is a deliberate and intended feature of the system.

Early practice

Prior to the formation of modern political parties, the President was the person who got the largest number of electoral college votes and the Vice-President was the person who got the next highest number of votes. For example, in 1792, George Washington was re-elected President with 132 votes, and the runner-up, John Adams, who got 77 votes, became Vice-President. This changed when political parties made this arrangement impractical, and by 1804, the President and the Vice-President were on the same ticket.

Electoral college electors were originally selected by state legislators, not by the people. As time went on, more states started directly electing electoral college electors. In practice, this meant the people chose their Presidential candidate and the electoral college electors duly voted for them. 

By the late 19th century, all states were holding elections for the President and Vice-President through electoral college representation.

Modern practice

Each state has the following representation in congress:

  • Two Senators 
  • A number of House of Representative seats roughly related to the state's population.

The size of each state's congressional delegation is their number of electoral college votes. For example, California has 53 Representatives and 2 Senators giving 55 electoral college electors and 55 electoral college votes.

During a Presidential election, the people in each state vote for who they want for President (and by extension, Vice-President). Although it's a federal election, the voting is entirely conducted by each state; the ballot paper is different, the counting process is different, and the supervision is different.

Most states allocate their electoral college votes on a winner-takes-all basis, the person with the largest share of the popular vote gets all the electoral college votes. For example, in 2016, the voting in Pennsylvania was: 2,926,441 votes for Hilary Clinton and 2,970,733 votes for Donald Trump, and Donald Trump was allocated all of Pennsylvania's electoral college votes. 

Two states do things a little differently. Maine and Nebraska use the Congressional District method. They allocate one of their electoral college votes to each district used to elect a member of the House of Representatives. The winner of the statewide vote is then allocated the other two electoral college votes. In Maine in 2016, Hilary Clinton won three electoral college votes and Donald Trump one.

Washington D.C. isn't a state and doesn't have Senators; it has a non-voting delegate to the House of Representatives. However, it does have electoral college votes! Under the 23rd amendment to the Constitution, it has the same electoral college votes as the least populous state (currently 3). 

In total, there are 538 electoral college votes:

  • 100 Senators
  • 435 Representatives
  • 3 Electors for Washington D.C.

The electoral college does not meet as one body in person. Electors meet in their respective state capitols and vote for President.

How electoral college votes are decided

How are electoral college votes allocated to states? I've talked about the formula before, 2 Senators for every state plus the number of House of Representative seats. House of Representative seats are allocated on a population basis using census data. There are 435 seats that are reallocated every ten years based on census data. Growing states may get more seats and shrinking states fewer. This is why the census has been politicized from time to time - if you can influence the census you can gain a ten-year advantage for your party. 

Faithless electors and the Supreme Court

Remember, the electors meet and vote for President. Let's imagine we have two Presidential candidates, cat and dog, and that the people of the state vote for cat. What's to stop the electors voting for dog instead? Nothing at all. For many states, there's nothing to stop electors voting for anyone regardless of who won the election in the state. This can and does happen, even as recently as 2016. It happens so often that there's a name for them: faithless electors.

In 2016, five electors who should have voted for Hilary Clinton didn't vote for her, and two who should have voted for Donald Trump didn't vote for him. These votes were officially accepted and counted.

Several states have laws that mandate that electors vote as instructed or provide punishment for electors who do not vote as instructed. These laws were challenged in the Supreme Court, which voted to uphold them.

On the face of it, faithless electors sound awful, but I do have to say a word in their defense. They do have some support from the original intent of the Constitutional Convention and they do have some support from the Federalist Papers. It's not entirely as black and white as it appears to be.

Have faithless electors ever swayed a Presidential election? No. Could they? Yes.

Gerrymandering

In principle, it's possible to gerrymander electoral college votes, but it hasn't been done in practice. Let me explain how a gerrymander could work.

First off, you'd move to Congressional District representation. Because the shape of congressional districts are under state control, you could gerrymander these districts to your heart's content. Next, you'd base your senatorial electoral college votes on the congressional district winner on a winner-takes-all basis. Let's say you had 10 congressional districts and you'd gerrymandered them so your party could win 7. Because 7 of the 10 districts would be for one candidate, you'd award your other two votes to that candidate. In other words, a candidate could lose the popular vote but still gain the majority of the electoral college votes for a state.

The electoral college and representative democracy

Everyone knows that Hilary Clinton won the popular vote but Donald Trump won the electoral college and became President. This was a close election, but it's theoretically possible for a candidate to lose the popular vote by a substantial margin, yet still win the presidency.

Bear in mind what I said at the beginning of this piece, electoral college votes are not entirely representative of the population, by design. Here's a chart of electoral college votes per 1,000,000 population for 2020. Note how skewed it is in favor of low-population (and rural) states. If you live in Wyoming your vote is worth 5 times that of a voter in Texas. 

Obviously, some states are firmly Democratic and others firmly Republican. The distribution of electoral college votes pushes candidates to campaign more heavily in small swing states, giving them an outsize influence (for example, New Hampshire). Remember, your goal as a candidate is to win electoral college votes, your goal is not to win the popular vote. You need to focus your electoral spending so you get the biggest bang for your buck in terms of electoral college votes, which means small swing states.

Nightmare scenarios

Here are two scenarios that are quite possible with the current system:

Neither of these scenarios is good for democracy or stability. There is nothing to prevent them now.

Who else uses an electoral college?

Given the problems with an electoral college, it's not surprising that there aren't many other cases in the world of its use. According to Wikipedia, there are several other countries that use it for various elections, but they are a minority.

Could the electoral college be changed for another system?

Yes, but it would take a constitutional change, which is a major undertaking and would require widespread cross-party political support. Bear in mind, a more representative system (e.g. going with the popular vote) would increase the power of the more populous states and decrease the power of less populous states - which takes us all the way back to the Great Compromise and the Constitutional Convention.

What's next?

I hope you enjoyed this article. I intend to write more election-based pieces as November comes closer. I'm not going to endorse or support any candidate or party; I'm only interested in the process of democracy!

If you liked this post, you might like these ones

Saturday, May 23, 2020

Finding electoral fraud - the democracy data deficit

Why we need to investigate fraud

In July 2016, Fox News' Sean Hannity reported that Mitt Romney received no votes at all in 59 Philadelphia voting precincts in the 2012 Presidential Election. He claimed that this was evidence of vote-rigging - something that received a lot of commentary and on-air discussion at the time. On the face of it, this does sound like outright electoral fraud; in a fair election, how is it possible for a candidate to receive no votes at all? Since then, there have been other allegations of fraud and high-profile actual incidents of fraud. In this blog post, I’m going to talk about how a citizen-analyst might find electoral fraud. But I warn you, you might not like what I’m going to say.

 National Museum of American History, Public domain, via Wikimedia Commons

Election organization - the smallest electoral units

In almost every country, the election process is organized in the same way; the electorate is split into geographical blocks small enough to be managed by a team on election day. The blocks might contain one or many polling stations and may have a few hundred to a few thousand voters. These blocks are called different things in different places, for example, districts, divisions, or precincts. Because precinct seems to be the most commonly used word, that's what I'm going to use here. The results from the precincts are aggregated to give results for the ward, county, city, state, or country. The precinct boundaries are set by different authorities in different places, but they're known. 

How to look for fraud

A good place to look for electoral shenanigans is at the precinct level, but what should we look for? There are several easy checks:

  • A large and unexplained increase or decrease in the number of voters compared to previous elections and compared to other nearby precincts. 
  • An unexpected change in voting behavior compared to previous elections/nearby precincts. For example, a precinct that ‘normally’ votes heavily for party Y suddenly voting for party X.
  • Changes in voting patterns for absentee voters e.g. significantly more or less absentee votes or absentee voter voting patterns that are very different from in-person votes.
  • Results that seem inconsistent with the party affiliation of registered voters in the precinct.
  • A result that seems unlikely given the demographics of the precinct.

Of course, none of these checks is a smoking gun, either individually or collectively, but they might point to divisions that should be investigated. Let’s start with the Philadelphia case and go from there.

Electoral fraud - imagined and real

It’s true that some divisions (precincts) in Philadelphia voted overwhelmingly for Obama in 2012. These divisions were small (averaging about 600 voters) and almost exclusively (95%+) African-American. Obama was hugely popular with the African-American community in Philadelphia, polling 93%+. The same divisions also have a history of voting overwhelmingly Democratic. Given these facts, it’s not at all surprising to see no or very few votes for Mitt Romney. Similar arguments hold for allegations of electoral fraud in Cleveland, Ohio in 2012

In fact, there were some unbalanced results the other way too; in some Utah precincts, Obama received no votes at all - again not surprising given the voter population and voter history. 

Although on the face of it these lopsided results seem to strongly indicate fraud, the allegations don't stand up to analytical scrutiny.

Let’s look at another alleged case of electoral fraud, this time in 2018 in North Carolina. The congressional election was fiercely contested and appeared to be narrowly decided in favor of Mark Harris. However, investigators found irregularities in absentee ballots, specifically, missing ballots from predominantly African-American areas. The allegations were serious enough that the election was held again, and criminal charges have been made against a political operative in Mark Harris’ campaign. The allegation is ‘ballot harvesting’, where operatives persuade voters who might vote for their opposition to voting via an absentee ballot and subsequently make these ballots disappear.

My sources of information here are newspaper reports and analysis, but what if I wanted to do my own detective work and find areas where the results looked odd? How might I get the data? This is where things get hard.

Democracy’s data - official sources

To get the demographics of a precinct, I can try going to the US Census Bureau. The Census Bureau defines small geographic areas, called tracts, that they can supply data on. Tract data include income levels, population, racial makeup, etc. Sometimes, these tracts line up with voting districts (the Census term for precincts), but sometimes they don’t. If tracts don’t line up with voting districts, then automated analysis becomes much harder. In my experience, it takes a great investment of time to get any useful data from the Census Bureau; the data’s there, it’s just really hard finding out how to get it. In practice then, it’s extremely difficult for a citizen-analyst to link census data to electoral data.

What about voting results? Surely it’s easy to get electoral result data? As it turns out, this is surprisingly hard too. You might think the Federal Election Commission (FEC) will have detailed data, but it doesn’t. The data available from the FEC for the 2016 Presidential Election is less detailed than the 2016 Presidential Election Wikipedia page. The reason is, Presidential Elections are run by the states, so there are 51 (including Washington DC) separate authorities maintaining electoral results, which means 51 different ways of getting data, 51 different places to get it, and 51 different levels of detail available. The FEC sources its data from the states, so it's not surprising its reports are summary reports.  

If we need more detailed data, we need to go to the states themselves. 

Let's take Massachusetts as an example, Presidential Election data is available for 2016, down to the ward level (as a CSV), but for Utah, data is only available at the county level (as an Excel file), which is the same as Pennsylvania, where the data is only available from a web page. To get detail below the county level may take freedom of information requests, if the information is available at all. 

In effect, this puts precinct-level nationwide voting analysis from official sources beyond almost all citizen-analysts.

Democracy’s data - unofficial sources

In practice, voting data is hard to come by from official sources, but it is available from unofficial sources who've put the work into getting the data from the states and make it available to everyone.

Dave Leip offers election data down to detailed levels; the 2016 results by country will cost you $92 and results by Congressional District will cost you $249, however, high-level results are on his website and available for free. He's even been kind enough to list his sources and URLs if you want to spend the time to duplicate his work. Leip’s data is used by the media in their analysis, and probably by political campaigns too. He’s put in a great deal of work to gather the data and he’s asking for a return on his effort, which is fair enough. 

The MIT Election Data and Science Lab (MEDSL) collects election data, including down to the precinct level and the data is available for the most recent Presidential Election (2016 at the time of writing). As usual with this kind of data, there are all kinds of notes to read before using the data. MIT has also been kind enough to make tools available to analyze the data and they also make available their website scrapping tools.

The MIT project isn't the only project providing data. Various other universities have collated electoral resources at various levels of detail:

Democracy’s data - electoral fraud

What about looking for cases of electoral fraud? There isn't a central repository of electoral fraud cases and there are multiple different court systems in the US (state and federal), each maintaining records in different ways. Fortunately, Google indexes a lot of cases, but often, court transcripts are only available for a fee, and of course, it's extremely time-consuming to trawl through cases.

The Heritage Foundation maintains a database of known electoral fraud cases. They don't claim their database is complete, but they have put a lot of effort into maintaining it and it's the most complete record I know of. 

In 2018, there were elections for the House of Representatives, the Senate, state elections, and of course county and city elections. Across the US, there must have been thousands of different elections in 2018. How many cases of electoral fraud do you think there were? What level of electoral fraud would undermine your faith in the system? In 2018, there were 65 cases. From the Heritage Foundation data, here’s a chart of fraud cases per year for the United States as a whole.

Electoral fraud cases by year
(Electoral fraud cases by year from the Heritage Foundation electoral fraud database)

It does look like there's been an increase in electoral fraud up to about 2010, but bear in mind the dataset cover the period of computerization and the rise of the internet. We might expect a rise in fraud cases because it's easier to find case records. 

Based on this data, there really doesn’t seem to be large-scale electoral fraud in the United States. In fact, in reading the cases on their website, most of them are small-scale frauds concerning local elections (e.g. mayoral elections) - in a lot of cases, the frauds are frankly pathetic. 

Realistic assessment of election data

Official data is either hard to come by or not available at the precinct level, which leaves us using unofficial data. Fortunately, unofficial data is high quality and from reputable sources. The problem is, data from unofficial sources aren't available immediately after an election; there may be a long delay between the election and the data. If one of the goals of electoral data analysis is finding fraud, then timely data availability is paramount.

Of course, this kind of analysis I'm talking about here won't find small-scale fraud, where a person votes more than once or impersonates someone. But small-scale fraud will only affect the outcome of the very tightest of races. Democracy is most threatened by fraud that might affect the results, which in most cases is larger-scale fraud like the North Carolina case. Statistical analysis might detect these kinds of fraud.

Sean Hannity's allegation of electoral fraud in Philadelphia didn't stand up to analysis, but it was worth investigating and is the kind of fraud we could detect using data - if only it were available in a timely way. 

How things could be - a manifesto

Imagine groups of researchers sitting by their computers on election night. As election results at the precinct level are posted online, they analyze the results for oddities. By the next morning, they may have spotted oddities in absentee ballots, or unexplained changes in voting behavior, or unexpected changes in voter turnout - any of which will feed into the news cycle. Greater visibility of anomalies will enable election officials to find and act on fraud more quickly.

To do this will require consistency of reporting at the state level and a commitment to post precinct results as soon as they're counted and accepted. This may sound unlikely, but there are federal standards the states must follow in many other areas, including deodorants, teddy bears, and apple grades, but also for highway construction, minimum drinking age, and the environment. Isn't the transparency of democracy at least as important as deodorants, teddy bears, and apples?

If you liked this post, you might like these ones