Showing posts with label analytics. Show all posts

Monday, September 14, 2020

The datasaurus: always visualize your data

The summary is not the whole picture

If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.

The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.

What are summary statistics?

Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.

The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.

Anscombe's Quartet

Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!

Here are the summary statistics data:

Property	Value
Mean of x	9
Sample variance of x : $\sigma ^{2}$	11
Mean of y	7.50
Sample variance of y : $\sigma ^{2}$	4.125
Correlation between x and y	0.816
Linear regression line	y = 3.00 + 0.500x
Coefficient of determination of the linear regression : $R^{2}$	0.67

These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables is different in each case and implies different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.

The datasaurus

Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.

Property	Value
n	142
mean	54.2633
x standard deviation	16.7651
y mean	47.8323
y standard deviation	26.9353
Pearson correlation	-0.0645

What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.

The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.

The datasaurus dozen

Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with the same summary statistics as the datasaurus. Here they all are.

The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.

If all this sounds a bit abstract, let's think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data, but you might be very wrong.

You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:

the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.

(The Autodesk Research team have posted their work as a paper you can read here.)

Lessons learned

The lessons you should take away from all this are simple:

summary statistics hide a lot
there are many relationships between variables that will give summary statistics that look like noise
always visualize your data!

Tuesday, September 8, 2020

Can you believe the polls?

Opinion polls have known sin

Polling companies have run into trouble over the years in ways that render some poll results doubtful at best. Here are just a few of the problems:

Fraud allegations.
Leading questions
Choosing not to publish results/picking methodologies so that polls agree.

Running reliable polls is hard work that takes a lot of expertise and commitment. Sadly, companies sometimes get it wrong for several reasons:

Ineptitude.
Lack of money.
Telling people what they want to hear.
Fakery.

In this blog post, I'm going to look at some high-profile cases of dodgy polling and I'm going to draw some lessons from what happened.

(Are some polls real or fake? Image source: Wikimedia Commons. Image credit: Basile Morin. License: Creative Commons.)

Allegations of fraud part 1 - Research 2000

Backstory

Research 2000 started operating around 1999 and gained some solid early clients. In 2008, The Daily Kos contracted with Research 2000 for polling during the upcoming US elections. In early 2010, Nate Silver at FiveThirtyEight rated Research 2000 as an F and stopped using their polls. As a direct result, The Daily Kos terminated their contract and later took legal action to reclaim fees, alleging fraud.

Nate Silver's and others' analysis

After the 2010 Senate elections, Nate Silver analyzed polling results for 'house effects' and found a bias towards the Democratic party for Research 2000. These kinds of biases appear all the time and vary from election to election. The Research 2000 bias was large (at 4.4%), but not crazy; the Rasmussen Republican bias was larger for example. Nonetheless, for many reasons, he graded Research 2000 an F and stopped using their polling data.

In June of 2010, The Daily Kos publicly dismissed Research 2000 as their pollster based on Nate Silver's ranking and more detailed discussions with him. Three weeks later, The Daily Kos sued Research 2000 for fraud. After the legal action was public, Nate Silver blogged some more details of his misgivings about Research 2000's results, which led to a cease and desist letter from Research 2000's lawyers. Subsequent to the cease-and-desist letter, Silver published yet more details of his misgivings. To summarize his results, he was seeing data inconsistent with real polling - the distribution of the numbers was wrong. As it turned out, Research 2000 was having financial trouble around the time of the polling allegations and was negotiating low-cost or free polling with The Daily Kos in exchange for accelerated payments.

Others were onto Research 2000 too. Three statisticians analyzed some of the polling data and found patterns inconsistent with real polling - again, real polls tend to have results distributed in certain ways and some of the Research 2000 polls did not.

The result

The lawsuit progressed with strong evidence in favor of The Daily Kos. Perhaps unsurprisingly, the parties agreed a settlement, with Research 2000 agreeing to pay The Daily Kos a settlement fee. Research 2000 effectively shut down after the agreement.

Allegations of fraud part 2 - Strategic Vision, LLC

Backstory

This story requires some care in the telling. At the time of the story, there were two companies called Strategic Vision, one company is well-respected and wholly innocent, the other not so much. The innocent and well-respected company is Strategic Vision based in San Diego. They have nothing to do with this story. The other company is Strategic Vision, LLC based in Atlanta. When I talk about Strategic Vision, LLC from now on it will be solely about the Atlanta company.

To maintain trust in the polling industry, the American Association for Public Opinion Research (AAPOR) has guidelines and asks polling companies to disclose some details of their polling methodologies. They rarely censure companies, and their censures don't have the force of law, but public shaming is effective as we'll see.

What happened

In 2008, the AAPOR asked 21 polling organizations for details of their 2008 pre-election polling, including polling for the New Hampshire Democratic primary. Their goal was to quality-check the state of polling in the industry.

One polling company didn't respond for a year, despite repeated requests to do so. As a result, in September 2009, the AAPOR published a public censure of Strategic Vision, LLC which you can read here.

It's very unusual for the AAPOR to issue a censure, so the story was widely reported at the time, for example in the New York Times, The Hill, and The Wall Street Journal. Strategic Vision LLC's public response to the press coverage was that they were complying but didn't have time to submit their data. They denied any wrongdoing.

Subsequent to the censure, Nate Silver looked more closely at Strategic Vision LLC's results. Initially, he asked some very pointed and blunt questions. In a subsequent post, Nate Silver used Benford's Law to investigate Strategic Vision LLC's data, and based on his analysis he stated there was a suggestion of fraud - more specifically, that the data had been made up. In a post the following day, Nate Silver offered some more analysis and a great example of using Benford's Law in practice. Again, Strategic Vision LLC vigorously denied any wrongdoing.

One of the most entertaining parts of this story is a citizenship poll conducted by Strategic Vision, LLC among high school students in Oklahoma. The poll was commissioned by the Oklahoma Council on Public Affairs, a think tank. The poll asked eight various straightforward questions, for example:

who was the first US president?
what are the two main political parties in the US?

and so on. The results were dismal: only 23% of students answered George Washington and only 43% of students knew Democratic and Republican. Not one student in 1,000 got all questions correct - which is extraordinary. These types of polls are beloved of the press; there are easy headlines to be squeezed from students doing poorly, especially on issues around citizenship. Unfortunately, the poll results looked odd at best. Nate Silver analyzed the distribution of the results and concluded that something didn't seem right - the data was not distributed as you might expect. To their great credit, when the Oklahoma Council on Public Affairs became aware of problems with the poll, they removed it from their website and put up a page explaining what happened. They subsequently terminated their relationship with Strategic Vision, LLC.

In 2010, a University of Cincinnati professor awarded Strategic Vision LLC the ''Phantom of the Soap Opera" award on the Media Ethics site. This site has a little more back story on the odd story of Strategic Vision LLC's offices or lack of them.

The results

Strategic Vision, LLC continued to deny any wrongdoing. They never supplied their data to the AAPOR and they stopped publishing polls in late 2009. They've disappeared from the polling scene.

Other polling companies

Nate Silver rated other pollsters an F and stopped using them. Not all of the tales are as lurid as the ones I've described here, but there are accusations of fraud and fakery in some cases, and in other cases, there are methodology disputes and no suggestion of impropriety. Here's a list of pollsters Nate Silver rates an F.

Pharos Research Group
TCJ Research
Overtime Politics
Big Data Poll
OurProgress (The Progress Campaign)
KG Polling
Blumenthal Research Daily
CSP Polling

Anarchy in the UK

It's time to cross the Atlantic and look at polling shenanigans in the UK. The UK hasn't seen the rise and fall of dodgy polling companies, but it has seen dodgy polling methodologies.

Herding

Let's imagine you commission a poll on who will win the UK general election. You get a result different from the other polls. Do you publish your result? Now imagine you're a polling analyst, you have a choice of methodologies for analyzing your results, do you do what everyone else does and get similar results, or do you do your own thing and maybe get different results from everyone else?

Sadly, there are many cases when contrarian polls weren't published and there is evidence that polling companies made very similar analysis choices to deliberately give similar results. This leads to the phenomenon called herding where published poll results tend to herd together. Sometimes, this is OK, but sometimes it can lead to multiple companies calling an election wrongly.

In 2015, the UK polls predicted a hung parliament, but the result was a working majority for the Conservative party. The subsequent industry poll analysis identified herding as one of the causes of the polling miss.

This isn't the first time herding has been an issue with UK polling and it's occasionally happened in the US too.

Leading questions

The old British TV show 'Yes, Prime Minister' has a great piece of dialog neatly showing how leading questions work in surveys. 'Yes, Prime Minister' is a comedy, but UK polls have suffered from leading questions for a while.

The oldest example I've come across dates from the 1970's and the original European Economic Community membership referendum. Apparently, one poll asked the following questions to two different groups:

France, Germany, Italy, Holland, Belgium and Luxembourg approved their membership of the EEC by a vote of their national parliaments. Do you think Britain should do the same?
Ireland, Denmark and Norway are voting in a referendum to decide whether to join the EEC. Do you think Britain should do the same?

These questions are highly leading and unsurprisingly elicited the expected positive result in both (contradictory) cases.

Moving forward in time to 2012, leading questions or artful question wording, came up again. The background is press regulation. After a series of scandals where the press behaved shockingly badly, the UK government considered press regulation to curb abuses. Various parties were for or against various aspects of press regulation and they commissioned polls to support their viewpoints.

The polling company YouGov published a poll, paid for by The Media Standards Trust, that showed 79% of people thought there should be an independent government-sanctioned regulator to investigate complaints against the press. Sounds comprehensive and definitive.

But there was another poll at about the same time, this time paid for by The Sun newspaper, that found that only 24% of the British public wanted a government regulator for the press - the polling company here was also YouGov!

The difference between the 79% and 24% came through careful question wording - a nuance that was lost in the subsequent press reporting of the results. You can listen to the story on the BBC's More Or Less program that gives the wording of the question used.

What does all this mean?

The quality of the polling company is everything

The established, reputable companies got that way through high-quality reliable work over a period of years. They will make mistakes from time to time, but they learn from them. When you're considering whether or not to believe a poll, you should ask who conducted the poll and consider the reputation of the company behind it.

With some exceptions, the press is unreliable

None of the cases of polling impropriety were caught by the press. In fact, the press has a perverse incentive to promote the wild and outlandish, which favors results from dodgy pollsters. Be aware that a newspaper that paid for a poll is not going to criticize its own paid-for product, especially when it's getting headlines out of it.

Most press coverage of polls focuses on discussing what the poll results mean, not how accurate they are and sources of bias. If these things are discussed, they're discussed in a partisan manner (disagreeing with a poll because the writer holds a different political view). I've never seen the kind of analysis Nate Silver does elsewhere - and this is to the great detriment of the press and their credibility.

Vested interests

A great way to get press coverage is by commissioning polls and publishing the results; especially if you can ask leading questions. Sometimes, the press gets very lazy and doesn't even report who commissioned a poll, even when there's plainly a vested interest.

Anytime you read a survey, ask who paid for it and what the exact questions were.

Outliers are outliers, not trends

Outlier poll results get more play than results in line with other pollsters. As I write this in early September 2020, Biden is about 7% ahead in the polls. Let's imagine two survey results coming in early September:

Biden ahead by 8%.
Trump ahead by 3%

Which do you think would get more space in the media? Probably the shocking result, even though the dull result may be more likely. Trump-supporting journalists might start writing articles on a campaign resurgence while Biden-supporting journalists might talk about his lead slipping and losing momentum. In reality, the 3% poll might be an anomaly and probably doesn't justify consideration until it's backed by other polls.

Bottom line: outlier polls are probably outliers and you shouldn't set too much store by them.

There's only one Nate Silver

Nate Silver seems like a one-man army, routing out false polling and pollsters. He's stood up to various legal threats over the years. It's a good thing that he exists, but it's a bad thing that there's only one of him. It would be great if the press could take inspiration from him and take a more nuanced, skeptical, and statistical view of polls.

Can you believe the polls?

Let me close by answering my own question: yes you can believe the polls, but within limits and depending on who the pollster is.

Reading more

This blog post is one of a series of blog posts about opinion polls.

Fundamentally wrong? Using economic data as an election predictor - why I distrust forecasting models built on economic and other data
Can you believe the polls? - fake polls, leading questions, and other sins of opinion polling.
President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.
Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.
The dirty little secrets of opinion polling - my experiences working for an opinion polling company as a street interviewer.
The electoral college for beginners - how the electoral college works

Monday, August 17, 2020

Poll-axed: disastrously wrong opinion polls

Getting it really, really wrong

On occasions, election opinion polls have got it very, very wrong. I'm going to talk about some of their biggest blunders and analyze why they messed up so very badly. There are lessons about methodology, hubris, and humility in forecasting.

(Image credit: secretlondon123, Source: Wikimedia Commons, License: Creative Commons)

The Literary Digest - size isn't important

The biggest, badest, and boldest polling debacle happened in 1936, but it still has lessons for today. The Literary Digest was a mass-circulation US magazine published from 1890-1938. In 1920, it started printing presidential opinion polls, which over the years acquired a good reputation for accuracy [Squire], so much so that they boosted the magazine's subscriptions. Unfortunately, its 1936 opinion poll sank the ship.

(Source: Wikimedia Commons. License: Public Domain)

The 1936 presidential election was fought between Franklin D. Roosevelt (Democrat), running for re-election, and his challenger Alf Landon (Republican). The backdrop was the ongoing Great Depression and the specter of war in Europe.

The Literary Digest conducted the largest-ever poll up to that time, sending surveys to 10 million people and receiving 2.3 million responses; even today, this is orders of magnitude larger than typical opinion polls. Through the Fall of 1936, they published results as their respondents returned surveys; the magazine didn't interpret or weight the surveys in any way [Squire]. After 'digesting' the responses, the Literary Digest confidently predicted that Landon would easily beat Roosevelt. Their reasoning was, the poll was so big it couldn’t possibly be wrong, after all the statistical margin of error was tiny.

Unfortunately for them, Roosevelt won handily. In reality, handily is putting it mildly, he won a landslide victory (523 electoral college votes to 8).

So what went wrong? The Literary Digest sampled its own readers, people who were on lists of car owners, and people who had telephones at home. In the Great Depression, this meant their sample was not representative of the US voting population; the people they sampled were much wealthier. The poll also suffered from non-response bias; the people in favor of Landon were enthusiastic and filled in the surveys and returned them, the Roosevelt supporters less so. Unfortunately for the Literary Digest, Roosevelt's supporters weren't so lethargic on election day and turned up in force for him [Lusinchi, Squire]. No matter what the size of the Literary Digest's sample, their methodology baked in bias, so it was never going to give an accurate forecast.

Bottom line: survey size can't make up for sampling bias.

Sampling bias is an ongoing issue for pollsters. Factors that matter a great deal in one election might not matter in another, and pollsters have to estimate what will be important for voting so they know who to select. For example, having a car or a phone might not correlate with voting intention for most elections, until for one election they do correlate very strongly. The Literary Digest's sampling method was crude, but worked fine in previous elections. Unfortunately, in 1936 the flaws in their methodology made a big difference and they called the election wrongly as a result. Fast-forwarding to 2016, flaws in sampling methodology led to pollsters underestimating support for Donald Trump.

Sadly, the Literary Digest never recovered from this misstep and folded two years later.

Dewey defeats Truman - or not

The spectacular implosion of the 1936 Literary Digest poll gave impetus to the more 'scientific' polling methods of George Gallup and others [Igo]. But even these scientific polls came undone in the 1948 US presidential election.

The election was held not long after the end of World War II and was between the incumbent, Harry S. Truman (Democrat), and his main challenger, Thomas E. Dewey (Republican). At the start of the election campaign, Dewey was the favorite over the increasingly unpopular Truman. While Dewey ran a low-key campaign, Truman led a high-energy, high-intensity campaign.

The main opinion polling companies of the time, Gallup, Roper, and Crossley firmly predicted a Dewey victory. The Crossley Poll of 15 October 1948 put Dewey ahead in 27 states [Topping]. In fact, their results were so strongly in favor of Dewey that some polling organizations stopped polling altogether before the election.

The election result? Truman won convincingly.

A few newspapers were so convinced that Dewy had won that they went to press with a Dewey victory announcement, leading to one of the most famous election pictures of all time.

(Source: Truman Library)

What went wrong?

As far as I can tell, there were two main causes of the pollsters' errors:

Undecided voters breaking for Truman. Pollsters had assumed that undecided voters would split their votes evenly between the candidates, which wasn't true then, and probably isn't true today.
Voters changing their minds or deciding who to vote for later in the campaign. If you stop polling late in the campaign, you're not going to pick up last-minute electoral changes.

Just as in 1936, there was a commercial fallout, for example, 30 newspapers threatened to cancel their contracts with Gallup.

As a result of this fiasco, the polling industry regrouped and moved towards random sampling and polling late into the election campaign.

US presidential election 2016

For the general public, this is the best-known example of polls getting the result wrong. There's a lot to say about what happened in 2016, so much in fact, that I'm going to write a blog post on this topic alone. It's not the clear-cut case of wrongness it first appears to be.

(Imaged credit: Michael Vadon, Source: Wikimedia Commons, License: Creative Commons)

For now, I'll just give you some hints: like the Literary Digest example, sampling was one of the principal causes, exacerbated by late changes in the electorate's voting decisions. White voters without college degrees voted much more heavily for Donald Trump than Hilary Clinton and in 2016, opinion pollsters didn't control for education, leading them to underestimate Trump's support in key states. Polling organizations are learning from this mistake and changing their methodology for 2020. Back in 2016, a significant chunk of the electorate seemed to make up their minds in the last few weeks of the election which was missed by earlier polling.

It seems the more things change, the more they remain the same.

Anarchy in the UK?

There are several properties of the US electoral system that make it very well suited for opinion polling but other electoral systems don't have these properties. To understand why polling is harder in the UK than in the US, we have to understand the differences between a US presidential election and a UK general election.

The US is a national two-party system, the UK is a multi-party democracy with regional parties. In some constituencies, there are three or more parties that could win.
In the US, the president is elected and there are only two candidates, in the UK, the electorate vote for Members of Parliament (MPs) who select the prime minister. This means the candidates are different in each constituency and local factors can matter a great deal.
There are 50 states plus Washington DC, meaning 51 geographical areas. In the UK, there are currently 650 constituencies, meaning 650 geographies area to survey.

These factors make forecasting UK elections harder than US elections, so perhaps we should be a bit more forgiving. But before we forgive, let's have a look at some of the UK's greatest election polling misses.

General elections

The 1992 UK general election was a complete disaster for the opinion polling companies in the UK [Smith]. Every poll in the run-up to the election forecast either a hung parliament (meaning, no single party has a majority) or a slim majority for the Labour party. Even the exit polls forecast a hung parliament. Unfortunately for the pollsters, the Conservative party won a comfortable working majority of seats. Bob Worcester, the best-known UK pollster at the time, said the polls were more wrong "...than in any time since their invention 50 years ago" [Jowell].

Academics proposed several possible causes [Jowell, Smith]:

"Shy Tories". The idea here is that people were too ashamed to admit they intended to vote Conservative, so they lied or didn't respond at all.
Don't knows/won't say. In any poll, some people are undecided or won't reveal their preference. To predict an election, you have to model how these people will vote, or at least have a reliable way of dealing with them, and that wasn't the case in 1992 [Lynn].
Voter turnout. Different groups of people actually turn out to vote at different proportions. The pollsters didn't handle differential turnout very well, leading them to overstate the proportion of Labour votes.
Quota sampling methods. Polling organizations use quota-based sampling to try and get a representative sample of the population. If the sampling is biased, then the results will be biased [Lynn, Smith].

As in the US in 1948, the pollsters re-grouped, licked their wounds and revised their methodologies.

After the disaster of 1992, surely the UK pollsters wouldn't get it wrong again? Moving forward 2015, the pollsters got it wrong again!

In the 2015 election, the Conservative party won a working majority. This was a complex, multi-party election with strong regional effects, all of which were well-known at the time. As in 1992, the pollsters predicted a hung parliament and their subsequent humiliation was very public. Once again, there were various inquiries into what went wrong [Sturgis]. Shockingly, the "official" post-mortem once again found that sampling was the cause of the problem. The polls over-represented Labour supporters and under-represented Conservative supporters, and the techniques used by pollsters to correct for sampling issues were inadequate [Sturgis]. The official finding was backed up by independent research which further suggested pollsters had under-represented non-voters and over-estimated support for the Liberal Democrats [Melon].

Once again, the industry had a rethink.

There was another election in 2019. This time, the pollsters got it almost exactly right.

It's nice to see the polling industry getting a big win, but part of me was hoping Lord Buckethead or Count Binface would sweep to victory in 2019.

(Count Binface. Source: https://www.countbinface.com/)

(Lord Buckethead. Source: https://twitter.com/LordBuckethead/status/1273601785094078464/photo/1. Not the hero we need, but the one we deserve.)

EU referendum

This was the other great electoral shock of 2016. The polls forecast a narrow 'Remain' victory, but the reality was a narrow 'Leave' win. Very little has been published on why the pollsters got it wrong in 2016, but what little that was published suggests that the survey method may have been important. The industry didn't initiate a broad inquiry, instead, individual polling companies were asked to investigate their own processes.

Other countries

There have been a series of polling failures in other countries. Here are just a few:

Takeaways

In university classrooms around the world, students are taught probability theory and statistics. It's usually an antiseptic view of the world, and opinion poll examples are often presented as straightforward math problems, stripped of the complex realities of sampling. Unfortunately, this leaves students unprepared for the chaos and uncertainty of the real world.

Polling is a complex, messy issue. Sampling governs the success or failure of polls, but sampling is something of a dark art and it's hard to assess its accuracy during a campaign. In 2020, do you know the sampling methodologies used by the different polling companies? Do you know who's more accurate than who?

Every so often, the polling companies take a beating. They re-group, fix the issues, and survey again. They get more accurate, and after a while, the press forgets about the failures and talks in glowing terms about polling accuracy, and maybe even doing away with the expensive business of elections in favor of polls. Then another debacle happens. The reality is, the polls are both more accurate and less accurate than the press would have you believe.

As Yogi Berra didn't say, "it's tough to make predictions, especially about the future".

If you liked this post, you might like these ones

Forecasting the 2020 election: a retrospective
What do presidential approval polls really tell us?
Fundamentally wrong? Using economic data as an election predictor - why I distrust forecasting models built on economic and other data
Can you believe the polls? - fake polls, leading questions, and other sins of opinion polling.
President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.
Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.
Who will win the election? Election victory probabilities from opinion polls
Sampling the goods: how opinion polls are made - my experiences working for an opinion polling company as a street interviewer.
The electoral college for beginners - how the electoral college works

References

[Igo] '"A gold mine and a tool for democracy": George Gallup, Elmo Roper, and the business of scientific polling,1935-1955', Sarah Igo, J Hist Behav Sci. 2006;42(2):109-134

[Jowell] "The 1992 British Election: The Failure of the Polls", Roger Jowell, Barry Hedges, Peter Lynn, Graham Farrant and Anthony Heath, The Public Opinion Quarterly, Vol. 57, No. 2 (Summer, 1993), pp. 238-263

[Lusinchi] '“President” Landon and the 1936 Literary Digest Poll: Were Automobile and Telephone Owners to Blame?', Dominic Lusinchi, Social Science History 36:1 (Spring 2012)

[Lynn] "How Might Opinion Polls be Improved?: The Case for Probability Sampling", Peter Lynn and Roger Jowell, Journal of the Royal Statistical Society. Series A (Statistics in Society), Vol. 159, No. 1 (1996), pp. 21-28

[Melon] "Missing Nonvoters and Misweighted Samples: Explaining the 2015 Great British Polling Miss", Jonathan Mellon, Christopher Prosser, Public Opinion Quarterly, Volume 81, Issue 3, Fall 2017, Pages 661–687

[Smith] "Public Opinion Polls: The UK General Election, 1992", T. M. F. Smith, Journal of the Royal Statistical Society. Series A (Statistics in Society), Vol. 159, No. 3 (1996), pp. 535-545

[Squire] "Why the 1936 Literary Digest poll failed", Peverill Squire, Public Opinion Quarterly, 52, 125-133, 1988

[Sturgis] "Report of the Inquiry into the 2015 British general election opinion polls", Patrick Sturgis, Nick Baker, Mario Callegaro, Stephen Fisher, Jane Green, Will Jennings, Jouni Kuha, Ben Lauderdale, Patten Smith

[Topping] '‘‘Never argue with the Gallup Poll’’: Thomas Dewey, Civil Rights and the Election of 1948', Simon Topping, Journal of American Studies, 38 (2004), 2, 179–198

Saturday, June 13, 2020

Death by rounding

Round, round, round, round, I get around

Rounding errors are one of those basic things that every technical person thinks they're on top of and won't happen to them, but the problem is, it can and does happen to good people, sometimes with horrendous consequences. In this blog post, I'm going to look at rounding errors, show you why they can creep in, and provide some guidelines you should follow to keep you and your employer safe. Let's start with some real-life cases of rounding problems.

(Rounding requires a lot of effort. Image credit: Wikimedia Commons. License: Public Domain)

Rounding errors in the real world

The wrong rounding method

In 1992, there was a state-level election in Schleswig-Holstein in Germany. The law stated that every party that received 5% or more of the vote got a seat, but there were no seats for parties with less than 5%. The software that calculated results rounded the results up (ceil) instead of rounding the results down (floor) as required by law. The Green party received 4.97% of the vote, which was rounded up to 5.0%, so it appeared the Green party had won a seat. The bug was discovered relatively quickly, and the seat was reallocated to the Social Democrats who gained a one-seat majority because of it [Link].

Cumulative rounding

The more serious issue is cumulative rounding errors in real-time systems. Here a very small error becomes very important when it's repeatedly or cumulatively added.

The Vancouver Stock Exchange set up a new index in January 1982, with a value set to 1,000. The index was updated with each trade, but the index was rounded down to three decimal places (truncated) instead of rounding to the nearest decimal place. The index was calculated thousands of times a day, so the error was cumulative. Over time, the error built up from something not noticeable to something very noticeable indeed. The exchange had to correct the error; on Friday November 25th, 1983 the exchange closed at 524.811, the rounding error was fixed, and when the exchange reopened, the index was 1098.892 - the difference being solely due to the rounding error bug fix [Link].

The most famous case of cumulative rounding errors is the Patriot missile problem in Dharan in 1991. A Patriot missile failed to intercept a Scud missile, which went on to kill 28 people and injured a further 98. The problem came from the effects of a cumulative rounding error. The Patriot system updated every 0.1s, but 0.1 can't be represented exactly in a fixed point system, there's rounding, which in this case was rounding down. The processors used by the Patriot system were old 24-bit systems that truncated the 0.1 decimal representation. Over time, the truncation error built up, resulting in the Patriot missile incorrectly responding to sensor data and missing the Scud missile [Link].

In-built bias

The way you were taught to round numbers in school leads to bias, it's a small, but it's there. You were taught to always round up numbers ending in 0.5, so 3.5 gets rounded to 4, 4.5. gets rounded to 5 and so on. Trouble is, by definition, 0.5 is exactly halfway. If you always round up, you're slightly biasing the results. Over time, or with a large enough data set, this can lead to errors.

To get round this problem, most modern languages uses something called banker's rounding (also known as Gaussian rounding or round to half-even) which round towards the nearest even number, so 3.5 gets rounded to 4 and 4.5 gets rounded to 4 too, however, 5.5 gets rounded to 6. This form of rounding is the one recommended by the IEEE and is the rounding method used in Python 3 and NumPy. However, not all libraries use it, some still use the round-to-nearest int method.

Here's what you should do. If you need to use rounding, check out what rounding algorithm your system uses. If you have any choice, use banker's rounding.

Theoretical explanation of rounding errors

Cumulative errors

Fairly obviously, cumulative errors are a sum:

E = ∑e

where E is the cumulative error and e is the individual error. In the Vancouver Stock Exchange example, the mean individual rounding error when rounding to three decimal places was 0.0005. From Wikipedia, there were about 3,000 transactions per day, and the period from January 1st 1982 when the index started to November 25th, 1983 when the index was fixed was about 473 working days. This gives an expected cumulative error of about 710, which is in the ballpark of what actually happened.

Of course, if the individual error can be positive or negative, this can make the problem better or worse. If the error is distributed evenly around zero, then the cumulative error should be zero, so things should be OK in the long run. But even a slight bias will eventually result in a significant cumulative error - regrettably, one that might take a long time to show up.

Although the formula above seems trivial, the point is, it is possible to calculate the cumulative effect of rounding errors.

Combining errors

When we combine numbers, errors can really hurt depending on what the combination is. Let's start with a simple example, if:

z = x - y

and:

s_zis the standard error in z

s_xis the standard error in x

s_yis the standard error in y

then

s_z = [s²_x + s²_y]^1/2

If x and y are numerically close to one another, errors can quickly become very significant. My first large project involved calculating quantum states, which included a formula like z = x - y. Fortunately, the rounding was correct and not truncated, but the combination of machine precision errors and the formulae above made it very difficult to get a reliable result. We needed the full precision of the computer system and we had to check the library code our algorithms used to make sure rounding errors were correctly dealt with. We were fortunate in that the results of rounding errors were obvious in our calculations, but you might not be so fortunate.

Ratios are more complex, let's define:

z = x/y

with the s values defined as before, then:

s_z /z = [(s_x/x)² + (s_y/y)²]^0.5

This suffers from the same problem as before, under certain conditions, the error can become very significant very quickly. In a system like the Patriot missile, sensor readings are used in some very complex equations. Rounding errors can combine to become very important.

The takeaway is very easy to state: if you're combining numbers using a ratio or subtracting them, rounding (or other errors) can hurt you very badly very quickly.

Insidious rounding errors

Cumulative rounding errors and the wrong type of rounding are widely discussed on the internet, but I've seen two other forms of rounding that have caught people out. They're hard to spot but can be damaging.

Rounding in the wrong places - following general advice too closely

Many technical degrees include some training on how to present errors and significant digits. For example, a quantity like 12.34567890 ∓ 0.12345678 is usually written 12.3 ∓ 0.1. We're told not to include more significant digits than the error analysis warrants. Unfortunately, this advice can lead you astray if you apply it unthinkingly.

Let's say we're taking two measurements:

x = 5.26 ∓0.14

y = 1.04 ∓0.12

following the rules of representing significant digits, this gives us

x = 5.3 ∓0.1

y = 1.0 ∓0.1

If :

z = x/y

then with the pre-rounded numbers:

z = 5.1 ∓ 0.6

but with the rounded numbers we have:

z = 5.3 ∓ 0.5

Whoops! This is a big difference. The problem occurred because we applied the advice unthinkingly. We rounded the numbers prematurely; in calculations, we should have kept the full precision and only shown rounded numbers for display to users.

The advice is simple: preserve full precision in calculations and reserve rounding for numbers shown to users.

Spreadsheet data

Spreadsheets are incredible sources of errors and bugs. One of the insidious things spreadsheets do is round numbers, which can result in numbers appearing not to add up.

Let's have a look at an example. The left of the table shows numbers before rounding. The right of the table shows numbers with rounding (suppressing the decimal places). The numbers on the right don't add up because of rounding (they should sum to 1206).

	No round		Round
Jan	121.4	Jan	121
Feb	251.4	Feb	251
Mar	311.4	Mar	311
Apr	291.4	Apr	291
May	141.4	May	141
Jun	91.4	Jun	91
TOTAL	1208.4	TOTAL	1208

An insidious problem occurs rounded when numbers are copied from spreadsheets and used in calculations - which is a manifestation of the premature rounding problem I discussed earlier.

1.999... = 2, why 2 != 2, and machine precision

Although it's not strictly a rounding error, I do have to talk about the fact that 1.999... = 2. This result often surprises people, but it's an easy thing to prove. Unfortunately, on machines with finite precision, 1.9999... == 2 will give you False! Just because it's mathematically true, doesn't mean it's true on your system.

I've seen a handful of cases when two numbers that ought to be the same fail an equality test, the equivalent of 2 == 2 evaluating to False. One of the numbers has been calculated through a repeated calculation and machine precision errors propagate, the other number has been calculated directly. Here's a fun example from Python 3:

1 == (1/7) + (1/7) + (1/7) + (1/7) + (1/7) + (1/7) + (1/7)

evaluates to False!

To get round this problem, I've seen programmers do True/False difference evaluations like this:

abs(a - b) <= machine_precision

The machine precision constant is usually called epsilon.

What to watch for

Cumulative errors in fixed-point systems

The Patriot missile case makes the point nicely: if you're using sensor data in a system using fixed-point arithmetic, or indeed in any computer system, be very careful how your system rounds its inputs. Bear in mind, the rounding might be done in an ADC (analog-to-digital converter) beyond your control - in which case, you need to know how it converts data. If you're doing the rounding, you might need to use some form of dithering.

Default rounding and rounding methods

There are several different rounding methods you can use; your choice should be a deliberate one and you should know their behavior. For example, in Python, you have:

floor
ceil
round - which uses banker's rounding not the school textbook form of rounding and was changed from Python 2 to Python 3.

You should be aware of the properties of each of these rounding methods. If you wanted to avoid the Vancouver Stock Exchange problem, what form of rounding would you choose and why? Are you sure?

A more subtle form of rounding can occur when you mix integers and floating-point numbers in calculations. Depending on your system and the language you use, 7.5/2 can give different answers. I've seen some very subtle bugs involving hidden type conversion, so be careful.

Premature rounding

You were taught to only present numbers to an appropriate numbers of decimal places, but that was only for presentation. For calculations, use the full precision available.

Spreadsheets

Be extremely careful copying numbers from spreadsheets, the numbers may have been rounded and you may need to look closer to get extra digits of precision.

Closing thoughts

Rounding seems like a simple problem that happens to other people, but it can happen to you and it can have serious consequences. Take some time to understand the properties of the system and be especially careful if you're doing cumulative calculations, mixed floating-point and integer calculation, or if you're using a rounding function.

Saturday, May 30, 2020

Inventory: your job may depend on how it's managed

Why should you care about inventory?

For your own job security, you need to understand the financial position of your employer; their true financial position will govern your pay and promotion prospects. Long before they affect your job, you can spot looming signs of trouble in company financial statements. Learning a little of accounting will also help you understand news stories as we'll see. In this blog post, I'm going to talk about one of the easiest signs of trouble to spot, inventory problems. Because it's always fun, I'm going to include some cases of fraud. Bear in mind, innocent people lose their jobs because of inventory issues; I hope you won't be one of them.

(Is inventory good or bad? It depends. Image credit: Wikimedia Commons. License: public domain.)

Inventory concepts

Inventories are items held for sale or items that will be used to manufacture products. Good examples are the items retailers hold for sale (e.g. clothes, food, books) and the parts manufacturers hold (e.g. parts for use on a car assembly line). On a balance sheet, inventory is listed as a current asset, which means it's something that can be turned into cash 'quickly'. There are different types of accounting inventory, but I won't go into what they are.

Inventory changes can be benign but can be a sign of trouble. Let's imagine a bookseller whose inventory is increasing. Is this good or bad?

If the bookseller is expanding (more sales, more shops), then increasing inventory is a sign of success.
If the bookseller is not expanding, then increasing inventory is deeply concerning. The bookseller is buying books it can't sell.

There are two ways of valuing inventory, which opens the door to shenanigans. Let's imagine you're a coal-burning power station and you have a stockpile of coal. The price of coal fluctuates. Do you value your stockpile of coal at current market prices or the price that you paid for it? There are two ways of evaluating inventory: FIFO and LIFO.

FIFO is first-in, first-out - the first items purchased are the first items sold. Inventory is valued at current market prices.
LIFO is last-in, first-out - the last items purchased are the first items sold. Inventory is valued at historic market prices.

If prices are going up, then LIFO increases the cost of goods sold and reduces profitability, conversely, FIFO reduces the cost of goods sold and increases profitability. There are also tax implications for the different inventory evaluation methods.

Obviously, things are more complex than I've described here but we have enough of the basic ideas, so let's get to the fraud stories.

Inventory shenanigans

OM Group produced specialty chemicals from raw materials, including cobalt. In the early 2000s, cobalt was mostly sourced as a by-product from mines in the Democratic Republic of the Congo, a very unstable part of the world. The price of cobalt was going down and OM Group saw a way of making that work to their advantage. Their first step was to use the LIFO method of valuing their cobalt inventory. The next step was to buy up cheap cobalt and keep buying as the price dropped. Here's what that meant; because they used LIFO, for accounting purposes, the cobalt they used in production was valued at the new (low) market price, so the cost of goods sold went down, so profitability went up! The older (and more expensive) cobalt was kept in inventory. To keep the business profits increasing, they needed the price of cobalt to go down and they needed to buy more of it, regardless of their manufacturing needs. The minute prices went up, or they started eating into inventory, or they stopped buying more cobalt, profitability would fall. To put it simply, the boost to profits was an accounting shell game.

OM Group logo at the time. Image credit: Wikimedia Commons. License: Public Domain.)

As you might expect, the music eventually stopped. The SEC charged some of the executives with fraud and reached a settlement with the company, and there was a class-action lawsuit from some company investors. Unsurprisingly, the company later changed its name when the dust settled. If you want to understand how you could spot something like this, there's a very readable description of the accounting fraud by Gary Mishuris, an analyst who spotted it.

Manufacturing plants run best when producing at a constant rate, but market demands fluctuate. If demand reduces, then inventory will increase, something that will be obvious in a company's financial statements. How can a company disguise a drop in demand? One illegal way is through something called 'channel stuffing', which is forcing your distributors and resellers to take your unsold inventory so you can record it as sales.

Semiconductor companies are manufacturers and typically have large distribution networks through resellers covering different geographies or markets. For example, a semiconductor company may sell directly to large electronics companies but may serve smaller electronics companies through resellers, who may, in turn, resell to other distributors and so on.

Between 1995 and 2006, Vitesse Semiconductor used channel stuffing extensively to manage its earnings. It had an arrangement with its distributors that they could unconditionally sell back any chips they had bought and not sold. Here's how channel stuffing worked; if Vitesse needed to increase profits in a quarter, they would require their distributors to buy Vitesse' inventory. This would show up as an increase in Vitesse's sales for that quarter. At some point in the future, the resellers could sell the chips back to Vitesse. The chips themselves might never leave the warehouse, but might have been 'owned' by several different companies. In other words, short-term increases in profitability were driven by an accounting scam.

(Vitesse Semiconductor chip. Image credit: Raimond Spekking via Wikimedia Commons. License: Creative Commons)

Of course, this is all illegal and the SEC took action against the company; the executives were indicted for fraud. Vitesse had to restate their earnings substantially downwards, which in turn triggered a class action lawsuit. This fraud has even made it into fraud textbooks.

I want to stop for a minute and ask you to think. These are entertaining stories, but what if you were an (innocent) employee of OM Group or Vitesse Semiconductor? When the SEC arrests the leadership, what are the implications for employees? When the accounts are restated and profitability takes a nose dive, what do you think the pay and job prospects are like for the rank-and-file workers?

Inventory and politics - Brexit

A while back, I was chatting to a Brexit supporter, when a news report came on the TV; UK economic output had increased, but the increase had gone into inventory, not sales. Manufacturers and others were assuming Brexit would disrupt their supply chain, so they'd increased output to give them a buffer. I was horrified, but the Brexit supporter thought this was great news. After chatting some more, I realized they had no understanding of how inventory worked. Let's talk through some scenarios to understand why the news was bad.

Scenario 1: no Brexit supply chain disruption. UK firms have an excess of inventory. They can either keep the inventory indefinitely (and pay higher costs than their overseas competitors) or they can run down inventory which means fewer hours for their workers.

Scenario 2: Brexit supply chain disruption. UK firms can't get parts, so they run down inventory until supply chain issues are fixed. Selling inventory means fewer hours worked by the workers.

In both scenarios, UK firms have incurred production costs earlier than their overseas competitors, which reduces their cash flow.

(Image credit: Wikimedia Commons - Tim Reckmann. License: Creative Commons.)

This is obviously highly simplified, but you get the point. None of these scenarios are good for firms or for workers.

If increasing inventory without sales is so good, why don't firms do it all the time? In fact, why bother with the pesky business of selling at all when you can just produce for inventory? The question seems silly, but answering it leads you to consider what an optimum level of inventory is. The logic pushes you towards just in time, which leads to an understanding of why supply chain interruptions are bad.

Closing thoughts

Your job security depends on the financial stability of your employer. If you work for a company that produces public accounts, you have an opportunity to make your own risk assessment. Inventory is one factor among many you should watch. Here are some things you should look out for:

Changes to inventory evaluation methods (LIFO, FIFO).
Increases in inventory not matched to growth.
Increasing sales to distributors not matched to underlying market demand (especially when the inventory never leaves the company).

Yes, some companies do produce fraudulent accounts, and yes, some do hide poor performance, but you can still take steps to protect your career based on the cold hard reality of financial statements, not on hype.

Saturday, May 23, 2020

Finding electoral fraud - the democracy data deficit

Why we need to investigate fraud

In July 2016, Fox News' Sean Hannity reported that Mitt Romney received no votes at all in 59 Philadelphia voting precincts in the 2012 Presidential Election. He claimed that this was evidence of vote-rigging - something that received a lot of commentary and on-air discussion at the time. On the face of it, this does sound like outright electoral fraud; in a fair election, how is it possible for a candidate to receive no votes at all? Since then, there have been other allegations of fraud and high-profile actual incidents of fraud. In this blog post, I’m going to talk about how a citizen-analyst might find electoral fraud. But I warn you, you might not like what I’m going to say.

National Museum of American History, Public domain, via Wikimedia Commons

Election organization - the smallest electoral units

In almost every country, the election process is organized in the same way; the electorate is split into geographical blocks small enough to be managed by a team on election day. The blocks might contain one or many polling stations and may have a few hundred to a few thousand voters. These blocks are called different things in different places, for example, districts, divisions, or precincts. Because precinct seems to be the most commonly used word, that's what I'm going to use here. The results from the precincts are aggregated to give results for the ward, county, city, state, or country. The precinct boundaries are set by different authorities in different places, but they're known.

How to look for fraud

A good place to look for electoral shenanigans is at the precinct level, but what should we look for? There are several easy checks:

A large and unexplained increase or decrease in the number of voters compared to previous elections and compared to other nearby precincts.
An unexpected change in voting behavior compared to previous elections/nearby precincts. For example, a precinct that ‘normally’ votes heavily for party Y suddenly voting for party X.
Changes in voting patterns for absentee voters e.g. significantly more or less absentee votes or absentee voter voting patterns that are very different from in-person votes.
Results that seem inconsistent with the party affiliation of registered voters in the precinct.
A result that seems unlikely given the demographics of the precinct.

Of course, none of these checks is a smoking gun, either individually or collectively, but they might point to divisions that should be investigated. Let’s start with the Philadelphia case and go from there.

Electoral fraud - imagined and real

It’s true that some divisions (precincts) in Philadelphia voted overwhelmingly for Obama in 2012. These divisions were small (averaging about 600 voters) and almost exclusively (95%+) African-American. Obama was hugely popular with the African-American community in Philadelphia, polling 93%+. The same divisions also have a history of voting overwhelmingly Democratic. Given these facts, it’s not at all surprising to see no or very few votes for Mitt Romney. Similar arguments hold for allegations of electoral fraud in Cleveland, Ohio in 2012.

In fact, there were some unbalanced results the other way too; in some Utah precincts, Obama received no votes at all - again not surprising given the voter population and voter history.

Although on the face of it these lopsided results seem to strongly indicate fraud, the allegations don't stand up to analytical scrutiny.

Let’s look at another alleged case of electoral fraud, this time in 2018 in North Carolina. The congressional election was fiercely contested and appeared to be narrowly decided in favor of Mark Harris. However, investigators found irregularities in absentee ballots, specifically, missing ballots from predominantly African-American areas. The allegations were serious enough that the election was held again, and criminal charges have been made against a political operative in Mark Harris’ campaign. The allegation is ‘ballot harvesting’, where operatives persuade voters who might vote for their opposition to voting via an absentee ballot and subsequently make these ballots disappear.

My sources of information here are newspaper reports and analysis, but what if I wanted to do my own detective work and find areas where the results looked odd? How might I get the data? This is where things get hard.

Democracy’s data - official sources

To get the demographics of a precinct, I can try going to the US Census Bureau. The Census Bureau defines small geographic areas, called tracts, that they can supply data on. Tract data include income levels, population, racial makeup, etc. Sometimes, these tracts line up with voting districts (the Census term for precincts), but sometimes they don’t. If tracts don’t line up with voting districts, then automated analysis becomes much harder. In my experience, it takes a great investment of time to get any useful data from the Census Bureau; the data’s there, it’s just really hard finding out how to get it. In practice then, it’s extremely difficult for a citizen-analyst to link census data to electoral data.

What about voting results? Surely it’s easy to get electoral result data? As it turns out, this is surprisingly hard too. You might think the Federal Election Commission (FEC) will have detailed data, but it doesn’t. The data available from the FEC for the 2016 Presidential Election is less detailed than the 2016 Presidential Election Wikipedia page. The reason is, Presidential Elections are run by the states, so there are 51 (including Washington DC) separate authorities maintaining electoral results, which means 51 different ways of getting data, 51 different places to get it, and 51 different levels of detail available. The FEC sources its data from the states, so it's not surprising its reports are summary reports.

If we need more detailed data, we need to go to the states themselves.

Let's take Massachusetts as an example, Presidential Election data is available for 2016, down to the ward level (as a CSV), but for Utah, data is only available at the county level (as an Excel file), which is the same as Pennsylvania, where the data is only available from a web page. To get detail below the county level may take freedom of information requests, if the information is available at all.

In effect, this puts precinct-level nationwide voting analysis from official sources beyond almost all citizen-analysts.

Democracy’s data - unofficial sources

In practice, voting data is hard to come by from official sources, but it is available from unofficial sources who've put the work into getting the data from the states and make it available to everyone.

Dave Leip offers election data down to detailed levels; the 2016 results by country will cost you $92 and results by Congressional District will cost you $249, however, high-level results are on his website and available for free. He's even been kind enough to list his sources and URLs if you want to spend the time to duplicate his work. Leip’s data is used by the media in their analysis, and probably by political campaigns too. He’s put in a great deal of work to gather the data and he’s asking for a return on his effort, which is fair enough.

The MIT Election Data and Science Lab (MEDSL) collects election data, including down to the precinct level and the data is available for the most recent Presidential Election (2016 at the time of writing). As usual with this kind of data, there are all kinds of notes to read before using the data. MIT has also been kind enough to make tools available to analyze the data and they also make available their website scrapping tools.

The MIT project isn't the only project providing data. Various other universities have collated electoral resources at various levels of detail:

Harvard Kennedy School
University of Michigan Library
Princeton University Library
The University of Florida has a project to provide US election data at the precinct level and has also made their precinct data available online.

Democracy’s data - electoral fraud

What about looking for cases of electoral fraud? There isn't a central repository of electoral fraud cases and there are multiple different court systems in the US (state and federal), each maintaining records in different ways. Fortunately, Google indexes a lot of cases, but often, court transcripts are only available for a fee, and of course, it's extremely time-consuming to trawl through cases.

The Heritage Foundation maintains a database of known electoral fraud cases. They don't claim their database is complete, but they have put a lot of effort into maintaining it and it's the most complete record I know of.

In 2018, there were elections for the House of Representatives, the Senate, state elections, and of course county and city elections. Across the US, there must have been thousands of different elections in 2018. How many cases of electoral fraud do you think there were? What level of electoral fraud would undermine your faith in the system? In 2018, there were 65 cases. From the Heritage Foundation data, here’s a chart of fraud cases per year for the United States as a whole.

(Electoral fraud cases by year from the Heritage Foundation electoral fraud database)

It does look like there's been an increase in electoral fraud up to about 2010, but bear in mind the dataset cover the period of computerization and the rise of the internet. We might expect a rise in fraud cases because it's easier to find case records.

Based on this data, there really doesn’t seem to be large-scale electoral fraud in the United States. In fact, in reading the cases on their website, most of them are small-scale frauds concerning local elections (e.g. mayoral elections) - in a lot of cases, the frauds are frankly pathetic.

Realistic assessment of election data

Official data is either hard to come by or not available at the precinct level, which leaves us using unofficial data. Fortunately, unofficial data is high quality and from reputable sources. The problem is, data from unofficial sources aren't available immediately after an election; there may be a long delay between the election and the data. If one of the goals of electoral data analysis is finding fraud, then timely data is paramount.

Of course, this kind of analysis I'm talking about here won't find small-scale fraud, where a person votes more than once or impersonates someone. But small-scale fraud will only affect the outcome of the very tightest of races. Democracy is most threatened by fraud that might affect the results, which in most cases is larger-scale fraud like the North Carolina case. Statistical analysis might detect these kinds of fraud.

Sean Hannity's allegation of electoral fraud in Philadelphia didn't stand up to analysis, but it was worth investigating and is the kind of fraud we could detect using data - if only it were available in a timely way.

How things could be - a manifesto

Imagine groups of researchers sitting by their computers on election night. As election results at the precinct level are posted online, they analyze the results for oddities. By the next morning, they may have spotted oddities in absentee ballots, or unexplained changes in voting behavior, or unexpected changes in voter turnout - any of which will feed into the news cycle. Greater visibility of anomalies will enable election officials to find and act on fraud more quickly.

To do this will require consistency of reporting at the state level and a commitment to post precinct results as soon as they're counted and accepted. This may sound unlikely, but there are federal standards the states must follow in many other areas, including deodorants, teddy bears, and apple grades, but also for highway construction, minimum drinking age, and the environment. Isn't the transparency of democracy at least as important as deodorants, teddy bears, and apples?

If you liked this post, you might like these ones

Forecasting the 2020 election: a retrospective

What do presidential approval polls really tell us?

Fundamentally wrong? Using economic data as an election predictor - why I distrust forecasting models built on economic and other data

Can you believe the polls? - fake polls, leading questions, and other sins of opinion polling.

President Hilary Clinton: what the polls got wrong in 2016 and why they got it wrong - why the polls said Clinton would win and why Trump did.

Poll-axed: disastrously wrong opinion polls - a brief romp through some disastrously wrong opinion poll results.

Who will win the election? Election victory probabilities from opinion polls

Sampling the goods: how opinion polls are made - my experiences working for an opinion polling company as a street interviewer.

The electoral college for beginners - how the electoral college works