Showing posts with label data analytics. Show all posts
Showing posts with label data analytics. Show all posts

## Saturday, February 27, 2021

### Simpson's paradox: a trap for the naive analyst

Let's imagine you're the Chief Revenue Officer at a manufacturing company that sells tubes and cylinders. You're having trouble with European sales reps discounting, so you offer a spif: the country team that sells at the highest price gets a week-long vacation somewhere warm and sunny with free food and drink. The Italian and German sales teams are raring to go.

At the end of the quarter, you have these results [Wang]:

 Product type Cylinder Tube Sales team No sales Average price No sales Average price German 80 €100 20 €70 Italian 20 €120 80 €80

This looks like a clear victory for the Italians! They maintained a higher price for both cylinders and tubes! If they have a higher price for every item, then obviously, they've won. The Italians start packing their swimsuits.

Not so fast, say the Germans, let's look at the overall results.

 Sales team Average price German €94 Italian €88

Despite having a lower selling price for both cylinders and tubes, the Germans have maintained a higher selling price overall!

How did this happen? It's an instance of Simpon's paradox.

# Why the results reversed

Here's how this happened: the Germans sold more of the expensive cylinders and the Italians sold more of the cheaper tubes. The average price is the ratio of the total monetary amount/total sales quantity. To put it very simply, ratios (prices) can behave oddly.

Let's look at a plot of the selling prices for the Germans and Italians.

The blue circles are tubes and the orange circles are cylinders. The size of the circles represents the number of sales. The little red dot in the center of the circles is the price.

Let's look at cylinders. Plainly, the Italians sold them at a higher price, but they're the most expensive item and the Germans sold more of them. Now, let's look at tubes, once again, the Italians sold them at a higher price than the Germans, but they're cheaper than cylinders and the Italians sold more of them.

You can probably see where this is going. Because the Italians sold more of the cheaper items, their average (or pooled) price is dragged down, despite maintaining a higher price on a per-item basis. I've re-drawn the chart, but this time I've added a horizontal black line that represents the average.

The product type (cylinders or tubes) is known in statistics as a confounder because it confounds the results. It's also known as a conditioning variable.

# A disturbing example - does this drug work?

The sales example is simple and you can see the cause of the trouble immediately. Let's look at some data from a (pretend) clinical trial.

Imagine there's some disease that impacts men and women and that some people get better on their own without any treatment at all. Now let's imagine we have a drug that might improve patient outcomes. Here's the data [Lindley].

 Female Male Recovered Not Recovered Rate Recovered Not Recovered Rate Took drug 8 2 80% 12 18 40% Not take drug 21 9 70% 3 7 30%

Wow! The drug gives everyone an added 10% on their recovery rate. Surely we need to prescribe this for everyone? Let's have a look at the overall data.

 Everyone Recovered Not Recovered Rate Took drug 20 20 50% Not take drug 24 16 60%

What this data is saying is, the drug reduces the recovery rate by 10%.

Let me say this again.

• For men, the drug improves recovery by 10%.
• For women, the drug improves recovery by 10%.
• For everyone, the drug reduces recovery by 10%.

If I'm a clinician, and I know you have the disease, if you're a woman, I would recommend you take the drug, if you're a man I would recommend you take the drug, but if I don't know your gender, I would advise you not to take the drug. What!!!!!

This is exactly the same math as the sales example I gave you above. The explanation is the same. The only thing different is the words I'm using and the context.

# Simpson and COVID

In the United States, it's pretty well-established that black and Hispanic people have suffered disproportionately from COVID. Not only is their risk of getting COVID higher, but their health outcomes are worse too. This has been extensively covered in the press and on the TV news.

In the middle of 2020, the CDC published data that showed fatality rates by race/ethnicity. The fatality rate means the fraction of patients with COVID who die. The data showed a clear result: white people had the worst fatality rate of the racial groups they studied.

Doesn't this contradict the press stories?

No.

There are three factors at work:

• The fatality rate increases with age for all ethnic groups. It's much higher for older people (75+) than younger people.
• The white population is older than the black and Hispanic populations.
• Whites have lower fatality rates in almost all age groups.

This is exactly the same as the German and Italian sales team example I started with. As a fraction of their population, there are more old white people than old black and Hispanic people, so the fatality rates for the white population are dominated by the older age group in a way that doesn't happen for blacks and Hispanics.

In this case, the overall numbers are highly misleading and the more meaningful comparison is at the age-group level. Mathematically, we can remove the effect of different demographics to make an apples-to-apples comparison of fatality rates, and that's what the CDC has done.

# In pictures

Wikipedia has a nice article on Simpson's paradox and I particularly like the animation that's used to accompany it, so I'm copying it here.

Each of the dots represents a measurement, for example, it could be price. The colors represent categories, for example, German or Italian sales teams, etc. if we look at the results overall, the trend is negative (shown by the black dots and black line). If we look at the individual categories, the trend is positive (colors). In other words, the aggregation reverses the individual trends.

# The classic example - sex discrimination at Berkeley

The Simpson's paradox example that's nearly always quoted is the Berkeley sex discrimination case [Bickel]. I'm not going to quote it here for two reasons: it's thoroughly discussed elsewhere, and the presentation of the results can be confusing. I've stuck to simpler examples to make my point.

# American politics

A version of Simpson's paradox can occur in American presidential elections, and it very nicely illustrates the cause of the problem.

In 2016, Hilary Clinton won the popular vote by 48.2% to 46.1%, but Donald Trump won the electoral college by 304 to 227. The reason for the reversal is simple, it's the population spread among the states and the relative electoral college votes allocated to the states. As in the case of the rollup with the sales and medical data I showed you earlier, exactly how the data rolls up can reverse the result.

The question, "who won the 2016 presidential election" sounds simple, but it can have several meanings:

• who was elected president
• who got the most votes
• who got the most electoral college votes

The most obvious meaning, in this case, is, "who was elected president". But when you're analyzing data, it's not always obvious what the right question really is.

# The root cause of the problem

The problem occurs because we're using an imprecise language (English) to interpret mathematical results. In the sales and medical data cases, we need to define what we want.

In the sales price example, do we mean the overall price or the price for each category? The contest was ambiguous, but to be fair to our CRO, this wasn't obvious initially. Probably, the fairest result is to take the overall price.

For the medical data case, we're probably better off taking the male and female data separately. A similar argument applies for the COVID example. The clarifying question is, what are you using the statistics for? In the drug data case, we're trying to understand the efficacy of a drug, and plainly, gender is a factor, so we should use the gendered data. In the COVID data case, if we're trying to understand the comparative impact of COVID on different races/ethnicities, we need to remove demographic differences.

If this was the 1980s, we'd be stuck. We can't use statistics alone to tell us what the answer is, we'd have to use data from outside the analysis to help us [Pearl]. But this isn't the 1980s anymore, and there are techniques to show the presence of Simpson's paradox. The answer lies in using something called a directed acyclic graph, usually called a DAG. But DAGs are a complex area and too complex for this blog post that I'm aiming at business people.

# What this means in practice

There's a very old sales joke that says, "we'll lose money on every sale but make it up in volume". It's something sales managers like to quote to their salespeople when they come asking for permission to discount beyond the rules. I laughed along too, but now I'm not so quick to laugh. Simpson's paradox has taught me to think before I speak. Things can get weird.

Interpreting large amounts of data is hard. You need training and practice to get it right and there's a reason why seasoned data scientists are sought after. But even experienced analysts can struggle with issues like Simpson's paradox and multi-comparison problems.

The red alert danger for businesses occurs when people who don't have the training and expertise start to interpret complex data. Let's imagine someone who didn't know about Simpson's paradox had the sales or medical data problem I've described here. Do you think they could reach the 'right' conclusion?

The bottom line is simple: you've got to know what you're doing when it comes to analysis.

# References

[Bickel] Sex Bias in Graduate Admissions: Data from Berkeley, By P. J. Bickel, E. A. Hammel, J. W. O'Connell, Science, 07 Feb 1975: 398-404
[Lindley] Lindley, D. and Novick, M. (1981). The role of exchangeability in inference. The Annals
of Statistics 9 45–58.
[Pearl] Judea Pearl, Comment: Understanding Simpson’s Paradox, The American Statistician, 68(1):8-13, February 2014.
[Wang] Wang B, Wu P, Kwan B, Tu XM, Feng C. Simpson's Paradox: Examples. Shanghai Arch Psychiatry. 2018;30(2):139-143. doi:10.11919/j.issn.1002-0829.218026

# Why aren't 2D plots good enough?

Most data visualization problems involve some form of two-dimensional plotting, for example plotting sales by month. Over the last two hundred years, analysts have developed several different types of 2D plots, including scatter charts, line charts, and bar charts, so we have all the chart types we need for 2D data. But what happens if we have a 3D dataset?

The dataset I'm looking at is English Premier League (EPL) results. I want to know how the full-time scores are distributed, for example, are there more 1-1 results than 2-1 results? I have three numbers, the full-time home goals (FTHG), the full-time away goals (FTAG). and the number of games that had that score. How can I present this 3D data in a meaningful way?

(You can't rely on 3D glasses to visualize 3D data. Image source: Wikimedia Commons, License: Creative Commons, Author: Oliver Olschewski)

# Just the text

The easiest way to view the data is to create a table, so here it is. The columns are the away goals, the rows are the home goals, and the cell values are the number of matches with that result, so 778 is the number of matches with a score of 0-1.

This presentation is easy to do, and relatively easy to interpret. I can see 1-1 is the most popular score, followed by 1-0. You can also see that some scores just don't occur (9-9) and results with more than a handful of goals are very uncommon.

This is OK for a smallish dataset like this, but if there are hundreds of rows and/or columns, it's not really viable. So what can we do?

# Heatmaps

A heatmap is a 2D map where the 3rd dimension is represented as color. The more intense (or lighter) the color, the higher the value. For this kind of plot to work, you do have to be careful about your color map. Usually, it's best to choose the intensity of just one color (e.g. shades of blue). In a few cases, multiple colors can work (colors for political parties), but those are the exceptions.

Here's the same data plotted as a heatmap using the Brewer color palette "RdPu" (red-purple).

The plot does clearly show the structure. It's obvious there's a diagonal line beyond which no results occur. It's also obvious which scores are the most common. On the other hand, it's hard to get a sense of how quickly the frequency falls off because the human eye just isn't that sensitive to variations in color, but we could probably play around with the color scale to make the most important color variation occur over the range we're interested in.

This is an easy plot to make because it's part of R's ggplot package. Here's my code:

plt_goal_heatmap <- goal_distribution %>%
ggplot(aes(FTHG, FTAG, fill=Matches)) +
geom_tile() +
scale_fill_distiller(palette = "RdPu") +
ggtitle("Home/Away goal heatmap")

# Perspective scatter plot

Another alternative is the perspective plot, which in R, you can create using the 'persp' function. This is a surface plot as you can see below.

You can change your perspective on the plot and view it from other angles, but even from this perspective, it's easy to see the very rapid falloff in frequency as the scores increase.

However, I found this plot harder to use than the simple heatmap, and I found changing my viewing angle was awkward and time-consuming.

Here's my code in case it's useful to you:

persp(x = seq(0, max(goal_distribution$FTHG)), y = seq(0, max(goal_distribution$FTAG)),
z = as.matrix(
unname(
goal_distribution, FTAG, Matches, fill=0)[,-1])),
xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
main = "Distribution of matches by score",
theta = 60, phi = 20,
expand = 1,
col = "lightblue")

# 3D scatter plot

We can go one stage further and create a 3D scatter chart. On this chart, I've plotted the x, y, and z values and color-coded them so you get a sense of the magnitude of the z values. I've also connected the points to the axis (the zero plane if you like) to emphasize the data structure a bit more.

As with the persp function,  you can change your perspective on the plot and view it from another angle.

The downside with this approach is it requires the 'plot3D' library in R and it requires you to install a new graphics server (XQuartz). It's a chunk of work to get to a visualization. The function to draw the plot is 'scatter3D'. Here's my code:

scatter3D(x=goal_distribution$FTHG, y=goal_distribution$FTAG,
z=goal_distribution\$Matches,
xlab = "FTHG", ylab = "FTAG", zlab = "Matches",
phi = 5,
theta = 40,
bty = "g",
type = "h",
pch = 19,
main="Distribution of matches by score",
cex = 0.5)

# What's my choice?

My goal was to understand the distribution of goals in the EPL, so what presentations of the data were most useful to me?

The simple table worked well and was the most informative, followed by the heatmap. I found both persp and scatter3D to be awkward to use and both consumed way more time than they were worth. The nice thing about the heatmap is that it's available as part of the wonderful ggplot library.

Bottom line: keep it simple.

# The summary is not the whole picture

If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.

The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.

# What are summary statistics?

Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.

The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.

# Anscombe's Quartet

Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!

Here are the summary statistics data:

PropertyValue
Mean of x9
Sample variance of x : ${\displaystyle \sigma ^{2}}$11
Mean of y7.50
Sample variance of y : ${\displaystyle \sigma ^{2}}$4.125
Correlation between x and y0.816
Linear regression liney = 3.00 + 0.500x
Coefficient of determination of the linear regression : ${\displaystyle R^{2}}$0.67

These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables is different in each case and implies different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.

# The datasaurus

Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.

Property Value
n 142
mean 54.2633
x standard deviation 16.7651
y mean 47.8323
y standard deviation 26.9353
Pearson correlation -0.0645

What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.

The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.

# The datasaurus dozen

Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with exactly the same summary statistics as the datasaurus. Here they all are.

The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.

If all this sounds a bit abstract, let's think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data, but you might be very wrong.

You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:

the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.

(The Autodesk Research team have posted their work as a paper you can read here.)

# Lessons learned

The lessons you should take away from all this are simple:

• summary statistics hide a lot
• there are many relationships between variables that will give summary statistics that look like noise

# Serial numbers and losing business

Here's a story about how something innocuous and low-level like serial numbers can damage your reputation and lose you business. I have advice on how to avoid the problem too!

(Serial numbers can give away more than you think. Image source: Wikimedia Commons. License: Public Domain.)

# Numbered by design

Years ago, I worked for a specialty manufacturing company, its products were high precision, low-volume, and expensive. The industry was cut-throat competitive, and commentary in the press was that not every manufacturer would survive; as a consequence, customer confidence was critical.

An overseas customer team came to us to design a specialty item. The company spent a week training them and helping them design what they wanted. Of course, the design was all on a CAD system with some templated and automated features. That's where the trouble started.

One of the overseas engineers spotted that a customer-based serial number was automatically included in the design. Unfortunately, the serial number was 16, implying that the overseas team was only the 16th customer (which was true). This immediately set off their alarm bells - a company with only 16 customers was probably not going to survive the coming industry shake-out. The executive team had to smooth things over, which included lying about the serial numbers. As soon as the overseas team left, the company changed its system to start counting serial numbers from some high, but believable number (something like 857).

Here's the point: customers can infer a surprising amount from your serial numbers, especially your volume of business.

# Invoices

Years later, I was in a position where I was approving vendor invoices. Some of my vendors didn't realize what serial numbers could reveal, and I ended up gaining insight into their financial state. Here are the rules I used to figure out what was going on financially, which was very helpful when it came to negotiating contract renewals.

• If the invoice is unnumbered, the vendor is very small and they're likely to have only a handful of customers. All accounting systems offer invoice generation and they all number/identify individual invoices. If the invoice doesn't have a serial number, the vendor's business is probably too small to warrant buying an accounting system, which means a very small number of customers.
• Naive vendors will start invoice numbering from 1, or from a number like 1,000. You can infer size if they do this.
• Many accounting systems will increment invoice numbers by 1 by default. If you're receiving regular invoices from a vendor, you can use this to infer their size too. If this month's invoice is 123456 and next month's is 123466, this might indicate 10 invoices in a month and therefore 10 customers. You can do this for a while and spot trends in a vendor's customer base, for example, if you see invoices incrementing by 100 and later by 110, this may be because the vendor has added 10 customers.

The accounting tool suppliers are wise to this, and many tools offer options for invoice numbering that stop this kind of analysis (e.g. starting invoices from a random number, random invoice increments, etc.). But not all vendors use these features and serial number analysis works surprisingly often.

(Destroyed German Tank. Image source: Wikimedia Commons. License: Public Domain)

# The German tank problem

Serial number analysis has been used in wartime too. In World War II, the allied powers wanted to understand the capacity of Nazi industry to build tanks. Fortunately, German tanks were given consecutive serial numbers (this is a simplification, but it was mostly true). Allied troops were given the job of recording the serial numbers of captured or destroyed tanks which they reported back. Statisticians were able to infer changes in Nazi tank production capabilities through serial number analysis, which after the war was found to be mostly correct. This is known as the German tank problem and you can read a lot more about it online.

# Simple things say a lot

The bottom line is simple: serial numbers can give away more about your business than you think. They can tell your customers how big your customer base is, and whether it's expanding or contracting; crucial information when it comes to renegotiating contracts. Pay attention to your serial numbers and invoices!

# Round, round, round, round, I get around

Rounding errors are one of those basic things that every technical person thinks they're on top of and won't happen to them, but the problem is, it can and does happen to good people, sometimes with horrendous consequences. In this blog post, I'm going to look at rounding errors, show you why they can creep in, and provide some guidelines you should follow to keep you and your employer safe. Let's start with some real-life cases of rounding problems.

(Rounding requires a lot of effort. Image credit: Wikimedia Commons. License: Public Domain)

# Rounding errors in the real world

## The wrong rounding method

In 1992, there was a state-level election in Schleswig-Holstein in Germany. The law stated that every party that received 5% or more of the vote got a seat, but there were no seats for parties with less than 5%. The software that calculated results rounded the results up (ceil) instead of rounding the results down (floor) as required by law. The Green party received 4.97% of the vote, which was rounded up to 5.0%, so it appeared the Green party had won a seat. The bug was discovered relatively quickly, and the seat was reallocated to the Social Democrats who gained a one-seat majority because of it [Link].

## Cumulative rounding

The more serious issue is cumulative rounding errors in real-time systems. Here a very small error becomes very important when it's repeatedly or cumulatively added.

The Vancouver Stock Exchange set up a new index in January 1982, with a value set to 1,000. The index was updated with each trade, but the index was rounded down to three decimal places (truncated) instead of rounding to the nearest decimal place. The index was calculated thousands of times a day, so the error was cumulative. Over time, the error built up from something not noticeable to something very noticeable indeed. The exchange had to correct the error; on Friday November 25th, 1983 the exchange closed at 524.811, the rounding error was fixed, and when the exchange reopened, the index was 1098.892 - the difference being solely due to the rounding error bug fix [Link].

The most famous case of cumulative rounding errors is the Patriot missile problem in Dharan in 1991. A Patriot missile failed to intercept a Scud missile, which went on to kill 28 people and injured a further 98. The problem came from the effects of a cumulative rounding error. The Patriot system updated every 0.1s, but 0.1 can't be represented exactly in a fixed point system, there's rounding, which in this case was rounding down. The processors used by the Patriot system were old 24-bit systems that truncated the 0.1 decimal representation. Over time, the truncation error built up, resulting in the Patriot missile incorrectly responding to sensor data and missing the Scud missile [Link].

# Theoretical explanation of rounding errors

## Cumulative errors

Fairly obviously, cumulative errors are a sum:

E = ∑e

where E is the cumulative error and e is the individual error. In the Vancouver Stock Exchange example, the mean individual rounding error when rounding to three decimal places was 0.0005. From Wikipedia, there were about 3,000 transactions per day, and the period from January 1st 1982 when the index started to November 25th, 1983 when the index was fixed was about 473 working days. This gives an expected cumulative error of about 710, which is in the ballpark of what actually happened.

Of course, if the individual error can be positive or negative, this can make the problem better or worse. If the error is distributed evenly around zero, then the cumulative error should be zero, so things should be OK in the long run. But even a slight bias will eventually result in a significant cumulative error - regrettably, one that might take a long time to show up.

Although the formula above seems trivial, the point is, it is possible to calculate the cumulative effect of rounding errors.

## Combining errors

When we combine numbers, errors can really hurt depending on what the combination is. Let's start with a simple example, if:

z = x - y

and:

sis the standard error in z
sis the standard error in x
sis the standard error in y

then

sz  = [s2x + s2y]1/2

If x and y are numerically close to one another, errors can quickly become very significant. My first large project involved calculating quantum states, which included a formula like z = x - y. Fortunately, the rounding was correct and not truncated, but the combination of machine precision errors and the formulae above made it very difficult to get a reliable result. We needed the full precision of the computer system and we had to check the library code our algorithms used to make sure rounding errors were correctly dealt with. We were fortunate in that the results of rounding errors were obvious in our calculations, but you might not be so fortunate.

Ratios are more complex, let's define:

z = x/y

with the s values defined as before, then:

sz /z = [(sx/x)2 + (sy/y)2]0.5

This suffers from the same problem as before, under certain conditions, the error can become very significant very quickly. In a system like the Patriot missile, sensor readings are used in some very complex equations. Rounding errors can combine to become very important.

The takeaway is very easy to state: if you're combining numbers using a ratio or subtracting them, rounding (or other errors) can hurt you very badly very quickly.

# Insidious rounding errors

Cumulative rounding errors and the wrong type of rounding are widely discussed on the internet, but I've seen two other forms of rounding that have caught people out. They're hard to spot but can be damaging.

## Rounding in the wrong places - following general advice too closely

Many technical degrees include some training on how to present errors and significant digits. For example, a quantity like 12.34567890 ∓ 0.12345678 is usually written 12.3 ∓ 0.1. We're told not to include more significant digits than the error analysis warrants. Unfortunately, this advice can lead you astray if you apply it unthinkingly.

Let's say we're taking two measurements:

x = 5.26 ∓0.14
y = 1.04 ∓0.12

following the rules of representing significant digits, this gives us

x = 5.3 ∓0.1
y = 1.0 ∓0.1

If :

z = x/y

then with the pre-rounded numbers:

z = 5.1 ∓ 0.6

but with the rounded numbers we have:

z = 5.3 ∓ 0.5

Whoops! This is a big difference. The problem occurred because we applied the advice unthinkingly. We rounded the numbers prematurely; in calculations, we should have kept the full precision and only shown rounded numbers for display to users.

The advice is simple: preserve full precision in calculations and reserve rounding for numbers shown to users.

Spreadsheets are incredible sources of errors and bugs. One of the insidious things spreadsheets do is round numbers, which can result in numbers appearing not to add up.

Let's have a look at an example. The left of the table shows numbers before rounding. The right of the table shows numbers with rounding (suppressing the decimal places). The numbers on the right don't add up because of rounding (they should sum to 1206).

 No round Round Jan 121.4 Jan 121 Feb 251.4 Feb 251 Mar 311.4 Mar 311 Apr 291.4 Apr 291 May 141.4 May 141 Jun 91.4 Jun 91 TOTAL 1208.4 TOTAL 1208

An insidious problem occurs rounded when numbers are copied from spreadsheets and used in calculations - which is a manifestation of the premature rounding problem I discussed earlier.

# 1.999... = 2, why 2 != 2, and machine precision

Although it's not strictly a rounding error, I do have to talk about the fact that 1.999... = 2. This result often surprises people, but it's an easy thing to prove. Unfortunately, on machines with finite precision, 1.9999... == 2 will give you False! Just because it's mathematically true, doesn't mean it's true on your system.

I've seen a handful of cases when two numbers that ought to be the same fail an equality test, the equivalent of 2 == 2 evaluating to False. One of the numbers has been calculated through a repeated calculation and machine precision errors propagate, the other number has been calculated directly. Here's a fun example from Python 3:

1 == (1/7) + (1/7) + (1/7) + (1/7) + (1/7) + (1/7) + (1/7)

evaluates to False!

To get round this problem, I've seen programmers do True/False difference evaluations like this:

abs(a - b) <= machine_precision

The machine precision constant is usually called epsilon.

# What to watch for

## Cumulative errors in fixed-point systems

The Patriot missile case makes the point nicely: if you're using sensor data in a system using fixed-point arithmetic, or indeed in any computer system, be very careful how your system rounds its inputs. Bear in mind, the rounding might be done in an ADC (analog-to-digital converter) beyond your control - in which case, you need to know how it converts data. If you're doing the rounding, you might need to use some form of dithering.

## Default rounding and rounding methods

There are several different rounding methods you can use; your choice should be a deliberate one and you should know their behavior. For example, in Python, you have:

• floor
• ceil
• round - which uses banker's rounding not the school textbook form of rounding and was changed from Python 2 to Python 3.

You should be aware of the properties of each of these rounding methods. If you wanted to avoid the Vancouver Stock Exchange problem, what form of rounding would you choose and why? Are you sure?

A more subtle form of rounding can occur when you mix integers and floating-point numbers in calculations. Depending on your system and the language you use, 7.5/2 can give different answers. I've seen some very subtle bugs involving hidden type conversion, so be careful.

## Premature rounding

You were taught to only present numbers to an appropriate numbers of decimal places, but that was only for presentation. For calculations, use the full precision available.

Be extremely careful copying numbers from spreadsheets, the numbers may have been rounded and you may need to look closer to get extra digits of precision.

# Closing thoughts

Rounding seems like a simple problem that happens to other people, but it can happen to you and it can have serious consequences. Take some time to understand the properties of the system and be especially careful if you're doing cumulative calculations, mixed floating-point and integer calculation, or if you're using a rounding function.

## Wednesday, March 11, 2020

### Benford's Law: finding fraud and data oddities

What links fraud detection, old-fashioned log tables, and error detection in data feeds? Benford’s Law provides the link and I'll show you what it is and how you might use it.

Imagine I gave you thousands of invoices and asked you to record the first digit of the amount. Out of say, 10,000 invoices, how many would you expect to start with the number 1, how many with the number 2, and so on? Naively, you might expect 1,111 to start with a 1; 1,111 to start with a 2 and so on. But that’s not what happens in the real world. 1 occurs more often than 2, which occurs more often than 3, and so on.

The Benford’s Law story starts in 1881, when Simon Newcomb, an astronomer, was using some mathematical log tables. For those of you too young to know, these are tables of the logarithms of numbers, very useful in pre-calculator days. Newcomb noticed that the pages for logarithms beginning 1 were more well-thumbed than the other pages, indicating that people were looking for the logarithms of some numbers more than others. Being an academic, he published a paper on it.

In 1938, a physicist called Frank Benford looked at a number of datasets and found the same relationship between the first digits. For example, he looked at the first digit of addresses and found that 1 occurred more frequently than 2, which occurred more frequently than 3 and so on. He didn't just look at addresses, he looked at the first digit of physical constants, the surface area of rivers, and numbers in the Reader's Digest etc. Despite being the second person to discover this relationship, the law is named after him and not Newcomb.

It turns out, we can mathematically describe Benford’s Law as:

P(d) = log(1 + (1/d))

Where d is the numbers 1 to 9 and P(d) is the probability of the number occurring. If we plot it out we get:

This means that for some datasets we expect the first digit to be one 30.1% of the time, the second digit to be two 17.6% of the time, three to be the first digit 12.5% of the time, etc.

The why of Benford’s Law is much too complex for this blog post. It was only recently (1998) proved by Hill [Hill] and involves digging into the central limit theorem and some very fundamental statistical and probability concepts.

Going back to my accounting example, it would seem all we have to do is plot the distribution for our invoice data and compare it to Benford’s Law. If there’s a difference, then there’s fraud. But the reality is, things are more complex than that.

Benford’s Law doesn’t apply everywhere, there are some conditions:

• The data set must vary over several orders of magnitude (e.g. from 1 to 1,000)
• The data set must have dimensions, or units. For example, Euros, or mm.
• The mean is greater than the median and the skew is positive.

Collins provides a nice overview of how it can be used to detect accounting fraud [Collins]. But Linville [Linville] has poked some practical holes in its use. He conducted an experiment using graduate students to create fake test invoices (this was a research exercise, not an attempt at fraud!) that were mixed in with simulated invoice data. He found that if the fake invoices were less than 10% or so of the total dataset, the deviations from Benford’s Law were too small to be reliably detected.

Benford’s Law actually applies to all digits, not just the first. We can plot out an expected distribution for two digits as I’ve shown below. This has also been used for fraud detection as you might expect.

You can use Benford's Law to detect errors in incoming data. Let's say you have a datafeed of user addresses. You know the house numbers should obey Benford's Law, so you can work out the distribution the data actually has and compare it to the theoretical Benford's Law distribution. If the difference is above some threshold, you can set an alert. Bear in mind, it's not just addresses that follow the law, other properties of a data feed may too. A deviation from Benford"s Law doesn't tell you which particular items are wrong, but you do get a clue about which category, for example,  you might discover items starting with a 2 are too frequent. This is a special case of using the deviation of real data from an expected distribution as an error detection mechanism - a very useful data quality assurance method everyone should be using.

To truly understand Benford’s Law, you’ll need to dig deeply into statistics and possibly number theory, but using it is relatively straightforward. You should be aware it exists and know its limitations - especially if you’re looking for fraud.

# References

[Collins] J. Carlton Collins, “Using Excel and Benford’s Law to detect fraud”, https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
[Hill] Hill, T. P. "The First Digit Phenomenon." Amer. Sci. 86, 358-363, 1998.
[Linville] “The Problem Of False Negative Results In The Use Of Digit Analysis”, Mark Linville, The Journal of Applied Business Research, Volume 24, Number 1

Wikipedia article https://en.wikipedia.org/wiki/Benford%27s_law
Mathworld article http://mathworld.wolfram.com/BenfordsLaw.html

## Saturday, February 22, 2020

### The Monty Hall Problem

Everyone thinks they understand probability, but every so often, something comes along that shows that maybe you don’t actually understand it at all. The Monty Hall problem is a great example of something that seems very counterintuitive and teaches us to be very wary of "common sense".

The problem got its fame from a 1990 column written by Marilyn vos Savant in Parade magazine. She posed the problem and provided the solution, but the solution seemed so counterintuitive that several math professors and many PhDs wrote to her saying she was incorrect. The discussion was so intense, it even reached the pages of the New York Times. But vos Savant was indeed correct.

(Monty Hall left (1976) - image credit: ABC Television - source Wikimedia Commons, no known copyright, Marilyn vos Savant right (2017) - image credit: Nathan Hill via Wikimedia Commons - Creative Commons License.  Note: the reason why the photos are from different years/ages is the availability of open-source images.)

The problem is loosely based on a real person and a real quiz show. In the US, there’s a long-running quiz show called ‘Let’s make a deal’, and its host for many years was Monty Hall, in whose honor the problem is named. Monty Hall was aware of the fame of the problem and had some interesting things to say about it.

Vos Savant posed the Monty Hall problem in this form:

• A quiz show host shows a contestant three doors. Behind two of them is a goat and behind one of them is a car. The goal is to win the car.
• The host asked the contestant to choose a door, but not open it.
• Once the contestant has chosen a door, the host opens one of the other doors and shows the contestant a goat. The contestant now knows that there’s a goat behind that door, but he or she doesn’t know which of the other two doors the car’s behind.
• Here’s the key question: the host asks the contestant "do you want to change doors?".
• Once the contestant decided whether to switch or not, the host opens the contestant's chosen door and the contestant wins the car or a goat.
• Should the contestant change doors when asked by the host? Why?

What do you think the probability of winning is if the contestant does not change doors? What do you think the probability of winning is if they do?

Here are the results.

• If the contestant sticks with their choice, they have a ⅓ chance of winning.
• If the contestant changes doors, they have a ⅔ chance of winning.

What?

This is probably not what you expected, so let’s investigate what’s going on.

I’m going to start with a really simple version of the game. The host shows me three doors and asks me to choose one. There’s a ⅓ probability of the car being behind my door and ⅔ probability of the car being behind the other two doors.

Now, let’s add in the host opening one of the other doors I haven’t chosen, showing me a goat, and asking me if I want to change doors. If I don’t change doors, the probability of me winning is ⅓ because I haven’t taken into account the extra information the host has given me.

What happens if I change my strategy? When I made my initial choice of doors, there was a ⅔ probability the car was behind one of the other two doors. That can't change. Whatever happens, there are still three doors and the car must be behind one of them. There’s a ⅔ probability that the car is behind one of the two doors.

Here’s where the magic happens. When the host opens a door and shows me a goat, there’s now a 0 probability that the car’s behind that door. But there was a ⅔ probability the car was behind one of the two doors before, so this must mean there’s a ⅔ probability the car is behind the remaining door!

There are more formal proofs of the correctness of this solution, but I won’t go into them here. For those of you into Bayes theorem, there’s a really nice formal proof.

I know some of you are probably completely unconvinced. I was at first too. Years ago, I wrote a simulator and did 1,000,000 simulations of the game. Guess what? Sticking gave a ⅓ probability and changing gave a ⅔ probability. You don’t even have to write a simulator anymore, there are many websites offering simulations of the game so you can try different strategies.

If you want to investigate the problem in-depth, read Rosenhouse's book. It's 174 pages on this problem alone, covering the media furor, basic probability theory, Bayes theory, and various variations of the game. It pretty much beats the problem to death.

The Monty Hall problem is a fun problem, but it does serve to illustrate a more serious point. Probability theory is often much more complex than it first appears and the truth can be counter-intuitive. The problem teaches us humility. If you’re making business decisions on multiple probabilities, are you sure you’ve correctly worked out the odds?

# References

• The Wikipedia article on the Monty Hall problem is a great place to start.
• New York Times article about the 1990 furor with some background on the problem.
• Washington Post article on the problem.
• 'The Monty Hall Problem', Jason Rosenhouse - is an entire book on various aspects of the problem. It's 174 pages long but still doesn't go into some aspects of it (e.g. the quantum variation).

# Correlation is not causation

Because they’ve misunderstood one of the main rules of statistical evidence, I’ve seen people make serious business mistakes and damage their careers. The rule is a simple, but subtle one: correlation is not causation. I’m going to explain what this means and show you cases where it’s obviously true, and some cases where it’s less obvious. Let’s start with some definitions.

Clearly, causation means one thing causes another. For example, prolonged exposure to ultraviolet light causes sunburn, the Vibrio cholerae bacteria causes cholera, and recessions cause bankruptcies.

# What is correlation?

Correlation occurs when two things vary in the same way. For example, lung cancer rates vary with the level of smoking, commuting times vary with the state of the economy, and health and longevity are correlated with income and wealth. The relationship usually becomes clear when we plot the data out, but it’s very rarely perfect. To give you a sense of what I mean, I’ve taken the relationship between brain mass and body mass in mammals and plotted the data below, each dot is a different type of mammal [Rogel-Salazar].

The straight line on the chart is a fit to the data. As you can see, there’s a relationship between brain and body mass but the dots are spread.

We measure how well two things are correlated with something called the correlation coefficient, r.  The closer r is to 1 (or -1), the better the correlation (this is a gross simplification). I typically look for r to be 0.8 (or < -0.8) or better.  For the brain and body data above, r is 0.89, so the correlation is ‘good’.

For causation to exist, to say that A causes B, we must be able to observe the correlation between A and B. If sunscreen is effective at reducing sunburn we should observe increased sunscreen use leading to reduced sunburn. However, we need more than correlation to prove causation (I’m skipping over details to keep it simple).

# Correlations does not imply causation

Here’s the important bit: correlation does not imply causation. Just because two things are correlated does not imply that one causes the other. Two things could be very well correlated and there could be no causal relationship between them at all. There could be a confounding factor that causes both variables to move in the same way. In my view, misunderstanding this is the single biggest problem in data analysis.

The excellent website Spurious Correlations shows the problem in a fun way, I’ve adapted an example from the website to illustrate my point. Here are two variables I've shown varying with time.

(Image credit: Spurious Correlations)

Imagine one of the variables was sales revenue and the other was the number of hours of sales effort. The correlation between them is very high (r=0.998). Would you say the amount of sales effort causes the sales revenue? If sales revenue was important to you, would you invest in more sales hours? If I presented this evidence to you in an executive meeting, what would you say?

Actually, I lied to you. The red line is US spending on science, space, and technology and the black line is suicides by hanging, strangulation, and suffocation. How can these things be related to each other? Because there’s some other variable or variables both of them depend on, or frankly, just by chance. Think for a minute what happens as an economy grows, all kinds of expenditure goes up; sales of expensive wine go up, and people spend more on their houses. Does that mean sales of expensive wine cause people to spend more on houses?

(On the spurious correlations website there are a whole bunch of other examples, including: divorce rates in Maine correlated with per capita consumption of margarine, total revenue generated by arcades is correlated with the age of Miss America, and letters in the winning word of the Scripps National Spelling Bee are correlated with number of people killed by venomous spiders.)

The chart below shows the relationship between stork pairs and human births for several European locations 1980-1990 [Matthews]. Note r is high at 0.85.

Is this evidence that storks deliver babies? No. Remember correlation is not causation. There could well be many confounding variables here, for example, economic growth leading to more leisure time. Just because we don’t know what the confounding factors are doesn’t mean they don’t exist.

My other (possibly apocryphal) example concerns lice. In Europe in the middle ages, lice were considered beneficial (especially for children) because sick people didn’t have as many lice [Zinsser]. Technically, this type of causation mistake is known as the post hoc ergo propter hoc fallacy if you want to look it up.

# Correlation/causation offenders

The causation/correlation problem often rears its ugly head in sales and marketing. Here are two examples I’ve seen, with the details disguised to protect the guilty.

I’ve seen a business analyst present the results of detailed sales data modeling and make recommendations for change based on the correlation/causation confusion. The sales data set was huge and they’d found a large number of correlations in the data (with good r values). They concluded that these correlations were causation, for example, in area X sales scaled with the number of sales reps and they concluded that more reps = more sales. They made a series of recommendations based on their findings. Unfortunately, most of the relationships they found were spurious and most of their recommendations and forecasts were later found to be wrong. The problem was, there were other factors at play that they hadn’t accounted for. It doesn’t matter how complicated the model or how many hours someone has put in, the same rule applies; correlation does not imply causation.

The biggest career blunder I saw was a marketing person claiming that visits to the company website were driving all company revenue, I remember them talking about the correlation and making the causation claim to get more resources for their group. Unfortunately, later on, revenue went down for reasons (genuinely) unrelated to the website. The website wasn’t driving all revenue - it was just one of a number of factors, including the economy and the product. However, their claim to be driving all revenue wasn’t forgotten by the executive team and the marketing person paid the career price.

Here’s what I think you should take away from all this. Just because two things appear to be correlated doesn’t mean there’s causation. In business, we have to make decisions on the basis of limited evidence and that’s OK. What’s not OK is to believe there’s evidence when there isn’t - specifically to infer causation from correlation. Statistics and experience teach us humility. The UK Highway Code has some good advice here, a green light doesn’t mean go, it means ‘proceed with caution'.