Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Monday, November 23, 2020

Chance encounters of the third kind: understanding probability

Why look back at basic probability?

Bayes' theorem lies at the heart of much of modern machine learning. Although it's relatively simple to understand, you do need some grounding in probability theory. This blog post is all about getting you up close and personal with probability theory so I can tell you all about Bayes in a later post.

(You can work out the probability aliens are on earth given that Elvis lives. Image source: Pixabay Author: Pete Linforth License: Pixabay.)

The very basics

Think of some event that might occur in the future, say winning the lottery, buying a new car, or England winning the World Cup. We can estimate the probability of these events happening; we can call the event A and the probability of the event occurring P(A). If the event is certain to occur, then P(A) =1, if it's certain not to occur, then P(A) = 0, and in all cases: 0 \(\leq \) P(A) \(\leq \) 1.

We'll consider the probability of several events I'm going to call A, B, C, etc. These can be any events at all, including aliens landing, Elvis making a comeback, or getting a pay raise at the end of the year.

The complementary rule

If the probability of an event A occurring is P(A), the probability of it not occurring is \(1 - P(A)\). This is called the complement and different authors use different notation for it:

\[1 - P(A) = P(A^c) = P(A-) = P( \bar A) = P( \raise.25ex\hbox{$\scriptstyle\sim$} A) \]

Let me give you an example using one notation. Imagine 1% of the population has a disease and 99% don't, then:

\[1 = P(D+) + P(D-) = 0.01 + 0.99\]

Independence

Independence is a huge issue in probability modeling and it can lead to big errors if not handled correctly. On the face of it, it's a simple idea, but there are subtleties.

Two events are independent if one does not affect or influence the other in any way (alternatively, one event does not give any information about the other). For example, the odds of Joe Biden winning the 2020 Presidential election do not depend on the odds of New Zealand opening its borders to international travelers. Looking at things the other way, the odds of me winning the lottery are dependent on my purchasing a ticket (I have to buy a ticket to stand any chance of winning) - these are dependent events. I'm sure you can think of many other examples.

Independent and dependent events are treated very differently mathematically, the big mistake comes when events that are not independent are considered to be independent. For example, an organization might run many opinion polls in an election. The errors in the polls will not be independent of one another because the organization may well have a systemic bias that affects all their polls. There are similar problems in epidemiology; if you and I live together, my probability of catching an infectious disease is not independent of your probability of catching an infectious disease. The most famous example of confusing independent and dependent events was the subprime mortgage scandals of 2008 onwards. The analysts who developed the subprime mortgage default models assumed that mortgage defaults were independent of one another. Unfortunately for all of us, that wasn't the case in 2008. Economic conditions led to many defaults, which in turn led to broader financial problems, which in turn led to more defaults. In 2008 and onwards, sub-prime mortgage defaults were dependent on one another.

Disjoint (mutually exclusive) events

Two events are disjoint if they're mutually exclusive, in other words, if both can't happen. For example, only one of Joe Biden or Donald Trump can win the election - they both can't be President. In notation I'll explain later: \( P(A  \ and \  B) = P(A \cap B) = 0\).

Probability A and B occurring (intersection) - the multiplication rule

What's the probability of A and B occurring (also known as their joint or conjoint probability)? Here's where we run into some notation issues. Some sources write 'and' and some use the symbol '\(\cap\)' - both mean the same thing.

Here's the rule for dependent events:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B | A)\]

Here's the rule for independent events:

\[P(A \ and \ B) = P(A  \cap B) = P(A) P(B)\]

Here's the rule for disjoint events:

\[ P(A \ and \ B) = P(A \cap B) = 0\]

The and relationship is commutative:

\[P(A \cap B) = P(B \cap A)\]

Probability of A or B occurring (union) - the addition rule

What's the probability of A or B occurring? Some sources write 'or' and some write '\(\cup\)'. Here's the rule:
\[P(A \ or \ B) = P(A  \cup B) = P(A) +  P(B) - P(A \cap B) \]

\[= P(A) + P(B) - P(A)P(B | A)\]

The or relationship is commutative:

\[P(A \cup B) = P(B \cup A)\]

For disjoint events, the addition rule simplifies to:

\[P(A \ and \ B) = P(A  \cup B) = P(A) +  P(B)  \]

because from before we have:

\[P(A \cap B) =  0\]

Conditional probability - the conditional rule

What's the probability I have a disease given I've tested positive for the disease? We use the | symbol to mean "given that", so P(A | B) means the probability of A happening given that B has occurred.  Here are some examples from everyday life:

  • What's the probability I win the lottery given that I've bought a ticket?
  • What's the probability I will get a degree if I go to college?
  • What's the probability I will have an accident if I'm driving and if it's snowing and if it's dark?

The interesting thing about conditional probability is that it can be quite different from the 'raw' probability. For example, let's say you're from a poor family, you might only have a 10% chance of getting a degree, but if you get accepted to a college, the probability might shoot up to 50%, and if you actually go to college, the probability may get to 95%. The probability can change quite substantially depending on new information (as we'll see with Bayes' theorem).

The general rule is:

\[P(A | B) = \frac{P(B \cap A)}{P(B)}\]

If A and B are independent (A does not depend on B), then P(A | B) = P(A).

The law of total probability

There's a general form of this law and a more specific form. Because the specific form will be useful for Bayesian work later, we'll start with that.

\[P(A+) = P(A+ \cap \ B+) + P(A+ \cap \ B-)\]

In words, the probability of an event A+ occurring is the probability of the event A+ occurring and the event B+ occurring plus the probability of event A+ occurring and the probability of event B+ not occurring (B-). This might be clearer if we remember \(1 = P(B+) + P(B-)\) and we think of probabilities using a Venn diagram.

The more general form of this law is:

\[P(A) = \sum_i{P(A \cap B_i)} = \sum_i{P(A | B_i)P(B_i)}\]

The law of total probability and conditional probabilities

One of the most useful forms of Bayes' theorem relies on the combination of the law of total probability and conditional probability. Here's the key relationship:

\[1 = P(A | B) + P(\bar A | B)\]

Let me put this into words. If event B happens, then either A or not A happens, there are no other options, so the two probabilities must sum to 1.

What use is probability theory?

I grew up hearing about the value of 'common sense', but probability theory often gives results that seem very counterintuitive and 'common sense' can lead you wildly astray. A fun example is the Monty Hall problem, but there are lots of other examples in the real world where the probability of something happening is not what it appears to be at first - and they're not so fun. The counter-intuitive example you find most often on the internet is the probability that you have a disease given a positive test result; it's mostly not what you think.

Bayes' theorem takes us into the world of the counter-intuitive and I'll talk about Bayes in a future blog post.

Monday, November 16, 2020

Geese or enemy aircraft? Receiver Operating Characteristic curves in machine learning

In a strange quirk of history, one of the ways of evaluating machine learning algorithms has its roots in World War II and was subsequently used in a range of disciplines, including psychiatry. Only much later was it used in machine learning, but it kept its original name: receiver operating characteristic (ROC). I'm going to look at the history of this technique and explain what it is and why it's so important.

Is it geese or is it enemy planes?

In 1940, the situation in Britain was dire; the country was engaged in a desperate stand against Hitler.  To weaken the country, and break the will of the people, Nazi aircraft heavily bombed British cities, which was the infamous blitz. I've seen estimates of over 43,000 people killed and of course, there was huge damage to Britain's industrial and cultural infrastructure. Newsreel pictures and propaganda of the time give a view of the devastation. Britain stood alone against the Nazi threat; the Battle of Britain was an existential one.

(Office workers in London going to work through bomb damage. Image source: Wikimedia Commons, License: Public Domain.)

It was vital therefore to detect enemy aircraft as quickly as possible, so the British government used a new technology called radar. Radar receivers had a number of settings, for example, you could turn the gain (amplification) up, but what should the correct settings be? Obviously, you want to correctly identify enemy aircraft, but you don't want to identify a flock of geese as aircraft. If you divert limited resources to chasing wild geese, those resources aren't available to pursue the real threat. This is where the receiver operating characteristic curve comes in. It was a way of deciding the best operating point and/or deciding the best receiver.

Ways of being right and wrong

I've covered this before in a previous blog post about the confusion matrix, so I'll just briefly recap here. There are two ways to be right and two ways to be wrong if we're doing a binary classification (geese/enemy aircraft).


Actual
enemy aircraft geese
Prediction enemy aircraft True Positive False Positive
geese False Negative True Negative

From the counts of the True Positives, False Negatives, etc. we can define two quantities:

\[TPR = \frac{TP}{TP + FN} = 1 - FNR\]
\[= True \ Positive \ Rate, sensitivity, recall, hit rate\]
\[FPR = \frac{FP}{FP + TN} = False \ Positive \ Rate, fall out\]

There are an overly large number of other quantities we can define to help us evaluate classification. But these quantities and numbers are points: they allow us to evaluate an algorithm at a point, or under a single operating condition.

A picture is worth a thousand words

The receiver operating characteristic is a plot of the True Positive Rate vs. the False Positive Rate for different settings. Generically, it looks something like this. 

We get a curve by varying a parameter and measuring FNR and TPR at each of the parameter values. In the case of our World War II radar receiver, the parameter could be gain; increasing the gain changes the trade-off between TPR and FPR. 

Let's imagine a receiver that was just a random selector - choosing geese or enemy aircraft based on the toss of a coin. We would expect it to give us a straight line at \(45^o\). Over time, the random selector would tend to the 50-50 point on the straight line. A real receiver has to do better than chance, so it has to be above the random line. In the chart below, the chance line is the black dotted line.

An ideal receiver has very different properties from the chance line. I've indicated an ideal operating curve in red on the chart below - it always gives a 100% True Positive Rate.

The ROC chart allows us to compare the behavior of different algorithms or different receivers. We could draw out the ROC curve for two receivers for example and choose the best one (the highest line). Here's a graphical representation.

A more mathematical way of doing the same thing is to use the ROC curves, but work out an area under the curve (AUC). An ideal receiver has an AUC of 1 (the red line), but obviously, the higher the AUC, the better.

Machine learning

Classifiers enable us to make categorical decisions based on input data. For example, if a user types 'evening wear' into a shopping site, do you show them cocktail dresses or tuxedos? A machine learning algorithm might use the users' browsing behavior to make a guess about male or female clothing. But how correct is the algorithm? This is where ROC curves can be used to understand the degree of correctness and the appropriate algorithmic settings to use.

Uses of ROC curves outside of machine learning

ROC curves are used in a wide range of disciplines:

More tongue-in-cheek, a group of medical researchers in Sydney, Australia used a ROC to find the optimal walking speed for men over 70 to avoid death. If you're interested, the optimal speed is 0.82m/s. 

Limitations of ROC curves - precision-recall

In a previous blog post, I looked at the confusion matrix and talked about prevalence. The idea is simple: a biased data set can give you a false sense of the accuracy of your data. If your data is biased, a precision-recall plot may be more appropriate.

Going back to the confusion matrix, here's how we define precision and recall.

\[Precision = \frac{TP}{TP + FP}\]
\[Recall = \frac{TP}{TP + FN}\]

Here's a typical precision-recall curve.

Because precision gives us an indication of how relevant the results are, precision-recall curves are often used to evaluate information retrieval algorithms.

Despite the long track record for receiver operating characteristic curves, precision-recall curves may be a better evaluation method. However, old habits die hard and ROC curves still reign.

Don't lose sight of the end goal

ROC and precision-recall curves are all about the same thing: figuring out how useful an algorithm is. There are lots of different ways an algorithm can be wrong, which means different ways of investigating correctness. Don't lose sight of the fact that under the hood, machine learning algorithms are probabilistic.

Reading more

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/S1481803500013336

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.7628&rep=rep1&type=pdf

Monday, November 9, 2020

Dazed and confused: the confusion matrix and getting it right and wrong

How correct are my (machine learning) algorithms?

In machine learning, we're using algorithms to make predictions about outcomes based on input data. For example, given that a consumer at an online store views dog collars and dog leads, you might show them dog food if they search for 'pet food'. This is fairly obvious, but what if they then searched for evening wear, would you show them cocktail dresses or tuxedos? 

(The confusion matrix can be confusing. Image source: Pixabay. Author: Erika Wittlieb. License: Pixabay license)

The confusion matrix is about quantifying the correctness of algorithms, but it's not sufficient of itself. Fortunately, there are quantities we can derive from the confusion matrix that will show up certain types of error as we'll see.

The confusion matrix

I'm going to use the example of an online store that sells pet products. Imagine an algorithm that tries to decide if a consumer has a cat or not. There are two ways the algorithm can be right and two ways the algorithm can be wrong. I'll draw it out as a matrix so you can see it a bit more easily. In reality, we might put counts of false negatives, etc. in the matrix.


Actual
cat not cat
Prediction cat True Positive False Positive
not cat False Negative True Negative

All of this sounds great. It looks like we can define some rates and be done.  Let's start with some definitions and see where we get to.

We might want to know often we said it was a cat when it actually was a cat, in other words, when it actually was positive, how often did we say it was positive. This is called the True Positive Rate (TPR), which is defined like this (where FNR is the False Negative Rate and is similarly defined):

\[TPR = \frac{TP}{TP + FN} = 1 - FNR = sensitivity, recall, hit rate\]

On the flip side, how often did we say not cat when it really was not cat (how often did we say negative when it really was negative):

\[TNR = \frac{TN}{TN + FP} = 1 - FPR = specificity, selectivity\]

There are a whole bunch of other metrics we can similarly define and I won't belabor the point by defining them all here (it seems as if every possible combination of true/false positive/negative has a name). I'm just going to show some of them in this table to give you a flavor.


Actual Parameter
cat not cat
Prediction cat True Positive False Positive Precision (positive predictive value)
\[\frac{TP}{FP + TP}\]
False Discovery Rate
\[\frac{FP}{FP + TP}\]
not cat False Negative True Negative False Omission Rate
\[\frac{FN}{FN + TN}\]
Parameter True Positive Rate (Recall, Sensitivity)
\[\frac{TP}{TP + FN}\]
True Negative Rate (specificity)
\[\frac{TN}{TN + FP}\]
False Positive Rate \[\frac{FP}{TN + FP}\]

Be careful here; it's easy to get caught up on the names and definitions. You should focus on what this means for the correctness of your results.

We can use these metrics to help decide if our algorithms are good or not - but there are other things we need to consider.

Prevalence

One of the major issues in algorithmic bias has been prevalence. It's possible to get what seems like highly accurate results but for the results to be deeply biased by the underlying data. Again, the confusion matrix can help.

We can define the accuracy of an algorithm using this formula:

\[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\]

Let's imagine we're getting a really great accuracy. We're really good at saying it's a cat when it really is a cat. Doesn't this sound like a really great algorithm? Think about your answer before moving on.

The trouble is, it could be because almost all the underlying data is cat data. Imagine 95% of the data was cats and we said cat 100% of the time. Some of the metrics in the table would look wonderful. We'd get a 95% accuracy for example!

A version of this has happened in real life with awful consequences. Some of the human datasets that machine learning algorithms are trained on are biased: for example, they are disproportionally images of white people, or even worse, white males. In 2015, Google released a photo app that classified images. It misclassified pictures of black people as Gorillas. This is just horrendous on multiple levels. The problem here might be that their training data set didn't include many pictures of non-white people. The labeling algorithms were accurate, just so long as you're white.

To test for bias in the dataset, we look at a number called prevalence which represents the fraction of the data set that's in a category. In our example, the prevalence of cats would be 0.95 and non-cats 0.05, which reveals a huge bias towards cats. This might be OK if the site was aimed at cat lovers, but not so great if the site was trying to grow non-cat sales.

If you're doing any machine learning work for public consumption, you must consider prevalence.

One number to bind them all

Precision, recall, false discovery rate... there are lots of numbers here and it gets confusing. Why don't we create one metric that binds them all together? We would like a score of 1 for this metric to represent perfection, and 0 to represent total failure. Fortunately, there is such a metric and it's called the \(F_1\) score.

I won't go into the derivation here, but I will give you the formula:

\[F_1 = \frac{TP}{TP + \frac{1}{2}(FP + FN)}\]

(for those of you who want a bit more, it's the harmonic mean of precision and recall). 

Even the \(F_1\) score isn't the end of it. It weighs precision and recall equally, but in reality, that might not be what we want. For example, we might consider a false positive much worse than a false negative (sending an innocent person to jail rather than setting a guilty person free for example). In these kinds of cases, there's a weighting factor \(\beta\) we can apply.

We can define \(\beta\) as:

\[\beta = \frac{TP + FP}{TP + FN}\] and we can create a revised F score as:

\[F_\beta =  \frac{(1 + \beta^2) TP}{(1 + \beta^2) TP + \beta^2FN + FP}\]

All this looks a bit familiar

By the way, there are very obvious parallels here to statistics, specifically, \(\alpha\), \(\beta\), Type I, and Type II errors. We're getting quite close to statistical tests with some of these processes, which probably isn't surprising. Sadly, similar things are called by different names in different disciplines, a nice way to keep barriers to entry high.

Snakes and pirates

Both Python and R have libraries you can use that will give you the confusion matrix and quantities derived from it. In Python, you should look at confusion_matrix in scikit-learn. In R, you need confusionMatrix from the caret package.

What's next?

The confusion matrix is just the start. There are several techniques based on it that you can use to effectively evaluate algorithms. In a future blog post, I'm going to look at something called Receiver Operating Characteristic which has a very interesting history.  The thought I want to leave you with is a simple one: the confusion matrix is a means of representing different ways of being right and wrong. You can use quantities derived from the matrix to indicate bias and to indicate correctness.

Monday, November 2, 2020

The null hypothesis test

What's null hypothesis testing?

In business, as in many other fields, we have to make decisions in the face of uncertainty. Does this technology improve conversion? Is the new sales process working? Is the new machine tool improving quality? Almost never are the answers to these questions absolutely certain; there will be probabilities we have to trade off to make our decision.

(Two hypotheses battling it out for supremacy. Image source: Wikimedia Commons. Author: Pierdante Romei. License: Creative Commons.)

Null hypothesis tests are a set of techniques that enable us to reach probabilistic conclusions in an unbiased way. They provide a level playing field to decide if an effect is there or not.

Although null hypothesis tests are widely taught in statistics classes, many people who've come into data science from other disciplines aren't familiar with the core ideas. Similarly, people with business backgrounds sometimes end up evaluating A/B tests where the correct interpretation of null hypothesis tests is critical to understanding what's going on. 

I’m going to explain to you what null hypothesis testing is and some of the concepts needed to implement and understand it.

What result are you testing for?

To put it simply, a null hypothesis test is a test of whether there is an effect of a certain size present or not. The null hypothesis is that there is no effect, and the alternate hypothesis is that there is an effect. 

At its heart, the test is about probability and not certainty. We can’t say for sure if there is an effect or not, what we can say is the probability of there being an effect.  But probabilities are limited and we have to make binary go/no-go decisions - so null hypothesis tests include the idea of probability thresholds for deciding whether something is there or not.

To illustrate the use of a null hypothesis test, I’m going to use a famous example, that of the lady tasting tea. 

In a research lab, there was a woman who claimed she could tell the difference between cups of tea prepared in one of two ways:

  • The milk poured into the cup first and then the tea poured in
  • The tea poured first and then the milk poured in.
(Image source: Wikimedia Commons Artist: Ian Smith License: Creative Commons)

The researcher decided to do a test of her abilities by asking her to taste multiple cups of tea and state how she thought each cup had been prepared. Of course, it’s possible she could be 100% successful by chance alone. 

We can set up a null hypothesis test using these hypotheses:

  • The null hypothesis is the most conservative option. Here it’s that she can’t taste the difference. More specifically, her success rate is indistinguishable from random chance.
  • The alternative hypothesis is that she can tell the difference. More specifically, her success rate is significantly different from random chance.
Let's define some quantities:
  • \( p_T \) - the proportion of cups of tea she correctly got
  • \( p_C \)  - the proportion of cups of tea she would be expected to get by chance alone (by guessing)
We can write the null and alternative hypotheses as:

  • \( H_0: p_T = p_C\) 
  • \( H_1: p_T \neq p_C\) 

But – the hypotheses in this form aren't enough. Will we insist she has to be correct every single time? Is there some threshold we expect her to reach before we accept her claim?

The null hypothesis is the first step in setting up a statistical test, but to make it useful, we have to go a step further and set up thresholds. To do this, we have to understand different types of errors.

Error types

To make things easy, we’ll call 'milk first' a positive and 'milk second' a negative.

For our lady testing tea, there are four possibilities:

  • She can say ‘milk first’ when it was 'milk first' – a true positive
  • She can say ‘milk first’ when it wasn’t 'milk first' – a false positive (also known as a Type I error)
  • She can say ‘milk second’ when it was 'milk second' – a true negative
  • She can say ‘milk second’ when it wasn’t 'milk second' – a false negative (also known as a Type II error)

This is usually expressed as a table like the one below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

We can assign probabilities to each of these outcomes. As you can see, there are two numbers that are important here, \(\alpha\) and \(\beta\); however, in practice, we consider \(\alpha\) and 1-\(\beta\) as the numbers of importance. \(\alpha\) is called significance, and 1-\(\beta\) is called power. We can set values for each of them prior to the test. By convention, \(\alpha\) is usually 0.05, and 1-\(\beta \geq \) 0.80.

Test results, test size, and p-values

Our lady could guess correctly by chance alone. We have to set up the test so a positive conclusion due to randomness is unlikely, hence the use of thresholds. The easiest way to do this is to set the test size correctly, i.e. set the number of cups of tea. Through some math I won't go into, we can use \(\alpha\), (1-\(\beta\)), and the effect size to set the sample size. The effect size, in this case, is her ability to detect how the cup of tea was prepared above and beyond what would be expected by chance. For example, we might run a test to see if she was 20% better than chance.

To evaluate the test, we calculate a p-value from the test results. The p-value is the probability the test result was due to chance alone. Because this is so important, I'm going to explain it again using other words. Let's imagine the lady tasting tea was guessing. By guessing alone, she could get between 0% and 100% correct. We know the probability for each percentage. We know it's very unlikely she'll get 100% or 0% by guesswork, but more likely she'll get 50%. For the score she got, we can work out the probability of her getting this score (or higher) through chance alone. Let's say there was a 3% chance she could have gotten her score by guessing alone. Is this proof she's not guessing?

We compare the p-value to our \( \alpha\) threshold to decide which hypothesis is wrong. Let’s say our p-value was 0.03 and our \( \alpha \) value was 0.05, because 0.03 < 0.05 we reject the null hypothesis. In other words, we would accept that the lady was not guessing.

False negatives, false positives

Using \(\alpha\) and a p-value, we can work out the chance of us saying there's an effect when there is none (a false positive). But what about a false negative? We could say there's no effect when there really is one. That might be as damaging to a business as a false positive. The quantity \(\beta\) gives us the probability of a false negative. By convention, statisticians talk about the power (1-\(\beta\)) of a test which is the probability of detecting an effect of the size you think is there.

Single tail or two-tail tests

Technically, the way the null hypothesis is set up in the case of the lady tasting tea is a two-tailed test. To ‘succeed ’ she has to do a lot better than chance or she has to do a lot worse. That’s appropriate in this case because we’re trying to understand if she’s doing something else other than guessing.

We could set up the test differently so she has to only be right more often than chance suggests. This would be a one-tail test. One-tail tests are shorter than two-tail tests, but they’re more limited. 

In business, we tend to do two-tailed tests rather than one-tailed tests.

Fail to reject the null or rejecting the null

Remember, we’re talking about probabilities and not certainties. Even if we gave our lady 100 cups to taste, there’s still a possibility she gets them all right due to chance alone. So we can’t say either the null or the alternate is true, all we can do is reject them at some threshold, or fail to reject them. In the case of a p-value of 0.03, a statistician wouldn’t say the alternate is true (the lady can taste the difference), but they would say ‘we reject the null hypothesis’. If the p-value was 0.1, it would be higher than the \( \alpha \) value and we would ‘fail to reject the null hypothesis’. This language is complex, but statisticians are trying to capture the idea that results are about probabilities, not certainties.

Choice of significance and power

Significance and power affect test size, so maybe we should choose them to make the test short? If you want to do a valid test, you're not free to choose any values of \(\alpha\) and (1-\(\beta\)) you choose. Convention dictates that you stick to these ranges:

  • \(\alpha \geq 0.95\) - anything less than this is usually considered a junk test.
  • (1-\(\beta) \geq 0.8\) - anything less than this is not worth doing. 

The why behind these values is the subject of another blog post.

The null hypothesis test summarized

This has been a very high-level summary of what happens in a null hypothesis test, for the sake of simplicity there are several steps I've left out and I've greatly summarized some ideas. Here's a simple summary of the steps I've discussed.

  1. Decide if the test is one-tail or two-tail.
  2. Create a null and alternate hypothesis.
  3. Set values for \(\alpha\) and (1-\(\beta\)) prior to the test.
  4. After the test, calculate a p-value.
  5. Compare the p-value to \(\alpha\) to figure out a false positive probability
  6. Check \(\beta\) to figure out the probability of a false negative.

I've left out topics like the z-test and the t-test and a bunch of other important ideas. 

Your takeaway should be that this process is complex and there are no shortcuts. At its heart, hypothesis testing is about deciding what's true when the data is uncertain and you need to do it without bias.


(Justice is supposed to be blind and balanced - like a null hypothesis test. Image source: Wikimedia Commons. License:  GNU Free Documentation License.)

Problems with the null hypothesis test

Mathematically, there's controversy about the fundamentals of the procedure, but frankly, the controversy is too complex to discuss here - in any case, the controversy isn't over whether the procedures work or not.

A more serious problem is baked into the approach. At its heart, null hypothesis testing is about making a binary yes/no decision based on probabilistic data. The results are never certain. Unfortunately, test results are often taken as certain. For example, if we can't detect an effect in a test, it's often assumed there is no effect, but that's not true. This assumption that no detection = no effect has had tragic consequences in medical trials; there are high-profile cases where the negative side effects of a drug have been just below the threshold levels. Sadly, once the drugs have been released, the negative effects become well know with disastrous consequences, a good example being Vioxx.

You must be aware that a test failure doesn't mean there isn't an effect. It could mean there's an effect hovering just below your acceptance threshold.

Using the null hypothesis in business

This is all a bit abstract, so let's bring it back to business. What are some examples of null hypothesis tests in the business world?

A/B testing

Most of the time, we choose a two-tail test because we're interested in the possibility a change might make conversion or other metrics worse. The hypothesis test we use is usually of this form:

\(H_0 : CR_B = CR_A\)

\(H_1 : CR_B \neq CR_A\)

where CR is the conversion rate, or revenue per user per branch, or add to bag etc.

Manufacturing defects

Typically, these tests are one-tailed because we're only interested in an improvement. Here, the test might be:

\(H_0 : DR_B = DR_A\)

\(H_1 : DR_B < DR_A\)

where DR is the defect rate.

Closing thoughts

If all this seems a bit complex, arbitrary, and dependent on conventions, you're not alone. As it turns out, null hypothesis techniques are based on the shotgun marriage of two separate approaches to statistics. In a future blog post, I'll delve into this some more. 

For now, here's what you should take away:
  • You should understand that you need education and training to run these kinds of tests. A good grounding in statistics is vital.
  • The results are probabilistic and not certain. A negative test doesn't mean an effect isn't there, it might just be hovering underneath the threshold of detection.

Reading more

https://www.sagepub.com/sites/default/files/upm-binaries/40007_Chapter8.pdf

Monday, September 14, 2020

The datasaurus: always visualize your data

The summary is not the whole picture

If you just use summary statistics to describe your data, you can miss the bigger picture, sometimes literally so. In this blog post, I'm going to show you how relying on summaries alone can lead you catastrophically astray and I'm going to tell you how you can avoid making career-damaging mistakes.

The datasaurus is why you need to visualize your data. Source: Alberto Cairo. Open source.

What are summary statistics?

Summary statistics are parameters like the mean, standard deviation, and correlation coefficient; they summarize the properties of the data and the relationship between variables. For example, if the correlation coefficient, r, is about 0.8 for two data sets x and y, we might think there's a relationship between them, but if it's about 0, we might think there isn't.

The use of summary statistics is widely taught, every textbook emphasizes them, and almost everyone uses them. But if you use summary statistics in isolation from other methods you might miss important relationships - you should always visualize your data as we'll see.

Anscombe's Quartet

Take a look at the four plots below. They're obviously quite different, but they all have the same summary statistics!

Here are the summary statistics data:

PropertyValue
Mean of x9
Sample variance of x : 11
Mean of y7.50
Sample variance of y : 4.125
Correlation between x and y0.816
Linear regression liney = 3.00 + 0.500x
Coefficient of determination of the linear regression : 0.67

These plots were developed in 1973 by the statistician Francis Anscombe to make exactly this point: you can't rely on summary statistics, you need to visualize your data. The graphical relationship between the x and y variables is different in each case and implies different things. By plotting the data out, we can see what the relationships are, but summary statistics hide what's going on.

The datasaurus

Let's zoom forward to 2016. The justly famous Alberto Cairo tweeted about Anscombe's quartet and illustrated the point with this cool set of summary statistics. He later expanded on his tweet in a short blog post.

Property Value
n 142
mean 54.2633
x standard deviation 16.7651
y mean 47.8323
y standard deviation 26.9353
Pearson correlation -0.0645

What might you conclude from these summary statistics? I might say, the correlation coefficient is close to zero so there's not much of a relationship between the x and the y variables. I might conclude there's no interesting relationship between the x and y variables - but I would be wrong.

The summary might not mean anything to you, but the visualization surely will. This is the datasaurus data set, the x and the y variables draw out a dinosaur.

The datasaurus dozen

Two researchers at Autodesk Research took things a stage further. They started with Alberto Cairo's datasaurus and created a dozen other charts with exactly the same summary statistics as the datasaurus. Here they all are.

The summary statistics look like noise, but the charts reveal the underlying relationships between the x and y variables. Some of these relationships are obviously fun, like the star, but there are others that imply more meaningful relationships.

If all this sounds a bit abstract, let's think about how this might manifest itself in business. Let's imagine you're an analyst working for a large company. You have data on sales by store size for Europe and you've been asked to analyze the data to gain insights. You're under time pressure, so you fire up a Python notebook and get some quick summary statistics. You get summary statistics that look like the ones I showed you above. So you conclude there's nothing interesting in the data, but you might be very wrong.

You should plot the data out and look at the chart. You might see something that looks like the slanting charts above, maybe something like this:



the individual diagonal lines might correspond to different European countries (different regulations, different planning rules, different competition, etc.). There could be a very significant relationship that you would have missed by relying on summary data.

(The Autodesk Research team have posted their work as a paper you can read here.)

Lessons learned

The lessons you should take away from all this are simple:

  • summary statistics hide a lot
  • there are many relationships between variables that will give summary statistics that look like noise
  • always visualize your data!

Tuesday, September 8, 2020

Can you believe the polls?

Opinion polls have known sin

Polling companies have run into trouble over the years in ways that render some poll results doubtful at best. Here are just a few of the problems:

  • Fraud allegations.
  • Leading questions
  • Choosing not to publish results/picking methodologies so that polls agree.

Running reliable polls is hard work that takes a lot of expertise and commitment. Sadly, companies sometimes get it wrong for several reasons:

  • Ineptitude.
  • Lack of money. 
  • Telling people what they want to hear. 
  • Fakery.

In this blog post, I'm going to look at some high-profile cases of dodgy polling and I'm going to draw some lessons from what happened.

(Are some polls real or fake? Image source: Wikimedia Commons. Image credit: Basile Morin. License: Creative Commons.)

Allegations of fraud part 1 - Research 2000

Backstory

Research 2000 started operating around 1999 and gained some solid early clients. In 2008, The Daily Kos contracted with Research 2000 for polling during the upcoming US elections. In early 2010, Nate Silver at FiveThirtyEight rated Research 2000 as an F and stopped using their polls. As a direct result, The Daily Kos terminated their contract and later took legal action to reclaim fees, alleging fraud.

Nate Silver's and others' analysis

After the 2010 Senate elections, Nate Silver analyzed polling results for 'house effects' and found a bias towards the Democratic party for Research 2000. These kinds of biases appear all the time and vary from election to election. The Research 2000 bias was large (at 4.4%), but not crazy; the Rasmussen Republican bias was larger for example. Nonetheless, for many reasons, he graded Research 2000 an F and stopped using their polling data.

In June of 2010, The Daily Kos publicly dismissed Research 2000 as their pollster based on Nate Silver's ranking and more detailed discussions with him. Three weeks later, The Daily Kos sued Research 2000 for fraud. After the legal action was public, Nate Silver blogged some more details of his misgivings about Research 2000's results, which led to a cease and desist letter from Research 2000's lawyers. Subsequent to the cease-and-desist letter, Silver published yet more details of his misgivings. To summarize his results, he was seeing data inconsistent with real polling - the distribution of the numbers was wrong. As it turned out, Research 2000 was having financial trouble around the time of the polling allegations and was negotiating low-cost or free polling with The Daily Kos in exchange for accelerated payments. 

Others were onto Research 2000 too. Three statisticians analyzed some of the polling data and found patterns inconsistent with real polling - again, real polls tend to have results distributed in certain ways and some of the Research 2000 polls did not.

The result

The lawsuit progressed with strong evidence in favor of The Daily Kos. Perhaps unsurprisingly, the parties agreed a settlement, with Research 2000 agreeing to pay The Daily Kos a settlement fee. Research 2000 effectively shut down after the agreement.

Allegations of fraud part 2 - Strategic Vision, LLC

Backstory

This story requires some care in the telling. At the time of the story, there were two companies called Strategic Vision, one company is well-respected and wholly innocent, the other not so much. The innocent and well-respected company is Strategic Vision based in San Diego. They have nothing to do with this story. The other company is Strategic Vision, LLC based in Atlanta. When I talk about Strategic Vision, LLC from now on it will be solely about the Atlanta company.

To maintain trust in the polling industry, the American Association for Public Opinion Research (AAPOR) has guidelines and asks polling companies to disclose some details of their polling methodologies. They rarely censure companies, and their censures don't have the force of law, but public shaming is effective as we'll see. 

What happened

In 2008, the AAPOR asked 21 polling organizations for details of their 2008 pre-election polling, including polling for the New Hampshire Democratic primary. Their goal was to quality-check the state of polling in the industry.

One polling company didn't respond for a year, despite repeated requests to do so. As a result, in September 2009, the AAPOR published a public censure of Strategic Vision, LLC which you can read here

It's very unusual for the AAPOR to issue a censure, so the story was widely reported at the time, for example in the New York Times, The Hill, and The Wall Street Journal. Strategic Vision LLC's public response to the press coverage was that they were complying but didn't have time to submit their data. They denied any wrongdoing.

Subsequent to the censure, Nate Silver looked more closely at Strategic Vision LLC's results. Initially, he asked some very pointed and blunt questions. In a subsequent post, Nate Silver used Benford's Law to investigate Strategic Vision LLC's data, and based on his analysis he stated there was a suggestion of fraud - more specifically, that the data had been made up. In a post the following day, Nate Silver offered some more analysis and a great example of using Benford's Law in practice. Again, Strategic Vision LLC vigorously denied any wrongdoing.

One of the most entertaining parts of this story is a citizenship poll conducted by Strategic Vision, LLC among high school students in Oklahoma. The poll was commissioned by the Oklahoma Council on Public Affairs, a think tank. The poll asked eight various straightforward questions, for example:

  • who was the first US president? 
  • what are the two main political parties in the US?  

and so on. The results were dismal: only 23% of students answered George Washington and only 43% of students knew Democratic and Republican. Not one student in 1,000 got all questions correct - which is extraordinary. These types of polls are beloved of the press; there are easy headlines to be squeezed from students doing poorly, especially on issues around citizenship. Unfortunately, the poll results looked odd at best. Nate Silver analyzed the distribution of the results and concluded that something didn't seem right - the data was not distributed as you might expect. To their great credit, when the Oklahoma Council on Public Affairs became aware of problems with the poll, they removed it from their website and put up a page explaining what happened. They subsequently terminated their relationship with Strategic Vision, LLC.

In 2010, a University of Cincinnati professor awarded Strategic Vision LLC the ''Phantom of the Soap Opera" award on the Media Ethics site. This site has a little more back story on the odd story of Strategic Vision LLC's offices or lack of them.

The results

Strategic Vision, LLC continued to deny any wrongdoing. They never supplied their data to the AAPOR and they stopped publishing polls in late 2009. They've disappeared from the polling scene.

Other polling companies

Nate Silver rated other pollsters an F and stopped using them. Not all of the tales are as lurid as the ones I've described here, but there are accusations of fraud and fakery in some cases, and in other cases, there are methodology disputes and no suggestion of impropriety. Here's a list of pollsters Nate Silver rates an F.

Anarchy in the UK

It's time to cross the Atlantic and look at polling shenanigans in the UK. The UK hasn't seen the rise and fall of dodgy polling companies, but it has seen dodgy polling methodologies.

Herding

Let's imagine you commission a poll on who will win the UK general election. You get a result different from the other polls. Do you publish your result? Now imagine you're a polling analyst, you have a choice of methodologies for analyzing your results, do you do what everyone else does and get similar results, or do you do your own thing and maybe get different results from everyone else?

Sadly, there are many cases when contrarian polls weren't published and there is evidence that polling companies made very similar analysis choices to deliberately give similar results. This leads to the phenomenon called herding where published poll results tend to herd together. Sometimes, this is OK, but sometimes it can lead to multiple companies calling an election wrongly.

In 2015, the UK polls predicted a hung parliament, but the result was a working majority for the Conservative party. The subsequent industry poll analysis identified herding as one of the causes of the polling miss. 

This isn't the first time herding has been an issue with UK polling and it's occasionally happened in the US too.

Leading questions

The old British TV show 'Yes, Prime Minister' has a great piece of dialog neatly showing how leading questions work in surveys. 'Yes, Prime Minister' is a comedy, but UK polls have suffered from leading questions for a while.

The oldest example I've come across dates from the 1970's and the original European Economic Community membership referendum. Apparently, one poll asked the following questions to two different groups:

  • France, Germany, Italy, Holland, Belgium and Luxembourg approved their membership of the EEC by a vote of their national parliaments. Do you think Britain should do the same?
  • Ireland, Denmark and Norway are voting in a referendum to decide whether to join the EEC. Do you think Britain should do the same?

These questions are highly leading and unsurprisingly elicited the expected positive result in both (contradictory) cases.

Moving forward in time to 2012, leading questions or artful question wording, came up again. The background is press regulation. After a series of scandals where the press behaved shockingly badly, the UK government considered press regulation to curb abuses. Various parties were for or against various aspects of press regulation and they commissioned polls to support their viewpoints. 

The polling company YouGov published a poll, paid for by The Media Standards Trust, that showed 79% of people thought there should be an independent government-sanctioned regulator to investigate complaints against the press. Sounds comprehensive and definitive. 

But there was another poll at about the same time, this time paid for by The Sun newspaper,  that found that only 24% of the British public wanted a government regulator for the press - the polling company here was also YouGov! 

The difference between the 79% and 24% came through careful question wording - a nuance that was lost in the subsequent press reporting of the results. You can listen to the story on the BBC's More Or Less program that gives the wording of the question used.

What does all this mean?

The quality of the polling company is everything

The established, reputable companies got that way through high-quality reliable work over a period of years. They will make mistakes from time to time, but they learn from them. When you're considering whether or not to believe a poll,  you should ask who conducted the poll and consider the reputation of the company behind it.

With some exceptions, the press is unreliable

None of the cases of polling impropriety were caught by the press. In fact, the press has a perverse incentive to promote the wild and outlandish, which favors results from dodgy pollsters. Be aware that a newspaper that paid for a poll is not going to criticize its own paid-for product, especially when it's getting headlines out of it.

Most press coverage of polls focuses on discussing what the poll results mean, not how accurate they are and sources of bias. If these things are discussed, they're discussed in a partisan manner (disagreeing with a poll because the writer holds a different political view). I've never seen the kind of analysis Nate Silver does elsewhere - and this is to the great detriment of the press and their credibility.

Vested interests

A great way to get press coverage is by commissioning polls and publishing the results; especially if you can ask leading questions. Sometimes, the press gets very lazy and doesn't even report who commissioned a poll, even when there's plainly a vested interest.

Anytime you read a survey, ask who paid for it and what the exact questions were.

Outliers are outliers, not trends

Outlier poll results get more play than results in line with other pollsters. As I write this in early September 2020, Biden is about 7% ahead in the polls. Let's imagine two survey results coming in early September:

  • Biden ahead by 8%.
  • Trump ahead by 3%

Which do you think would get more space in the media? Probably the shocking result, even though the dull result may be more likely. Trump-supporting journalists might start writing articles on a campaign resurgence while Biden-supporting journalists might talk about his lead slipping and losing momentum. In reality, the 3% poll might be an anomaly and probably doesn't justify consideration until it's backed by other polls. 

Bottom line: outlier polls are probably outliers and you shouldn't set too much store by them.

There's only one Nate Silver

Nate Silver seems like a one-man army, routing out false polling and pollsters. He's stood up to various legal threats over the years. It's a good thing that he exists, but it's a bad thing that there's only one of him. It would be great if the press could take inspiration from him and take a more nuanced, skeptical, and statistical view of polls. 

Can you believe the polls?

Let me close by answering my own question: yes you can believe the polls, but within limits and depending on who the pollster is.

Reading more

This blog post is one of a series of blog posts about opinion polls. 

Wednesday, August 12, 2020

Who will win the election? Election victory probabilities from opinion polls

Polls to probabilities

How likely is it that your favorite candidate will win the election? If your candidate is ahead of their opponent by 5%, are they certain to win? What about 10%? Or if they're down by 2%, are they out of the race? Victory probabilities are related to how far ahead or behind a candidate is in the polls, but the relationship isn't a simple one and has some surprising consequences as we'll see.

Opinion poll example

Let's imagine there's a hard-fought election between candidates A and B. A newspaper publishes an opinion poll a few days before the election:

  • Candidate A: 52%
  • Candidate B: 48%
  • Sample size: 1,000

Should candidate A's supporters pop the champagne and candidate B's supporters start crying?

The spread and standard error

Let's use some standard notation. From the theory of proportions, the mean and standard error for the proportion of respondents who chose A is:

\[ p_a = {n_a \over n} \] \[ \sigma_a = { \sqrt {{p_a(1-p_a)} \over n}} \]

where \( n_a \) is the number of respondents who chose A and \( n \) is the total number of respondents. If the proportion of people who answered candidate B is \(p_b\), then obviously, \( p_a + p_b = 1\).

Election probability theory usually uses the spread, \(d\), which is the difference between the candidates: \[d = p_a - p_b = 2p_a - 1 \] From statistics theory, the standard error of \( d \)  is: \[\sigma_d = 2\sigma_a\] (these relationships are easy to prove, but a bit tedious, if anyone asks, I'll show the proof.)

Obviously, for a candidate to win, their spread, \(d\), must be > 0.

Everything is normal

From the central limit theorem (CLT), we know \(p_a\) and \(p_b\) are normally distributed, and also from the CLT, we know \(d\) is normally distributed. The next step to probability is viewing the normal distribution for candidate A's spread. The chart below shows the normal distribution with mean \(d\) and standard error \(\sigma_d\).

As with most things with the normal distribution, it's easier if we transform everything to the standard normal using the transformation: \[z = {(x - d) \over \sigma_d}\] The chart below is the standard normal representation of the same data.

The standard normal form of this distribution is a probability density function. We want the probability that \(d>0\) which is the light green shaded area, so it's time to turn to the cumulative distribution function (CDF), and its complement, the complementary cumulative distribution function (CCDF).

CDF and CCDF

The CDF gives us the probability that we will get a result less than or equal to some value I'll label \(z_c\). We can write this as: \[P(z \leq z_c) = CDF(z_c) = \phi(z_c) \] The CCDF is defined so that: \[1 = P(z \leq z_c) + P(z > z_c)= CDF(z_c) + CCDF(z_c) = \phi(z_c) + \phi_c(z_c)\] Which is a long-winded way of saying the CCDF is defined as:  \[CCDF(z_c) = P(z_c \gt 0) = \phi_c(z_c)\]

The CDF is the integral of the PDF, and from standard textbooks: \[ \phi(z_c) = {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right) \] We want the CCDF,  \(P(z > z_c)\), which is simply 1 - CDF.

Our critical value occurs when the spread is zero. The transformation to the standard normal in this case is: \[z_c = {(x - d) \over \sigma_d} = {-d \over \sigma_d}\] We can write the CCDF as: \[\phi_c(z_c) = 1 - \phi(z_c) = 1- {1 \over 2} \left( 1 + erf\left( {z_c \over \sqrt2} \right) \right)\ \] \[= 1 - {1 \over 2} \left( 1 + erf\left( {-d \over {\sigma_d\sqrt2}} \right) \right)\] We can easily show that: \[erf(x) = -erf(-x)\] Using this relationship, we can rewrite the above equation as: \[ P(d > 0) = {1 \over 2} \left( 1 + erf\left( {d \over {\sigma_d\sqrt2}} \right) \right)\]

What we have is an equation that takes data we've derived from an opinion poll and gives us a probability of a candidate winning.

Probabilities for our example

For candidate A:

  • \(n=1000\)
  • \( p_a = {520 \over 1000} = 0.52 \)
  • \(\alpha_a = 0.016 \)
  • \(d = {{520 - 480} \over 1000} = 0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 90\%\)

For candidate B:

  • \(n=1000\)
  • \( p_b = {480 \over 1000} = 0.48 \)
  • \(\alpha_b = 0.016 \)
  • \(d = {{480 - 520} \over 1000} = -0.04\)
  • \(\alpha_d = 0.032\)
  • \(P(d > 0) = 10\%\)

Obviously, the two probabilities add up to 1. But note the probability for candidate A. Did you expect a number like this? A 4% point lead in the polls giving a 90% chance of victory?

Some consequences

Because the probability is based on \( erf \), you can quite quickly get to highly probable events as I'm going to show in an example. I've plotted the probability for candidate A for various leads (spreads) in the polls. Most polls nowadays tend to have about 800 or so respondents (some are more and some are a lot less), so I've taken 800 as my poll size. Obviously, if the spread is zero, the election is 50%:50%. Note how quickly the probability of victory increases as the spread increases.

What about the size of the poll, how does that change things? Let's fix the spread to 2% and vary the size of the poll from 200 to 2,000 (the usual upper and lower bounds on poll sizes). Here's how the probability varies with poll size for a spread of 2%.

Now imagine you're a cynical and seasoned poll analyst working on candidate A's campaign. The young and excitable intern comes rushing in, shouting to everyone that A is ahead in the polls! You ask the intern two questions, and then, like the Oracle at Delphi, you predict happiness or not. What two questions do you ask?

  • What's the spread?
  • What's the size of the poll?

What's missing

There are two elephants in the room, and I've been avoiding talking about them. Can you guess what they are?

All of this analysis assumes the only source of error is random noise. In other words, there's no systemic bias. In the real world, that's not true. Polls aren't wholly based on random sampling, and the sampling method can introduce bias. I haven't modeled it at all in this analysis. There are at least two systemic biases:

  • Pollster house effects arising from house sampling methods
  • Election effects arising from different population groups voting in different ways compared to previous elections.

Understanding and allowing for bias is key to making a successful election forecast. This is an advanced topic for another blog post.

The other missing item is more subtle. It's undecided voters. Imagine there are two elections and two opinion polls. Both polls have 1,000 respondents.

Election 1:

  • Candidate A chosen by 20%
  • Candidate B chosen by 10%
  • Undecided voters are 70%
  • Spread is 10%
Election 2:

  • Candidate A chosen by 55%
  • Candidate B chosen by 45%
  • Undecided voters are 0%
  • Spread is 10%
In both elections, the spread from the polls is 10%, so candidate A has the same higher chance of winning in both elections, but this doesn't seem right. Intuitively, we should be less certain about an election with a high number of undecided voters. Modeling undecided voters is a topic for another blog post!

Reading more

The best source of election analysis I've read is in the book "Introduction to data science" and the associated edX course "Inference and modeling", both by Rafael Irizarry. The analysis in this blog post was culled from multiple books and websites, each of which only gave part of the story.

If you liked this post, you might like these ones