Tuesday, July 20, 2021

We don't need no education: England's poor educational performance

Why is education important?

High-paying jobs tend to be knowledge-intensive. If you can't get the qualified workers you want in the UK you could move your operations to Poland, Spain, or Czechia. This obviously applies to IT jobs, but also to specialized manufacturing jobs and jobs in other areas. Governments are in a race or a beauty contest to attract employers with high-paying jobs to set up in their country. High-paying jobs support an ecosystem of other employment, from transportation workers to baristas.

(In the modern economy, those who train well and invest will win. Image source: Wikimedia, Author:Ub-K0G76A. License: Creative Commons.)

That's why the UK's flat educational performance is so worrying and why it's puzzling there isn't more concern about it in the country. To summarize what I'm going to tell you: educational achievement in the UK has been stagnant for over a decade and later-in-life learning is flat or declining.

I'm going to focus on England because it's the largest of the UK countries, but the story is similar in Scotland, Wales, and Northern Ireland.

A slice of PISA

The OECD has run PISA (Programme for International Student Assessment) every three years since 2000. It's an assessment of 15-year-old students' achievement in science, math, and reading across several countries. The idea is you can compare performance across nations across time using the same yardstick. Currently, 79 countries take part.

Of course, not every 15-year-old in every country takes the test. In 2018 in the UK, around 13,000 students took the test - which means there was sampling. Sampling implies there will be some uncertainty in the results, so small year-to-year changes are insignificant. How statistical significance is calculated for a sample like this is well-known and I won't go into the theory here.

The press (and governments) give a lot of attention to the country league table of results, but I'm going to focus on the scores themselves.

Standing still

The National Foundation for Education Research in the UK produces summary results and more detailed results for England, Northern Ireland, Scotland, and Wales. I'm just going to focus on the results for England.

Here are the results for math, science, and reading. Note the y-axis and the fact I've zoomed in on the scores.

With the exception of Math in 2009 and 2012, there have been no statistically significant changes in results from the 2018 results. This is so important I'm going to say it again in another way. The performance of English 15-year-olds in math, science, and reading has not measurably changed from 2006 to 2018.

Let that sink in. Despite 12 years of government policy, despite 12 years of research, despite 12 years of the widespread adoption of computing technology in the classroom, English students' performance has not measurably changed when measured with a fair and consistent international test.

Not learning later in life

We live in a world where change is a constant. New technology is re-making old industries all the time. Whatever qualifications you have, at some point you'll need retraining. Are people in the UK learning post-formal education?

The Learning and Work Institute tracks participation in learning post-formal education; they have estimates of the fraction of the UK adult population that is currently taking some form of training. Here are the results from 1996 to 2019. The blue line is the fraction of the population currently taking some form of training, and the red line is the fraction of the population who have never taken any form of training course since leaving full-time education.

The rates of participation in lifelong learning are at best steady-state and at worst declining. The Institute breaks the data down by social class (important in the UK) and age when someone left full-time education (a proxy for their level of education). Unsurprisingly, participation rates in lifelong learning are higher for those with more education and they're lower the further down the social class scale you go.

Younger people have worse skills

In 2016, the OECD published a study on adult skills in England. The study was worrying. To quote from the report:

"In most countries, but not in England, younger people have stronger basic skills than the generation of people approaching retirement."
"In England, one-third of those aged 16-19 have low basic skills."
"England has three times more low-skilled people among those aged 16-19 than the best-performing countries like Finland, Japan, Korea and the Netherlands. Much of this arises from weak numeracy (and to a lesser extent literacy) performance on average."
"Around one in ten of all university students in England have numeracy or literacy levels below level 2."
"Most low-skilled people of working age are in employment."

For context, people with level 2 skills or below "struggle to estimate how much petrol is left in the petrol tank from a sight of the gauge, or not be able to fully understand instructions on a bottle of aspirin".

Education at a glance

The OECD summarized a lot of data for the UK as a whole in a 2020 summary. The results are a mixed bag, some good things, and a lot of bad things. Here are a few cherry-picked quotes:

"In United Kingdom, the proportion of adults employed in the private sector and participating in job-related non-formal education and training not sponsored by the employer is low compared to other OECD and partner countries. (3 %, rank 31/36 , 2016)"
"In United Kingdom, the number of annual hours of participation of adults in formal education and training is comparatively low (169 %, rank 26/26 , 2016)"
"In United Kingdom, the share of capital expediture on primary education is one of the smallest among OECD and partner countries with available data. (3.3 %, rank 27/32 , 2017)"

Politics and the press

Education is a long-term process. If a government invests in education for 5-year-olds, it will be 10 years or more before the effects are apparent in exam results. Most education ministers only last a few years in the job and they want some kind of results quickly. This tends to focus policy on the short-term.

In the UK, the government has tinkered with the qualification system for 16 and 18-year-olds. The net effect is to make it hard to compare results over time. For many years, average grades were going up and many commentators were convinced standards were slipping, but of course, it delivered the results politicians wanted.

The PISA results cut through all this and expose a system that's not improving. The politicians' response was to point at the country league tables and make vaguely positive comments about pupil achievements.

What about the decrease in adult education and training and the OECD report on skills? As far as I can tell, silence. I couldn't even find a decent discussion in the press.

Is there a way forward?

I don't think there are any easy answers for 15-year-olds, and certainly, none that are quick and cheap. What might help is a more mature discussion of what's going on, with some honesty and openness from politicians. Less tinkering with the exam system would be a good idea.

For adult learning, I'm very skeptical of a way forward without a significant cultural change. I can see a huge and widening educational divide in the UK. Graduates are training and retraining, while non-graduates are not. This is not good for society.

Monday, July 12, 2021

What is beta in statistical testing?

\(\beta\) is \(\alpha\) if there's an effect

In hypothesis testing, there are two kinds of errors:

Type I - we say there's an effect when there isn't. The threshold here is \(\alpha\).
Type II - we say there's no effect when there really is an effect. The threshold here is \(\beta\).

This blog post is all about explaining and calculating \(\beta\).

The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate \(C\) on both branches. Here are our null and alternative hypotheses:

\(H_0: C_B - C_A = 0\) there is no difference between the branches
\(H_1: C_B - C_A \neq 0\) there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity \(C_B - C_A\) to be normally distributed. If there is no effect, then \(C_B - C_A = 0\), if there is an effect \(C_B - C_A \neq 0\).

\(\alpha\) in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured \(C_B - C_A\) result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by \(\alpha\).

One way of looking at the blue area is to think of it as a confidence interval around the mean \(x_0\):

\[\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s \]

In this equation, s is the standard error in our measurement. The probability of a measurement \(x\) lying in this range is:

\[0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \]

If we transform our measurement \(x\) to the standard normal \(z\), and we're using a 95% acceptance region (boundaries given by \(z\) values of 1.96 and -1.96), then we have for the null hypothesis:

\[0.95 = P[-1.96 < z < 1.96]\]

\(\beta\) in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold \(\beta\).

To draw this in pictures, I want to take a step back. We have two hypotheses:

\(H_0: C_B - C_A = 0\) there is no difference between the branches
\(H_1: C_B - C_A \neq 0\) there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.

If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by \(\alpha\). The value of \(\alpha\) gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by \(\beta\). The value of \(\beta\) gives us the probability of a false negative.

Calculating \(\beta\)

Calculating \(\beta\) is calculating the orange area of the alternative hypothesis chart. The boundaries are set by \(\alpha\) from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

\(\beta\) is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate \(\beta\), we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
\[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \] which is the mean and 95% confidence interval for the null hypothesis. Our measurement \(x\) must lie between these bounds.
Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is \(\bar x_1\).
We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
\[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ] \]
Transforming to the standard normal distribution using the formula \(z = \frac{x - \bar x_1}{\sigma}\), we can write the probability \(\beta\) as:
\[\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ] \]

This time, let's put some numbers in.

\(n = 200,000\) (100,000 per branch)
\(C_B = 0.062\)
\(C_A = 0.06\)
\(\bar x_0= 0\) - the null hypothesis
\(\bar x_1 = 0.002\) - the alternate hypothesis
\(s = 0.00107\) - this comes from combining the standard errors of both branches, so \(s^2 = s_A^2 + s_B^2\), and I'm using the usual formula for the standard error of a proportion, for example, \(s_A = \sqrt{\frac{C_A(1-C_A)}{n} }\)

Plugging them all in, this gives:
\[\beta = P[ -3.829 < z < 0.090]\]
which gives \(\beta = 0.536\)

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Reading more

Tuesday, July 6, 2021

Spritely fraud detection

Scientific fraud and business manipulation

Sadly, there's a long history of scientific fraud and misrepresentation of data. Modern computing technology has provided better tools for those trying to mislead, but the fortunate flip side is, modern tools provide ways of exposing misrepresented data. It turns out, the right tools can indicate what's really going on.

(Author: Nick Youngson. License: Creative Commons. Source: Wikimedia)

In business, companies often say they can increase sales, or reduce costs, or do so some other desirable thing. The evidence is sometimes in the form of summary statistics like means and standard deviations. Do you think you could assess the credibility of evidence based on the mean and standard deviation summary data alone?

In this blog post, I'm going to talk about how you can use one tool to investigate the credibility of mean and standard deviation evidence.

Discrete quantities

Discrete quantities are quantities that can only take discrete values. An example is a count, for example, a count of the number of sales. You can have 0, 1, 2, 3... sales, but you can't have -1 sales or 563.27 sales.

Some business quantities are measured on scales of 1 to 5 or 1 to 10, for example, net promoter scores or employee satisfaction scores. These scales are often called Likert scales.

For our example, let's imagine a company is selling a product on the internet and asks its customers how likely they are to recommend the product. The recommendation is on a scale of 0 to 10, where 0 is very unlikely to recommend and 10 is very likely to recommend. This is obviously based on the net promoter idea, but I'm simplifying things here.

Very unlikely to recommend										Very likely to recommend
0	1	2	3	4	5	6	7	8	9	10

Imagine the salesperson for the company tells you the results of a 500-person study are a mean of 9 and a standard deviation of 2.5. They tell you that customers love the product, but obviously, there's some variation. The standard deviation shows you that not everyone's satisfied and that the numbers are therefore credible.

But are these numbers really credible?

Stop for a second and think about it. It's quite possible that their customers love the product. A mean of 9 on a scale of 10 isn't perfection, and the standard deviation of 2.5 suggests there is some variation, which you would expect. Would you believe these numbers?

Investigating credibility

We have three numbers; a mean, a standard deviation, and a sample size. Lots of different distributions could have given rise to these numbers, how can we backtrack to the original data?

The answer is, we can't fully backtrack, but we can investigate possibilities.

In 2018, a group of academic researchers in The Netherlands and the US released software you can use to backtrack to possible distributions from mean and standard deviation data. Their goal was to provide a tool to help investigate academic fraud. They wrote up how their software works and published it online, you can read their writeup here. They called their software SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) and made it open-source, even going so far as to make a version of it available online. The software will show you the possible distributions that could give rise to the summary statistics you have.

One of the online versions is here. Let's plug in the salesperson's numbers to see if they're credible.

If you go to the SPRITE site, you'll see a menu on the left-hand side. In my screenshot, I've plugged in the numbers we have:

Our scale goes from 0 to 10,
Our mean is 9,
Our standard deviation is 2.5,
The number of samples is 500.
We'll choose 2 decimal places for now
We'll just see the top 9 possible distributions.

Here are the top 9 results.

Something doesn't smell right. I would expect the data to show some form of more even distribution about the mean. For a mean of 9, I would expect there to be a number of 10s and a number of 8s too. These estimated distributions suggest that almost everyone is deliriously happy, with just a small handful of people unhappy. Is this credible in the real world? Probably not.

I don't have outright evidence of wrongdoing, but I'm now suspicious of the data. A good next step would be to ask for the underlying data. At the very least, I should view any other data the salesperson provides with suspicion. To be fair to the salesperson, they were probably provided with the data by someone else.

What if the salesperson had given me different numbers, for example, a mean of 8.5, a standard deviation of 1.2, and 100 samples? Looking at the results from SPRITE, the possible distributions seem much more likely. Yes, misrepresentation is still possible, but on the face of it, the data is credible.

Did you spot the other problem?

There's another, more obvious problem with the data. The scale is from 0 to 10, but the results are a mean of 9 and a standard deviation of 2.5, which implies a confidence interval of 6.5 to 11.5. To state the obvious, the maximum score is 10 but the upper range of the confidence interval is 11.5. This type of mistake is very common and doesn't of itself indicate fraud. I'll blog more about this type of mistake later.

What does this mean?

Due diligence is about checking claims for veracity before spending money. If there's a lot of money involved, it behooves the person doing the due diligence to check the consistency of the numbers they've been given. Tools like SPRITE are very helpful for sniffing out areas to check in more detail. However, just because a tool like SPRITE flags something up it doesn't mean to say there's fraud; people make mistakes with statistics all the time. However, if something is flagged up, you need to get to the bottom of it.

Other ways of detecting dodgy numbers

The GRIM test - https://peerj.com/preprints/2064v1/
GRIMMER - https://peerj.com/preprints/2400v1/
https://www.evanmiller.org/how-to-read-an-unlabeled-sales-chart.html

Finding out more

http://crystalprisonzone.blogspot.com/2021/01/i-tried-to-report-scientific-misconduct.html - a great blog post about the use of SPRITE and other tools.
https://hackernoon.com/introducing-sprite-and-the-case-of-the-carthorse-child-58683c2bfeb - a more detailed example
https://towardsdatascience.com/sprite-case-study-2-the-case-of-the-polarizing-porterhouse-and-some-updates-7dfe4d1564fc
https://jamesheathers.medium.com/sprite-case-study-3-soup-is-good-albeit-extremely-confusing-food-96ea526c488d
https://www.sciencemag.org/news/2018/02/meet-data-thugs-out-expose-shoddy-and-questionable-research
https://osf.io/pwjad/ - SPRITE source code

Monday, June 21, 2021

Unknown Pleasures: pulsars, pop, and plotting

The echoes of history

Sometimes, there are weird connections between very different cultural areas and we see the echoes of history playing out. I'm going to tell you how pulsars, Nobel Prizes, an iconic album cover, Nazi atrocities, and software chart plotting all came to be connected.

The discovery of pulsars

In 1967, Jocelyn Bell was working on her Ph.D. as a post-graduate researcher at the Mullard Radio Astronomy Observatory, near Cambridge in the UK. She had helped build a new radio telescope and now she was operating it. On November 28, 1967, she saw a strikingly unusual and regular signal, which the team nicknamed "little green men". The signal turned out to be a pulsar, a type of star new to science.

This was an outstanding discovery that shook up astronomy. The team published a paper in Nature, but that wasn't the end of it. In 1974, the Nobel committee awarded the Nobel Physics Prize to the team. To everyone on the team except Jocelyn Bell.

Over the years, there's been a lot of controversy over the decision, with many people thinking she was robbed of her share of the prize, either because she was a Ph.D. student or because she was a woman. Bell herself has been very gracious about the whole thing; she is indeed a very classy lady.

The pulsar and early computer graphics

In the late 1960s, a group of Ph.D. students from Cornell University were analyzing data from the pulsar Bell discovered. Among them was Harold Craft, who used early computer systems to visualize the results. Here's what he said to the Scientific American in 2015: "I found that it was just too confusing. So then, I wrote the program so that I would block out when a hill here was high enough, then the stuff behind it would stay hidden."

Here are three pages from Craft's Ph.D. thesis. Take a close look at the center plot. If Craft had made every line visible, it would have been very difficult to see what was going on. Craft re-imagined the data as if he were looking at it at an angle, for example, as if it were a mountain range ridgeline he was looking down on. With a mountain ridgeline, the taller peaks hide what's behind them. It was a simple idea, but very effective.

(Credit: JEN CHRISTIANSEN/HAROLD D. CRAFT)

The center plot is very striking. So striking in fact, that it found its way into the Cambridge Encyclopaedia of Astronomy (1977 edition):

(Cambridge Encyclopedia of Astronomy, 1977 edition, via Tim O'Riley)

Joy Division

England in the 1970s was not a happy place, especially in the de-industrialized north. Four young men in Manchester had formed a band and recorded an album. The story goes that one of them, Bernard Sumner, was working in central Manchester and took a break in the city library. He came across the pulsar image in the encyclopedia and liked it a lot.

The band needed an image for their debut album, so they selected this one. They gave it to a recently graduated designer called Peter Saville, with the instructions it was to be a black-on-white image. Saville felt the image would look better white-on-black, so he designed this cover.

This is the iconic Unknown Pleasures album from Joy Division.

The starkness of the cover, without the band's name or the album's name, set it apart. The album itself was critically acclaimed, but it never rose high in the charts at the time. However, over time, the iconic status of the band and the album cover grew. In 1980, the lead singer, Ian Curtis, committed suicide. The remaining band members formed a new band, New Order, that went on to massive international fame.

By the 21st century, versions of the album cover were on beach towels, shoes, and tattoos.

Joy plots

In 2017, Claus Wilke created a new charting library for R, ggjoy. His package enabled developers to create plots like the famous Unknown Pleasures album cover. In honor of the album cover, he called these plots joy plots.

Ridgeline plots

This story has a final twist to it. Although joy plots sound great, there's a problem.

Joy Division took their name from a real Nazi atrocity fictionalized in a book called House of Dolls. In some of their concentration camps, the Nazis forced women into prostitution. The camp brothels were called Joy Division.

The name joy plots was meant to be fun and a callback to an iconic data visualization, but there's little joy in evil. Given this history, Wilke renamed his package ggridges and the plots ridgeline plots.

Here's an example of the great visualizations you can produce with it.

If you search around online, you can find people who've re-created the pulsar image using ggridges.

(Andrew B. Collier via Twitter)

It's not just R programmers who are playing with Unknown Pleasures, Python programmers have got into the act too. Nicolas P. Rougier created a great animation based on the pulsar data set using the venerable Matplotlib plotting package - you can see the animation here.

If you liked this post

Monday, June 14, 2021

Confidence, significance, and p-values

What is truth?

Statistical testing is ultimately all about probabilities and thresholds for believing an effect is there or not. These thresholds and associated ideas are crucial to the decision-making process but are widely misunderstood and misapplied. In this blog post, I'm going to talk about three testing concepts: confidence, significance, and p-values; I'll deal with the hugely important topic of statistical power in a later post.

(peas - not p-values. Author: Malyadri, Source: Wikimedia Commons, License: Creative Commons)

Types of error

To simplify, there are two kinds of errors in statistical testing:

Type I - false positive. You say there's an effect when there isn't. This is decided by a threshold \(\alpha\), usually set to 5%. \(\alpha\) is called significance.
Type II - false negative. You say there isn't an effect but there is. This is decided by a threshold \(\beta\) but is usually expressed as the statistical power which is \(1 - \beta\).

In this blog post, I'm going to talk about the first kind of error, Type I.

Distribution and significance

Let's imagine you're running an A/B test on a website and you're measuring conversion on the A branch (\(c_A\)) and on the B branch (\(c_B\)). The null hypothesis is that there's no effect, which we can write as:

\[H_0: c_A - c_B = 0\]

This next piece is a little technical but bear with me. Tests of conversion are usually large tests (mostly, > 10,000 samples in practice). The conversion rate is the mean conversion for all website visitors. Because there are a large number of samples, and we're measuring a mean, the Central Limit Theorem (CLT) applies, which means the mean conversion rates will be normally distributed. By extension from the CLT, the quantity \( c_A - c_B\) will also be normally distributed. If we could take many measurements of \( c_A - c_B\) and the null hypothesis was true, we would theoretically expect the results to look like something like this.

Look closely at the chart. Although I've cut off the x-axis, the values go off to \(\pm \infty\). If all values of \( c_A - c_B\) are possible, how can we reject the null hypothesis and say there's an effect?

Significance - \(\alpha\)

To decide if there's an effect there, we use a threshold. This threshold is referred to as the level of significance and is called \(\alpha\). It's usually set at the 0.05 or 5% level. Confusingly, sometimes people refer to confidence instead, which is 1 - significance, so a 5% significance level corresponds to a 95% confidence level.

In the chart below, I've colored the 95% region around the mean value blue and the 5% region (2.5% at each end) red. The blue region is called the acceptance region and the red region is called the rejection region.

What we do is compare the measurement we actually make with the chart. If our measurement lands in the red zone, we decide there's an effect there (reject the null), if our measurement lands in the blue zone, we'll decide there isn't an effect there (fail to reject the null or 'accept' the null).

One-sided or two-sided tests

On the chart with the blue and the red region, there are two rejection (red) regions. This means we'll reject the null hypothesis if we get a value that's more than a threshold above or below our null value. In most tests, this is what we want; we're trying to detect if there's an effect there or not and the effect can be positive or negative. This is called a two-sided test because we can reject the null in two ways (too negative, too positive).

But sometimes, we only want to detect if the treatment effect is bigger than the control. This is called a one-sided test. Technically, the null hypothesis in this case is:

\[H_0: c_A - c_B \leq 0\]

Graphically, it looks like this:

So we'll reject the null hypothesis only if our measured value lands in the red region on the right. Because there's only one rejection region and it's on one side of the chart, we call this a one-sided test.

p-values

I've very blithely talked about measured values landing in the rejection region or not. In practice, that's not what we do; in practice, we use p-values.

Let's say we measured some value x. What's the probability we would measure this value if the null hypothesis were true (in other words, if there were no effect)? Technically, zero because the distribution is continuous, but that isn't helpful. Let's try a more helpful form of words. Assuming the null hypothesis is true, what's the probability we would see a value of x or a more extreme value? Graphically, this looks something like the green area on the chart below.

Let me say again what the p-value is. If there's no effect at all, it's the probability we would see the result (or a more extreme result) that we measured, or how likely is it that our measurement could have been due to chance alone?

The chart below shows the probability the null hypothesis is true; it shows the acceptance region (blue), the rejection region (red), and the measurement p-value (green). The p-value is in the rejection region, so we'll reject the null hypothesis in this case. If the green overlapped with the blue region, we would accept (fail to reject) the null hypothesis.

Misunderstandings

There are some common misunderstandings around testing that can have quite bad commercial effects.

95% confidence is too high a bar - we should drop the threshold to 90%. In effect, this means you'll accept a lot of changes that have no effect. This will reduce the overall effectiveness of your testing program (see this prior blog post for an explanation).
One-sided tests are OK and give a smaller sample size, so we should use them. This is true, but it's often important to determine if a change is having a negative effect. In general, hypothesis testing tests a single hypothesis, but sadly, people try and read more into test results than they should and want to answer several questions with a single question.
p-values represent the probability of an effect being present. This is just not true at all.
A small p-value indicates a big effect. p-values do not make any indication about the size of an effect; a low p-value does not mean there's a big effect.

Practical tensions

In practice, there can be considerable tension between business and technical people over statistical tests. A lot of statistical practices (e.g. 5% significance levels, two-sided testing) are based on experience built up over a long time. Unfortunately, this all sounds very academic to the business person who needs results now and wants to take shortcuts. Sadly, in the long run, shortcuts always catch you up. There's an old saying that's very true: "there ain't no such thing as a free lunch."

Tuesday, June 8, 2021

Management and technical career tracks: separate but not equal

Promotion paths for technical people

I’ve worked in technology companies and I’ve seen the same question arise several times: what to do with technical people who don’t want to be managers? What do you promote them to?

Managers and technical tracks are not equal

(Image credit: Louis-Henri de Rudder, source: Old Book Illustrations)

The traditional engineering career ladder emphasizes management as the desired end-goal and devalues anyone not in a management position. Not everyone wants to be a manager and not everyone is good at management. Some people are extremely technically competent and want to stay technical. What are they offered?

Separate, but not equal

Most companies deal with the problem by creating a parallel career path for engineers who don’t want to be managers. This is supposedly separate but equal, but it always ends up being very unequal in the management branch’s favor. The inequality is reflected in job titles. Director is a senior position in most companies and it comes with management responsibility. The equivalent technical role might be ‘Fellow’, which has overtones of putting someone out to grass. A popular alternative is ‘Technical Director’, but note the management equivalent is Director - the engineers get a qualifying word the managers don’t, it’s letting people know the person isn’t a real Director (they're technically a Director, but...). Until you get to VP or C-level, the engineering titles are always worse.

The management and technical tracks have power differences too. The managers get to decide pay raises and promotions and hiring and firing, the technical people don’t. Part of this is obviously why people choose management (and why some people don’t choose management), but often the technical path people aren’t even given a seat at the table. When there are business decisions to be made, the technical people are usually frozen out, even when the business decisions aren't people ones. Sometimes this is legitimate, but most of the time it’s a power thing. The message is clear: if you want the power to change things, you need to be a manager.

A way forward

Here’s what I suggest. The managerial/technical divide is a real one. Not everyone wants to be a manager and there should be a career path upward for them. I suggest having the same job titles for the managerial path and the technical path. There should be no Technical Directors and Directors, just Directors. People on the technical path should be given a seat at the power table and should be equals when it comes to making business decisions. This means managers will have to give up power and it will mean a cultural shift, but if we’re to give meaningful advancement to the engineering track, this is the way it has to be.

Sunday, May 23, 2021

Why A/B tests don't add up

All the executives laughed

A few years ago, I was at an industry event. The speaker was an executive talking about his A/B testing program. He joked that vendors and his team were unreliable because the overall result was less than the sum of the individual tests. Everyone laughed knowingly.

But we shouldn't have laughed.

The statistics are clear and he should have known better. By the rules of the statistical game, the benefits of an A/B program will be less than the sum of the parts and I'm going to tell you why.

Thresholds and testing

An individual A/B test is a null hypothesis test with thresholds that decide the result of the test. We don't know whether there is an effect or not, we're making a decision based on probability. There are two important threshold numbers:

\(\alpha\) - also known as significance and usually set around 5%. If there really is no effect, \(\alpha\) is the probability we will say there is an effect. In other words, it's the false positive rate (Type I errors).
\(\beta\) - is usually set around 20%. If there really is an effect, \(\beta\) is the probability we will say there is no effect. In other words, it's the false negative rate (Type II errors). In practice, power is used instead of \(\beta\), power is \(1-\beta\), so it's usual to set the power to 80%.

Standard statistical practice focuses on just a single test, but an organization's choice of \(\alpha\) and \(\beta\) affect the entire test program.

\(\alpha\), \(\beta\) and the test program

To see how the choice of \(\alpha\) and \(\beta\) affect the entire test program, let's run a simplified thought experiment. Imagine we choose \(\alpha = 5\%\) and \(\beta = 20\%\), which are standard settings in most organizations. Now imagine we run 1,000 tests, in 100 of them there's a real effect and in 900 of them there's no effect. Of course, we don't know which tests have an effect and which don't.

Take a second to think about these questions before moving on:

How many many positive test results will we measure?
How many false positives will we see?
How many true positives will we see?

At this stage, you should have numbers in mind. I'm asking you to do this so you understand the importance of what happens next.

The logic to answer these questions is straightforward. In the picture below, I've shown how it works, but I'll talk you through it so you can understand it in more detail.

Of the 1,000 tests, 100 have a real effect. These are the tests that \(\beta\) applies to and \(\beta=20\%\), so we'll end up with:

20 false negatives, 80 true positives

Of the 1,000 tests, 900 have no effect. These are the tests that \(\alpha\) applies to and \(\alpha=5\%\), so we'll end up with:

855 true negatives, 45 false positives

Overall we'll measure:

125 positives made up of
80 true positives
45 false positives

Crucially, we won't know which of the 125 positives are true and which are false.

Because this is so important, I'm going to lay it out again: in this example, 36% of all test results we thought were positive are wrong, but we don't know which ones they are. They will dilute the overall results of the overall program. The overall results of the test program will be less than the sum of the individual test results.

What happens in reality

In reality, you don't know what proportion of test results are 'true'. It might be 10%, or 20%, or even 5%. Of course, the reason for the test is that you don't know the result. What this means is, it's hard to do this calculation on real data, but the fact that you can't easily do the calculation doesn't mean the limits don't apply.

Can you make things better?

To get a higher proportion of true positives, you can do at least three things.

Run fewer tests - selecting only tests where you have a good reason to believe there is a real effect. This would certainly work, but you would forgo a lot of the benefits of a testing program.
Run with a lower \(\alpha\) value. There's a huge debate in the scientific community about significance levels. Many authors are pushing for a 0.5% level instead of a 5% level. So why don't you just lower \(\alpha\)? Because the sample size will increase greatly.
Run with a higher power (lower \(\beta\)). Using a power of 80% is "industry standard", but it shouldn't be - in another blog post I'll explain why. The reason people don't do it is because of test duration - increasing the power increases the sample size.

Are there other ways to get results? Maybe, but none that are simple. Everything I've spoken about so far uses a frequentist approach. Bayesian testing offers the possibility of smaller test sizes, meaning you could increase power and reduce \(\alpha\) while still maintaining workable sample sizes. Of course, A/B testing isn't the only testing method available and other methods offer higher power with lower sample sizes.

No such thing as a free lunch

Like any discipline, statistical testing comes with its own rules and logic. There are trade-offs to be made and everything comes with a price. Yes, you can get great results from A/B testing programs, and yes companies have increased conversion, etc. using them, but all of them invested in the right people and technical resources to get there and all of them know the trade-offs. There's no such thing as a free lunch in statistical testing.