Monday, August 16, 2021

The seven dysfunctionalities of management books

The problems with popular management books

Over the years, I've read many management books ranging from the excellent to the terrible. I've noticed several dysfunctionalities that creep into even some of the best books. I'm going to list them out in what I think is their order of importance. See what you think.

The seven dysfunctionalities

My idea is worth 30 pages, I'll write 300

With few exceptions, most books fall into this trap. The author could express their ideas in a few pages and provide supporting evidence that would fill a few pages more. Of course, the economics of books means they can't. There's no market and no money in a 30-page pamphlet (when was the last time you paid $20 for 30 pages?) but there's a huge market for books. The logic is clear: spin out your idea to book-length and make some money.

This is a little odd for two reasons:

  • Business writing emphasizes brevity and getting to the point quickly - neither of which management books usually do.
  • No one has disrupted the market. Maybe our business culture and market economics mean disruption is impossible?

What I say is important, I worked with important people at important companies

This is a relatively new dysfunction. The author claims their work is important, not because of its widespread adoption, or because many people had success with it, but because they held senior positions at well-known companies in Silicon Valley. Usually, these books have lots of stories of famous people, some of which offer insight and some of which don't. In a few cases, the storytelling degenerates into name-dropping.

My evidence will be stories or bad data

The plural of anecdote is not data. Why should I believe your experience generalizes to me? Storytelling is important, but it doesn't amount to a coherent management framework. According to the esteemed Karl Popper, science is about making falsifiable statements - what falsifiable statements do stories make?

The other form of dysfunctional evidence is bad data. The problems here are usually regression to the mean, small sample sizes, or a misunderstanding of statistics. There are examples of management gurus developing theories of winning companies but whose theories were proved wrong almost as soon as the ink was dry on their books. This might be why newer books focus on storytelling instead.

I'll write a worse sequel and then an even worse sequel to that

Even the best authors fall prey to this trap. They publish a best-selling book and the temptation is there to write a sequel. The second book is usually so-so, but might sell well. So they write a third book which is even worse, and so on.

I'll create new words for old ideas

Here the author rediscovers psychology or sociology that's been known for decades. Sometimes, they'll admit it and provide a new twist on old ideas; but sometimes it's just known ideas repackaged. In any case, the author usually creates a trendy buzzy phrase for their idea, preferably one they can trademark for their consultancy practice.

I'll talk about my time in the military

The military does have some very interesting things to teach managers. Unfortunately, most of the military books for business management focus on events without providing much in the way of context for what happened and why. When they explain how it can be used in a civilian setting, it feels clunky and unconvincing. These military books also tend to focus on successes and brush over failures (if they mention any at all). This is sad because I've read some really great older military management books that have something to offer today's managers.

I'll push my consulting company

This is the original sin and the cause of many of the other sins. After the success of their book, the author forms a consultancy company. They create a 2nd edition that includes cherry-picked success stories from their consulting company, or maybe they write a second book with anecdotes from their consulting work. The book then becomes a 'subtle' promo for their consulting work.

Don't throw the baby out with the bathwater

I'm not saying that popular business management books have no value, I'm saying very few of them will have value in ten years' time when the hype has passed. Think back to the business books published ten or twenty years ago. How many stand up now? 

Despite the faddish nature of the genre, most business management books have the core of some good ideas, you just have to wade through the nonsense to get there.

What should you do?

Every manager needs a framework for decision-making. My suggestion is to get that framework from training and courses and not popular business books. Use quotes to get some extra insight. Management business books are useful for a refresher of core ideas, even if you have to wade through 300 pages instead of 30. If nothing else, the popular books are a handy guide to what your peers are reading now.

Monday, August 9, 2021

Criminal innovations: narco-subs

How do you transport lots of drugs internationally without getting caught?

The United States is one of the world's largest consumers of illegal drugs but the majority of the illegal drugs it consumes are manufactured in South America. Illegal drug producers need to transport their product northwards at the lowest price while evading detection. They've tried flying, but radar and aircraft have proved effective at stopping them, and they've tried boats, but coastguard patrols and radar have again stopped them. If you can't go over the water, and you can't go on the water, then how about going under the water? Drug cartels have turned to submarines and their variants for stealthy transportation. These submarines go by the generic name of narco-subs. As we'll see, it's not just the South Americans who are building submarines for illegal activities.

South American narco-subs

The experts on transporting drugs long distances by sea are the South American drug cartels; they've shown an amazing amount of innovative thinking over the years. Currently, they're using three main types of craft: low-profile vessels, submarines, and torpedoes. Low-profile vessels and submarines typically have small crews of 2-4 people, while torpedoes are uncrewed.

Low-profile vessels (LPVs)

To avoid radar and spotter planes, the cartels have turned to stealth technology; they've designed boats that have a very low radar cross-section with the smallest possible above-the-sea structures. 

(A low-profile vessel that was intercepted. Image source: US Customs and Border Protection.)
(Another low-profile vessel. Image source: US Customs and Border Protection.)

These vessels originally started as variations on existing commercial speedboats, with modifications to make them run lower in the water. Now, they're custom designs, typically long and thin, designed to pierce waves rather than ride over them. A typical newer LPV might be 3m wide by 30m long - quite a long vessel, but very narrow. H.I. Sutton describes several types of LPV in his Forbes article.

Submarines

There are various types of narco-subs, ranging from semi-submersibles to full-on submarines.

Semi-submersibles ride just below the surface, typically at snorkel depth. This image of a 2019 semi-submersible captured off Peru gives you the general idea.

(Semi-submersible narco-sub, Peru, 2019. Image source: Wikimedia Commons.)

The vessel is plainly based on a 'standard' boat and is designed to run just under the water. The very few above-surface structures make the vessel hard to spot with radar, or even from the air.

The Peruvian vessel is plainly a modified boat, but custom-built vessels exist, here's an image of one custom semi-submersible used by Columbian drug smugglers just before its capture in 2007. The blue paint job is camouflage.

(Semi-submersible narco-sub caught in 2007. Image source: Wikimedia Commons.)

This September 2019 image shows USCG boarding a 12m semi-submersible in the eastern Pacific. It had a crew of 4 and was carrying $165mn in cocaine.

(Source: Navy Times)

The drug cartels have created true submarines capable of traveling under the water to depths of a few hundred feet. Some of these submarines have even reached the astonishing length of 22m, making them comparable to midget submarines used by the world's navies (see Covert Shores comparison). 

In 2010, this 22 m-long monster was discovered in the Ecuadorian jungle. NPR has a long segment on how it was found and what happened next. The sub is estimated to have a range of 6,800 nautical miles and a dive depth of 62 feet. These numbers aren't impressive by military standards but bear in mind, this sub is designed for stealth, not warfare.

(22m long, fully submersible narco-sub. Image source: Wikimedia Commons.)

This isn't even the largest sub found, Hannah Stone reports on one narco-sub with a length of 30m, a crew of 4, air conditioning, and a small kitchen!

In November 2019, a narco-sub was caught in Galicia in Spain. Although the design was nothing new, its origin was. Authorities believe it started its journey in Brazil, crossing the Atlantic ocean to get to Spain (Covert Shores). This vessel was a semi-submersible design.

Bear in mind, all these submarines were built surreptitiously, often far away from population centers, which means no cutting-edge machine tools or precision parts and limited material supply. The subs are often constructed using wood and fiberglass - not special-purpose alloys.

Torpedoes

This is a relatively new innovation. Torpedoes are submersible vessels typically towed behind fishing vessels or other ships. If the ship is intercepted, the torpedo is cut loose, and after a period of time, it surfaces a camouflaged marker, allowing it to be retrieved after the authorities have gone.

This article on Insight Crime describes how torpedoes work in practice.

European narco-subs

It's not just the South Americans who are creating narco-subs, the Europeans are at it too. In February 2020, Spanish police raided a warehouse in Málaga where they found a very sophisticated narco-sub under construction. This is a well-constructed vessel, using hi-tech parts imported from countries around Europe. The paint job isn't accidental either - it's all about stealth.


(Image source: Europol)

Covert Shores reports that this is the fourth narco-sub caught in Spain.

Transporting cars illegally

So far, I've focused on narco-subs and drug trafficking, but similar technology has been used for other criminal activities. In China, Armored Stealth Boats have been used to traffic stolen luxury cars. The whole thing seems to be so James Bond, it can't be true, but it is. Covert Shores has an amazing article and images on the whole thing.

Some disturbing thoughts

There's a tremendous amount of risk-taking going on here; how many of these subs end up at the bottom of the sea? On the flip side, how many are getting through undetected? Of course, if large amounts of drugs can be transported this way, what about other contraband? Many of these subs are constructed with relatively primitive equipment and materials. What could a rogue nation-state do with up-to-date machine tools and modern materials?

Innovation - but for the wrong ends

All this innovation is amazing. The idea of constructing a submarine in the jungles of South America with limited materials and piloting it across the Atlantic is incredible. The sad thing is, all this creative effort is in support of criminal activity. It would be great if this get-up-and-go could be directed at something that benefits people instead. It seems to me that the fundamental problem is the economic incentive system - drugs pay well and there are few alternatives in the jungle. 

Reading more

The expert on narco-subs, and indeed on many OSINT aspects of naval warfare, is H.I. Sutton, who produces the website Covert Shores. If you want to read more details about narco-subs, check out his great website, Covert Shores.

USNI covers stories on narco-subs and other naval topics.

"Narco-Submarines: Specially Fabricated Vessels Used for Drug Smuggling Purposes" is a little old, but it's still good background reading.

Monday, August 2, 2021

Poleaxed opinion polls: the ongoing 2020 disaster

Why the polls failed in the US Presidential Election of 2020

In the wake of the widespread failure of opinion polls to accurately predict the outcome of the 2020 US Presidential election, the American Association for Public Opinion Research (AAPOR) commissioned a study to investigate the causes and make recommendations. Their findings were recently released.

(This is the key question for 2020 opinion pollsters. The answer is yes, but they don't know why. Image source: Wikimedia)

Summary of AAPOR's findings

I've read the report and I've dug through the findings. Here's my summary:

  1. The polls overstated support for Democratic candidates.
  2. We don't really know why.
  3. Er... that's it.

Yes, I'm being harsh, but I'm underwhelmed by the report and I find some of the statements in it unconvincing. I'll present some of their main findings and talk through them. I encourage you to read the report for yourself and reach your own conclusions.

(We don't know why we didn't get the results right.)

Factors they ruled out for 2020

  • Late-breaking changes in favor of Republican candidates. This happened in 2016 but didn't happen in 2020. The polls were directionally consistent throughout the campaign.
  • Weighting for education. In 2016, most polls didn't weight for education and education did seem to be a factor. In 2020, most polls did weigh for education. Educational weighting wasn't a factor.
  • Pollsters got the demographics wrong. Pollsters don't use random sampling, they often use stratified sampling based on demographics. There's no evidence that errors in demographics led to widespread polling errors in 2020.
  • People were afraid to say they voted for Trump. In races not involving Trump, the opinion polls were still wrong and still favored Democratic candidates. Trump wasn't the cause.
  • Intention to vote vs. actually voting. The results can't be explained by voters saying they were going to vote but who didn't actually vote. For example, if Democratic voters said they were going to vote Democratic and didn't actually vote, this would explain the error, but it didn't happen.
  • Proportion of early voters or election day voters. Early voting/election day voting didn't make a difference to the polling error.

Factors they couldn't rule out

  • Republican voters chose not to take part in surveys at a higher number than Democratic voters.
  • The weighting model used to adjust sampling may have been wrong. Pollsters use models of the electorate to adjust their results. If these models are wrong, the results will be biased.
  • Many more people voted in 2020 than in 2016 ("new voters" in the report) - maybe pollsters couldn't model these new voters very well.

Here's a paragraph from the report:

"Unfortunately, the ability to determine the cause or causes of polling error in 2020 is limited by the available data. Unless the composition of the overall electorate is known, looking only at who responded says nothing about who did not respond. Not knowing if the Republicans (or unaffiliated voters, or new voters) who responded to polls were more supportive of Biden than those who did not respond, for example, it is impossible to identify the primary source of polling error."

Let me put that paragraph another way: we don't have enough data to investigate the problem so we can't say what went wrong.

Rinse and repeat - or just don't

I'm going to quote some sentences from the report's conclusions and comments:

  • "Considering that the average margin of error among the state-level presidential polls in 2020 was 3.9 points, that means candidate margins smaller than 7.8 points would be difficult to statistically distinguish from zero using conventional levels of statistical significance. Furthermore, accounting for uncertainty of statistical adjustments and other factors, the total survey error would be even larger."
  • "Most pre-election polls lack the precision necessary to predict the outcome of semi-close contests."
  • "Our investigation reveals a systemic overstatement of the Democratic-Republican margin in nearly every contest, regardless of mode or proximity to the election. This overstatement is largest in states with more Republican supporters"

Some of the report's statements are extraordinary if you stop and think for a moment. I want you to ponder the key question: "what use are polls"?

The people paying for polls are mostly (but not completely) political campaigns and the media. The media want to report on an accurate snapshot of where the election is now and make an assessment of who will win. Political campaigns largely want the same thing. 

In places like Alaska or Hawaii, polls aren't very useful because voters tend to vote strongly Democratic or Republican. For example, Wyoming is overwhelmingly a Republican stronghold, and Washington D.C. a Democratic stronghold. My forecast for 2024 is simple: Wyoming will vote Republican and Washington D.C. Democratic. 

Polls are useful where the race is close, or, in the words of the report "semi-close". But, according to the report, polls in semi-close states don't have sufficient accuracy to predict the result.

So, if polls aren't useful in strongly Democratic or Republican states, and they lack predictive power in "semi-close" races, what use are they? Why should anyone pay for them?

There's an even deadlier issue for polling organizations. You can very clearly judge the accuracy of political opinion polls. Opinion poll companies run all kinds of polls on all kinds of topics, not just elections. How accurate are they in other areas where their success is harder to assess?

Where to next?

The polling industry has an existential credibility crisis. It can't continue to sell a product that doesn't work. It's extraordinary that an industry that's been around for nearly 100 years doesn't have the data to diagnose its failures. The industry needs to come together to fix its problems as soon as possible - or face irrelevancy in the near future.

Monday, July 26, 2021

Reconstructing an unlabelled chart

What were the numbers?

Often in business, we're presented with charts where the y-axis is unlabeled because the presenter wants to conceal the numbers. Are there ways of reconstructing the labels and figuring out what the data is? Surprisingly, yes there are.

Given a chart like this:

you can often figure out what the chart values should be.

The great Evan Miller posted on this topic several years ago ("How To Read an Unlabeled Sales Chart"). He discussed two methods:

  • Greatest common divisor (gcd)
  • Poisson distribution

In this blog post, I'm going to take his gcd work a step further and present code and a process for reconstructing numbers under certain circumstances. In another blog post, I'll explain the Poisson method.

The process I'm going to describe here will only work:

  • Where the underlying data is integers
  • Where there's 'enough' range in the underlying data.
  • Where the maximum underlying data is less than about 200.
  • Where the y-axis includes zero. 

The results

Let's start with some results and the process.

I generated this chart without axes labels, the goal being to recreate the underlying data. I measured screen y-coordinates of the top and bottom plot borders (187 and 677) and I measured the y coordinates of the top of each of the bars. Using the process and code I describe below, I was able to correctly recreate the underlying data values, which were \([33, 30, 32, 23, 32, 26, 18, 59, 47]\).

How plotting packages work

To understand the method, we need to understand how a plotting package will render a set of integers on a chart.

Let's take the list of numbers \([1, 2, 3, 5, 7, 11, 13, 17, 19, 23]\) and call them \(y_o\). 

When a plotting package renders \(y_o\) on the screen, it will put them into a chart with screen x-y coordinates. It's helpful to think about the chart on the screen as a viewport with x and y screen dimensions. Because we only care about the y dimensions, that's what I'll talk about. On the screen, the viewport might go from 963 pixels to 30 pixels on the y-axis, a total range of 933 y-pixels.

Here's how the numbers \(y_o\) might appear on the screen and how they map to the viewport y-coordinates. Note the origin is top left, not bottom right. I'll "correct" for the different origin.

The plotting package will translate the numbers \(y_o\) to a set of screen coordinates I'll call \(y_s\). Assuming our viewport starts from 0, we have:

\[y_s = my_o\]

Let's just look at the longest bar that corresponds to the number 23. My measurements of the start and end are 563 and 27, which gives a length of 536. \(m\) in this case is 536/23, or 23.3.

There are three things to bear in mind:

  • The set of numbers \(y_o\) are integers
  • The set of numbers \(y_s\) are integers - we can't have half a pixel for example.
  • The scalar \(m\) is a real number

Integer only solutions for \(m\) 

In Evan Miller's original post, he only considered integer values of \(m\). If we restrict ourselves to integers, then most of the time:

\[m = gcd(y_s)\]

where gcd is the greatest common divisor.

To see how this works, let's take:

\[y_o = [1 , 2,  3]\]

and

\[m = 8\]

These numbers give us:

\[y_s = [8, 16, 24]\]

To find the gcd in Python:

np.gcd.reduce([8, 16, 24])

which gives \(m = 8\), which is correct.

If we could guarantee \(m\) was an integer, we'd have an answer; we'd be able to reconstruct the original data just using the gcd function. But we can't do that in practice for three reasons:

  1. \(m\) isn't always an integer.
  2. There are measurement errors which means there will be some uncertainty in our \(y_s\) values.
  3. It's possible the original data set \(y_o\) has a gcd which is not 1.

In practice, we gather screen coordinates using a manual process which will introduce errors. At most, we're likely to be off by a few pixels for each measurement, however, even the smallest error will mean the gcd method won't work. For example, if the value on the screen should be 500 but we might incorrectly measure it as 499, this small error means the method fails (there is a way around this failure that will work for small measurement errors.)

If our original data set has a gcd greater than 1, the method won't work. Let's say our data was:

\[y_o = [2, 4, 6] \]

and:

\[m=8\]

we would have:

\[y_s = [16, 32, 48]\]

which has a gcd of 16, which is an incorrect estimate of \(m\). In practice, the odds of the original data set \(y_o\) having a gcd > 1 are low.

The real killer for this approach is the fact that \(m\) is highly likely in practice to be a real number.

Real solutions for \(m\)

The only way I've found for solving for \(m\) is to try different values for \(m\) to see what succeeds. To get this to work, we have to constrain \(m\) because otherwise there would be an infinite number of values to try. Here's how I constrain \(m\):

  • I limit the steps for different \(m\) values to 0.01.
  • I start my m values from just over 1 and I stop at a maximum \(m\) value. My maximum \(m\) value I get from assuming the smallest value I measure on the screen corresponds to a data value of 1, for example, if the smallest measurement is 24 pixels, the smallest possible original data is 1, so the maximum value for \(m\) is 24. 

Now we've constrained \(m\), how do we evaluate \(y_s = my_o\)? First off, we define an error function. We want our estimates of the original data \(y_o\) to be integers, so the further away we are from an integer, the worse the error. For the \(i\)th element of our estimate of \(y_o\), the error estimate is:

\[\frac{y_{si}}{m_{estimate}} -  \frac{y_{si}}{m_{estimate}}\]

we're choosing the least square error, which means minimizing:

\[ \frac{1}{n} \sum  \left ( round \left ( \frac{y_{si}}{m_{estimate}} \right ) -  \frac{y_{si}}{m_{estimate}} \right )^2 \]

in code, this comes out as:

sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)

Our goal is to try different values of \(m\) and choose the solution that yields the lowest error estimate.

The solution in practice

Before I show you how this works, there are two practicalities. The first is that \(m=1\) is always a solution and will always give a zero error, but it's probably not the right solution, so we're going to ignore \(m=1\). Secondly, there will be an error in our measurements due to human error. I'm going to assume the maximum error is 3 pixels for any measurement. To calculate a length, we take a measurement of the start and end of the bar (if it's a bar chart), which means our maximum uncertainty is 2*3. That's why I set my maximum \(m\) to be min(y) + 2*MAX_ERROR.

To show you how this works, I'll talk you through an example.

The first step is measurement. We need to measure the screen y-coordinates of the plot borders and the top of the bars (or the position of the points on a scatter chart). If the plot doesn't have borders, just measure the position of the bottom of the bars and the coordinate of the highest bar. Here are some measurements I took.

Here are the measurements of the top of the bars (_y_measured): \([482, 500, 489, 541, 489, 523, 571, 329, 399]\)

Here are the start and stop coordinates of the plot borders (_start, _stop):  \(677, 187\)

To convert these to lengths, the code is just: [_start - _y_m for _y_m in _y_measured]

The length of the screen from the top to the bottom is: _start - _stop = \(490\)

This gives us measured length (y_measured): \([195, 177, 188, 136, 188, 154, 106, 348, 278]\)

Now we run this code:

MAX_ERROR = 3

STEP = 0.01

ERROR_THRESHOLD = 0.01


def mse(y, div):

    """Means square error calculation."""

    return sum([(round(_y/div) - _y/div)**2 for _y in y])/len(y)


def find_divider(y):

    """Return the non-integer that minimizes the error function."""

    error_list = []  

    for _div in np.arange(1 + STEP, 

                          min(y) + 2*MAX_ERROR, 

                          STEP):

        error_list.append({"divider": _div, 

                           "error":mse(y, _div)})

    df_error = pd.DataFrame(error_list)

    df_error.plot(x='divider', y='error', kind='scatter')

    _slice = df_error[df_error['error'] == df_error['error'].min()]

    divider = _slice['divider'].to_list()[0]

    error = _slice['error'].to_list()[0]

    if error > ERROR_THRESHOLD:

        raise ValueError('The estimated error is {0} which is '

                          'too large for a reliable result.'.format(error))

    return divider


def find_estimate(y, y_extent):

    """Make an estimate of the underlying data."""

    if (max(y_measured) - min(y_measured))/y_extent < 0.1:

        raise ValueError('Too little range in the data to make an estimate.')  

    m = find_divider(y)

    return [round(_e/m) for _e in y_measured], m

estimate, m = find_estimate(y_measured, y_extent)

This gives us this output:

Original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Measured y values: [195, 177, 188, 136, 188, 154, 106, 348, 278]

Divider (m) estimate: 5.900000000000004

Estimated original numbers: [33, 30, 32, 23, 32, 26, 18, 59, 47]

Which is correct.

Limitations of this approach

Here's when it won't work:

  • If there's little variation in the numbers on the chart, then measurement errors tend to overwhelm the calculations and the results aren't good.
  • In a similar vein, if the numbers are all close to the top or the bottom of the chart, measurement errors lead to poor results.
  • \(m < 1\), which as the maximum y viewport range is usually in the range 500-900 pixels, it won't work for numbers greater than about 500.
  • I've found in practice that if \(m < 3\) the results can be unreliable. Arbitrarily, I call any error greater than 0.01 too high to protect against poor results. Maybe, I should limit the results to \(m > 3\).

I'm not entirely convinced my error function is correct; I'd like an error function that better discriminates between values. I tried a couple of alternatives, but they didn't give good results. Perhaps you can do better.

Notice that the error function is 'denser' closer to 1, suggesting I should use a variable step size or a different algorithm. It might be that the closer you get to 1, the more errors and the effects of rounding overwhelm the calculation. I've played around with smaller step sizes and not had much luck.

Future work

If the data is Poisson distributed, there's an easier approach you can take. In a future blog post, I'll talk you through it.

Where to get the code

I've put the code on my Github page here: https://github.com/MikeWoodward/CodeExamples/blob/master/UnlabeledChart/approxrealgcd.py

Tuesday, July 20, 2021

We don't need no education: England's poor educational performance

Why is education important?

High-paying jobs tend to be knowledge-intensive. If you can't get the qualified workers you want in the UK you could move your operations to Poland, Spain, or Czechia. This obviously applies to IT jobs, but also to specialized manufacturing jobs and jobs in other areas. Governments are in a race or a beauty contest to attract employers with high-paying jobs to set up in their country. High-paying jobs support an ecosystem of other employment, from transportation workers to baristas. 

(In the modern economy, those who train well and invest will win. Image source: Wikimedia, Author:Ub-K0G76A. License: Creative Commons.)

That's why the UK's flat educational performance is so worrying and why it's puzzling there isn't more concern about it in the country. To summarize what I'm going to tell you: educational achievement in the UK has been stagnant for over a decade and later-in-life learning is flat or declining. 

I'm going to focus on England because it's the largest of the UK countries, but the story is similar in Scotland, Wales, and Northern Ireland.

A slice of PISA

The OECD has run PISA (Programme for International Student Assessment) every three years since 2000. It's an assessment of 15-year-old students' achievement in science, math, and reading across several countries. The idea is you can compare performance across nations across time using the same yardstick. Currently, 79 countries take part.

Of course, not every 15-year-old in every country takes the test. In 2018 in the UK, around 13,000 students took the test - which means there was sampling. Sampling implies there will be some uncertainty in the results, so small year-to-year changes are insignificant. How statistical significance is calculated for a sample like this is well-known and I won't go into the theory here.

The press (and governments) give a lot of attention to the country league table of results, but I'm going to focus on the scores themselves.

Standing still

The National Foundation for Education Research in the UK produces summary results and more detailed results for England, Northern Ireland, Scotland, and Wales. I'm just going to focus on the results for England.

Here are the results for math, science, and reading. Note the y-axis and the fact I've zoomed in on the scores.

With the exception of Math in 2009 and 2012, there have been no statistically significant changes in results from the 2018 results. This is so important I'm going to say it again in another way. The performance of English 15-year-olds in math, science, and reading has not measurably changed from 2006 to 2018. 

Let that sink in. Despite 12 years of government policy, despite 12 years of research, despite 12 years of the widespread adoption of computing technology in the classroom, English students' performance has not measurably changed when measured with a fair and consistent international test.

Not learning later in life

We live in a world where change is a constant. New technology is re-making old industries all the time. Whatever qualifications you have, at some point you'll need retraining. Are people in the UK learning post-formal education?

The Learning and Work Institute tracks participation in learning post-formal education; they have estimates of the fraction of the UK adult population that is currently taking some form of training. Here are the results from 1996 to 2019. The blue line is the fraction of the population currently taking some form of training, and the red line is the fraction of the population who have never taken any form of training course since leaving full-time education.

The rates of participation in lifelong learning are at best steady-state and at worst declining. The Institute breaks the data down by social class (important in the UK) and age when someone left full-time education (a proxy for their level of education). Unsurprisingly, participation rates in lifelong learning are higher for those with more education and they're lower the further down the social class scale you go.

Younger people have worse skills

In 2016, the OECD published a study on adult skills in England. The study was worrying. To quote from the report:

  • "In most countries, but not in England, younger people have stronger basic skills than the generation of people approaching retirement."
  • "In England, one-third of those aged 16-19 have low basic skills."
  • "England has three times more low-skilled people among those aged 16-19 than the best-performing countries like Finland, Japan, Korea and the Netherlands. Much of this arises from weak numeracy (and to a lesser extent literacy) performance on average."
  • "Around one in ten of all university students in England have numeracy or literacy levels below level 2."
  • "Most low-skilled people of working age are in employment."

For context, people with level 2 skills or below "struggle to estimate how much petrol is left in the petrol tank from a sight of the gauge, or not be able to fully understand instructions on a bottle of aspirin".

Education at a glance

The OECD summarized a lot of data for the UK as a whole in a 2020 summary. The results are a mixed bag, some good things, and a lot of bad things. Here are a few cherry-picked quotes:

  • "In United Kingdom, the proportion of adults employed in the private sector and participating in job-related non-formal education and training not sponsored by the employer is low compared to other OECD and partner countries. (3 %, rank 31/36 , 2016)"
  • "In United Kingdom, the number of annual hours of participation of adults in formal education and training is comparatively low (169 %, rank 26/26 , 2016)"
  • "In United Kingdom, the share of capital expediture on primary education is one of the smallest among OECD and partner countries with available data. (3.3 %, rank 27/32 , 2017)"

Politics and the press

Education is a long-term process. If a government invests in education for 5-year-olds, it will be 10 years or more before the effects are apparent in exam results. Most education ministers only last a few years in the job and they want some kind of results quickly. This tends to focus policy on the short-term.

In the UK, the government has tinkered with the qualification system for 16 and 18-year-olds. The net effect is to make it hard to compare results over time. For many years, average grades were going up and many commentators were convinced standards were slipping, but of course, it delivered the results politicians wanted.  

The PISA results cut through all this and expose a system that's not improving. The politicians' response was to point at the country league tables and make vaguely positive comments about pupil achievements.

What about the decrease in adult education and training and the OECD report on skills? As far as I can tell, silence. I couldn't even find a decent discussion in the press.

Is there a way forward?

I don't think there are any easy answers for 15-year-olds, and certainly, none that are quick and cheap. What might help is a more mature discussion of what's going on, with some honesty and openness from politicians. Less tinkering with the exam system would be a good idea.

For adult learning, I'm very skeptical of a way forward without a significant cultural change. I can see a huge and widening educational divide in the UK. Graduates are training and retraining, while non-graduates are not. This is not good for society.

Monday, July 12, 2021

What is beta in statistical testing?

\(\beta\) is \(\alpha\) if there's an effect

In hypothesis testing, there are two kinds of errors:

  • Type I - we say there's an effect when there isn't. The threshold here is \(\alpha\).
  • Type II - we say there's no effect when there really is an effect. The threshold here is \(\beta\).

This blog post is all about explaining and calculating \(\beta\).


The null hypothesis

Let's say we do an A/B test to measure the effect of a change to a website. Our control branch is the A branch and the treatment branch is the B branch. We're going to measure the conversion rate \(C\) on both branches. Here are our null and alternative hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

Remember, we don't know if there really is an effect, we're using procedures to make our best guess about whether there is an effect or not, but we could be wrong. We can say there is an effect when there isn't (Type I error) or we can say there is no effect when there is (Type II error).

Mathematically, we're taking the mean of thousands of samples so the central limit theorem (CLT) applies and we expect the quantity \(C_B - C_A\) to be normally distributed. If there is no effect, then \(C_B - C_A = 0\), if there is an effect \(C_B - C_A \neq 0\).

\(\alpha\) in a picture

Let's assume there is no effect. We can plot out our expected probability distribution and define an acceptance region (blue, 95% of the distribution) and two rejection regions (red, 5% of the distribution). If our measured \(C_B - C_A\) result lands in the blue region, we will accept the null hypothesis and say there is no effect, If our result lands in the red region, we'll reject the null hypothesis and say there is an effect. The red region is defined by \(\alpha\).

One way of looking at the blue area is to think of it as a confidence interval around the mean \(x_0\):

\[\bar x_0 + z_\frac{\alpha}{2} s \; and \; \bar x_0 + z_{1-\frac{\alpha}{2}} s \]

In this equation, s is the standard error in our measurement. The probability of a measurement \(x\) lying in this range is:

\[0.95 = P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \]

If we transform our measurement \(x\) to the standard normal \(z\), and we're using a 95% acceptance region (boundaries given by \(z\) values of 1.96 and -1.96), then we have for the null hypothesis:

\[0.95 = P[-1.96 < z < 1.96]\]

\(\beta\) in a picture

Now let's assume there is an effect. How likely is it that we'll say there's no effect when there really is an effect? This is the threshold \(\beta\).

To draw this in pictures, I want to take a step back. We have two hypotheses:

  • \(H_0: C_B - C_A = 0\) there is no difference between the branches
  • \(H_1: C_B - C_A \neq 0\) there is a difference between the branches

We can draw a distribution for each of these hypotheses. Only one distribution will apply, but we don't know which one.



If the null hypothesis is true, the blue region is where our true negatives lie and the red region is where the false positives lie. The boundaries of the red/blue regions are set by \(\alpha\). The value of \(\alpha\) gives us the probability of a false positive.

If the alternate hypothesis is true, the true positives will be in the green region and the false negatives will be in the orange region. The boundary of the green/orange regions is set by \(\beta\). The value of \(\beta\) gives us the probability of a false negative.

Calculating \(\beta\)

Calculating \(\beta\) is calculating the orange area of the alternative hypothesis chart. The boundaries are set by \(\alpha\) from the null hypothesis. This is a bit twisty, so I'm going to say it again with more words to make it easier to understand.

\(\beta\) is about false negatives. A false negative occurs when there is an effect, but we say there isn't. When we say there isn't an effect, we're saying the null hypothesis is true. For us to say there isn't an effect, the measured result must lie in the blue region of the null hypothesis distribution.

To calculate \(\beta\), we need to know what fraction of the alternate hypothesis lies in the acceptance region of the null hypothesis distribution.

Let's take an example so I can show you the process step by step.

  1. Assuming the null hypothesis, set up the boundaries of the acceptance and rejection region. Assuming a 95% acceptance region and an estimated mean of x, this gives the acceptance region as:
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] \] which is the mean and 95% confidence interval for the null hypothesis. Our measurement \(x\) must lie between these bounds.
  2. Now assume the alternate hypothesis is true. If the alternate hypothesis is true, then our mean is \(\bar x_1\).
  3. We're still using this equation from before, but this time, our distribution is the alternate hypothesis.
    \[P \left [ \bar x_0 + z_\frac{\alpha}{2} s < x < \bar x_0 + z_{1-\frac{\alpha}{2}} s \right ] ] \]
  4. Transforming to the standard normal distribution using the formula \(z = \frac{x - \bar x_1}{\sigma}\), we can write the probability \(\beta\) as:
    \[\beta = P \left [ \frac{\bar x_0 + z_\frac{\alpha}{2} s - \bar x_1}{s} < z < \frac{ \bar x_0 + z_{1-\frac{\alpha}{2}} s - \bar x_1}{s} \right ] \]

This time, let's put some numbers in. 

  • \(n = 200,000\) (100,000 per branch)
  • \(C_B = 0.062\)
  • \(C_A =  0.06\)
  • \(\bar x_0= 0\) - the null hypothesis
  • \(\bar x_1 = 0.002\) - the alternate hypothesis
  • \(s = 0.00107\)  - this comes from combining the standard errors of both branches, so \(s^2 = s_A^2 + s_B^2\), and I'm using the usual formula for the standard error of a proportion, for example, \(s_A = \sqrt{\frac{C_A(1-C_A)}{n} }\)

Plugging them all in, this gives:
\[\beta = P[ -3.829 < z < 0.090]\]
which gives \(\beta = 0.536\)

This is too hard

This process is complex and involves lots of steps. In my view, it's too complex. It feels to me that there must be an easier way of constructing tests. Bayesian statistics holds out the hope for a simpler approach, but widespread adoption of Bayesian statistics is probably a generation or two away. We're stuck with an overly complex process using very difficult language.

Reading more

Tuesday, July 6, 2021

Spritely fraud detection

Scientific fraud and business manipulation

Sadly, there's a long history of scientific fraud and misrepresentation of data. Modern computing technology has provided better tools for those trying to mislead, but the fortunate flip side is, modern tools provide ways of exposing misrepresented data. It turns out, the right tools can indicate what's really going on.

(Author: Nick Youngson. License: Creative Commons. Source: Wikimedia)

In business, companies often say they can increase sales, or reduce costs, or do so some other desirable thing. The evidence is sometimes in the form of summary statistics like means and standard deviations. Do you think you could assess the credibility of evidence based on the mean and standard deviation summary data alone?

In this blog post, I'm going to talk about how you can use one tool to investigate the credibility of mean and standard deviation evidence.

Discrete quantities

Discrete quantities are quantities that can only take discrete values. An example is a count, for example, a count of the number of sales. You can have 0, 1, 2, 3... sales, but you can't have -1 sales or 563.27 sales.

Some business quantities are measured on scales of 1 to 5 or 1 to 10, for example, net promoter scores or employee satisfaction scores. These scales are often called Likert scales.

For our example, let's imagine a company is selling a product on the internet and asks its customers how likely they are to recommend the product. The recommendation is on a scale of 0 to 10, where 0 is very unlikely to recommend and 10 is very likely to recommend. This is obviously based on the net promoter idea, but I'm simplifying things here.

Very unlikely to recommend                   Very likely to recommend
0 1 2 3 4 5 6 7 8 9 10


Imagine the salesperson for the company tells you the results of a 500-person study are a mean of 9 and a standard deviation of 2.5. They tell you that customers love the product, but obviously, there's some variation. The standard deviation shows you that not everyone's satisfied and that the numbers are therefore credible.

But are these numbers really credible?

Stop for a second and think about it. It's quite possible that their customers love the product. A mean of 9 on a scale of 10 isn't perfection, and the standard deviation of 2.5 suggests there is some variation, which you would expect. Would you believe these numbers?

Investigating credibility

We have three numbers; a mean, a standard deviation, and a sample size. Lots of different distributions could have given rise to these numbers, how can we backtrack to the original data?

The answer is, we can't fully backtrack, but we can investigate possibilities.

In 2018, a group of academic researchers in The Netherlands and the US released software you can use to backtrack to possible distributions from mean and standard deviation data. Their goal was to provide a tool to help investigate academic fraud. They wrote up how their software works and published it online, you can read their writeup here. They called their software SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) and made it open-source, even going so far as to make a version of it available online. The software will show you the possible distributions that could give rise to the summary statistics you have.

One of the online versions is here. Let's plug in the salesperson's numbers to see if they're credible. 

If you go to the SPRITE site, you'll see a menu on the left-hand side. In my screenshot, I've plugged in the numbers we have:

  • Our scale goes from 0 to 10, 
  • Our mean is 9, 
  • Our standard deviation is 2.5, 
  • The number of samples is 500. 
  • We'll choose 2 decimal places for now
  • We'll just see the top 9 possible distributions.

Here are the top 9 results.

Something doesn't smell right.  I would expect the data to show some form of more even distribution about the mean. For a mean of 9, I would expect there to be a number of 10s and a number of 8s too. These estimated distributions suggest that almost everyone is deliriously happy, with just a small handful of people unhappy. Is this credible in the real world? Probably not.

I don't have outright evidence of wrongdoing, but I'm now suspicious of the data. A good next step would be to ask for the underlying data. At the very least, I should view any other data the salesperson provides with suspicion. To be fair to the salesperson, they were probably provided with the data by someone else.

What if the salesperson had given me different numbers, for example, a mean of 8.5, a standard deviation of 1.2, and 100 samples? Looking at the results from SPRITE, the possible distributions seem much more likely. Yes, misrepresentation is still possible, but on the face of it, the data is credible.

Did you spot the other problem?

There's another, more obvious problem with the data. The scale is from 0 to 10, but the results are a mean of 9 and a standard deviation of 2.5, which implies a confidence interval of 6.5 to 11.5. To state the obvious, the maximum score is 10 but the upper range of the confidence interval is 11.5. This type of mistake is very common and doesn't of itself indicate fraud. I'll blog more about this type of mistake later.

What does this mean?

Due diligence is about checking claims for veracity before spending money. If there's a lot of money involved, it behooves the person doing the due diligence to check the consistency of the numbers they've been given. Tools like SPRITE are very helpful for sniffing out areas to check in more detail. However, just because a tool like SPRITE flags something up it doesn't mean to say there's fraud; people make mistakes with statistics all the time. However, if something is flagged up, you need to get to the bottom of it.

Other ways of detecting dodgy numbers 

Finding out more