Sunday, May 23, 2021

Why A/B tests don't add up

All the executives laughed

A few years ago, I was at an industry event. The speaker was an executive talking about his A/B testing program. He joked that vendors and his team were unreliable because the overall result was less than the sum of the individual tests. Everyone laughed knowingly.

But we shouldn't have laughed.

The statistics are clear and he should have known better. By the rules of the statistical game, the benefits of an A/B program will be less than the sum of the parts and I'm going to tell you why.

Thresholds and testing

An individual A/B test is a null hypothesis test with thresholds that decide the result of the test. We don't know whether there is an effect or not, we're making a decision based on probability. There are two important threshold numbers:

  • \(\alpha\) - also known as significance and usually set around 5%. If there really is no effect, \(\alpha\) is the probability we will say there is an effect. In other words, it's the false positive rate (Type I errors).
  • \(\beta\) - is usually set around 20%. If there really is an effect, \(\beta\) is the probability we will say there is no effect. In other words, it's the false negative rate (Type II errors). In practice, power is used instead of \(\beta\), power is \(1-\beta\), so it's usual to set the power to 80%.

Standard statistical practice focuses on just a single test, but an organization's choice of \(\alpha\) and \(\beta\) affect the entire test program.

\(\alpha\), \(\beta\) and the test program

To see how the choice of \(\alpha\) and \(\beta\) affect the entire test program, let's run a simplified thought experiment. Imagine we choose  \(\alpha = 5\%\) and \(\beta = 20\%\), which are standard settings in most organizations. Now imagine we run 1,000 tests, in 100 of them there's a real effect and in 900 of them there's no effect. Of course, we don't know which tests have an effect and which don't.

Take a second to think about these questions before moving on:

  • How many many positive test results will we measure?
  • How many false positives will we see?
  • How many true positives will we see?

At this stage, you should have numbers in mind. I'm asking you to do this so you understand the importance of what happens next.

The logic to answer these questions is straightforward. In the picture below, I've shown how it works, but I'll talk you through it so you can understand it in more detail.

Of the 1,000 tests, 100 have a real effect. These are the tests that \(\beta\) applies to and \(\beta=20\%\), so we'll end up with:

  • 20 false negatives, 80 true positives

Of the 1,000 tests, 900 have no effect. These are the tests that \(\alpha\) applies to and \(\alpha=5\%\), so we'll end up with:

  • 855 true negatives, 45 false positives

Overall we'll measure:

  • 125 positives made up of
  • 80 true positives
  • 45 false positives

Crucially, we won't know which of the 125 positives are true and which are false.

Because this is so important, I'm going to lay it out again: in this example, 36% of all test results we thought were positive are wrong, but we don't know which ones they are. They will dilute the overall results of the overall program. The overall results of the test program will be less than the sum of the individual test results.

What happens in reality

In reality, you don't know what proportion of test results are 'true'. It might be 10%, or 20%, or even 5%. Of course, the reason for the test is that you don't know the result. What this means is, it's hard to do this calculation on real data, but the fact that you can't easily do the calculation doesn't mean the limits don't apply.

Can you make things better?

To get a higher proportion of true positives, you can do at least three things.

  • Run fewer tests - selecting only tests where you have a good reason to believe there is a real effect. This would certainly work, but you would forgo a lot of the benefits of a testing program.
  • Run with a lower \(\alpha\) value. There's a huge debate in the scientific community about significance levels. Many authors are pushing for a 0.5% level instead of a 5% level. So why don't you just lower \(\alpha\)? Because the sample size will increase greatly.
  • Run with a higher power (lower \(\beta\)). Using a power of 80% is "industry standard", but it shouldn't be - in another blog post I'll explain why. The reason people don't do it is because of test duration - increasing the power increases the sample size.

Are there other ways to get results? Maybe, but none that are simple. Everything I've spoken about so far uses a frequentist approach. Bayesian testing offers the possibility of smaller test sizes, meaning you could increase power and reduce \(\alpha\) while still maintaining workable sample sizes. Of course, A/B testing isn't the only testing method available and other methods offer higher power with lower sample sizes.

No such thing as a free lunch 

Like any discipline, statistical testing comes with its own rules and logic. There are trade-offs to be made and everything comes with a price. Yes, you can get great results from A/B testing programs, and yes companies have increased conversion, etc. using them, but all of them invested in the right people and technical resources to get there and all of them know the trade-offs. There's no such thing as a free lunch in statistical testing.

Monday, May 17, 2021

Counting on Poisson

Why use the Poisson distribution?

Because it has properties that make it great to work with, data scientists use the Poisson distribution to model different kinds of counting data. But these properties can be seductive, and sometimes people model data using the Poisson distribution when they shouldn't. In this blog post, I'll explain why the Poisson distribution is so popular and why you should think twice before using it.

(Siméon-Denis Poisson by E. Marcellot, Public domain, via Wikimedia Commons)

Poisson processes

The Poisson distribution is a discrete event probability distribution used to model events created using a Poisson process. Drilling down a level, a Poisson process is a series of events that have these properties:

  • They occur at random but at a constant mean rate,
  • They are independent of one another, 
  • Two (or more) events can't occur at the same time

Good examples of Poisson processes are website visits, radioactive decay, and calls to a help center. 

The properties of a Poisson distribution

Mathematically, the Poisson probability mass function looks like this:

\[ P_r (X=k) = \frac{\lambda^k e^{- \lambda}}{k!} \]

where 

  • k is the number of events (always an integer)
  • \(\lambda\) is the mean value (or expected rate)

It's a discrete distribution, so it's only defined for integer values of \(k\).

Graphically, it looks like this for \(\lambda=6\). Note that it isn't symmetrical and it stops at 0, you can't have -1 events.

(Let's imagine we were modeling calls per hour in a call center. In this case, \(k\) is the measured calls per hour, \(P\) is their frequency of occurrence, and \(\lambda\) is the mean number of calls per hour).

Here are some of the Poisson distribution's properties:

  • Mean: \(\lambda\)
  • Variance: \(\lambda\)
  • Mode: floor(\(\lambda\))

The fact that some of the key properties are given by \(\lambda\) alone makes using it easy. If your data follows a Poisson distribution, once you know the mean value, you've got the variance (and standard deviation), and the mode too. In fact, you've pretty much got a full description of your data's distribution with just a single number.

When to use it and when not to use it

Because you can describe the entire distribution with just a single number, it's very tempting to assume that any data that involves counting follows a Poisson distribution because it makes analysis easier.  Sadly, not all counts follow a Poisson distribution. In the list below, which counts do you think might follow a Poisson distribution and which might not?

  • The number of goals in English Premier League soccer matches.
  • The number of earthquakes of at least a given size per year around the world.
  • Bus arrivals.
  • The number of web pages a person visits before they make a purchase.

Bus arrivals are not well modeled by a Poisson distribution because in practice they're not independent of one another and don't occur at a constant rate. Bus operators change bus frequencies throughout the day, with more buses scheduled at busy times; they may also hold buses at stops to even out arrival times. Interestingly, bus arrivals are one of the textbook examples of a Poisson process, which shows that you need to think before applying a model.

The number of web pages a person visits before they make a purchase is better modeled using a negative binomial distribution

Earthquakes are well-modeled by a Poisson distribution. Earthquakes in different parts of the world are independent of one another and geological forces are relatively constant, giving a constant mean rate for quakes. It's possible that two earthquakes could happen simultaneously in different parts of the world, which shows that even if one of the criteria might not apply, data can still be well-modeled by Poisson.

What about soccer matches? We know two goals can't happen at the same time. The length of matches is fixed and soccer is a low-scoring game, so the assumption of a constant rate for goals is probably OK. But what about independence? If you've watched enough soccer, you know that the energy level in a game steps up as soon as a goal is scored. Is this enough to violate the independence requirement? Apparently not, scores in soccer matches are well-modeled by a Poisson distribution.

What should a data scientist do?

Just because the data you're modeling is a count doesn't mean it follows a Poisson distribution. More generally, you should be wary of making choices motivated by convenience. If you have count data, look at the properties of your data before deciding on a distribution to model it with. 

If you liked this blog post you might like

Monday, May 10, 2021

Soldiers on the moon: Project Horizon

The moon and the cold war

In 1959, Cold War rivalries were intense and drove geopolitics; the protagonists had already fought several proxy wars and the nuclear arms race was well underway. The Soviet Union put the first satellite into space in 1957, which was a wake-up call to the United States; if the Soviet Union could put a satellite in orbit, they could put a missile in orbit. By extension, if the Soviet Union got to the moon first, they could build a lunar military base and dominate the moon. The Soviet Union had announced plans to celebrate its 50th anniversary (1967) with a lunar landing. The race was on with a clear deadline.

(Phadke09, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

In response to the perceived threat, the US Army developed an audacious plan to set up a military base on the moon by 1965 and beat the Soviets. The plan, Project Horizon, was 'published' in 1959 but only declassified in 2014. The plan is very extensive and covers living arrangements, spacesuits, power, and transport, it was published in two volumes with illustrations (Volume I, Volume II).

In some alternative history, something like this could have happened. The ideas in it still have some relevance, so let's dive in and take a look. 

Getting there

In 1959, the enormous Saturn V rockets didn't exist and the army planners knew that heavy-lift rockets would take years to develop. To meet the 1965 deadline, they needed to get going with current and near-future technology, which meant Saturn I and Saturn II rockets. The plan called for 61 Saturn I rockets and 88 Saturn IIs, with a launch rate of 5.3 per month. In reality, only 19 Saturn Is were ever launched (the Apollo project used Saturn Vs, of which 13 were launched). 

To state the obvious, smaller rockets carry smaller payloads. To maximize payloads to orbit, you need to think about launch sites; the closer you are to the equator the bigger boost you get from the Earth's rotation. Project Horizon considered several launch sites on the equator, none of which were in US territory. The map below comes from the report and I've indicated prospective launch sites in red.

  • Somalia. Rejected because of remoteness.
  • Manus Island. Rejected because of remoteness.
  • Christmas Island. Remote, but given serious consideration because of logistics.
  • Brazil. Closer to the US and given serious consideration.

The report doesn't decide between Christmas Island and Brazil but makes a good case for both. The launch site would obviously be huge and would have several gantries for multiple launches to hit the 5.3 launches per month target.

The next question is: how do you get to the moon? Do you go directly or do you attempt equatorial orbit refueling? In 1969, Apollo 11 went directly to the moon, but it was launched by the far larger Saturn V rocket. With just Saturn I and Saturn IIs, the team needed to take a different approach. They settled on orbital refueling from a 10-person space station followed by a direct approach once more powerful rockets became available.

Landing on the moon

The direct and indirect methods of getting to the moon led to two different lander designs, one of which I've shown below. The obvious question is, why is it streamlined? The upper stage is the return-to-earth vehicle so it's shaped for re-entry. In the Apollo missions, the reentry vehicle was part of the command module that stayed in lunar orbit, so the lunar lander could be any shape.

Staffing

The plan was for an initial landing by two astronauts followed by construction teams to build the base. The role of the first astronauts was scouting and investigating possible base sites, and they were to stay on the moon for 30-90 days, living in their lander. The construction crews would build the base in 18 months but the maximum tour of duty for a construction crew member was 12 months. By late 1965, the base would be staffed by a team of 12 (all men, of course, this was planned in 1959 after all).

The moon base

The moon crew was to have two different sorts of rides: a lunar bulldozer for construction and a rover for exploration (and 'surveillance'). The bulldozer was to dig trenches and drop the living quarter into them (the trenches would also be partially excavated by high explosives). The living quarters themselves were 10ft x 20ft cylinders. 

Burying living quarters helps with temperature regulation and provides protection against radiation. The cylinders themselves were to be giant double-walled thermos flasks (vacuum insulated) for thermal stability.

The finished base was to be l-shaped.

Lunar living

The initial construction camp looks spartan at best and things only improve marginally when the l-shaped base is completed.

In the finished base, toilets were to be 'bucket-type' with activated charcoal for odor control; urine was to be stored on the moon surface for later recycling.

The men were to be rationed to 6lb of water (2.7 liters) and 4lb of food per day - not starvation or dehydration, but not generous either. The initial plan was for all meals to be pre-cooked, but the soldiers would later set up a hydroponics farm for fresh fruit and vegetables.

Spacesuits

Curiously, there isn't as much as you'd expect about spacesuits, only a few pages. They knew that spacesuits would be restrictive and went so far as defining the body measurements of a standard man, including details like palm length. The idea seems straightforward, if technology restricts your design flexibility, then select your crew to fit what you can build. 

Power

Perhaps unsurprisingly for the 1950s, power for the base was to come from two nuclear reactors. both of which needed to be a safe distance from the base and recessed into the regolith in case of accidents. It seems like the lunar bulldozer was going to be very busy. 

Weapons

Soldiers mean guns or at least weapons. The report is surprisingly coy about weapons; it alludes to R&D work necessary to develop lunar weapons, but that's about it.

Costs

$6 billion total in 1959 dollars. Back then, this was an awful lot of money. The real Apollo program cost $25.4 billion and it's highly likely $6 billion was a substantial underestimate, probably by an order of magnitude.

Project Horizon's impact

As far as I can tell, very little. The plan was put to Eisenhower, who rejected it. Instead, NASA was created and the race to the moon as we know it started. But maybe some of the Project Horizon ideas might come back.

Burying habitats in the lunar regolith is an idea the Soviets used in their lunar base plans and has been used several times in science fiction. It's a compelling idea because it insulates the base from temperature extremes and from radiation. However, we now know lunar regolith is a difficult substance to work with.

Nuclear power makes sense but has obvious problems, and transporting nuclear power systems to orbit has risks. The 1970s British TV science fiction series "Space:1999" had a nuclear reactor explosion knocking the moon out of orbit, which is far-fetched, but a nuclear problem on the moon would be severe.

The ideas of in-flight re-fueling and lunar waystations have come up again in NASA's future lunar exploration plans.

What may have dealt a project like Project Horizon a final death blow is the 1967 Outer Space Treaty which bans weapons of mass destruction and the militarization of space.

Project Horizon is a footnote in the history of space exploration but an interesting one. It gives insight into the mind of the military planners of the time and provides a glimpse into one alternative path the world might have taken.

If you liked this blog post

You might like these other ones:

Monday, May 3, 2021

How to hire well

How I've learned to hire

I’ve done a lot of hiring and I’ve learned what works and what doesn’t work to make a good hire (someone who performs well and stays). I’ve come to trust my judgment but only within the confines of a hiring process that covers my blind spots. Here’s a description of what I typically like to do, but bear in mind this is an amalgamation of processes from different employers.

To be clear: what I say in this blog post might not reflect current or previous hiring processes at my current or former employers. I'm presenting a mix of processes with the goal of giving you insight into one amalgamated hiring process and how one hiring manager thinks.

Principles - caution, excitement, 'no', and decency

The hiring process is fraught for both parties. We're both trying to decide if we want to spend extended amounts of time with each other. The hiring manager wants someone who will fit in, perform well, and will stay. The applicant wants to work in an environment that suits them and rewards them appropriately. No one enjoys the interviewing process and everyone wants to get what they want quickly. This suggests the first principle: caution. It's easy to make a mistake when the pressure is on and the thing that will save you is having a good process.

Once the process starts, I try and follow an ‘excite and select’ approach. I want to excite candidates by meeting the team and by the whole interview process and I want them to feel energized by what they experience. I then select from enthusiastic and excited candidates.

My default position is always ‘no’ at all stages. If I’m in doubt, I sleep on it and say ‘no’ the next day. On occasions, I’ve been under a great deal of pressure to make a hire, but this attitude has saved me from hiring the wrong person. Even in the US, unwinding a bad hiring decision is extremely painful, and in Europe, it can be almost impossible. It’s far better to be sure than take a risk. I’ve only changed my mind after a ‘no’ once, and that turned out to be a good decision that I stand by.

My next principle is being humane. The interview process is stressful and I want to treat candidates well and with respect at every stage. Even if they’ve been rejected, I want them to feel good about the process. Let's be honest, sometimes there's just a mismatch of skills - I've said no to some really great people.


(Interviews should be friendly and humane, not an interrogation or a stress test. Image credit: Noh Mun Duek, license: Creative Commons, source: Wikimedia Commons.)

The hiring process

The job ad

I like to think very carefully about the wording of the job ad. It has to excite and attract candidates, but it also has to be honest and clear about the job.

I've had a few candidates who've misunderstood the job and that's become clear at the screening interview. To stop this from happening, I've sometimes created a longer form job description I've sent to candidates we've selected for screening. The longer form description describes more about the role and provides some background about the company. Some candidates have withdrawn from the process after seeing the longer form description and that's OK - better for everyone to stop the process sooner if there's no match.

Resume selection

This is an art. Here are some of the factors I consider for technical positions.

  • A Github page is real plus. I check out the content.
  • Blog posts (personal or company) or content for marketing is a plus.
  • Mention of methods and languages. Huge shopping lists of languages are a bad sign. I also want to know what they've done with languages and methods.
  • Clear descriptions of what they've done, with a focus on the technical piece. I prefer straightforward language.
  • Training courses. Huge shopping lists are again a no for me.

I don't tend to select on the college someone went to but I know lots of organizations that do.

The screening interview

The first interview is a screening interview with me as the hiring manager. I do this via video call so I can get a sense of the person’s responses and their ability to interact. I always have a script for these calls and always follow the same process. I work out the areas I want to talk about and create the best questions I can to differentiate between candidates. The script gives me a more consistent (and fairer) way to compare candidates and also enables me to learn what works and what doesn’t. For example, if candidates find a question confusing, or everyone answers a question well, I can change the question. For behavioral questions, I ask for examples of the behavior and my technical questions are usually about experience. Here are some examples:

  • Can you give me an example of how you dealt with conflicting demands?
  • Can you tell me about a time you managed an underperforming employee?
  • What’s the largest program you’ve written?
  • What are the biggest limitations of Python?

These questions are launch points for deeper discussions.

The technical screen

Next comes a technical screening. Again, this must be the same for all candidates. It must be fair and allow for nervousness. 

I'm very careful about the technical questions that my interview team asks. I make sure that people are asked relevant questions that reveal the extent of their knowledge and skills. For example, if my team were interviewing someone for a machine learning position, I would ask about their use of key libraries (e.g. caret), but I wouldn't ask them about building SVMs or random forest models from scratch unless that's something they'd be doing.

Cultural fit and add

Finally, there are in-person interviews. I like to use teams of two where I can so two people can get a read. Any more than two and it starts to feel like an interrogation. Each team has a brief for the areas they want to probe and a list of questions they want to ask. 

Team selection is something of an art; I’ve known interviewers who are unable to say ‘no’ to any candidate, no matter how bad. If I have to include someone like this on the interview team, I’ll balance them with someone who can say no.  

I’ve heard of companies doing all-day interviews, but this seems like overkill to me and it stresses the interviewee; there’s a balance here between thoroughness and being human. For in-person interviews, I ensure that every team offers the candidate a drink or time out to visit the restroom. 

Where I can, I have the very last interview as a discussion with the candidate, asking them what went well and what went badly in the process. Sometimes candidates answer a question badly and use the discussion opportunity to better answer the question. Everyone makes mistakes and interviews are stressful, it seems like a good opportunity to offer the candidate a pause for reflection and an opportunity to correct errors.

I always look for the ability to work well with others and I value that over technical skills. A good technical person can always learn new technical skills, but it's very difficult to train someone not to be a jerk.

Decision making

Before we go to a decision, I find people the candidate may have interacted with who are not on the interview team. Many times, I’ve asked the receptionist how the candidate treated them. On one occasion, a candidate upset the receptionist so badly, they came to me and told me what had happened. It was an instant ‘no’ from that point.

To decide hire or no hire, I gather the interview teams together and we have a discussion about the candidate. If consensus exists to hire, most of the time I go ahead and make an offer but only after probing to make sure this is a considered opinion of everyone in the room. On a few occasions, I’ve overruled the group and said no. This happens when I think some factor is very important but the group hasn’t considered it well enough. If the decision is a uniform no, I don’t hire. I reserve the right to overrule the group, but it’s almost inconceivable I’d overrule a uniform no. If the view of the group is mixed, I probe those in favor and those against. In almost all cases where views are mixed, I say no - this is part of my default ‘no’ position.

The benefits

I know this process sounds regimented, but there are important benefits. The first is fairness for candidates; everyone is treated the same and there’s a consistent set of filters. The second is learning; if the process is wrong or has failed in some respect, we can fix it. Thirdly, the process is inclusive - the team has a huge say in who gets hired and who doesn’t.

If hiring and retaining good staff is important, then it’s important to have a fair, decent, and thorough hiring process. Through years of experience, I’ve honed my process and I’ve been pleased that the companies I’ve worked for have all had similar underlying processes and similar principles.

Good luck

If you"re searching for a job, I hope this post has given you some insight into a hiring process and what you have to do to succeed. Good luck to you.

Sunday, April 25, 2021

How to be creative like David Bowie, R.E.M. and Coldplay

Technical people are creative too

It's obvious that songwriters are creative, but technologists have to be creative problem solvers too. Many business technical problems are poorly defined at best and require a great deal of imagination just to get started.

(It's not just artists who have to be creative. Michelangelo, Public domain, via Wikimedia Commons.)

One of the big stumbling blocks for truly solving technology problems is local optimization. Imagine you're trying to reduce the processing time for a machine learning algorithm. You can focus on optimizing calculations, but would you be better off using a different algorithm altogether? Graphically, it looks something like this.

(Local and global optimization. Adapted from Wikimedia Commons. Author: Martynas Patasius. License: Creative Commons)

You end up focusing a great deal of effort for little gain when a more original approach might yield dividends.

This begs the question, how can you be more creative, how can you find more original approaches? How do songwriters and other creatives jump-start the creative process?

There's a technique I've found very useful: Oblique Strategies, but you have to be creative about how you use it.

Oblique Strategies

The painter Peter Schmidt and the musician Brian Eno first met in the late 1960s. They had a lot in common, including an interest in the creative process. Schmidt created a list of 55 quotes to overcome artistic blocks and Eno had been working on something similar, so in 1974 they combined their efforts to produce "Oblique Strategies".

Oblique Strategies was originally a set of phrases printed on cards. The idea was simple. If you're stuck in the creative process, view one of the random cards. Try to relate what's on the card to the problem at hand. Obviously, you'll have to jump through some creative hoops to do so, which is the whole point of the exercise.

Here are some example phrases from the cards:

  • Lowest common denominator
  • Turn it upside down
  • Always give yourself credit for having more than personality
(A card from the Oblique Strategies deck. Author: Bastiaan Terhorst, Source: Flickr, License: Attribution-NonCommercial-NoDerivs 2.0 Generic)

Who used it and what were the results?

In the 1970s, David Bowie moved to Berlin and, working with Brian Eno, recorded three albums, Low, “Heroes” and Lodger. You might be more familiar with some of the songs: Sound and Vision, Heroes, and Boys Keep Swinging. To keep the songwriting process going, the team used the Oblique Strategies deck to find new ways of thinking about music and lyrics.

(David Bowie, CBS Television, Public domain, via Wikimedia Commons)

R.E.M. even went so far as quoting some of the phrases on the cards in their music, in their song "What's the frequency, Kenneth?" they quote the card "withdrawal in disgust is not the same as apathy".

Moving to more recent times, Coldplay used the cards when recording their album Viva la Vida or Death and All His Friends, but that may be because Brian Eno was their producer. 

The best descriptions I've found for how musicians use the cards are blog bosts by Rosie Cass and Dave Dyment.

My use of oblique strategies

Oblique Strategies are not a daily tool for me. I only use the cards when I'm stuck and brainstorming with colleagues doesn't work. Because of commercial confidentiality, I'm not going to give you a detailed work problem example, instead, I'm going to give you a similar problem and how I might use the cards.

Imagine I'm trying to find missing data in a data set. I've used a statistical approach to detect missing data, but I'm stuck finding more, I know 30% of the data is missing, and I can characterize 45% of that 30%, but I don't know how to find the remaining 65%. What can I do? This is where I would turn to Oblique Strategies. The art is to use the words on the card as a starting point for your thinking.

My starting point is one of the online card generators, let's start with this one. I pull a card online:

  • It is quite possible (after all)

This makes me think about what the desired end state might be. Can I identify 100% of the missing data? What 70% be enough? Maybe 45% is the best I can hope for? Do the success criteria change over time? Are there different categories of missing data? Maybe I can only get a few categories?

You can see how the train of thought continues. Let's pull another card.  

  • In total darkness, or in a very large room, very quietly

This is harder. Darkness makes me think of the lack of data here. Maybe there are hints about the missing data in the data I have? The large room makes me think of the data collection process. How efficient is it and how does it work?

This card probably isn't as helpful, so let's pull in another card.

  • State the problem in words as clearly as possible

Hmmm. This is more direct, but it's probably good advice to write down the problem clearly. So clearly that someone else could work on the problem. Maybe by clarifying the problem I could see a solution.

I could go on, but I think you can see the point. The cards don't provide a solution, but they're a kind of shock to the mind to think about the problem in different ways. It breaks me out of the rut of thinking-as-usual. 

I've used the cards to solve several business-technical problems over the last few years. They've enabled me to tame some very difficult problems. Once you get the idea of how to use them, the process is straightforward - it can even turn a frustrating problem into an enjoyable thinking session.

Where to get the cards

There are several online card generators, chose the one you like best:

If you prefer the tactile feel of real cards, you can buy cards from these vendors, but the prices are high:

Creativity is a must

If you're quite early in your career, then turning to your manager and saying "I'm stuck" is acceptable and even expected behavior. But as the senior person, there might be no one else to turn to, and even worse, you have to solve the "I'm stuck, help me" requests from others. Obviously, experience is a great tutor, but even experience lets us down from time to time. We all need an occasional creativity boost and Oblique Strategies is one of the methods I use.

Sunday, April 18, 2021

A/B testing basics: ways of being right and wrong (frequentist version)

What are we trying to achieve?

In a typical A/B test, we're trying to find out if a change has a (positive) effect. For example, does changing the page layout increase the clickthrough rate? Despite what you've been told, we can't answer these types of questions with absolute certainty: the best we can do is provide a probable answer.  We use statistical best practices to map a probability to a pass/fail answer. 

In this blog post, I'm going to lay out some fundamentals to help you understand the process a statistician follows to translate a probabilistic result into a pass/fail result. 

A typical A/B test

To provide some focus for discussion, let's imagine we're testing to see if a discount on a website increases the rate of purchase. We'll have a control branch that doesn't have the discount and a treatment branch that has the discount. We'll measure conversion for both branches: \(c_T\) for the conversion for the treatment branch and \(c_C\) for the conversion for the control branch.

This kind of test is called a null hypothesis test. The null hypothesis here is that there is no difference, the alternate hypothesis is that there is a difference. We can write this as:
\[H_0: c_T  - c_C = 0\]
\[H_1: c_T - c_C \neq 0\]
There's something subtle here you need to know. The conversion rate we measure is an average conversion rate over many visitors, probably several thousand. Because of this, some very important mathematics kicks in, specifically something called the Central Limit Theorem. This theorem tells us our results will be normally distributed, in other words, \(c_T - c_C\) will be normally distributed, which is important as we'll see in a minute.

Types of error

I've blogged about null hypothesis tests before, so I'm only going to summarize things here. We can assume there's some underlying truth: either \(H_0\) or \(H_1\) is true. We don't know which is true and we're making an educated true/false guess. This gives us two ways of being right and two ways of being wrong. I've shown this in the table below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

We can't know for certain what the truth is, but we can define limits on our uncertainty. We can also define thresholds that will enable us to make reasonable pass/fail estimates. I'll show you how this works.

Assuming the null is true

The first step is to assume the null hypothesis is true, which means \( c_T  - c_C = 0\). As I explained earlier, the quantity \(c_T - c_C\) is normally distributed (this is a probability distribution, which I've blogged about before). We can compare our actual measurement of  \( c_T  - c_C\) to the theoretical distribution and ask how likely it is that the underlying value really is zero (in other words, what's the probability of the null being true?). 

Let me take a second to explain this some more. Imagine I'm trying to find out if a coin is biased. I throw it ten times and see six heads. Does this prove the coin is biased? No. It could be biased, but I don't have enough throws to say. Now imagine I've thrown the coin 100,000 times and I see 60,000 heads, does this prove bias? It's not absolutely sure, but it's highly likely the coin is biased. With statistics, we quantify this kind of analysis and set ground rules for what we consider evidence.

We can take our hypothetical A/B test and map the expected result to a standard normal distribution (very easy to do). Let's look at the standard normal distribution below, which plots a probability vs. a measurement value \(z\). Although it's true that all values are possible, the likelihood of some of them occurring is very low. For example, the probability of measuring a \(z\) value in the range \(-1  \leq z \leq 1\) is 0.68, but the probability of measuring a \(z\) value in the range \(1  \leq z \leq 3\) is only 0.16.



Certainty is impossible, but what we want to do is say whether a measurement means the null hypothesis is true or the alternate is. Put it another way, for a given measurement, how likely is it that the null is true or not? What's our threshold for acceptance/rejection? The standard procedure is to compare our measurement to the chart above. If our measurement falls in the blue zone on the chart we'll consider it means the null hypothesis is true. Anything that falls in the red zone, we'll consider the alternate is true. But we might be wrong - we can never have certainty. The size of the red area gives us the limits on our certainty. By convention, the red zones are 5% of the probability.

The standard limits we use are that we have to be in the 95% probability (blue) zone around zero to accept the null, and in the red 5% area to accept the alternate. This 5% threshold is usually called significance level and is given the symbol \(\alpha\). 

Using a threshold of 5% crudely speaking means we'll be wrong 5% of the time. Let's imagine a company running 100 tests in a year, this threshold means they'll be wrong in about 5 cases.

Surely this is enough? Surely we can now do this calculation and use \(\alpha\) to say pass/fail? No. We have assumed the null is true. But we also need to do the opposite and assume the alternate is true. 

Assuming the alternate is true

Now, we assume the alternate is true, that \( c_T - c_C \neq 0\). We can plot this out as a normal distribution too, but there's a difference. When we considered the null hypothesis to be true, we considered both sides of the normal curve, but here we only care about one side of the distribution. Remember, we're looking at the difference \( c_T - c_C \), so one side of the curve 'points' towards zero (no difference), and the other side points towards a bigger difference. We only care about the side that 'points' towards zero.

If there really is a difference, we expect a probability distribution like this below. We'll consider the alternate hypothesis to be true if our measurement lands in the blue zone, if it lands in the red zone, we'll reject the alternate. As before, the alternate could be true, and by chance, we could land in the red zone. The threshold value we'll use here is called \(\beta\). 


For reasons I won't go into, the threshold value is called the power of a test and is given by \(1-\beta\). Typical values of power range from 80% to 95%, but 80% is considered a minimum threshold. I'll have a lot more to say about power in another blog post.

Putting it together

Usually, the two charts I've shown you are shown looking like this. The sample sizes are chosen so that \(\alpha\) and \(\beta\) line up.



For our A/B test, here are the simplified steps in the process.

  1. Note the number of samples in each branch, in this case, the number of samples is the number of website visitors.
  2. Work out the conversion rate for the two branches and work out \( c_T - c_C \).
  3. Work out the probability of observing \( c_T - c_C \) if the null is true. (This is a simplification, we work out a p-value, which is the probability of observing a measurement greater than or equal to the measurement we're seeing).
  4. Compare the p-value to \(\alpha\). If \(p < \alpha\) then we reject the null hypothesis (we believe the treatment had an effect). If \(p > \alpha\) we accept the null hypothesis (we believe the treatment had no effect).
  5. Work out the probability of observing \( c_T - c_C \) if the alternate is true. This is the observed power. The observed power should be greater than about 80%. An observed power lower than about 80% means the test is unreliable.

How to fail

When people new to statistics get involved in A/B testing, they sometimes make the mistake of focusing only on confidence (and p-values). This gives them insight into false positives, but it says nothing at all about false negatives. To put it bluntly, this incorrect process puts all the emphasis on the risk of doing something, but none at all on the risk of doing nothing. This kind of focus also leads to tests that are too short to be reliable.

Let me put this another way. Significance is about protecting you from buying something that doesn't work. Power is about protecting you from not buying something that works.

Why not just set the thresholds higher?

The widths of the normal distributions I've shown depend on the number of samples. The more samples there are, the narrower the curve. The thresholds depend on the narrowness of the curve. To put it simply, increasing confidence and power mean increasing the number of samples in the test, which means a longer test. So all we need to do is increase the length of the test? Not so fast, the relationship isn't a linear one. Increasing power or significance by a few percentage points could double the length of the test depending on what the power and significance levels are.

Where do these thresholds come from?

The choice of a confidence value of 95% is arbitrary and comes from statistical standard practice. There's a fierce ongoing debate in the social sciences about whether this threshold is appropriate; an emerging view is that it's too lax a standard. In a recent paper in Nature, Benjamin et al [Benjamin] argued passionately that 99.5% is a better threshold. 

Something similar applies to power. The 'industry standard' is 80%, a figure with a far murkier background [Cohen]. In my view, using this figure of 80% is wrong in almost all cases. 80% is a minimum. I'll have a lot more to say about power in another blog post.

Eye of newt and toe of frog...

I've talked glibly about accepting and rejecting hypothesis. This is a deliberate simplification on my part. The true statistical language is "fail to reject the null hypothesis" and "reject the null hypothesis". There are good fundamental reasons for using this language, but if you're not a statistical person, it's very confusing. I've chosen a simplified version to make my point.

The process for deciding an A/B test reads like a witches' brew recipe rather than a scientific process. It's reliant on arbitrary thresholds, some difficult concepts, and confusing language. The null hypothesis test itself is a shot-gun marriage of techniques. Unsurprisingly, p-values are widely misinterpreted and misunderstood [Amrhein]. 

Fundamentally, the whole process is a witches' brew; it works, but it's not satisfying. 

Fortunately, there is an alternative view using a Bayesian approach which is simpler, and more enlightening. I'll talk about the Bayesian approach in another blog post. If the Bayesian approach is more satisfying, why did I show this (frequentist) approach here? Because this approach is what people are taught.

References

[Amrhein] Valentin Amrhein, Sander Greenland, Blake McShane, Scientists rise up against statistical significance, Nature 567, 305-307 (2019)

[Benjamin] Benjamin, D.J., Berger, J.O., Johannesson, M. et al. Redefine statistical significance. Nat Hum Behav 2, 6–10 (2018). https://doi.org/10.1038/s41562-017-0189-z

[Cohen] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillside, NJ: Lawrence Erlbaum Associates.

Monday, April 12, 2021

Goodies and baddies: how a poor model of history lets us down

The wrong model to understand history

Some of my history teachers taught me the wrong model to understand history. To be fair to them, they were simplifying complex events for children. But as an adult, I've seen journalists use the same simple model to stoke outrage and twist the meaning of historical discussions.

I'm going to tell you what the broken model of history is, show you how a simple local legend blasts it apart, and how seductive and damaging the model can be.

The heroes or villains model

There's a lot of English history, a good deal of it is bloody, complex, and hard to understand. To simplify subtle stories for schoolchildren, books and teachers often boil down stories to a few core elements, reducing historical figures to stereotypical heroes or villains. Sometimes, this works well, for example, Hitler and Mussolini fall neatly into the villain category, but in most cases, it doesn't work at all as we'll see. Even worse, the heroes or villains model lends itself to a kind of uncritical patriotism: "Britons good, foreigners bad". 

A good example of this simplification is Winston Churchill, who's often uncritically portrayed by the British press as a hero. Some writers consider any criticism of Churchill as unpatriotic and an attempt to portray him as a villain (hero or villain - no space for something else). My contention is, the hero or villain model is the wrong model to understand Churchill, or indeed any other historical figure.

Because the Churchill story is so charged, I'm going to use another historical example and try and apply the heroes or villains model to it. My story involves the English Civil War, regicide, death squads, and a local Massachusetts legend.

(The English Civil War - who were the heroes and who were the villains? Unknown author, Public domain, via Wikimedia Commons)

The backstory

The regicides of Charles I of England

By the late 1630s, Charles I and Parliament were at loggerheads over who governed the country with money, power, and religion as the key issues. The disagreement broke into armed conflict starting in 1642 and the country fought a long and bloodthirsty civil war that the Parliamentary forces eventually won in 1648. 

(Three views of Charles I, Royal Collection, Public domain, via Wikimedia Commons)

The victorious Parliamentarians put Charles I on trial for his life. Of course, the guilty verdict was obvious. In January 1649, 59 commissioners (judges) signed the death warrant for Charles I, who was executed soon after. These 59 commissioners became known as the regicides.

Oliver Cromwell

After Charles I's execution, the country became a republic, run by Oliver Cromwell who took the title "Lord Protector". Cromwell was a transformative figure in British history, his rule was effective but bloody. In particular, his Irish military campaign was brutal, even for the time, and involved multiple massacres. He violently suppressed Catholicism in both England and Ireland. Even to this day, Cromwell's name is cursed in Ireland for what he did.

(Oliver Cromwell, After Samuel Cooper, Public domain, via Wikimedia Commons)

The Restoration

Cromwell died of natural causes in 1658. Soon after, Charles' I son, Charles II, swept back into power and England's experiment with republicanism was at an end. Expediency meant that Charles II pardoned many Parliamentarians, but there was no forgiveness for the 59 commissioners who signed Charles I's death warrant. Understandably, Charles II wanted vengeance on those who'd killed his father. The fate that awaited the regicides was worse than torture and execution; they were to be hung, drawn, and quartered. Knowing this, many of them fled, some to Europe, but some fled to the new colonies in America.

(Charles II, Peter Lely, Public domain, via Wikimedia Commons)

One of the regicides, William Goffe, escaped to the New England colonies, starting off in Boston, then moving to Connecticut and then to central Massachusetts. Charles II sent secret agents to track all the regicides down, in effect, they had a license to kill. If Goffe was caught, he might be killed on the spot if he was lucky, if he was unlucky, he could expect unbearable torture before he was executed. To avoid Charles's agents, Goffe went into hiding in the town of Hadley, Massachusetts.

The angel of Hadley

In 1675, Hadley, Massachusetts was a border town. English settlers were displacing Native American tribes which led to an armed conflict called King Philip's war. On one side were the English settlers, and on the other were Native American tribes.

Legend has it that on September 1st, 1675, Hadley was attacked by Native American forces. The villagers responded to fight off the attack, but having no military experience, their defense was ineffective. It looked as if the village would be lost and destroyed.

Out of the confusion of battle, an old white-haired man appeared and took charge - none of the villagers had seen him before. He rallied and organized the villagers into an effective battle formation, and together, they managed to fend off the attack. The white-haired angel saved the village, but he vanished as soon as the battle was won. He became known as the 'angel of Hadley' for saving the town.

(The angel of Hadley, Frederick Chapman (1818-1891), Public domain, via Wikimedia Commons)

After the battle, Goffe went back into hiding with good reason. Charles' agents were still looking for him. He's rumored to have died in New Haven in 1680, though there are other accounts of him living and dying elsewhere in New England.

Goodies and baddies

Let's apply the heroes or villains model to the actors in this story and see how it holds up.

Goffe saved Hadley from destruction, therefore he's a hero. But only from the point of view of the white settlers in Hadley. If you were a Native American opponent, Goffe stopped your forces from retaking land that was rightfully yours, so he's a villain. If you were a Royalist, then Goffe was a villain because he signed Charles I's death warrant, but if you were a Parliamentarian, he was a hero.

What about Cromwell? To many in England, he's a hero for his strong leadership and military prowess, but to the Irish, he's a villain; a bloodthirsty tyrant who massacred the population and violently suppressed Catholicism.

The heroes or villains model doesn't work at all for this story. In fact, it doesn't work for almost all of history. In most cases, it's a reductio ad absurdum, suitable only for young children. Heroes or villains might be too serious a name - I should really call it the goodies and baddies model of history.

History and patriotism

All countries have their national myths and national heroes. The goodies or baddies model is often uncritically applied to historical figures, with parts of the press bolstering the goodies or baddies model, shouting down those who disagree and accusing them of a lack of patriotism.

Let's take another English example. On June 7th, 2020, protestors in the English city of Bristol pulled down the statue of Edward Colston (1636-1721) and threw it into the harbor. They were objecting to Colston's involvement in the slave trade where he made most of his money. Later in life, Colston became a philanthropist who donated large sums of money to support schools, hospitals, and almshouses, especially in the Bristol area. Because of his philanthropy, the people of Bristol erected a statue of him in 1895. 

(Edward Colston statue, Simon Cobb, CC0, via Wikimedia Commons)

Is Colston a goodie or a baddie? If you look at his involvement in the slave trade, he's definitely a baddie. If you look only at his philanthropy, he's a goodie. But as you can tell, I think the goodies or baddies model doesn't work at all. Edward Colston was both and neither - reducing him to wholly goodie or baddie is absurd.

Similarly, the model breaks down for historical greats like Churchill, who did both good things and questionable things. By unthinkingly applying a goodie and baddie model, we're preventing ourselves from reaching a deeper and richer understanding of historical people and events.

The goodies and baddies model has a use though. It generates outrage which in turn helps sell newspapers and increase ratings. It's a handy culture war tool to beat your opposition with. Most of the press coverage of the Colston statue saga focused on politicians condemning the protestors (goodies and baddies again). But what about discussing whether the statue should be there at all in the 21st century? Any criticism of Churchill is often met with a fierce response but was Churchill always and in every decision and action a goodie? Is anyone? Outrage displaces critical thinking, which may be the point.

A better model

Rather than label people goodies or baddies, it's better to ask what and why. What caused the English Civil War? Why did the victorious Parliamentarians execute Charles I? Why did Goffe believe what he believed? Why was the Colston statue still in place in 2021 and why was it erected in the first place when his role in slavery was known? 

These are deeper and less emotional topics, but if we want to truly understand history, we need to move away from a simplistic goodies and baddies model to an understanding of the times people lived in and the rich complexities of their actions. No one is wholly good or bad, and we shouldn't expect them to be.