Showing posts with label survivor bias. Show all posts
Showing posts with label survivor bias. Show all posts

Sunday, April 5, 2020

Sherlock Holmes, business books, Nazi fighters, and survivor bias

In the Sherlock Holmes story, Silver Blaze, Holmes solved the case by a deduction from something that didn’t happen. In the story, the dog didn’t bark, implying the dog knew the horse thief. This a neat twist on something called survivor bias, the best example of which is Abraham Wald’s analysis of surviving bombers. I’ll talk about how survivor bias rears its ugly head and tell you about Wald and what he did.

Survivor bias occurs when we look at the survivors of some process and we try to deduce something about their commonality without considering external factors. An obvious example might be collating details on lottery winners’ lives in an attempt to figure out what factors might lead to winning the lottery. For example, I might study what day of the week and time of day winners bought their tickets, I might look at where winning tickets were bought, and I might look at the age and gender of winners. From this, I might conclude that to improve my chances of winning the lottery I need to be a 45-year-old man who buys tickets at 3:40pm on Wednesday afternoon at a gas station. But the lottery is a random process and all we've done is analyze who's playing, not the causes of winning. Put like this, it seems almost incredible that anyone could have problems with survivor bias, but survivor bias doesn’t always manifest itself in obvious ways.

Let’s imagine I want to write a bestselling business book unraveling the secrets of how to win at business. I could select businesses that have been successful over several years and look for factors they have in common. I might call my book “Building excellent businesses that last”. As you surely know, there have been several bestselling books based on this premise. Unfortunately, they age like milk; it turns out that most of the companies these books identify as winners subsequently performed poorly - which is a regression to the mean. The problem is, other factors may have contributed to these businesses' success, for example, the competitive environment, new product innovation, a favorable economy, and so on. Any factors I derived from commonalities between winning companies today are just like an analysis of the common factors of lottery winners. By focusing on (current) winners, the door is open to survivor bias [Shermer, Jones].

The most famous example of survivor bias is Wald and the bombers. It is a little cliched to tell the story, but it’s such a great story, I’m going to tell it again, but my way.

Abraham Wald (1902-1950) was a great mathematician who made contributions to many fields, including econometrics, statistics, and geometry. A Hungarian Jew, he suffered discrimination and persecution while looking for work in Vienna, Austra in 1938, and so emigrated with his family to New York. During World War II, he worked in the Statistical Research Group at Columbia University. This is where he was given the task of improving bomber survivability; where should armor go to best protect bombers given that armor is heavy and not everywhere can be protected [Wallis]?

Not all bombers came home after bombing runs over Nazi-occupied Europe. Nazi fighter planes attacked bombers on the way out and the way back, and of course, they shot down many planes. To help his analysis, Wald had data on where surviving planes were hit. The image below is a modern simulation of the kind of data he had to work with; to be clear, this is not the exact data Wald had, it’s simulated data. The visualization shows where the bullet holes were on returning planes. If you had this data, where would you put the extra armor to ensure more planes came home?


(Simulated data on bomber aircraft bullet holes. Source: Wikipedia - McGeddon, License: Creative Commons)

Would you say, put the extra armor where the most bullet holes are? Doesn’t that seem the most likely answer?

Wrong.

This is the most famous example of survivor bias - and it’s literally about survival. Wald made the reasonable assumption that bullets would hit the plane randomly, remember, this is 1940’s technology and aerial combat was not millimeter precision. This means the distribution of bullet holes should be more or less even on planes. The distribution he saw was not even - there were few bullet holes in the engine and cockpit, but he was looking at surviving planes. His conclusion was, planes that were hit in key places did not survive. Look at the simulated visualization above - did you notice the absence of bullet holes in the engine areas? If you got hit in an engine, you didn’t come home. This is the equivalent of the dog that didn’t bark in the night. The conclusion was of course to armor the places where there were not bullet holes.

A full appreciation of survivor bias will mean you're more skeptical of many self-help books. A lot of them proceed on the same lines: let's take some selected group of people, for example, successful business people or sports people, and find common factors or habits. By implication, you too can become a winning athlete or business person or politician just by adopting these habits. But how many people had all these habits or traits and didn't succeed? All Presidents breathe, but if you breathe, does this mean you'll become President? Of course, this is ludicrous, but many self-help books are based on similar assumptions, it's just harder to spot.

Survivor bias manifests itself on the web with e-commerce. Users visit websites and make purchases or not. We can view those who make purchases as survivors. One way of increasing the number of buyers (survivors) is to focus on their commonalities, but as we’ve seen, this can give us biased results, or even the wrong result. A better way forward may be to focus on the selection process (the web page) and understand how that’s filtering users; in other words, understanding why people didn't buy.

One of the things I like about statistics in business is that correctly applying what seems like esoteric ideas can lead to real monetary impact, and survivor bias is a great example.

Saturday, February 8, 2020

The Anna Karenina bias

Russian novels and business decisions

What has the opening sentence of a 19th-century Russian novel got to do with quantitative business decisions in the 21st century? Read on and I'll tell you what the link is and why you should be aware of it when you're interpreting business data.

Anna Karenina

The novel is Leo Tolstoy's 'Anna Karenina' and the opening line is: "All happy families are alike; each unhappy family is unhappy in its own way". Here's my take on what this means. For a family to be happy, many conditions have to be met, which means that happy families are all very similar. Many things can lead to unhappiness, either on their own or in combination, which means there's more diversity in unhappy families. So how does this apply to business?

Leo Tolstoy's family
(Leo Tolstoy's family. Do you think they were happy? Image source: Wikimedia Commons. License: Public Domain)

Survivor bias

The Anna Karenina bias is a form of survivor bias, which is, in turn, a form of selection bias. Survivor bias is the bias introduced by concentrating on the survivors of some selection process and ignoring those that did not. The famous story of Wald and the bombers is, in my view, the best example of survivor bias. If Wald had focused on the surviving bombers, he would have recommended putting armor in the wrong place.

When we look at the survivors of some selection process, they will necessarily be more alike than non-survivors because of the selection process (unhappy families vs. happy families).  Let me give you an example, buying groceries on the web. Imagine a group of people surfing a grocery store. Some won't buy (unhappy families), but some will (happy families). To buy, you have to find an item you want to buy, you have to have the money, you have to want to buy now, and so on. This selection process will give a group of people who are very similar in a number of dimensions - they will exhibit less variability than the non-purchasers.

Some factors will be important to a purchaser's decision and other factors might not be. In the purchaser group, we might expect to see more variation in factors that aren't important to the buying decision and less variation in factors that are. To quote Shugan [Shugan]:

"Moreover, variables exhibiting the highest levels of variance in survivors might be unimportant for survival because all observed levels of those variables have resulted in survival. One implication is a possible inverse correlation between the importance of a variable for survival and the variable’s observed variability"

In the opinion poll world, the Anna Karenina bias rears its ugly head too. Pollsters often use robocalls to try and reach voters. To successfully record an opinion, the call has to go through, it has to be answered, and the person has to respond to the survey questions. This is a selection process. Opinion pollsters try and correct for biases, but sometimes they miss them. If the people who respond to polls exhibit less variability than the general population on some key factor (e.g. education), then the poll may be biased.

In my experience, most forms of B2C data analysis can be viewed as a selection process, and the desired outcomes of most analysis is figuring out the factors that lead to survival (in other words, what made people buy). The Anna Karenina bias warns us that some of the observed factors might be unimportant for survival and gives us a way of trying to understand which factors are relevant.



Leo Tolstoy in 1897. (Image credit: Wikipedia. Public domain image.)

The takeaways

If you're analyzing business data, here's what to be aware of:

  • Don't just focus on the survivors, you need to look at the non-survivors too.
  • Survivors will all tend to look the same - there will be less variability among survivors than among non-survivors. 
  • Survivors may look the same on many factors, only some of which may be relevant.
  • The factors that vary the most among survivors might be the least important.

References

[Shugan] "The Anna Karenina Bias: Which Variables to Observe?", Marketing Science, Vol. 26, No. 2, March–April 2007, pp. 145–148