Survivor bias occurs when we look at the survivors of some process and we try to deduce something about their commonality without considering external factors. An obvious example might be collating details on lottery winners’ lives in an attempt to figure out what factors might lead to winning the lottery. For example, I might study what day of the week and time of day winners bought their tickets, I might look at where winning tickets were bought, and I might look at the age and gender of winners. From this, I might conclude that to improve my chances of winning the lottery I need to be a 45-year-old man who buys tickets at 3:40pm on Wednesday afternoon at a gas station. But the lottery is a random process and all we've done is analyze who's playing, not the causes of winning. Put like this, it seems almost incredible that anyone could have problems with survivor bias, but survivor bias doesn’t always manifest itself in obvious ways.
Let’s imagine I want to write a bestselling business book unravelling the secrets of how to win at business. I could select businesses that have been successful over several years and look for factors they have in common. I might call my book “Building excellent businesses that last”. As you surely know, there have been several bestselling books based on this premise. Unfortunately, they age like milk, it turns out that most of the companies these books identify as winners subsequently performed poorly - which is a regression to the mean. The problem is, other factors may have contributed to these businesses' success, for example, the competitive environment, new product innovation, a favorable economy and so on. Any factors I derived from commonalities between winning companies today are just like an analysis of the common factors of lottery winners. By focusing on (current) winners, the door is open to survivor bias [Shermer, Jones].
The most famous example of survivor bias is Wald and the bombers. It is a little cliched to tell the story, but it’s such a great story, I’m going to tell it again, but my way.
Abraham Wald (1902-1950) was a great mathematician who made contributions to many fields, including econometrics, statistics, and geometry. A Hungarian Jew, he suffered discrimination and persecution while looking for work in Vienna, Austra in 1938, and so emigrated with his family to New York. During World War II, he worked in the Statistical Research Group at Columbia University. This is where he was given the task of improving bomber survivability; where should armor go to best protect bombers given that armor is heavy and not everywhere can be protected [Wallis]?
Not all bombers came home after bombing runs over Nazi-occupied Europe. Nazi fighter planes attacked bombers on the way out and the way back, and of course, they shot down many planes. To help his analysis, Wald had data on where surviving planes were hit. The image below is a modern simulation of the kind of data he had to work with; to be clear, this is not the exact data Wald had, it’s simulated data. The visualization shows where the bullet holes were on returning planes. If you had this data, where would you put the extra armor to ensure more planes came home?
(Simulated data on bomber aircraft bullet holes. Source: Wikipedia - McGeddon, License: Creative Commons)
Would you say, put the extra armor where the most bullet holes are? Doesn’t that seem the most likely answer?
This is the most famous example of survivor bias - and it’s literally about survival. Wald made the reasonable assumption that bullets would hit the plane randomly, remember, this is 1940’s technology and aerial combat was not millimeter precision. This means the distribution of bullet holes should be more or less even on planes. The distribution he saw was not even - there were few bullet holes in the engine and cockpit, but he was looking at surviving planes. His conclusion was, planes that were hit in key places did not survive. Look at the simulated visualization above - did you notice the absence of bullet holes in the engine area? If you got hit in the engine, you didn’t come home. This is the equivalent of the dog that didn’t bark in the night. The conclusion was of course to armor the places where there were not bullet holes.
A full appreciation of survivor bias will mean you're more skeptical of many self-help books. A lot of them proceed on the same lines: let's take some selected group of people, for example, successful business people or sports people, and find common factors or habits. By implication, you too can become a winning athlete or business person or politician just by adopting these habits. But how many people had all these habits or traits and didn't succeed? All Presidents breathe, but if you breathe, does this mean you'll become President? Of course, this is ludicrous, but many self-help books are based on similar assumptions, it's just harder to spot.
Survivor bias manifests itself on the web with ecommerce. Users visit websites and make purchases or not. We can view those who make purchases as survivors. One way of increasing the number of buyers (survivors) is to focus on their commonalities, but as we’ve seen, this can give us biased results - or even the wrong result. A better way forward may be to focus on the selection process (the web page) and understand how that’s filtering users - in other words, understanding why people did not buy.
One of the things I like about statistics in business is that correctly applying what seems like esoteric ideas can lead to real monetary impact, and survivor bias is a great example.