Russian novels and business decisions
What has the opening sentence of a 19th-century Russian novel got to do with quantitative business decisions in the 21st century? Read on and I'll tell you what the link is and why you should be aware of it when you're interpreting business data.
The novel is Leo Tolstoy's 'Anna Karenina' and the opening line is: "All happy families are alike; each unhappy family is unhappy in its own way". Here's my take on what this means. For a family to be happy, many conditions have to be met, which means that happy families are all very similar. Many things can lead to unhappiness, either on their own or in combination, which means there's more diversity in unhappy families. So how does this apply to business?
The Anna Karenina bias is a form of survivor bias, which is, in turn, a form of selection bias. Survivor bias is the bias introduced by concentrating on the survivors of some selection process and ignoring those that did not. The famous story of Wald and the bombers is, in my view, the best example of survivor bias. If Wald had focused on the surviving bombers, he would have recommended putting armor in the wrong place.
When we look at the survivors of some selection process, they will necessarily be more alike than non-survivors because of the selection process (unhappy families vs. happy families). Let me give you an example, buying groceries on the web. Imagine a group of people surfing a grocery store. Some won't buy (unhappy families), but some will (happy families). To buy, you have to find an item you want to buy, you have to have the money, you have to want to buy now, and so on. This selection process will give a group of people who are very similar in a number of dimensions - they will exhibit less variability than the non-purchasers.
Some factors will be important to a purchaser's decision and other factors might not be. In the purchaser group, we might expect to see more variation in factors that aren't important to the buying decision and less variation in factors that are. To quote Shugan [Shugan]:
"Moreover, variables exhibiting the highest levels of variance in survivors might be unimportant for survival because all observed levels of those variables have resulted in survival. One implication is a possible inverse correlation between the importance of a variable for survival and the variable’s observed variability"
In the opinion poll world, the Anna Karenina bias rears its ugly head too. Pollsters often use robocalls to try and reach voters. To successfully record an opinion, the call has to go through, it has to be answered, and the person has to respond to the survey questions. This is a selection process. Opinion pollsters try and correct for biases, but sometimes they miss them. If the people who respond to polls exhibit less variability than the general population on some key factor (e.g. education), then the poll may be biased.
In my experience, most forms of B2C data analysis can be viewed as a selection process, and the desired outcomes of most analysis is figuring out the factors that lead to survival (in other words, what made people buy). The Anna Karenina bias warns us that some of the observed factors might be unimportant for survival and gives us a way of trying to understand which factors are relevant.
If you're analyzing business data, here's what to be aware of:
- Don't just focus on the survivors, you need to look at the non-survivors too.
- Survivors will all tend to look the same - there will be less variability among survivors than among non-survivors.
- Survivors may look the same on many factors, only some of which may be relevant.
- The factors that vary the most among survivors might be the least important.