Independence in probability

Why should I care about independence?

Many models in the finance industry and elsewhere assume events are independent. When this assumptions fails, catastrophic losses can occur as we saw in 2008 and 1992. The problem is, developers and data scientists assume independence because it greatly simplifies problems, but the executive team often don't know this has happened, or even worse, don't understand what it means. As a result, the company ends up being badly caught out when circumstances change and independence no longer applies.

(Sergio Boscaino from Busseto, Italy, CC BY 2.0 , via Wikimedia Commons)

In this post, I'm going to explain what independence is, why people assume it, and how it can go spectacularly wrong. I'll provide a some guidance for managers so they know the right questions to ask to avoid disaster. I've pushed the math to the end, so if math isn't your thing, you can leave early and still get benefit.

What is independence?

Two events are independent if the outcome of one doesn't affect the other in any way. For example, if I throw two dice, the probability of me throwing a six on the second die isn't affected in any way by what I throw on the first die.

Here are some examples of independent events:

Throwing a coin and getting a head, throwing a dice and getting a two.
Drawing a king from a deck of cards, winning the lottery having bought a ticket.
Stopping at at least one red light on my way to the store, rain falling two months from now.

By contrast, some events are not independent (they're dependent):

Raining today and raining tomorrow. Rain today increases the chances of rain tomorrow.
Heavy snow today and a football match being played. Heavy snow will cause the match to be postponed.
Drawing a king from a deck of cards, then without replacing the card, drawing a king on the second draw.

Why people assume independence

People assume independence because the math is a lot, lot simpler. If two events are dependent, the analyst has to figure out the relationship between them, something that can be very challenging and time consuming to do. Other than knowing there's a relationship, the analyst might have no idea what it is and there may be no literature to guide them. For example, we know that smoking increases the risk of lung cancer (and therefore a life insurance claim), so how should an actuary price that risk? If they price it too low, the insurance company will pay out too much in claims, if they price it too high, the insurance company will lose business to cheaper competitors. In the early days when the link between smoking and cancer was discovered, how could an actuary know how to model the relationship?

Sometimes, analysts assume independence because they don't know any better. If they're not savvy about probability theory, they may do a simple internet search on combining probabilities that will suggest all they have to do is multiple probabilities, which is misleading at best. I believe people are making this mistake in practice because I've interviewed candidates with MS degrees in statistics who made this kind of blunder.

Money and fear can also drive the choice to assume independence. Imagine you're an analyst. Your manager is on your back to deliver a model as soon as possible. If you assume independence, your model will be available on time and you'll get your bonus, if you don't, you won't hit your deadline and you won't get your bonus. Now imagine the bad consequences of assuming independence won't be visible for a while. What would you do?

Harder examples

Do you think the following are independent?

Two unrelated people in different towns defaulting on their mortgage at the same time

Houses in different towns suffering catastrophic damage (e.g. fire, flood, etc.)

Most of the time, these events will be independent. For example, a house burning down because of poor wiring doesn't tell you anything about the risk of a house burning down in a different town (assuming a different electrician!). But there are circumstances when the independence assumption fails:

A hurricane hits multiple towns at once causing widespread catastrophic damage in different insurance categories (e.g. hurricane Andrew in 1992).

A recession hits, causing widespread lay-offs and mortgage defaults, especially for sub-prime mortgages (2008).

Why independence fails

Prior to 1992, the insurance industry had relatively simple risk models. They assumed independence; an assumption that worked well for some time. In an average year, they knew roughly how many claims there would be for houses, cars etc. Car insurance claims were independent of house insurance claims that in turn were independent of municipal and corporate insurance claims and so on.

When hurricane Andrew hit Florida in 1992, it destroyed houses, cars, schools, hospitals etc. across multiple towns. The assumption of independence just wasn't true in this case. The insurance claims were sky high and bankrupted several companies.

(Hurricane Andrew, houses destroyed in Dade County, Miami. Image from FEMA. Source: https://commons.wikimedia.org/wiki/File:Hurricane_andrew_fema_2563.jpg)

To put it simply, the insurance computer models didn't adequately model the risk because they had independence baked in.

Roll forward 15 years and something similar happened in the financial markets. Sub-prime mortgage lending was build on a set of assumptions, including default rates. The assumption was, mortgage defaults were independent of one another. Unfortunately, as the 2008 financial crisis hit, this was no longer valid. As more people were laid-off, the economy went down, so more people were laid-off. This was often called contagion but perhaps a better analogy is the reverse of a well known saying: "a rising tide floats all boats".

(Image credit: Secret London 123, CC BY-SA 2.0, via Wikimedia Commons)

The assumption of independence simplified the analysis of sub-prime mortgages and gave the results that people wanted. The incentives weren't there to price in risk. Imagine your company was making money hand over fist and you stood up and warned people of the risks of assuming independence. Would you put your bonus and your future on the line to do so?

What to do - recommendations

Let's live in the real world and accept that assuming independence gets us to results that are usable by others quickly.

If you're a developer or a data scientist, you must understand the consequences of assuming independence and you must recognize that you're making that assumption. You must also make it clear what you've done to your management.

If you're a manager, you must be aware that assuming independence can be dangerous but that it gets results quickly. You need to ask your development team about the assumptions they're making and when those assumptions fail. It also means accepting your role as a risk manager; that means investing in development to remove independence.

To get results quickly, it may well be necessary for an analyst to assume independence. Once they've built the initial model (a proof of concept) and the money is coming in, then the task is to remove the independence assumption piece-by-piece. The mistake is to stop development.

The math

Let's say we have two events, A and B, with probabilities of occurring P(A) and P(B).

If the events are independent, then the probability of them both occurring is:

\[P(A \ and \ B) = P(A \cap B) = P(A) P(B)\]

This equation serves as both a definition of independence and test of independence as we'll see next.

Let's take two cases and see if they're independent:

Rolling a dice and getting a 1 and a 2
Rolling a dice and getting a (1 or 2) and (2, 4, or 6)

For case 1, here are the probabilities:

\(P(A) = 1/6\)
\(P(B) = 1/6\)
\(P(A \cap B) = 0\), it's not possible to get 1 and 2 at the same time
\(P(A )P(B) = (1/6) * (1/6)\)

So the equation \(P(A \ and \ B) = P(A \cap B) = P(A) P(B)\) isn't true, therefore the events are not independent.

For case 2, here are the probabilities:

\(P(A) = 1/3\)
\(P(B) = 1/2\)
\(P(A \cap B) = 1/6\)
\(P(A )P(B) = (1/2) * (1/3)\)

So the equation is true, therefore the events are independent.

Dependence uses conditional probability, so we have this kind of relationship:

\[P(A \ and \ B) = P(A \cap B) = P(A | B) P(B)\]

The expression \(P(A | B)\) means the probability of A given that B has occurred (e.g the probability the game is canceled given that it's snowed). There are a number of ways to approach finding \(P(A | B)\), the most popular over the last few years has been Bayes' Theorem which states:

\[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

There's a whole methodology that goes with the Bayesian approach and I'm not going to go into it here, except to say that it's often iterative; we make an initial guess and progressively refine it in the light of new evidence. The bottom line is, this process is much, much harder and much more expensive than assuming independence.

Engora Data Blog

Thursday, February 13, 2025

Why assuming independence can be very, very bad for business