Why should you care about probability distributions?

Using the wrong probability distribution can be extremely expensive for businesses:

for businesses using machinery (factories, vehicles, aircraft, etc.), it can lead to parts being changed too frequently or too infrequently
for businesses relying on returning customers, it can lead to substantial under or over-estimates of revenue and/or targeting the wrong customers with promotions
for businesses forecasting future sales by territory and/or product, it can lead to poor territory allocation or poor product resource allocation.

Given that it's so important, what is a probability distribution, and what are some examples?

What's a probability distribution?

At its simplest, a probability distribution tells you how likely an outcome is given some input. For example, how is sales probability distributed by price, or how likely is a component to fail in the next month?

If something is certain to occur, the probability is 1, if it's certain not to occur, the probability is zero. Let's imagine a component lasts a maximum of 6 months before failure. Our probability distribution might show the probability of failure on days 1 to 180. The sum of all failure probabilities for all days must sum to 1.

In the real world, data is noisy and we don't expect real data to exactly follow theoretical distributions, but given enough data, the match should be close enough for us to reason about what's going on.

Uniform distribution - gambling and manufacturing

If the probability is the same for all input values, the distribution is uniform.

Let's imagine we're manufacturing candy, and we want to have equal numbers of red, blue, green, black, and white sweets in a packet. In theory, here's what we should observe.

But in reality, there's random noise so we might see something like this below. We can quantify the difference between the expected distribution and the actual distribution, which tells us something about the variability in the manufacturing process.

The uniform distribution also occurs in gambling, for example, lotteries or dice games.

Reading more

Uniform distribution description by NIST

Binomial distribution - pass/fail and conversion

Each customer who comes into a store or who visits a website will either buy or not buy, which we can turn into a conversion rate. We can model these kinds of pass/fail processes using the binomial distribution. Here's the probability distribution.

The binomial distribution shows us the probability of measuring different results given an underlying 'truth'. Let's imagine the 'true' conversion rate was 0.04, we might not measure 0.04 due to sampling error, instead, we might measure 0.045 or 0.055, depending on how many samples we take. It's important to understand what this means:

There is uncertainty in our measurement.
The smaller the sample, the bigger the uncertainty.

Although many technical people understand this, most non-technical people do not, which can lead to tension.

Reading more

Yale stats

Poisson distribution - waiting in line

Imagine you're a bank serving customers with ATMs at a location. ATMs are expensive, but you don't want to keep people waiting in long lines to do their transactions, it's bad for business. So how do you balance the cost of an ATM against its use? By modeling how many people are using the ATM over a time period.

It turns out, the number of people who visit an ATM over a time period can be modeled using the Poisson distribution, which I've shown below. This gives us a way of assessing how much variation there might be in usage and therefore how many machines we might want to install.

The Poisson distribution is often used to model counting processes. It's very attractive because it has an unusual feature, the standard deviation for the distribution is \(\sqrt{\gamma}\) where \(\gamma\) is the mean. Unfortunately, this property makes it a little too attractive; it's sometimes used when it shouldn't be.

Reading more

The Poisson Distribution and Poisson Process Explained

Exponential distribution

How long does a car battery last? How long do phone calls last? When will the next earthquake occur? These durations typically follow the exponential distribution (which is strongly related to the Poisson distribution). I've shown this distribution below.

Reading more

The exponential distribution

Power law distribution - finding fraud

How are incomes distributed in a population? How might you find fraud in the pattern of digits in expenses? It turns out, the distribution of the first digits in invoices follows a power-law distribution. The chart below shows a generic power-law distribution - for fraud detection, it's 'flipped'.

Reading more

Power law distribution

Normal distribution - almost everywhere, but not quite

What's the probability distribution for male soldiers' chest measurements? How are the results of A/B tests distributed? What about the distribution of measurement errors? All these, and many, many more follow the normal distribution, which is also called the Gaussian distribution or the bell curve. If you only learn one distribution, this is the one to learn.

The properties of this distribution are extremely well-known, and every student of statistics and probability theory will know them. It's ubiquitous because of something called the Central Limit Theorem, which, simplifying a great deal, says that the sum of samples from any distribution follows a normal distribution.

Because it's everywhere, for some people, it's the only distribution they know. Like the old saying goes, if you only have a hammer, every problem is a nail. This distribution can be over-used, with bad consequences.

Here's the distribution. It ought to look familiar.

Reading more

The normal distribution

Lognormal distribution

How long do visitors spend on web pages? What about the distribution of internet traffic? Or the distribution of city sizes? These all follow a log-normal distribution that looks like the example below. The lognormal distribution is quite common in business.

Note the 'fat tail' or 'long tail' on the right-hand side. Many businesses have been caught out because they assumed sales or market risk followed a normal distribution when in fact they followed a lognormal distribution.

There's a variation of the Central Limit Theorem that yields log-normal distributions instead of normal distributions.

Reading more

Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. "Log-normal distributions across the sciences: keys and clues" BioScience 51.5 (2001): 341-352.

Other distributions

There are lots and lots of different distributions. I saw a list of 90 the other day. Almost all of them are esoteric and apply in a very limited set of cases. You don't have to know all of them but you should be aware that choosing the right distribution is important to make the correct estimates. The distributions I've listed in this blog post are probably the most important, and you should know them and their properties.

As you asked nicely, here is a list of some distributions.

Alpha Distribution

Anglit Distribution

Arcsine Distribution

Beta Distribution

Beta Prime Distribution

Bradford Distribution

Burr Distribution

Burr12 Distribution

Cauchy Distribution

Chi Distribution

Chi-squared Distribution

Cosine Distribution

Double Gamma Distribution

Double Weibull Distribution

Erlang Distribution

Exponential Distribution

Exponentiated Weibull Distribution

Exponential Power Distribution

Fatigue Life (Birnbaum-Saunders) Distribution

Fisk (Log Logistic) Distribution

Folded Cauchy Distribution

Folded Normal Distribution

Fratio (or F) Distribution

Gamma Distribution

Generalized Logistic Distribution

Generalized Pareto Distribution

Generalized Exponential Distribution

Generalized Extreme Value Distribution

Generalized Gamma Distribution

Generalized Half-Logistic Distribution

Generalized Inverse Gaussian Distribution

Generalized Normal Distribution

Gilbrat Distribution

Gompertz (Truncated Gumbel) Distribution

Gumbel (LogWeibull, Fisher-Tippetts, Type I Extreme Value) Distribution

Gumbel Left-skewed (for minimum order statistic) Distribution

HalfCauchy Distribution

HalfNormal Distribution

Half-Logistic Distribution

Hyperbolic Secant Distribution

Gauss Hypergeometric Distribution

Inverted Gamma Distribution

Inverse Normal (Inverse Gaussian) Distribution

Inverted Weibull Distribution

Johnson SB Distribution

Johnson SU Distribution

KSone Distribution

KStwo Distribution

KStwobign Distribution

Laplace (Double Exponential, Bilateral Exponential) Distribution

Left-skewed Lévy Distribution

Lévy Distribution

Logistic (Sech-squared) Distribution

Log Double Exponential (Log-Laplace) Distribution

Log Gamma Distribution

Log Normal (Cobb-Douglass) Distribution

Log-Uniform Distribution

Maxwell Distribution

Mielke’s Beta-Kappa Distribution

Nakagami Distribution

Noncentral chi-squared Distribution

Noncentral F Distribution

Noncentral t Distribution

Normal Distribution

Normal Inverse Gaussian Distribution

Pareto Distribution

Pareto Second Kind (Lomax) Distribution

Power Log Normal Distribution

Power Normal Distribution

Power-function Distribution

R-distribution Distribution

Rayleigh Distribution

Rice Distribution

Reciprocal Inverse Gaussian Distribution

Semicircular Distribution

Student t Distribution

Trapezoidal Distribution

Triangular Distribution

Truncated Exponential Distribution

Truncated Normal Distribution

Tukey-Lambda Distribution

Uniform Distribution

Von Mises Distribution

Wald Distribution

Weibull Maximum Extreme Value Distribution

Weibull Minimum Extreme Value Distribution

Wrapped Cauchy Distribution

Continuous or discrete - shaken or stirred?

Some quantities are discrete and some are continuous. A discrete quantity is something like a sales territory (e.g. Germany, Ireland, Spain) or customer count (you can't have 0.5 of a customer). A continuous quantity can take any value, for example, speed can be 45.2 kph, 120.01 kph, and so on. Some distributions apply to both continuous and discrete, and some apply only to continuous or discrete. To muddy the waters, sometimes continuous distributions are used to approximately model discrete quantities.

Business examples

Vehicles

Imagine you're running a delivery vehicle fleet. You need to keep your vehicles on the road, but you need to keep an eye on maintenance costs. You decide to use math to guide your decisions, so you work out the average lifetime for different components. You have two components A and B with the same lifetimes in miles. If either component fails, you have to tow the vehicle, which is very expensive.

Component A. Lifetime is 150,000 miles.
Component B. Lifetime is 150,000 miles.

A vehicle comes in for maintenance with 149,000 miles on the odometer. Should you replace components A and B?

As you might expect, there's a gotcha. Without knowing the probability distribution for failures, we can't make these decisions. For example, a windshield might have a uniform failure rate distribution, with the probability of failure for miles 1-100 the same as the probability of failure for miles 100,000-100,100. A clutch may have a failure rate that increases with mileage, the probability of failure at miles 100,000-100,100 being much higher than the probability of failure at miles 0-100. Because we know what a clutch and a windshield are, we might decide to replace the clutch and leave the windshield. But what if A and B were a serpentine belt and a heat shield?

The only way to make rational decisions is to understand what distribution the probability of failure follows, which may well be very different for different components (e.g. car seats vs. tires).

Marketing

A new analyst is studying the market for luxury goods in Germany. They have partial data for the fraction of the population that have a certain income. Using what they have, they assume their data is normally distributed and they make a forecast for the fraction of the population that will have an income high enough to afford luxury items. Do you think their forecast will be too low, just right, or too high?

Incomes are usually log-normally distributed, so the analyst, in this case, has chosen the wrong distribution. Because the lognormal has a very long right tail, the analyst's estimate is likely to be an underestimate and may be substantially out. A competitor might not make the same mistake.

Takeaways

I've interviewed people who claim data science on their resumes, but only know the normal distribution. If you assume your data is normal, when in reality it's log-normal or Poisson, things are going to go badly wrong for you. Any analyst in business needs to be very comfortable with different distributions and needs to know which may be applicable and when.

Engora Data Blog

Sunday, December 13, 2020

What's a probability distribution?

Why should you care about probability distributions?

What's a probability distribution?

Uniform distribution - gambling and manufacturing

Reading more

Binomial distribution - pass/fail and conversion

Reading more

Poisson distribution - waiting in line

Reading more

Exponential distribution

Reading more

Power law distribution - finding fraud

Reading more

Normal distribution - almost everywhere, but not quite

Reading more

Lognormal distribution

Reading more

Other distributions

Continuous or discrete - shaken or stirred?

Business examples

Vehicles

Marketing

Takeaways

No comments:

Post a Comment