Monday, May 17, 2021

Counting on Poisson

Why use the Poisson distribution?

Because it has properties that make it great to work with, data scientists use the Poisson distribution to model different kinds of counting data. But these properties can be seductive, and sometimes people model data using the Poisson distribution when they shouldn't. In this blog post, I'll explain why the Poisson distribution is so popular and why you should think twice before using it.

(Siméon-Denis Poisson by E. Marcellot, Public domain, via Wikimedia Commons)

Poisson processes

The Poisson distribution is a discrete event probability distribution used to model events created using a Poisson process. Drilling down a level, a Poisson process is a series of events that have these properties:

  • They occur at random but at a constant mean rate,
  • They are independent of one another, 
  • Two (or more) events can't occur at the same time

Good examples of Poisson processes are website visits, radioactive decay, and calls to a help center. 

The properties of a Poisson distribution

Mathematically, the Poisson probability mass function looks like this:

\[ P_r (X=k) = \frac{\lambda^k e^{- \lambda}}{k!} \]

where 

  • k is the number of events (always an integer)
  • \(\lambda\) is the mean value (or expected rate)

It's a discrete distribution, so it's only defined for integer values of \(k\).

Graphically, it looks like this for \(\lambda=6\). Note that it isn't symmetrical and it stops at 0, you can't have -1 events.

(Let's imagine we were modeling calls per hour in a call center. In this case, \(k\) is the measured calls per hour, \(P\) is their frequency of occurrence, and \(\lambda\) is the mean number of calls per hour).

Here are some of the Poisson distribution's properties:

  • Mean: \(\lambda\)
  • Variance: \(\lambda\)
  • Mode: floor(\(\lambda\))

The fact that some of the key properties are given by \(\lambda\) alone makes using it easy. If your data follows a Poisson distribution, once you know the mean value, you've got the variance (and standard deviation), and the mode too. In fact, you've pretty much got a full description of your data's distribution with just a single number.

When to use it and when not to use it

Because you can describe the entire distribution with just a single number, it's very tempting to assume that any data that involves counting follows a Poisson distribution because it makes analysis easier.  Sadly, not all counts follow a Poisson distribution. In the list below, which counts do you think might follow a Poisson distribution and which might not?

  • The number of goals in English Premier League soccer matches.
  • The number of earthquakes of at least a given size per year around the world.
  • Bus arrivals.
  • The number of web pages a person visits before they make a purchase.

Bus arrivals are not well modeled by a Poisson distribution because in practice they're not independent of one another and don't occur at a constant rate. Bus operators change bus frequencies throughout the day, with more buses scheduled at busy times; they may also hold buses at stops to even out arrival times. Interestingly, bus arrivals are one of the textbook examples of a Poisson process, which shows that you need to think before applying a model.

The number of web pages a person visits before they make a purchase is better modeled using a negative binomial distribution

Earthquakes are well-modeled by a Poisson distribution. Earthquakes in different parts of the world are independent of one another and geological forces are relatively constant, giving a constant mean rate for quakes. It's possible that two earthquakes could happen simultaneously in different parts of the world, which shows that even if one of the criteria might not apply, data can still be well-modeled by Poisson.

What about soccer matches? We know two goals can't happen at the same time. The length of matches is fixed and soccer is a low-scoring game, so the assumption of a constant rate for goals is probably OK. But what about independence? If you've watched enough soccer, you know that the energy level in a game steps up as soon as a goal is scored. Is this enough to violate the independence requirement? Apparently not, scores in soccer matches are well-modeled by a Poisson distribution.

What should a data scientist do?

Just because the data you're modeling is a count doesn't mean it follows a Poisson distribution. More generally, you should be wary of making choices motivated by convenience. If you have count data, look at the properties of your data before deciding on a distribution to model it with. 

If you liked this blog post you might like

No comments:

Post a Comment