Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

Thursday, December 18, 2025

The Skellam distribution

Distributions, distributions everywhere

There are a ton of distributions out there; SciPy alone has several hundred and that's nowhere near a complete set. I'm going to talk about one of the lesser known distributions, the Skellam distribution, and what it's useful for. My point is a simple one: it's not enough for data scientists to know the main distributions, they must be aware that other distributions exist and have real-world uses.

Overview of the Skellam distribution

It's easy to define the Skellam distribution: it's the difference between two Poisson distributions, or more formally, the difference between two Poisson distributed random variables. 

So we don't get lost in the math, here's a picture of a Skellam distribution.

If you really must know, here's how the PMF is defined mathematically:

\[ P(Z = k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)} \left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_k(2\sqrt{\mu_1 \mu_2}) \] where \(I_k(x)\) is given by the modified Bessel function: \[ I_k(x) = \sum_{j=0}^{\infty} \frac{1}{j!(j+|k|)!} \left(\frac{x}{2}\right)^{2j+|k|} \]

this all looks very complicated, but by now (2025), it's easy to code up, here's the SciPy code to calculate the PMF:

probabilities = stats.skellam.pmf(k=k_values, mu1=mu1, mu2=mu2)

What use is it?

Here are just a few uses I found:

  • Finance: modeling price changes between trades.
  • Medicine: modeling the change in the number of beds in an ICU, epileptic seizure counts during drug trials, differences in reported AIDS cases, and so on.
  • Sports: differences in home and away team football or hockey scores.
  • Technology: modeling sensor noise in cameras, 

Where did it come from?

Skellam published the original paper on this distribution in 1946. There isn't a lot of background on why he did the work and, as far as I can tell, it wasn't related to World War II research work in any way. It's only really been discussed more widely once people discovered it's use for modeling sports scores. It's been available as an off-the-shelf distribution in SciPy for over a decade now.

As an analyst, what difference does this make to you?

I worked in a place where the data we analyzed wasn't normally distributed (which isn't uncommon, a lot of data sets aren't normally distributed), so it was important that everyone knew at least something about non-normal statistics. I interviewed job candidates for some senior positions and asked them how they would analyze some obviously non-normal data. Far too many of them suggested using methods only suitable for normally distributed data. Some candidates had Master's degrees in relevant areas and told me they had never been taught how to analyze non-normal data, and even worse, they never looked into it themselves. This was a major warning for us recruiting.

Let's imagine you're given a new data set in a new area and you want to model it. It's obviously not normal, so what do you do? In these cases, you need to have an understanding of what other distributions are out there and their general shape and properties. You should just be able to look at data and guess a number of distributions that could work. You don't need to have an encyclopedic knowledge of them all, you just need to know they exist and you should know how to use a few of them. 

Monday, December 15, 2025

Poisson to predict football results?

Goals are Poisson distributed?

I've read a lot of literature that suggests that goals in games like football (soccer) and hockey (ice hockey) are Poisson distributed. But are they? I've found out that it's not as simple as some of the papers and articles out there suggest. To dig into it, I'm going to define some terms and show you some analysis.

The Poisson distribution

The Poisson distribution is a discrete distribution that shows the probability distribution of the number of independent events occurring over a fixed time period or interval. Examples of its use include: the number of calls in a call center per hour, website visits per day, and manufacturing defects per batch. Here's what it looks like:

If this were a chart of defects per batch, the x-axis would be the number of defects and the y-axis would be the probability of that number of defects, so the probability of 2 defects per batch would be 0.275 (or 27.5%).

Here's it's probability mass function formula:

\[PMF = \frac{ \lambda^{k}e^{-\lambda}}{k!} \]

Modeling football goals - leagues and seasons

A lot of articles, blogs, and papers suggest that football scores are well-modeled by the Poisson distribution. This is despite the fact that goals are not wholly independent of one another; it's well-known that scoring a goal changes a game's dynamics. 

To check if the Poisson distribution models scores well, here's what I did.

  1. Collected all English football league match results from 1888 to the present. This data includes the following fields: league_tier, season, home_club, home_goals, away_club_away_goals.
  2. Calculated a field total_goals (away_goals + home_goals).
  3. For each league_tier and each season, calculated relative frequency for total_goals, away_goals, and home_goals.
  4. Curve fit a Poisson distribution to the data.
  5. Calculated \(\chi^2\) and the associated p-value.

This gives me a dataframe of \(\chi^2\)  and p for each league_tier and season. In other words, I know how good a model the Poisson distribution is for goals scored in English league football.

This is the best fit (lowest \(\chi^2\) for total_goals). It's for league_tier 2 (the EFL Championship) and season 2022-2023. The Poisson fit here is very good. There are a lot of league_tiers and seasons with pretty similar fits.

Here's the worst fit (hightest \(\chi^2\) for total_goals). It's for league_tier 2 (the Second Division) and the 1919-1920 season (the first one after the first world war). By eye, it's still a reasonable approximation. It's an outlier though; there aren't many league_tiers and seasons with fits this bad.


Overall, it's apparent that the Poisson distribution is a very good way of modeling football results at the league_tier and season level. The papers and articles are right. But what about at the team level?

Modeling goals at the club level

Each season, a club faces a new set of opponents. If they change league tier (promotion, relegation), their opponents will be pretty much all new. If they stay in the same league, some opponents will be different (again due to promotion and relegation). If we want to test how good the Poisson distribution is at modeling results at the club level, we need to look season-by-season. This immediately introduces a noise problem; there are many more matches played in a league tier in a season than an individual club will play.

Following the same sort of process as before, I looked at how well the Poisson models goals that the club level. The answer is: not well.

The best performing fit has a low \(\chi^2\) = 0.05, the worst has a value of 98643. This is a bit misleading though, a lot of the fits are bad. Rather than show you the best and the worst, I'll just show you the results for one team and one season: Liverpool in 2024-2025.

(To clarify, total goals is the total number of goals scored in a season by a club, it's the sum of their home goals and their away goals.)

I checked the literature for club results modeling and I found that some authors found a Poisson distribution at the club level if they modeled the data over several seasons. I have mixed feelings about this. Although conditions vary within a season, they're more consistent than across different seasons. Over a period of several years, a majority of the players might have changed and of course, the remaining players will have aged. Is the Arsenal 2019 team the same as the Arsenal 2024 team? Where do you draw the line? On the other hand, the authors did find the Poisson distribution fit team results when aggregating over multiple seasons. As with all things in modeling sports results, there are deeper waters here and more thought and experimentation is required.

Although my season-by-season club fit \(\chi^2\) values aren't crazy, I think you'll agree with me that the fit isn't great and not particularly useful. Sadly, this is the consistent story with this data set. The bottom line is, I'm not sure how useful the Poisson distribution is for predicting scores at the club level for a single season.

Some theory that didn't work

It could be noise driving the poor fit at the club level, which is a variant of the "law of small numbers", but it could be something else. Looking at these results, I'm wondering if this is a case of the Poisson Limit Theorem. The Poisson Limit Theorem is simple: it states as the number of trials in a Binomial distribution increases towards infinity, the distribution tends to the Poisson distribution. In other words, Binomial distributions look like Poisson distributions if you have enough data.

The obvious thing to do is to try fitting the data using the Binomial distribution instead. If the Binomial doesn't fit any better, it's not the Poisson Limit Theorem. 

I tried fitting the club data using the Binomial distribution and I got fractionally better results, but not enough that I would use the Binomial distribution for any real predictions. In other words, this isn't the Poisson Limit Theorem at work.

I went back to all the sources that spoke about using the Poisson distribution to predict goals. All of them used data aggregated to the league or season level. One or two used the Poisson to try and predict who would end up at the top of a league at the end of the season. No one showed results at the club level for a single season or wrote about club-level predictions. I guess I know why now.

Some thoughts on next steps

There are four things I'm mulling over:

  • The Poisson distribution is a good fit for a league tier for a season.
  • I don't see the Poisson distribution as a good fit for a club for a season.
  • Some authors report the Poisson distribution is a fit for a club over several (5 or more) seasons. But clubs change over time, sometimes radically over short periods.
  • The Poisson Limit Theorem kicks in if you have enough data.
A league tier consists of several clubs, right now, there are 20 clubs in the Premier League. By aggregating the results over a season for 20 unrelated clubs, I get data that's well-fitted by the Poisson distribution. I'm wondering if the authors who modeled club data over five or more seasons got it right for the wrong reason. What if they aggregated the results of 5 unrelated clubs in the same season or even, different season? In other words, did they see a fit to multi-season club data because of aggregation alone? 

Implications for predicting results

The Poisson distribution is a great distribution to use to model the goals scores at the league and season level, but not so much at the club-level. The Binomial distribution doesn't really work at the club-level either. It may well be each team plays too few matches in a season for us to fit their results using an off-the-shelf distribution. Or put another way, randomness is too big an element of the game to let us make quick and easy predictions.

Monday, November 17, 2025

Data scientists need to learn JavaScript

Moving quickly

Over the last few months, I've become very interested in rapid prototype development for data science projects. Here's the key question I asked myself: how can a data scientist build their own app as quickly as possible? Nowadays, speed means code gen, but that's only part of the solution.

The options

The obvious quick development path is using Streamlit; that doesn't require any new skills because it's all in Python. Streamlit is great, and I've used it extensively, but it only takes you so far and it doesn't really scale. Streamlit is really for internal demos, and it's very good at that.

The more sustainable solution is using Django. It's a bigger and more complex beast, but it's scalable. Django requires Python skills, which is fine for most data scientists. Of course, Django apps are deployed on the web and users access them as web pages.

The UI is one place code gen breaks down under pressure

Where things get tricky is adding widgets to Django apps. You might want your app to take some action when the user clicks a button, or have widgets controlling charts etc. Code gen will nicely provide you with the basics, but once you start to do more complicated UI tasks, like updating chart data, you need to write JavaScript or be able to correct code gen'd JavaScript.

(As an aside, for my money, the reason why a number of code gen projects stall is because code gen only takes you so far. To do anything really useful, you need to intervene, providing detailed guidance, and writing code where necessary. This means JavaScript code.)

JavaScript != Python

JavaScript is very much not Python. Even a cursory glance will tell you the JavaScript syntax is unlike Python. More subtly, and more importantly, some of the underlying ideas and approaches are quire different. The bottom line is, a Python programmer is not going to write good enough JavaScript without training.

To build even a medium complexity data science app, you need to know how JavaScript callbacks work, how arrays work, how to debug in the browser, and so on. Because code gen is doing most of the heavy lifting for you, you don't need to be a craftsman, but you do need to be a journeyman.

What data scientists need to do

The elevator pitch is simple:

  • If you want to build a scalable data science app, you need to use Django (or something like it).
  • To make the UI work properly, code gen needs adult supervision and intervention.
  • This means knowing JavaScript.
(Data Scientist becoming JavaScript programmer. Gemini.)

In my view, all that's needed here is a short course, a good book, and some practice. A week should be enough time for an experienced Python programmer to get to where they need to be.

What skillset should data scientists have?

AI is shaking everything up, including data science. In my view, data scientists will have to do more than their "traditional" role. Data scientists who can turn their analysis into apps will have an advantage. 

For me, the skillset a data scientist will need looks a lot like the skillset of a full-stack developer. This means data scientists knowing a bit of JavaScript, code gen, deployment technologies, and so on. They won't need to be experts, but they will need "good enough" skills.

Friday, August 15, 2025

Probability playground: a great app/website

What a great website!

I've played around with probability distributions for more than two decades and I'm still finding out new things. Wikipedia helps of course, but it's sometimes obscure and misses things out.

Recently, I came across some wonderful webpages at the University of Buffalo. The pages were created by Adam Cunn (https://www.acsu.buffalo.edu/~adamcunn/) and are all about probability distributions. I've learned some interesting things from his pages and I thought I would share on my blog.

Probability Playground

The Probability Playground pages are really an app. The home page (https://www.acsu.buffalo.edu/~adamcunn/probability/probability.html) shows you the "core" probability distributions and how they're related. Clicking on any distribution takes you into the app proper.

Let's click on the beta distribution and I'll show you some interesting  stuff: https://www.acsu.buffalo.edu/~adamcunn/probability/beta.html

  • Click on the top right box labeled "Transformation". The dropdown box tells you that the limiting case for the beta distribution is the normal distribution (click it and see) and that a special case of the beta distribution is the normal distribution. 
  • Look on the bottom left of the page and you'll see several examples of the beta distribution in the real world. These are all great examples.
  • You can explore the effect of changing the distribution parameters on the PDF and CDF. It's a vivid demonstration that the beta distribution is a family of distributions.
As you can see, there's more on the page to explore and you can really get a sense of how distributions differ from one another. 

Of course, there are hundreds of different distributions and this app only has the "greatest hits", but that's fine. It was created as a teaching tool and it serves its purpose very well.

Stop reading this and go try it yourself

I like this app a lot, so I'm going to stop writing and tell you to go off and click on the site for yourself. Here's the link again: https://www.acsu.buffalo.edu/~adamcunn/probability/probability.html

Monday, June 30, 2025

Win, lose, or draw: trends in English football match results

Is the game getting more exciting?

Football (soccer) fans like to see exciting matches. Draws are boring but wins or losses are interesting; fans want to see teams give their all on the pitch. Which begs the question, is the game getting more or less thrilling over time? One way to answer this question is to look at fraction of matches in a league that end in a draw. The most boring extreme is every game is a draw (draw fraction = 1). The most engaging extreme is that every game ends in a win/loss (draw fraction = 0). How does the proportion of drawn games change over time?

(User:Aloba, Public domain, via Wikimedia Commons)

Draws by league and by season

From multiple sources, I put together a file containing all English national league games from the foundation of the league system in 1888 to the end of the 2024-2025 season. The different leagues started in different seasons, with the National League (tier 5) being the most recent. The top tier (tier 1) is currently called the Premier League, though, like the other leagues, it has undergone a number of confusing name changes.

From this data set, I calculated the fraction of all matches in a season and a league that ended in a draw. I also calculated the standard deviations so you can get a sense of the spread of the data. (Because the standard deviation values aren’t close to 0 or 1, I don’t need to use the Wilson Score Interval approach here, the “usual” way of calculating the standard deviation or standard error of a proportion is good enough.)

This chart shows the fraction of draws by league by season. The salmon-colored blocks are World War I and World War II. I’ll explain the blue lines later. The chart is interactive; click on the legend to turn the leagues off and on.

The standard deviation makes this chart hard to understand, so I’ve re-drawn the chart without it (below). Again, it’s interactive.

Let’s look at the top tier (tier 1 – currently, the Premier League). The fraction of draws started off low (around 0.167 in 1888) increasing up to the start of the First World War (0.274 in 1914). Things were more or less stable in the interwar period and the immediate post-war years. In 1968, the draw fraction shot up to 0.303, remained more or less steady, before starting a slow decline after 1993 (0.216 in 2023). The other leagues show a similar pattern, except they show no decline post 1993. How do we explain what’s going on?

Are rule changes the cause?

Let’s start by looking at significant rule changes in the game. The blue lines represent the first season significant rule changes were introduced:

  • In 1968, clubs were allowed to make substitutions for any reason (rather than just replace injured players). 
  • In 1981, the points system was changed from 2 for a win, 1 for a draw, 0 for a loss, to 3 for a win, 1 for a draw, and 0 for a loss. The thinking here was that this would encourage clubs to be more risk taking and go for a win when them might otherwise accept a draw. 
  • Of course, in 1993, the Premier League was founded.

You can judge for yourself the impact of the 1968 and 1981 changes.

The foundation of the Premier League (the new top tier) marks the start of a decline in the draw fraction for the top tier only; none of the other leagues show a similar sustained drop. The question is, why?

More or less equal?

There are at least two reasons why the fraction of draws in a league might change: 

  • The clubs are becoming more (less) equal. If all the teams in a league were equally skilled, we might expect every match to be a draw. 
  • The style of play changes so that wins/losses become more prevalent than draws. The clubs could still be equally skilled though; in the case of equally matched clubs, they might go from drawing all games, to winning and losing in equal measure. 

For the Premier League case, we have two competing explanations for why the draw fraction changed: unequal clubs vs. playing style. Fortunately, there is a way of analyzing the data for evidence supporting one of these explanations.

If all clubs are equal, then the fraction of matches each team wins will be about the same for all clubs. If clubs are very unequal, some clubs will win way more matches than others, so the win fraction for the top performing clubs will be higher than the win fraction for the low performing clubs. In other words, equality means little variance in the win fraction and inequality means high variance in the win variance.

For each league tier and each season, I calculated the standard deviation of the win fraction (which is the square root of the variance). The chart below shows the results. Bear in mind, the lower the standard deviation win fraction, the more equal the teams are, the higher it is, the less equal the teams are.

Prior to 1914, tier 1 and tier 2 show the same trend, a decline in win fraction standard deviation, suggesting the leagues are becoming more equal. Post-World War II, tiers 2, 3, 4 and 5 show no change over time with a low win fraction standard deviation, again suggesting equally matched teams. The Premier League is different, it shows an increasing win fraction over time, implying this league is becoming less and less equal; the difference between the winners and losers is getting bigger. 

To be clear, these results suggest that the cause of the Premier League declining draw fraction is not due to a change in the style of play, it’s due to the league tier becoming more unequal.

Supporting evidence

If you’re a football fan, inequality in the Premier League comes as no surprise. There’s been a lot of discussion about the Premier League having a league-within-a-league of top clubs (currently Liverpool, Manchester City, Chelsea, and Arsenal). A few years ago, the top clubs in Europe talked about forming a breakaway European super-league which lends credence to the idea that top clubs really are different. On the flip side, it’s also true that most clubs that get promoted into the league get relegated soon after (for example, Luton).

Money might be the cause, but the picture is more complicated than it seems.

A Deloitte analysis [https://www.deloitte.com/uk/en/services/consulting/research/annual-review-of-football-finance-premier-league-clubs.html] for the 2023-2024 season shows Manchester City had a revenue of £719 million, compared to Luton's £132 million. Any guesses how these teams finished the season? The drop from most revenue to least isn't linear either.

For the same 2023-2024 season, the entire English Championship League (tier 2) was £958 million, which is less than the revenue from just the top two Premiership clubs. Championship club revenue is also not evenly distributed, showing the same kind "top heavy" pattern the Premier League shows (see [https://swissramble.substack.com/p/english-clubs-by-revenue-grouping] for a chart). If the Championship also has unequal revenue distributions, why doesn't it show the same win fraction standard deviation as the Premier League? I'm not sure, but I can offer a couple of ideas. Money buys talent, but maybe it isn't even. For example, a £100 million player maybe twice as good as a £50 million player, but a £10 million player might only be 1.5 times as good as a £5 million player (there are many more £5 million players than  there are £100 million players). This would mean lower leagues become more equitable because the money difference matters less as the amount of money goes down. 

The bottom line is, the Premier League is becoming more winner-takes-all while the lower leagues are more equitable.

What does this mean?

It may well be true that draws are boring, but having dominant teams is also boring. If you can reliably predict who will win a match, it’s not as interesting. If your team always wins, why do you care?

Inequality has implications for promotion and relegation. Inequality suggests there may be a revolving door of the same few clubs moving between the Premier League and the Championship. If newly promoted clubs mostly get relegated, that's pretty dull. If it's the "usual suspects" for promotion/relegation then things start to look the same season to season.

Given that the Premier League was founded over money sharing issues, it's hard to see any changes that would more equitably distribute money. 

For the foreseeable future, we may have a very unequal Premier League with much more equal lower leagues.

Other football posts: