Wednesday, December 31, 2025

Whiskey prices!

Whiskey prices and age of the single malt

I was in a large alcohol supermarket the other day and I was looking at Scotch whiskey prices. I could see the same single malt at 18, 21, and 25 years. What struck me was how non-linear the price was. Like any good data scientist, I collected some data and took a closer look. I ended up taking a deeper dive into the whiskey market as you'll read.

(Gemini. Whiskey that's old enough to drink.)

The data and charts

From an online alcohol seller, I collected data on the retail prices of several single malt Scotch whiskies with different ages, being careful to make a like-for-like comparison and obviously comparing the same bottle size (750 ml). This is more difficult than it sounds as there are many varieties, even within the same single malt brand.

Here are the results. You can interact with this chart through the menu on the right. Yes, 50 year old whiskies do sell for $40,000.

First impressions are that the relationship between price and age is highly non-linear. To see this in more detail, I've redrawn the chart using a log y-axis.

This presentation suggests an exponential relationship between price and age. To confirm it, I did a simple curve fit and got an exponential fit that's very good.

What's going on with the price curve?

The exponential age-price curve is well-known and has been discussed in the literature [1, 2]. What might make the curve exponential? I find the literature a bit confusing here, so I'll offer some descriptions of the whiskey market and whiskey itself.

First off, by definition, whiskey takes a long time to come to market; by definition a 21 year old Scotch has been in a barrel for 21 years. This means distillers are making predictions about the demand for their product far into the future. A 50 year old whiskey on sale today was put into a barrel when Jaws was a new movie and when Microsoft was formed; do you think they could have made an accurate forecast for 2025 demand back then? Of course, the production process means the the supply is finite and relatively inelastic; you can't quickly make more 50 year old whiskey.

How whiskey ages adds to the difficulty distillers have with production. Unlike wine, whiskey ages in the barrel but not in the bottle; an 18 year old single malt bottled in 2019 is the same as an 18 year old single malt bottled in 2025. So once whiskey is bottled, it should be sold as soon as possible to avoid bottle storage costs. This punishes premature bottling; if you over-bottle, you either sell at a reduced price or bear storage costs.

There is a possible exception to whiskey not aging in the bottle known as the Old Bottle Effect (OBE). Expert tasters can taste novel flavors in whiskeys that have spent a long time in the bottle. These tastes are thought to come from oxidation, with oxygen permeating very slowly through the pores in the cork [3]. Generally speaking, oxidation is considered a bad thing for alcoholic drinks, but it seems in the case of whiskey, a little is OK. Viewing the online images of 50 year old whiskey bottles, it looks like they've been bottled recently, so I'm not convinced OBE has any bearing on whiskey prices,

Whiskey is distilled and gets its taste from the barrels, which means that unlike wine, there are no vintage years. Whiskey is unaffected by terroir or the weather; a 21 year old Scotch should taste the same regardless of the year it was bottled, which has a couple of consequences.

If you bottle too much whiskey and have to store it instead of selling it, you won't be able to charge a price premium for the bottles you store (over bottling = higher costs).
On the analysis side, it's possible to compare the prices of the same whiskey over several years; a 25 year old whiskey in 2019 is the same product as a 25 year old whiskey in 2025.

One notable production price driver is evaporation. Each year, 2-5% of whiskey in barrels is lost due to it, which is the so-called "angel's share". Let's assume a 4% annual loss from a 200 liter barrel and see what it does to the amount of whiskey we can sell (I've rounded the numbers to the nearest liter).

Year	Whiskey volume
0	200
3	177
10	133
15	108
18	96
21	85
25	72
30	59
40	39
50	26

By law, whiskey has to be matured for 3 years and in reality, the youngest single malts are 10 years old. To get the same revenue as selling the barrel at 10 years, a 50 year old barrel has to be sold for (133/26) or about 5 times the price. That helps explain the increase with age, but not the extent of the increase.

Storage costs obviously vary linearly with age and we can add in the time value of money (which follows the same type of equation as the angel's share). These costs obviously drive up the cost of older whiskey more, but all the production and supply-side factors still don't get us to an exponential price curve.

Before moving to the demand side, I should talk a bit about the phenomena of independent bottlers, also known as cask brokers or cask investment companies. These are companies that buy whiskey in barrels from the distillers and store the barrels. They either bottle the whiskey themselves or sell the barrels, sometime selling barrels back to the original distiller. As far as I can see, they're operating like a kind of futures market. There are several of these companies, the biggest being Gordon & MacPhail who were founded in 1895. It's not clear to me what effect these companies might have on the supply of single malts.

On the demand side, whiskey has been a boom-and-bust industry.

Up until the late 1970s, there had been a whiskey boom and distilleries had upped production in response. Unfortunately, that led to over-production and the creation of a whiskey 'loch' (by comparison with the wine lake and the butter mountain created by over-production). By the early 1980s, distilleries were closing and the industry was in a significant downturn. This led to a sharp reduction in production. For us in 2025, it means the supply of older whiskey is very much less than demand.

More recently, there was a whiskey boom from the early 2000s to the early 2020s. Demand increased substantially but with a fixed supply. Increased demand + fixed supply = increased price, and as older whiskies are rarer, this suggests that older whiskies appreciate in price more.

It's an anecdotal point, but I seem to remember it was uncommon to see "young" whiskeys less than 18 years old. It's only recently that I've seen lots of 10 year old whiskeys on sale. If this is true, it would be a distillers response to the boom; bottle and sell as much as you can now while demand is high. Bottling whiskies younger will have the side-effect of reducing the supply of older whiskeys.

Of course, the whiskey boom has seen older whiskies become luxury goods. The Veblan effect might be relevant here, this is an observation that when the price of some luxury goods increases, demand increases (the opposite dynamic from "normal" goods). Small additions to a product might drive up the price disproportionately (handbags being a good example), in this case, the small additions would be an increase in the age of the whiskey (say from 40 years to 45 years).

As rare and old whiskies have become more expensive, investors have moved in and bought whiskey not as something to drink, but as something to buy and sell. This has brought more money into the high-end of the market, adding to the price rise.

Let's pull all these strands together. Whiskey seems to be a boom-and-bust industry coupled with long-term production and a fixed supply. Over recent years, there's been a boom in whiskey consumption. Market dynamics suggest that distillers sell now while the market is good, which means bottling earlier, which in turn means fewer older whiskies for the future. Really old whiskies are quite rare because of the industry downturn that occurred decades ago and because of maturation costs. Rareness coupled with wealth and desirability pushes the price up to stratospheric levels for older whiskies. The price-age curve is then a function of supply, distillers bottling decisions, and market demand. That still doesn't get us to the exponential curve, but you can see how we could produce a model to get there.

What about blends, other countries, and science?

If single malt whiskey is becoming unaffordable, what about blends? Like wine, the theory goes that the blender can buy whiskey from different distillers and combine them to produce a superior product. However, like wine, the practice is somewhat different. Blends have been associated with the lower end of the market and I've had some really nasty cheap blended whiskey. At the upper end of the blend market, a 750ml bottle of Johnnie Walker Blue Label retails for about $178, and I've heard it's very good. For comparison, the $178 price tag puts it in the price range of some 18-21 year old whiskies. There are rumors that some lesser-known blends are single malts in all but name, so they might be worth investigating, but at over $150 a bottle, this feels a bit like gambling.

What about whiskey or whisky from other countries? I'm not sure I count bourbon as a Scotch-type whiskey, it kind of feels like its own thing - perhaps it's a separate branch of the whiskey family. Irish whiskey is very good and the market isn't as developed as Scotch, but prices are still high. I've tried Japanese whiskey and I didn't like it, maybe the more expensive stuff is better, but it's an expensive risk. I've seen Indian whiskey, but again the price was too high for me to want to try my luck.

What about engineered whiskey? Whiskey gets its flavor from wooden barrels and if you know the chemistry, you can in principle make an equivalent product much faster. There are several companies trying to do this and they've been trying for several years. The big publicity about these so-called molecular spirits was around 2019, but they've not dented the Scotch market at all and their products aren't widely available. The whiskey "equivalents" I've seen retail for about $40, making them much cheaper than single malts, however, the reviews are mixed. The price point does mean I'm inclined to take a risk; if I can find a bottle, I'll buy one.

Whiskey or whisky?

Both spellings are correct. Usage depends on where you are and the product you're talking about. Whiskey is the Irish spelling and it's the spelling used in the US for this category of spirits. Whisky is the Scottish spelling and it's the spelling they use on their bottles. Because I'm writing in the US, I've used whiskey in this blog post even though I'm writing about the Scottish product. I decided I couldn't win on a spelling choice, so I chose to be consistent.

The future

During the 2000s whiskey boom, investors created new distilleries and re-opened old ones, which suggests production is likely to increase over the coming years. At the same time, the whiskey boom is slowing down and sales are flattening. Are we headed to another whiskey crash? I kind of doubt it, but I think prices will stabilize or even come down slightly for younger whiskies (21 years or younger). Older whiskies will still be rare because of the industry slump in the 1980s and they're likely to remain eye-wateringly expensive.

Of course, I'll be having a glass of single malt in the near future, but I'll try not to bore everyone with whiskey facts!

References

Moroz D, Pecchioli B. Should You Invest in an Old Bottle of Whisky or in a Bottle of Old Whisky? A Hedonic Analysis of Vintage Single Malt Scotch Whisky Prices. Journal of Wine Economics. 2019;14(2):145-163. doi:10.1017/jwe.2019.13
Page, Ian B. "Why do distilleries produce multiple ages of whisky?." Journal of Wine Economics 14.1 (2019): 26-47.
https://hedonism.co.uk/what-obe-old-bottle-effect

Tuesday, December 30, 2025

Why are weather forecasting sites so bad?

Just show me what's relevant!

Weather forecasting in the US has got really bad for no real reason. I'm not talking about the accuracy, I'm talking about the way the data is presented. Oddly, it's the professional weather sites that are the worst.

Here's what I want. I want a daily view of the weather for the next week. I want temperature highs and lows, chances of rain/snow when and how much, and some details on the wind if it's going to be unusual. A line of two of text would be great for each day. I don't mind ads, but I don't want so many that I can't read the data. It's not much to ask, but it seems like it's hard to get.

(Gemini)

What the commercial sites give me

The commercial sites give me visual clutter everywhere. There are ads all over their pages. Of course, ads scream for attention, so multiple ads are distracting and make the page hard to use. If I try and change anything on the page, I get an ad I have to click away from. Because they have to allow space for ads and links to other content, the screen real estate they can use for actual weather data is very limited. Throw in some over-size icons and you leave even less room for meaningful text and data.

The hourly views they provide are very detailed, but oddly, poorly presented. If I want the hourly forecast for three days' time, I have to scroll through lots of stuff - which I guess is the point. The summary views are too truncated because of their cluttered presentations.

The radar charts are nice, as is the animation, but again they're distracting. The choice of colors makes me feel like I'm reading a 1980s superhero comic.

Of course, these websites have to be paid for and the money comes from ads. It seems like it's ads or subscriptions and I'm already paying too much in subscription fees. It feels like things aren't going to get better.

Google and others

Google provides a very good weather summary, as do a number of other sites. Unfortunately, they don't provide all the data I want, but they get pretty close. Their data presentation is great too.

TV is the worst

Let me be blunt. I don't trust TV forecasts. I've read that they tend to exaggerate bad weather to get viewers, this includes exaggerating rainfall and exaggerating weather severity. I've read of TV forecasters who were asked by their station manager to make forecasts worse to drive ratings. There's a saying in journalism, "if it bleeds, it leads" and it seems like sometimes weather forecasts fit into this category. It may well be that some or all of my local stations are not like this, but I have no way of knowing. If they want to gain my trust, they should publish data on their accuracy, but none of them do.

For reasons I'll get to in a minute, AI has made me lose faith in TV forecasters completely.

NWS

By now, many of you will be screaming about the National Weather Service. They provide free forecasts and plenty of data via their API. They have exactly the data I want, but it's poorly presented. Their website feels very late 1990s, and there may be reasons for that.

There's been an on-and-off campaign against the NWS for some time now. The argument against it is that it's unfair competition for the commercial weather forecast providers. Bear in mind that the commercial providers all use NWS data underneath and that we the tax payers have paid for weather collection. The push is to have the NWS stop providing data and forecasts to the public but still provide the data to commercial providers in bulk. In effect, this means the public will pay for data collection and pay again to see the data they paid to be collected. I can't help feeling that part of the awkward NWS data presentation is to deflect the unfair competition argument.

The NWS' parent agency is NOAA and recently, NOAA has suffered substantial cuts. At this time, it's not clear what the effect of these cuts are, but it can't be good for forecasting.

What I did about it

I built my own app using AI code gen and using an LLM to give me the text I wanted.

I wrote a long prompt to tell Cursor to build an app. I told it to get a US zip code, find the biggest town or city in the zip code, and convert it to latitude and longitude. Next up, I told it to get the NWS seven day forecast and pass the data to Google Gemini and produce a summary forecast from the data. Finally, I added in a weather chatbot, just because. I put the whole thing into Streamlit.

My app isn't perfect, but it's pretty close to what I want. It all fits on one page so it's easy to see the daily forecast and the overall summary is very readable. If I have questions, I can just ask the chatbot. I'm now using my app when I want a forecast because it has what I want and it's faster and easier to use than the alternatives. It's way better than watching the TV weather forecast and I'm convinced my app isn't biased to emphasize drama.

(My app, simple but effective.)

(Future enhancements I'm thinking of adding include:

Changing to a tabbed display.
Summary and seven day view on the main tab.
Hourly views on another tab - including Google-like charts.
Adding a radar view tab using the NWS radar data.
Adding text-to-speech via an AI service.

This is all about adding more functionality in an easy-to-use way that lets me get what I want quickly.)

My app took 10 minutes to write.

Let me say this again. I built an app that's better for me than the existing commercial weather forecasting services and I did it in 10 minutes.

There are implications here.

Let's say I'm a radio station and my existing meteorologist retires or leaves. Why not replace them with an app? I can generate a soothing calming voice using AI so I can automate the whole forecast and save myself some money. I can do the same thing if I'm a TV station too; I can hire someone cheap to read the forecast or generate a movie of the forecast. I could also amp up the urgency of any bad news without any fear of someone pushing back. In other words, AI is a game changer.

So long as the NWS exists and is providing free data, the potential exists to disrupt the weather forecasting market using AI.

What other markets like this could AI disrupt?

Tuesday, December 23, 2025

Using Cursor for data science: a talk

Code generation is good enough for data science use

I gave a talk at PyData Boston on using Cursor for data science. Here's the talk.

Friday, December 19, 2025

Small adventures with small language models

Small is the new large

I've been talking to people about small language models (SLMs) for a little while now. They've told me they've got great results and they're saving money compared to using LLMs; these are people running businesses so they know what they're talking about. At an AI event, someone recommended I read the recent and short NVIDIA SLM paper, so I did. The paper was compelling; it gave the simple message that SLMs are useful now and you can save time and money if you use them instead of LLMs.

(If you want to use SLMs, you'll be using Ollama and HuggingFace. They work together really well.)

As a result of what I've heard and read, I've looked into SLMs and I'm going to share with you what I've found. The bottom line is: they're worth using, but with strong caveats.

What is a SLM?

The boundary between an SLM and an LLM is a bit blurry, but to put it simply, an SLM is any model small enough to run on a single computer (even a laptop). In reality, SLMs require quite a powerful machine (developer spec) as we'll see, but nothing special, and certainly nothing beyond the budget of almost all businesses. Many (but not all) SLMs are open-source.

(If your laptop is "business spec", e.g., a MacBook Air, you probably don't have enough computing power to test out SLMs.)

How to get started

To really dive into SLMs, you need to be able to use Python, but you can get started without coding. Let's start with the non-coders path because this is the easiest way for everyone to get going.

The first port of call is visiting ollama.com and downloading their software for your machine. Install the software and run it. You should see a UI like this.

Out-of-the-box, Ollama doesn't install any SLMs, so I'm going to show you how to install a model. From the drop down menu on the bottom right, select llama3.2. This will install the model on your machine which will take a minute or so. Remember, these models are resource hogs and using them will slow down your machine.

Once you've installed a model, ask it a question. For example, "Who is the Prime Minister of Canada?". The answer doesn't really matter, this is just a simple proof that your installation was successful.

(By the way, the Ollama logo is very cute and they make great use of it. It shows you the power of good visual design.)

So many models!

The UI drop down list shows a number of models, but these are a fraction of what's available. Go to this page to see a few more: https://ollama.com/library. This is a nice list, but you actually have access to thousands more. HuggingFace has a repository of models that follow the GGUF format, you can see the list here: https://huggingface.co/models?library=gguf.

Some models are newer than others and some are better than others at certain tasks. HuggingFace have a leaderboard that's useful here: https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard. It does say LLM, but it includes SLMs too and you can select just a SLM view of the models. There are also model cards you can explore that give you insight into the performance of each model for different types of tasks.

To select the right models for your project, you'll need to define your problem and look for a model metric that most closely aligns with what you're trying to do. That's a lot of work, but to get started, you can install the popular models like mistral, llama3.2, and phi3 and get testing.

Who was the King of England in 1650?

You can't just generically evaluate an SLM, you have to evaluate it for a the task you want to do. For example, if you want a chatbot to talk about the stock you have in your retail company, it's no use testing the model on questions like "who was King of England in 1650?". It's nice if the model knows Kings & Queens, but not really very useful to you. So your first task is defining your evaluation criteria.

(England didn't have a King in 1650, it was a republic. Parliament had executed the previous King in 1649. This is an interesting piece of history, but why do you care if your SLM knows it?)

Text analysis: data breaches

For my evaluation, I chose a project analyzing press reports on data breaches. I selected nine questions I wanted answers to from a press report. Here are my questions:

"Does the article discuss a data breach - answer only Yes or No"
"Which entity was breached?"
"How many records were breached?"
"What date did the breach occur - answer using dd-MMM-YYYY format, if the date is not mentioned, answer Unknown, if the date is approximate, answer with a range of dates"
"When was the breach discovered, be as accurate as you can"
"Is the cause of the breach known - answer Yes or No only"
"If the cause of the breach is known state it"
"Were there any third parties involved - answer only Yes or No"
"If there were third parties involved, list their names"

The idea is simple, give the SLM a number of press reports. Get it to answer the questions on each article. Check the accuracy of the results for each SLM.

As it turns out, my questions needs some work, but they're good enough to get started.

Where to run your SLM?

The first choice you face is which computer to run your SLM on. Your choices boil down to evaluating it on the cloud or on your local machine. If you evaluate on the cloud, you need to choose a machine that's powerful enough but also works with your budget. Of course, the advantage of cloud deployment is you can choose any machine you like. If you choose your local machine, it needs to be powerful enough for the job. The advantage of local deployment is that it's easier and cheaper to get started.

To get going quickly, I chose my local machine, but as it turned out, it wasn't quite powerful enough.

The code

This is where we part ways with the Ollama app and turn to coding.

The first step is installing the Ollama Python module (https://github.com/ollama/ollama-python). Unfortunately, the documentation isn't great, so I'm going to help you through it.

We need to install the SLMs on our machine. This is easy to do, you can either do it via the command line or via the API. I'll just show you the command line way to install the model llama3.2:

ollama pull llama3.2

Because we have the same nine questions we want to ask of each article, I'm going to create a 'custom' SLM. This means selecting a model (e.g. Llama3.2) and customizing it with my questions. Here's my code.

ollama.create(
                model='breach_analyzer',
                from_='llama3.2',
                system=system_prompt,
                stream=True,
            ):

The system_prompt is my nine questions I showed you earlier plus a general prompt. model is the name I'm giving my custom model; in this case I'm calling it breach_analyzer.

Now I've customized my model, here's how I call it:

        response = ollama.generate(
            model='breach_analyzer',
            prompt=prompt,
            format=BreachAnalysisResponse.model_json_schema(),
        )

The prompt is the text of the article I want to analyze. The format is the JSON format I want the results to be in. The response is the response from the model using the JSON format defined by BreachAnalysisResponse.model_json_schema().

Note I'm using generate here and not chat. My queries are "one-off" and there's no sense of a continuing dialog. If I'd wanted a continuing dialog, I'd have used the chat function.

Here's how my code works overall:

Read in the text from six online articles.
Load the model the user has selected (either mistral, llama3.2, or phi3).
Customize the model.
Run all six online articles through the customized model.
Collect the results and analyze them.

I created two versions of my code, a command line version for testing and a Streamlit version for proper use. You can see both versions here: https://github.com/MikeWoodward/SLM-experiments/tree/main/Ollama

The results

The first thing I discovered is that these models are resource hogs! They hammered my machine and took 10-20 minutes to run each evaluation of six articles. My laptop is a 2020 developer spec MacBook Pro but it isn't really powerful enough to evaluate SLMs. The first lesson is, you need a powerful, recent machine to make this work; one that has GPUs built in that the SML can access. I've heard from other people that running SLMs on high-spec machines leads to fast (usable) response times.

The second lesson is accuracy. Of the three models I evaluated, not all of them answered my questions correctly. One of the articles was an article about tennis and not about data breaches, but one of the models incorrectly said it was about data breaches. Another of the models told me it was unclear whether there were third parties involved in a breach and then told me the name of the third party!

On reflection, I needed to tweak my nine questions to get clearer answers. But this was difficult because of the length of time it took to analyze each article. This is a general problem; it took so long to run the models that any tweaking of code or settings took too much time.

The overall winner in terms of accuracy was Phi-3, but this was also the slowest to run on my machine, taking nearly 20 minutes to analyze six articles. From commentary I've seen elsewhere, this model runs acceptably fast on a more powerful machine.

Here's the key question: could I replace paid-for LLMs with SLMs? My answer is: almost certainly yes, if you deploy your SLMs on a high-spec computer. There's certainly enough accuracy here to warrant a serious investigation.

How I could have improved the results?

The most obvious thing is a faster machine. A brand new top-of-the-range MacBookPro with lots of memory and built-in GPUs. Santa, if you're listening, this is what I'd like. Alternatively, I could have gone onto the cloud and used a GPU machine.

My prompts could be better. They need some tweaking.

I get the text of these articles using requests. As part of the process, it gives me all of the text on the page, which includes a lot of irrelevant stuff. A good next step would be to get rid of some of the extraneous and distracting text. There are lots of ways to do that and it's a job any competent programmer could do.

If I could solve the speed problem, it would be good to investigate using multiple models. This could take several forms:

asking the same questions using multiple models and voting on the results
using different models for different questions.

What's notable about these ways of improving the results is how simple they are.

Some musings

Evaluating SLMs is firmly in the technical domain. I've heard of non-technical people try to play with these models, but they end up going nowhere because it takes technical skills to make them do anything useful.
There are thousands of models and selecting the right one for your use case can be a challenge. I suggest going with the most recent and/or ones that score most highly on the HuggingFace leaderboard.
It takes a powerful machine to run these models. A new high-end machine with GPUs would probably run these models "fast enough". If you have a very recent and powerful local machine, it's worth playing around with SLMs locally to get started, but for serious evaluation, you need to get on the cloud and spend money.
Some US businesses are allergic to models developed in certain countries, some European businesses want models developed in Europe. If the geographic origin of your model is important, you need to check before you start evaluating.
You can get cost savings compared to LLMs, but there's hard work to be done implementing SLMs.

I have a lot more to say about evaluations and SLMs that I'm not saying here. If you want to hear more, reach out to me.

Next steps

Ian Stokes-Rees gave an excellent tutorial at PyData Boston on this topic and that's my number one choice for where to go next.

After that, I suggest you read the Ollama docs and join their Discord server. After that, the Hugging Face Community is a good place to go. Lastly, look at the YouTube tutorials out there.

Thursday, December 18, 2025

The Skellam distribution

Distributions, distributions everywhere

There are a ton of distributions out there; SciPy alone has several hundred and that's nowhere near a complete set. I'm going to talk about one of the lesser known distributions, the Skellam distribution, and what it's useful for. My point is a simple one: it's not enough for data scientists to know the main distributions, they must be aware that other distributions exist and have real-world uses.

Overview of the Skellam distribution

It's easy to define the Skellam distribution: it's the difference between two Poisson distributions, or more formally, the difference between two Poisson distributed random variables.

So we don't get lost in the math, here's a picture of a Skellam distribution.

If you really must know, here's how the PMF is defined mathematically:

\[ P(Z = k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)} \left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_k(2\sqrt{\mu_1 \mu_2}) \] where $I_k(x)$ is given by the modified Bessel function: \[ I_k(x) = \sum_{j=0}^{\infty} \frac{1}{j!(j+|k|)!} \left(\frac{x}{2}\right)^{2j+|k|} \]

this all looks very complicated, but by now (2025), it's easy to code up, here's the SciPy code to calculate the PMF:

probabilities = stats.skellam.pmf(k=k_values, mu1=mu1, mu2=mu2)

What use is it?

Here are just a few uses I found:

Finance: modeling price changes between trades.
Medicine: modeling the change in the number of beds in an ICU, epileptic seizure counts during drug trials, differences in reported AIDS cases, and so on.
Sports: differences in home and away team football or hockey scores.
Technology: modeling sensor noise in cameras,

Where did it come from?

Skellam published the original paper on this distribution in 1946. There isn't a lot of background on why he did the work and, as far as I can tell, it wasn't related to World War II research work in any way. It's only really been discussed more widely once people discovered it's use for modeling sports scores. It's been available as an off-the-shelf distribution in SciPy for over a decade now.

As an analyst, what difference does this make to you?

I worked in a place where the data we analyzed wasn't normally distributed (which isn't uncommon, a lot of data sets aren't normally distributed), so it was important that everyone knew at least something about non-normal statistics. I interviewed job candidates for some senior positions and asked them how they would analyze some obviously non-normal data. Far too many of them suggested using methods only suitable for normally distributed data. Some candidates had Master's degrees in relevant areas and told me they had never been taught how to analyze non-normal data, and even worse, they never looked into it themselves. This was a major warning for us recruiting.

Let's imagine you're given a new data set in a new area and you want to model it. It's obviously not normal, so what do you do? In these cases, you need to have an understanding of what other distributions are out there and their general shape and properties. You should just be able to look at data and guess a number of distributions that could work. You don't need to have an encyclopedic knowledge of them all, you just need to know they exist and you should know how to use a few of them.

Monday, December 15, 2025

Poisson to predict football results?

Goals are Poisson distributed?

I've read a lot of literature that suggests that goals in games like football (soccer) and hockey (ice hockey) are Poisson distributed. But are they? I've found out that it's not as simple as some of the papers and articles out there suggest. To dig into it, I'm going to define some terms and show you some analysis.

The Poisson distribution

The Poisson distribution is a discrete distribution that shows the probability distribution of the number of independent events occurring over a fixed time period or interval. Examples of its use include: the number of calls in a call center per hour, website visits per day, and manufacturing defects per batch. Here's what it looks like:

If this were a chart of defects per batch, the x-axis would be the number of defects and the y-axis would be the probability of that number of defects, so the probability of 2 defects per batch would be 0.275 (or 27.5%).

Here's it's probability mass function formula:

\[PMF = \frac{ \lambda^{k}e^{-\lambda}}{k!} \]

Modeling football goals - leagues and seasons

A lot of articles, blogs, and papers suggest that football scores are well-modeled by the Poisson distribution. This is despite the fact that goals are not wholly independent of one another; it's well-known that scoring a goal changes a game's dynamics.

To check if the Poisson distribution models scores well, here's what I did.

Collected all English football league match results from 1888 to the present. This data includes the following fields: league_tier, season, home_club, home_goals, away_club_away_goals.
Calculated a field total_goals (away_goals + home_goals).
For each league_tier and each season, calculated relative frequency for total_goals, away_goals, and home_goals.
Curve fit a Poisson distribution to the data.
Calculated $\chi^2$ and the associated p-value.

This gives me a dataframe of $\chi^2$ and p for each league_tier and season. In other words, I know how good a model the Poisson distribution is for goals scored in English league football.

This is the best fit (lowest $\chi^2$ for total_goals). It's for league_tier 2 (the EFL Championship) and season 2022-2023. The Poisson fit here is very good. There are a lot of league_tiers and seasons with pretty similar fits.

Here's the worst fit (hightest $\chi^2$ for total_goals). It's for league_tier 2 (the Second Division) and the 1919-1920 season (the first one after the first world war). By eye, it's still a reasonable approximation. It's an outlier though; there aren't many league_tiers and seasons with fits this bad.

Overall, it's apparent that the Poisson distribution is a very good way of modeling football results at the league_tier and season level. The papers and articles are right. But what about at the team level?

Modeling goals at the club level

Each season, a club faces a new set of opponents. If they change league tier (promotion, relegation), their opponents will be pretty much all new. If they stay in the same league, some opponents will be different (again due to promotion and relegation). If we want to test how good the Poisson distribution is at modeling results at the club level, we need to look season-by-season. This immediately introduces a noise problem; there are many more matches played in a league tier in a season than an individual club will play.

Following the same sort of process as before, I looked at how well the Poisson models goals that the club level. The answer is: not well.

The best performing fit has a low $\chi^2$ = 0.05, the worst has a value of 98643. This is a bit misleading though, a lot of the fits are bad. Rather than show you the best and the worst, I'll just show you the results for one team and one season: Liverpool in 2024-2025.

(To clarify, total goals is the total number of goals scored in a season by a club, it's the sum of their home goals and their away goals.)

I checked the literature for club results modeling and I found that some authors found a Poisson distribution at the club level if they modeled the data over several seasons. I have mixed feelings about this. Although conditions vary within a season, they're more consistent than across different seasons. Over a period of several years, a majority of the players might have changed and of course, the remaining players will have aged. Is the Arsenal 2019 team the same as the Arsenal 2024 team? Where do you draw the line? On the other hand, the authors did find the Poisson distribution fit team results when aggregating over multiple seasons. As with all things in modeling sports results, there are deeper waters here and more thought and experimentation is required.

Although my season-by-season club fit $\chi^2$ values aren't crazy, I think you'll agree with me that the fit isn't great and not particularly useful. Sadly, this is the consistent story with this data set. The bottom line is, I'm not sure how useful the Poisson distribution is for predicting scores at the club level for a single season.

Some theory that didn't work

It could be noise driving the poor fit at the club level, which is a variant of the "law of small numbers", but it could be something else. Looking at these results, I'm wondering if this is a case of the Poisson Limit Theorem. The Poisson Limit Theorem is simple: it states as the number of trials in a Binomial distribution increases towards infinity, the distribution tends to the Poisson distribution. In other words, Binomial distributions look like Poisson distributions if you have enough data.

The obvious thing to do is to try fitting the data using the Binomial distribution instead. If the Binomial doesn't fit any better, it's not the Poisson Limit Theorem.

I tried fitting the club data using the Binomial distribution and I got fractionally better results, but not enough that I would use the Binomial distribution for any real predictions. In other words, this isn't the Poisson Limit Theorem at work.

I went back to all the sources that spoke about using the Poisson distribution to predict goals. All of them used data aggregated to the league or season level. One or two used the Poisson to try and predict who would end up at the top of a league at the end of the season. No one showed results at the club level for a single season or wrote about club-level predictions. I guess I know why now.

Some thoughts on next steps

There are four things I'm mulling over:

The Poisson distribution is a good fit for a league tier for a season.
I don't see the Poisson distribution as a good fit for a club for a season.
Some authors report the Poisson distribution is a fit for a club over several (5 or more) seasons. But clubs change over time, sometimes radically over short periods.
The Poisson Limit Theorem kicks in if you have enough data.

A league tier consists of several clubs, right now, there are 20 clubs in the Premier League. By aggregating the results over a season for 20 unrelated clubs, I get data that's well-fitted by the Poisson distribution. I'm wondering if the authors who modeled club data over five or more seasons got it right for the wrong reason. What if they aggregated the results of 5 unrelated clubs in the same season or even, different season? In other words, did they see a fit to multi-season club data because of aggregation alone?

Implications for predicting results

The Poisson distribution is a great distribution to use to model the goals scores at the league and season level, but not so much at the club-level. The Binomial distribution doesn't really work at the club-level either. It may well be each team plays too few matches in a season for us to fit their results using an off-the-shelf distribution. Or put another way, randomness is too big an element of the game to let us make quick and easy predictions.