Friday, December 19, 2025

Small adventures with small language models

Small is the new large

I've been talking to people about small language models (SLMs) for a little while now. They've told me they've got great results and they're saving money compared to using LLMs; these are people running businesses so they know what they're talking about. At an AI event, someone recommended I read the recent and short NVIDIA SLM paper, so I did. The paper was compelling; it gave the simple message that SLMs are useful now and you can save time and money if you use them instead of LLMs. 

(If you want to use SLMs, you'll be using Ollama and HuggingFace. They work together really well.)

As a result of what I've heard and read, I've looked into SLMs and I'm going to share with you what I've found. The bottom line is: they're worth using, but with strong caveats.

What is a SLM?

The boundary between an SLM and an LLM is a bit blurry, but to put it simply, an SLM is any model small enough to run on a single computer (even a laptop). In reality, SLMs require quite a powerful machine (developer spec) as we'll see, but nothing special, and certainly nothing beyond the budget of almost all businesses. Many (but not all) SLMs are open-source.

(If your laptop is "business spec", e.g., a MacBook Air, you probably don't have enough computing power to test out SLMs.) 

How to get started

To really dive into SLMs, you need to be able to use Python, but you can get started without coding. Let's start with the non-coders path because this is the easiest way for everyone to get going.

The first port of call is visiting ollama.com and downloading their software for your machine. Install the software and run it. You should see a UI like this.

Out-of-the-box, Ollama doesn't install any SLMs, so I'm going to show you how to install a model. From the drop down menu on the bottom right, select llama3.2. This will install the model on your machine which will take a minute or so. Remember, these models are resource hogs and using them will slow down your machine.

Once you've installed a model, ask it a question. For example, "Who is the Prime Minister of Canada?". The answer doesn't really matter, this is just a simple proof that your installation was successful. 

(By the way, the Ollama logo is very cute and they make great use of it. It shows you the power of good visual design.)

So many models!

The UI drop down list shows a number of models, but these are a fraction of what's available. Go to this page to see a few more: https://ollama.com/library. This is a nice list, but you actually have access to thousands more. HuggingFace has a repository of models that follow the GGUF format, you can see the list here: https://huggingface.co/models?library=gguf

Some models are newer than others and some are better than others at certain tasks. HuggingFace have a leaderboard that's useful here: https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard. It does say LLM, but it includes SLMs too and you can select just a SLM view of the models. There are also model cards you can explore that give you insight into the performance of each model for different types of tasks. 

To select the right models for your project, you'll need to define your problem and look for a model metric that most closely aligns with what you're trying to do. That's a lot of work, but to get started, you can install the popular models like mistral, llama3.2, and phi3 and get testing.

Who was the King of England in 1650?

You can't just generically evaluate an SLM, you have to evaluate it for a the task you want to do. For example, if you want a chatbot to talk about the stock you have in your retail company, it's no use testing the model on questions like "who was King of England in 1650?". It's nice if the model knows Kings & Queens, but not really very useful to you. So your first task is defining your evaluation criteria.

(England didn't have a King in 1650, it was a republic. Parliament had executed the previous King in 1649. This is an interesting piece of history, but why do you care if your SLM knows it?)

Text analysis: data breaches

For my evaluation, I chose a project analyzing press reports on data breaches. I selected nine questions I wanted answers to from a press report. Here are my questions:

  • "Does the article discuss a data breach - answer only Yes or No"
  • "Which entity was breached?"
  • "How many records were breached?"
  • "What date did the breach occur - answer using dd-MMM-YYYY format, if the date is not mentioned, answer Unknown, if the date is approximate, answer with a range of dates"
  • "When was the breach discovered, be as accurate as you can"
  • "Is the cause of the breach known - answer Yes or No only"
  • "If the cause of the breach is known state it"
  • "Were there any third parties involved - answer only Yes or No"
  • "If there were third parties involved, list their names"

The idea is simple, give the SLM a number of press reports. Get it to answer the questions on each article. Check the accuracy of the results for each SLM.

As it turns out, my questions needs some work, but they're good enough to get started.

Where to run your SLM?

The first choice you face is which computer to run your SLM on. Your choices boil down to evaluating it on the cloud or on your local machine. If you evaluate on the cloud, you need to choose a machine that's powerful enough but also works with your budget. Of course, the advantage of cloud deployment is you can choose any machine you like. If you choose your local machine, it needs to be powerful enough for the job. The advantage of local deployment is that it's easier and cheaper to get started.

To get going quickly, I chose my local machine, but as it turned out, it wasn't quite powerful enough.

The code

This is where we part ways with the Ollama app and turn to coding. 

The first step is installing the Ollama Python module (https://github.com/ollama/ollama-python). Unfortunately, the documentation isn't great, so I'm going to help you through it.

We need to install the SLMs on our machine. This is easy to do, you can either do it via the command line or via the API. I'll just show you the command line way to install the model llama3.2:

ollama pull llama3.2

Because we have the same nine questions we want to ask of each article, I'm going to create a 'custom' SLM. This means selecting a model (e.g. Llama3.2) and customizing it with my questions. Here's my code.

ollama.create(
model='breach_analyzer',
from_='llama3.2',
system=system_prompt,
stream=True,
):

The system_prompt is my nine questions I showed you earlier plus a general prompt. model is the name I'm giving my custom model; in this case I'm calling it breach_analyzer.

Now I've customized my model, here's how I call it:

response = ollama.generate(
model='breach_analyzer',
prompt=prompt,
format=BreachAnalysisResponse.model_json_schema(),
)

The prompt is the text of the article I want to analyze. The format is the JSON format I want the results to be in.  The response is the response from the model using the JSON format defined by BreachAnalysisResponse.model_json_schema().

Note I'm using generate here and not chat. My queries are "one-off" and there's no sense of a continuing dialog. If I'd wanted a continuing dialog, I'd have used the chat function.

Here's how my code works overall:

  1. Read in the text from six online articles.
  2. Load the model the user has selected (either mistral, llama3.2, or phi3).
  3. Customize the model.
  4. Run all six online articles through the customized model.
  5. Collect the results and analyze them.
I created two versions of my code, a command line version for testing and a Streamlit version for proper use. You can see both versions here: https://github.com/MikeWoodward/SLM-experiments/tree/main/Ollama

The results

The first thing I discovered is that these models are resource hogs! They hammered my machine and took 10-20 minutes to run each evaluation of six articles. My laptop is a 2020 developer spec MacBook Pro but it isn't really powerful enough to evaluate SLMs. The first lesson is, you need a powerful, recent machine to make this work; one that has GPUs built in that the SML can access. I've heard from other people that running SLMs on high-spec machines leads to fast (usable) response times.

The second lesson is accuracy. Of the three models I evaluated, not all of them answered my questions correctly. One of the articles was an article about tennis and not about data breaches, but one of the models incorrectly said it was about data breaches. Another of the models told me it was unclear whether there were third parties involved in a breach and then told me the name of the third party! 

On reflection, I needed to tweak my nine questions to get clearer answers. But this was difficult because of the length of time it took to analyze each article. This is a general problem; it took so long to run the models that any tweaking of code or settings took too much time.

The overall winner in terms of accuracy was Phi-3, but this was also the slowest to run on my machine, taking nearly 20 minutes to analyze six articles. From commentary I've seen elsewhere, this model runs acceptably fast on a more powerful machine.

Here's the key question: could I replace paid-for LLMs with SLMs? My answer is: almost certainly yes, if you deploy your SLMs on a high-spec computer. There's certainly enough accuracy here to warrant a serious investigation.

How I could have improved the results?

The most obvious thing is a faster machine. A brand new top-of-the-range MacBookPro with lots of memory and built-in GPUs. Santa, if you're listening, this is what I'd like. Alternatively, I could have gone onto the cloud and used a GPU machine.

My prompts could be better. They need some tweaking.

I get the text of these articles using requests. As part of the process, it gives me all of the text on the page, which includes a lot of irrelevant stuff. A good next step would be to get rid of some of the extraneous and distracting text. There are lots of ways to do that and it's a job any competent programmer could do.

If I could solve the speed problem, it would be good to investigate using multiple models. This could take several forms:

  • asking the same questions using multiple models and voting on the results
  • using different models for different questions.

What's notable about these ways of improving the results is how simple they are.

Some musings

  • Evaluating SLMs is firmly in the technical domain. I've heard of non-technical people try to play with these models, but they end up going nowhere because it takes technical skills to make them do anything useful. 
  • There are thousands of models and selecting the right one for your use case can be a challenge. I suggest going with the most recent and/or ones that score most highly on the HuggingFace leaderboard.
  • It takes a powerful machine to run these models. A new high-end machine with GPUs would probably run these models "fast enough". If you have a very recent and powerful local machine, it's worth playing around with SLMs locally to get started, but for serious evaluation, you need to get on the cloud and spend money.
  • Some US businesses are allergic to models developed in certain countries, some European businesses want models developed in Europe. If the geographic origin of your model is important, you need to check before you start evaluating.
  • You can get cost savings compared to LLMs, but there's hard work to be done implementing SLMs.

I have a lot more to say about evaluations and SLMs that I'm not saying here. If you want to hear more, reach out to me.

Next steps

Ian Stokes-Rees gave an excellent tutorial at PyData Boston on this topic and that's my number one choice for where to go next.

After that, I suggest you read the Ollama docs and join their Discord server. After that, the Hugging Face Community is a good place to go. Lastly, look at the YouTube tutorials out there.

Thursday, December 18, 2025

The Skellam distribution

Distributions, distributions everywhere

There are a ton of distributions out there; SciPy alone has several hundred and that's nowhere near a complete set. I'm going to talk about one of the lesser known distributions, the Skellam distribution, and what it's useful for. My point is a simple one: it's not enough for data scientists to know the main distributions, they must be aware that other distributions exist and have real-world uses.

Overview of the Skellam distribution

It's easy to define the Skellam distribution: it's the difference between two Poisson distributions, or more formally, the difference between two Poisson distributed random variables. 

So we don't get lost in the math, here's a picture of a Skellam distribution.

If you really must know, here's how the PMF is defined mathematically:

\[ P(Z = k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)} \left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_k(2\sqrt{\mu_1 \mu_2}) \] where \(I_k(x)\) is given by the modified Bessel function: \[ I_k(x) = \sum_{j=0}^{\infty} \frac{1}{j!(j+|k|)!} \left(\frac{x}{2}\right)^{2j+|k|} \]

this all looks very complicated, but by now (2025), it's easy to code up, here's the SciPy code to calculate the PMF:

probabilities = stats.skellam.pmf(k=k_values, mu1=mu1, mu2=mu2)

What use is it?

Here are just a few uses I found:

  • Finance: modeling price changes between trades.
  • Medicine: modeling the change in the number of beds in an ICU, epileptic seizure counts during drug trials, differences in reported AIDS cases, and so on.
  • Sports: differences in home and away team football or hockey scores.
  • Technology: modeling sensor noise in cameras, 

Where did it come from?

Skellam published the original paper on this distribution in 1946. There isn't a lot of background on why he did the work and, as far as I can tell, it wasn't related to World War II research work in any way. It's only really been discussed more widely once people discovered it's use for modeling sports scores. It's been available as an off-the-shelf distribution in SciPy for over a decade now.

As an analyst, what difference does this make to you?

I worked in a place where the data we analyzed wasn't normally distributed (which isn't uncommon, a lot of data sets aren't normally distributed), so it was important that everyone knew at least something about non-normal statistics. I interviewed job candidates for some senior positions and asked them how they would analyze some obviously non-normal data. Far too many of them suggested using methods only suitable for normally distributed data. Some candidates had Master's degrees in relevant areas and told me they had never been taught how to analyze non-normal data, and even worse, they never looked into it themselves. This was a major warning for us recruiting.

Let's imagine you're given a new data set in a new area and you want to model it. It's obviously not normal, so what do you do? In these cases, you need to have an understanding of what other distributions are out there and their general shape and properties. You should just be able to look at data and guess a number of distributions that could work. You don't need to have an encyclopedic knowledge of them all, you just need to know they exist and you should know how to use a few of them. 

Monday, December 15, 2025

Poisson to predict football results?

Goals are Poisson distributed?

I've read a lot of literature that suggests that goals in games like football (soccer) and hockey (ice hockey) are Poisson distributed. But are they? I've found out that it's not as simple as some of the papers and articles out there suggest. To dig into it, I'm going to define some terms and show you some analysis.

The Poisson distribution

The Poisson distribution is a discrete distribution that shows the probability distribution of the number of independent events occurring over a fixed time period or interval. Examples of its use include: the number of calls in a call center per hour, website visits per day, and manufacturing defects per batch. Here's what it looks like:

If this were a chart of defects per batch, the x-axis would be the number of defects and the y-axis would be the probability of that number of defects, so the probability of 2 defects per batch would be 0.275 (or 27.5%).

Here's it's probability mass function formula:

\[PMF = \frac{ \lambda^{k}e^{-\lambda}}{k!} \]

Modeling football goals - leagues and seasons

A lot of articles, blogs, and papers suggest that football scores are well-modeled by the Poisson distribution. This is despite the fact that goals are not wholly independent of one another; it's well-known that scoring a goal changes a game's dynamics. 

To check if the Poisson distribution models scores well, here's what I did.

  1. Collected all English football league match results from 1888 to the present. This data includes the following fields: league_tier, season, home_club, home_goals, away_club_away_goals.
  2. Calculated a field total_goals (away_goals + home_goals).
  3. For each league_tier and each season, calculated relative frequency for total_goals, away_goals, and home_goals.
  4. Curve fit a Poisson distribution to the data.
  5. Calculated \(\chi^2\) and the associated p-value.

This gives me a dataframe of \(\chi^2\)  and p for each league_tier and season. In other words, I know how good a model the Poisson distribution is for goals scored in English league football.

This is the best fit (lowest \(\chi^2\) for total_goals). It's for league_tier 2 (the EFL Championship) and season 2022-2023. The Poisson fit here is very good. There are a lot of league_tiers and seasons with pretty similar fits.

Here's the worst fit (hightest \(\chi^2\) for total_goals). It's for league_tier 2 (the Second Division) and the 1919-1920 season (the first one after the first world war). By eye, it's still a reasonable approximation. It's an outlier though; there aren't many league_tiers and seasons with fits this bad.


Overall, it's apparent that the Poisson distribution is a very good way of modeling football results at the league_tier and season level. The papers and articles are right. But what about at the team level?

Modeling goals at the club level

Each season, a club faces a new set of opponents. If they change league tier (promotion, relegation), their opponents will be pretty much all new. If they stay in the same league, some opponents will be different (again due to promotion and relegation). If we want to test how good the Poisson distribution is at modeling results at the club level, we need to look season-by-season. This immediately introduces a noise problem; there are many more matches played in a league tier in a season than an individual club will play.

Following the same sort of process as before, I looked at how well the Poisson models goals that the club level. The answer is: not well.

The best performing fit has a low \(\chi^2\) = 0.05, the worst has a value of 98643. This is a bit misleading though, a lot of the fits are bad. Rather than show you the best and the worst, I'll just show you the results for one team and one season: Liverpool in 2024-2025.

(To clarify, total goals is the total number of goals scored in a season by a club, it's the sum of their home goals and their away goals.)

I checked the literature for club results modeling and I found that some authors found a Poisson distribution at the club level if they modeled the data over several seasons. I have mixed feelings about this. Although conditions vary within a season, they're more consistent than across different seasons. Over a period of several years, a majority of the players might have changed and of course, the remaining players will have aged. Is the Arsenal 2019 team the same as the Arsenal 2024 team? Where do you draw the line? On the other hand, the authors did find the Poisson distribution fit team results when aggregating over multiple seasons. As with all things in modeling sports results, there are deeper waters here and more thought and experimentation is required.

Although my season-by-season club fit \(\chi^2\) values aren't crazy, I think you'll agree with me that the fit isn't great and not particularly useful. Sadly, this is the consistent story with this data set. The bottom line is, I'm not sure how useful the Poisson distribution is for predicting scores at the club level for a single season.

Some theory that didn't work

It could be noise driving the poor fit at the club level, which is a variant of the "law of small numbers", but it could be something else. Looking at these results, I'm wondering if this is a case of the Poisson Limit Theorem. The Poisson Limit Theorem is simple: it states as the number of trials in a Binomial distribution increases towards infinity, the distribution tends to the Poisson distribution. In other words, Binomial distributions look like Poisson distributions if you have enough data.

The obvious thing to do is to try fitting the data using the Binomial distribution instead. If the Binomial doesn't fit any better, it's not the Poisson Limit Theorem. 

I tried fitting the club data using the Binomial distribution and I got fractionally better results, but not enough that I would use the Binomial distribution for any real predictions. In other words, this isn't the Poisson Limit Theorem at work.

I went back to all the sources that spoke about using the Poisson distribution to predict goals. All of them used data aggregated to the league or season level. One or two used the Poisson to try and predict who would end up at the top of a league at the end of the season. No one showed results at the club level for a single season or wrote about club-level predictions. I guess I know why now.

Some thoughts on next steps

There are four things I'm mulling over:

  • The Poisson distribution is a good fit for a league tier for a season.
  • I don't see the Poisson distribution as a good fit for a club for a season.
  • Some authors report the Poisson distribution is a fit for a club over several (5 or more) seasons. But clubs change over time, sometimes radically over short periods.
  • The Poisson Limit Theorem kicks in if you have enough data.
A league tier consists of several clubs, right now, there are 20 clubs in the Premier League. By aggregating the results over a season for 20 unrelated clubs, I get data that's well-fitted by the Poisson distribution. I'm wondering if the authors who modeled club data over five or more seasons got it right for the wrong reason. What if they aggregated the results of 5 unrelated clubs in the same season or even, different season? In other words, did they see a fit to multi-season club data because of aggregation alone? 

Implications for predicting results

The Poisson distribution is a great distribution to use to model the goals scores at the league and season level, but not so much at the club-level. The Binomial distribution doesn't really work at the club-level either. It may well be each team plays too few matches in a season for us to fit their results using an off-the-shelf distribution. Or put another way, randomness is too big an element of the game to let us make quick and easy predictions.

Sunday, December 14, 2025

Ozymandias

Some poetic background 

'Ozymandias' is one of my favorite poems; I find it easily accessible and the imagery very evocative. As you might expect, I've dug into the background a bit. There are some interesting stories behind the poem and I'm going to tell you one or two.

(Gemini)

Here's the background. The poem is the result of a friendly wager between Horace Smith and Percy Bysshe Shelley in 1817. The wager was to write a poem on the theme of an ancient Egyptian monumental sculpture that was then on it's way to Britain. The statue was one of the spoils of war; Britain and France had been fighting for control in Egypt, with Egypt's antiquities one of the great prizes. The Younger Memnon statue was one of these antiquities, and after some adventures, it was successfully looted and brought to London (the French had tried to take it and failed, so it's another of those Anglo-French rivalries). British society had big expectations for the statue, hence the bet to write a poem about the remains of a large statue in a desert. Shelley published his poem in 1818 and 'won' the bet for the better poem.

(Younger Memnon statue - of Rameses II. British Museum. Creative Commons License.)

The titular Ozymandias is the Greek-language version of the name of the Egyptian pharaoh Ramesses II (1279–1213 BCE). During his 60 year reign, Egypt built many cities and temples and successfully waged war against old rivals; scholars regard him as one of the great pharaohs. The Younger Memnon sculpture depicts Ramesses II in his youth. So in 1817, we have a statue of a once-great pharaoh whose empire has crumbed into dust, leaving only statues and ruins behind.

The poem

Here's the entire poem.

I met a traveller from an antique land

Who said: Two vast and trunkless legs of stone

Stand in the desert. Near them, on the sand,

Half sunk, a shattered visage lies, whose frown,

And wrinkled lip, and sneer of cold command,

Tell that its sculptor well those passions read

Which yet survive, stamped on these lifeless things,

The hand that mocked them and the heart that fed:

And on the pedestal these words appear:

"My name is Ozymandias, King of Kings:

Look on my works, ye Mighty, and despair!"

No thing beside remains. Round the decay

Of that colossal wreck, boundless and bare

The lone and level sands stretch far away.

(Here's a page of literary criticism/analysis of the poem.) 

The other entry

Here's Horace Smith's poem on the same theme.

In Egypt's sandy silence, all alone,

Stands a gigantic Leg, which far off throws

The only shadow that the Desert knows:—

"I am great OZYMANDIAS," saith the stone,

"The King of Kings; this mighty City shows

The wonders of my hand."— The City's gone,—

Naught but the Leg remaining to disclose

The site of this forgotten Babylon.


We wonder — and some Hunter may express

Wonder like ours, when thro' the wilderness

Where London stood, holding the Wolf in chace,

He meets some fragment huge, and stops to guess

What powerful but unrecorded race

Once dwelt in that annihilated place.

I'm not a poetry critic, but even I can see that Shelley's entry is greatly superior.

Best readings

There are many, many of readings of Shelley's poem on the internet. A lot of people like John Gielgud's reading (maybe because he's English and was a classical actor's actor), but for me, Bryan Cranston's reading is the best.

Shelley's life story

Shelley's life story is worth disappearing into a wiki-hole for. It's a lurid tale of political radicalism, lust, and poetry. Even today, some of Shelley's exploits seem wild, and it's easy to see why they would have been shocking two hundred years ago. Of course, it would be remiss of me not to say that Shelley had a hand in the creation of Frankenstein (his wife, Mary Shelley, was the author). Like other great cultural icons, he died young, at 29.

Friday, December 12, 2025

Data sonification: a curious oddity that may have some uses

What is sonification?

The concept is simple: you turn data into sound. Obviously, you can play with frequency and volume, but there are more subtle sonic things you can play with to represent data. Let's imagine you had sales data for different countries and the data went up and down with time, you could assign a different instrument for each country (e.g. drum for the US, piano for Germany, violin for France), and different sales volumes could be represented as different notes. The hope is of course, that the notes get higher as sales increase. 

If you have more musical experience, you could turn data sets into more interesting music, for example, mapping ups and downs in the data to shifts in tone and speed. 

(Gemini)

Examples

Perhaps the simplest sonfication example is the one you've probably seen in movies: using a Geiger counter to measure radiation. The more it clicks, the more radiation there is. Because it's noise rather than a dial, the user can focus their eyes on where they point the detector and use their ears to detect radiation. It's so simple, even James Bond has used a Geiger counter. In a similar vein, metal detectors use sound to alert the user to the presence of metal.

Perhaps the best example I've heard of sonification is Brian Foo mapping income inequality along the New York Subway's 2 line. You can watch video and music here: https://vimeo.com/118358642?fl=pl&fe=sh. He's turned a set of data into a story and you can see how this could be taken further into a full-on multi-media presentation.

Sometimes, our ears can tell us things our eyes can't. Steve Mould's video on "The planets are weirdly in sync" has a great sonification example starting here: https://youtu.be/Qyn64b4LNJ0?t=1110, the sonficiation shows up data relationships that charts or animations can't. The whole video is also worth a watch too (https://www.youtube.com/watch?v=Qyn64b4LNJ0).

There are two other related examples of sonification I want to share. 

In a nuclear facility, you sometimes hear a background white noise sound. That signifies that all is well. If the sound goes away, that signifies something very bad has happened and you need to get out fast. Why not sound an alarm if something bad happens? Because if something really bad happens, there might not be power for the alarm. Silence is a fail-safe.

In a similar vein, years ago I worked on an audio processing system. We needed to know the system was reliable, so we played a CD of music over and over through the system. If we ever heard a break or glitch in the music, we knew the audio system had failed and we needed to intervene to catch the bug. This was a kind of ongoing sonic quality assurance system.

What use is it?

Frankly, sonification isn't something I would see people use every day. It's a special purpose thing, but it's handy to know about. Here are two use cases.

  • The obvious one is presenting company data. This could be sales, or clicks, or conversion etc. With a bit of effort and musical ability, you could do the kind of thing that Brian Foo did. Imagine an investor presentation (or even an all-hands meeting) with a full-on multi-media presentation with charts, video, and sound.
  • The other use is safety and alerting. Imagine a company selling items on a website. It could pipe in music into common areas (e.g. restrooms and lunch areas). If sales are going well, it plays fast music, if they're slow, it plays slow music. If there are no sales at all, you get silence. This is a way of alerting everyone to the rhythm of sales and if something goes wrong. Obviously, this could go too far, but you get the idea.

Finding out more

Sonification: the music of data - https://www.youtube.com/watch?v=br_8wXKgtkg

The planets are weirdly in sync - https://www.youtube.com/watch?v=Qyn64b4LNJ0

Brian Foo's sonifications - https://datadrivendj.com/

NASA's astronomical data sonifications - https://science.nasa.gov/mission/hubble/multimedia/sonifications/

The sound of science - https://pmc.ncbi.nlm.nih.gov/articles/PMC11387736/

Monday, December 1, 2025

Some musings on code generation: kintsugi

Hype and reality

I've been using AI code generation (Claude, Gemini, Cursor...) for months and I'm familiar with its strengths and weaknesses. It feels like I've gone through whole the hype cycle (see https://en.wikipedia.org/wiki/Gartner_hype_cycle) and now I'm firmly on the Plateau of Productivity. Here are some musings covering benefits, disappointments, and a way forward.

(The Japanese art of Kintsugi. Image by Gemini.)

Benefits

Elsewhere, people have waxed lyrical about the benefits of code generation, so I'm just going to add in a few novel points.

It's great when you're unfamiliar with an area of a language; it acts as a prompt or tutorial. In the past, you'd have to wade through pages of documentation and write code to experiment. Alternatively, you could search to see if anyone's tackled your problem and has a solution. If you were really stuck, you could try and ask a question on Stack Overflow and deal with the toxicity. Now, you can get something to get you going quickly.

Modern code development requires properly commenting code, making sure code is "linted" and PEP8 compliant, and creating test cases etc. While these things are important, they can consume a lot of time. Code generation steps on the accelerator pedal and makes them go much faster. In fact, code gen makes it quite reasonable to raise the bar on code quality.

Disappointments

Pandas dataframes

I've found code gen really doesn't do well manipulating Pandas dataframes. Several times, I've wanted to transform dataframes or do something non-trivial, for example, aggregating data, merging dataframes, transforming a column in some complex way and so on. I've found the generated code to either be wrong or really inefficient. In a few cases, the code was wrong, but in a way that was hard to spot; subtle bugs are costly to fix.

Bloated code

This is something other people have commented to me too: sometimes generated code is really bloated. I've had cases where what should have been a single line of code gets turned into 20 or more lines. Some of it is "well-intentioned", meaning lots of error trapping. But sometimes it's just a poor implementation. Bloated code is harder to maintain and slower to run.

Django

It took me a while to find the problems with Django code gen. On the whole, code gen for Django works astonishingly well, it's one of the huge benefits. But I've found the generated code to be inefficient in several ways:

  • The model manipulations have sometimes been odd or poor implementations. A more thoughtful approach to aggregation can make the code more readable and faster.
  • If the network connection is slow or backend computations take some time, a page can take a long time to even start to render. A better approach involves building the page so the user sees something quickly and then adding other elements as they become available. Code gen doesn't do this "out of the box".
  • UI layout can sometimes take a lot of prompting to get right. Mostly, it works really well, but occasionally, code gen finds something it really, really struggles with. Oddly, I've found it relatively easy to fix these issues by hand.

JavaScript oddities

Most of my work is in Python, but occasionally, I've wandered into JavaScript to build apps. I don't know a lot of JavaScript, and that's been the problem, I've been slow to spot code gen wrongness.

My projects have widgets and charts and I found the JavaScript callbacks and code were overcomplicated and bloated. I re-wrote the code to be 50% shorter and much clearer. It cost me some effort to come up to speed with JavaScript to spot and fix things.

Oddly, I found hallucination more of a problem for JavaScript than Python. My code gen system hallucinated the need to include an external CSS that didn't exist and wasn't needed. Code gen also hallucinated "standard" functions that weren't available (that was nice one to debug!).

Similar to my Python experience, I found code gen to be really bad at manipulating data objects. In a few cases, it would give me code that was flat out wrong.

'Unpopular' code

If you're using libraries that have been extensively used by others (e.g. requests, Django, etc.), code gen is mostly good. But when you're using libraries that are a little "off the beaten path", I've found code generation really drops down in quality. In a few cases, it's pretty much unusable.

A way forward through the trough of disappointment

It's possible that more thorough prompting might solve some of these problems, but I'm not entirely convinced. I've found that code generation often doesn't do well with very, very detailed and long prompting. Here's what I think is needed.

Accepting that code generation is flawed and needs adult supervision. It's a tool, not a magic wand. The development process must include checks the code is correct.

Proper training. You need to spot when it's gone wrong and you need to intervene. This means knowing the languages you're code generating. I didn't know JavaScript well enough and I paid the price.

Libraries to learn from and use. Code gen learns from your codebase, but this isn't enough, especially if you're doing something new, and it can also mean code gen is learning the wrong things. Having a library means code gen isn't re-inventing the wheel each time.

In a corporate setting, all this means having thoughtful policies and practices for code gen and code development. Code gen is changing rapidly, which means policies and practices will need to be updated every six months, or when you learn something new.

Kintsugi

Kintsugi is the Japanese art of taking something broken (e.g., a pot or a vase) and mending it in a way that both acknowledges its brokenness and makes it more beautiful. Code generation isn't broken, but it can be made a lot more useful with some careful thought and acknowledging its weaknesses.