Friday, December 19, 2025

Small adventures with small language models

Small is the new large

I've been talking to people about small language models (SLMs) for a little while now. They've told me they've got great results and they're saving money compared to using LLMs; these are people running businesses so they know what they're talking about. At an AI event, someone recommended I read the recent and short NVIDIA SLM paper, so I did. The paper was compelling; it gave the simple message that SLMs are useful now and you can save time and money if you use them instead of LLMs. 

(If you want to use SLMs, you'll be using Ollama and HuggingFace. They work together really well.)

As a result of what I've heard and read, I've looked into SLMs and I'm going to share with you what I've found. The bottom line is: they're worth using, but with strong caveats.

What is a SLM?

The boundary between an SLM and an LLM is a bit blurry, but to put it simply, an SLM is any model small enough to run on a single computer (even a laptop). In reality, SLMs require quite a powerful machine (developer spec) as we'll see, but nothing special, and certainly nothing beyond the budget of almost all businesses. Many (but not all) SLMs are open-source.

(If your laptop is "business spec", e.g., a MacBook Air, you probably don't have enough computing power to test out SLMs.) 

How to get started

To really dive into SLMs, you need to be able to use Python, but you can get started without coding. Let's start with the non-coders path because this is the easiest way for everyone to get going.

The first port of call is visiting ollama.com and downloading their software for your machine. Install the software and run it. You should see a UI like this.

Out-of-the-box, Ollama doesn't install any SLMs, so I'm going to show you how to install a model. From the drop down menu on the bottom right, select llama3.2. This will install the model on your machine which will take a minute or so. Remember, these models are resource hogs and using them will slow down your machine.

Once you've installed a model, ask it a question. For example, "Who is the Prime Minister of Canada?". The answer doesn't really matter, this is just a simple proof that your installation was successful. 

(By the way, the Ollama logo is very cute and they make great use of it. It shows you the power of good visual design.)

So many models!

The UI drop down list shows a number of models, but these are a fraction of what's available. Go to this page to see a few more: https://ollama.com/library. This is a nice list, but you actually have access to thousands more. HuggingFace has a repository of models that follow the GGUF format, you can see the list here: https://huggingface.co/models?library=gguf

Some models are newer than others and some are better than others at certain tasks. HuggingFace have a leaderboard that's useful here: https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard. It does say LLM, but it includes SLMs too and you can select just a SLM view of the models. There are also model cards you can explore that give you insight into the performance of each model for different types of tasks. 

To select the right models for your project, you'll need to define your problem and look for a model metric that most closely aligns with what you're trying to do. That's a lot of work, but to get started, you can install the popular models like mistral, llama3.2, and phi3 and get testing.

Who was the King of England in 1650?

You can't just generically evaluate an SLM, you have to evaluate it for a the task you want to do. For example, if you want a chatbot to talk about the stock you have in your retail company, it's no use testing the model on questions like "who was King of England in 1650?". It's nice if the model knows Kings & Queens, but not really very useful to you. So your first task is defining your evaluation criteria.

(England didn't have a King in 1650, it was a republic. Parliament had executed the previous King in 1649. This is an interesting piece of history, but why do you care if your SLM knows it?)

Text analysis: data breaches

For my evaluation, I chose a project analyzing press reports on data breaches. I selected nine questions I wanted answers to from a press report. Here are my questions:

  • "Does the article discuss a data breach - answer only Yes or No"
  • "Which entity was breached?"
  • "How many records were breached?"
  • "What date did the breach occur - answer using dd-MMM-YYYY format, if the date is not mentioned, answer Unknown, if the date is approximate, answer with a range of dates"
  • "When was the breach discovered, be as accurate as you can"
  • "Is the cause of the breach known - answer Yes or No only"
  • "If the cause of the breach is known state it"
  • "Were there any third parties involved - answer only Yes or No"
  • "If there were third parties involved, list their names"

The idea is simple, give the SLM a number of press reports. Get it to answer the questions on each article. Check the accuracy of the results for each SLM.

As it turns out, my questions needs some work, but they're good enough to get started.

Where to run your SLM?

The first choice you face is which computer to run your SLM on. Your choices boil down to evaluating it on the cloud or on your local machine. If you evaluate on the cloud, you need to choose a machine that's powerful enough but also works with your budget. Of course, the advantage of cloud deployment is you can choose any machine you like. If you choose your local machine, it needs to be powerful enough for the job. The advantage of local deployment is that it's easier and cheaper to get started.

To get going quickly, I chose my local machine, but as it turned out, it wasn't quite powerful enough.

The code

This is where we part ways with the Ollama app and turn to coding. 

The first step is installing the Ollama Python module (https://github.com/ollama/ollama-python). Unfortunately, the documentation isn't great, so I'm going to help you through it.

We need to install the SLMs on our machine. This is easy to do, you can either do it via the command line or via the API. I'll just show you the command line way to install the model llama3.2:

ollama pull llama3.2

Because we have the same nine questions we want to ask of each article, I'm going to create a 'custom' SLM. This means selecting a model (e.g. Llama3.2) and customizing it with my questions. Here's my code.

ollama.create(
model='breach_analyzer',
from_='llama3.2',
system=system_prompt,
stream=True,
):

The system_prompt is my nine questions I showed you earlier plus a general prompt. model is the name I'm giving my custom model; in this case I'm calling it breach_analyzer.

Now I've customized my model, here's how I call it:

response = ollama.generate(
model='breach_analyzer',
prompt=prompt,
format=BreachAnalysisResponse.model_json_schema(),
)

The prompt is the text of the article I want to analyze. The format is the JSON format I want the results to be in.  The response is the response from the model using the JSON format defined by BreachAnalysisResponse.model_json_schema().

Note I'm using generate here and not chat. My queries are "one-off" and there's no sense of a continuing dialog. If I'd wanted a continuing dialog, I'd have used the chat function.

Here's how my code works overall:

  1. Read in the text from six online articles.
  2. Load the model the user has selected (either mistral, llama3.2, or phi3).
  3. Customize the model.
  4. Run all six online articles through the customized model.
  5. Collect the results and analyze them.
I created two versions of my code, a command line version for testing and a Streamlit version for proper use. You can see both versions here: https://github.com/MikeWoodward/SLM-experiments/tree/main/Ollama

The results

The first thing I discovered is that these models are resource hogs! They hammered my machine and took 10-20 minutes to run each evaluation of six articles. My laptop is a 2020 developer spec MacBook Pro but it isn't really powerful enough to evaluate SLMs. The first lesson is, you need a powerful, recent machine to make this work; one that has GPUs built in that the SML can access. I've heard from other people that running SLMs on high-spec machines leads to fast (usable) response times.

The second lesson is accuracy. Of the three models I evaluated, not all of them answered my questions correctly. One of the articles was an article about tennis and not about data breaches, but one of the models incorrectly said it was about data breaches. Another of the models told me it was unclear whether there were third parties involved in a breach and then told me the name of the third party! 

On reflection, I needed to tweak my nine questions to get clearer answers. But this was difficult because of the length of time it took to analyze each article. This is a general problem; it took so long to run the models that any tweaking of code or settings took too much time.

The overall winner in terms of accuracy was Phi-3, but this was also the slowest to run on my machine, taking nearly 20 minutes to analyze six articles. From commentary I've seen elsewhere, this model runs acceptably fast on a more powerful machine.

Here's the key question: could I replace paid-for LLMs with SLMs? My answer is: almost certainly yes, if you deploy your SLMs on a high-spec computer. There's certainly enough accuracy here to warrant a serious investigation.

How I could have improved the results?

The most obvious thing is a faster machine. A brand new top-of-the-range MacBookPro with lots of memory and built-in GPUs. Santa, if you're listening, this is what I'd like. Alternatively, I could have gone onto the cloud and used a GPU machine.

My prompts could be better. They need some tweaking.

I get the text of these articles using requests. As part of the process, it gives me all of the text on the page, which includes a lot of irrelevant stuff. A good next step would be to get rid of some of the extraneous and distracting text. There are lots of ways to do that and it's a job any competent programmer could do.

If I could solve the speed problem, it would be good to investigate using multiple models. This could take several forms:

  • asking the same questions using multiple models and voting on the results
  • using different models for different questions.

What's notable about these ways of improving the results is how simple they are.

Some musings

  • Evaluating SLMs is firmly in the technical domain. I've heard of non-technical people try to play with these models, but they end up going nowhere because it takes technical skills to make them do anything useful. 
  • There are thousands of models and selecting the right one for your use case can be a challenge. I suggest going with the most recent and/or ones that score most highly on the HuggingFace leaderboard.
  • It takes a powerful machine to run these models. A new high-end machine with GPUs would probably run these models "fast enough". If you have a very recent and powerful local machine, it's worth playing around with SLMs locally to get started, but for serious evaluation, you need to get on the cloud and spend money.
  • Some US businesses are allergic to models developed in certain countries, some European businesses want models developed in Europe. If the geographic origin of your model is important, you need to check before you start evaluating.
  • You can get cost savings compared to LLMs, but there's hard work to be done implementing SLMs.

I have a lot more to say about evaluations and SLMs that I'm not saying here. If you want to hear more, reach out to me.

Next steps

Ian Stokes-Rees gave an excellent tutorial at PyData Boston on this topic and that's my number one choice for where to go next.

After that, I suggest you read the Ollama docs and join their Discord server. After that, the Hugging Face Community is a good place to go. Lastly, look at the YouTube tutorials out there.

Thursday, December 18, 2025

The Skellam distribution

Distributions, distributions everywhere

There are a ton of distributions out there; SciPy alone has several hundred and that's nowhere near a complete set. I'm going to talk about one of the lesser known distributions, the Skellam distribution, and what it's useful for. My point is a simple one: it's not enough for data scientists to know the main distributions, they must be aware that other distributions exist and have real-world uses.

Overview of the Skellam distribution

It's easy to define the Skellam distribution: it's the difference between two Poisson distributions, or more formally, the difference between two Poisson distributed random variables. 

So we don't get lost in the math, here's a picture of a Skellam distribution.

If you really must know, here's how the PMF is defined mathematically:

\[ P(Z = k; \mu_1, \mu_2) = e^{-(\mu_1 + \mu_2)} \left(\frac{\mu_1}{\mu_2}\right)^{k/2} I_k(2\sqrt{\mu_1 \mu_2}) \] where \(I_k(x)\) is given by the modified Bessel function: \[ I_k(x) = \sum_{j=0}^{\infty} \frac{1}{j!(j+|k|)!} \left(\frac{x}{2}\right)^{2j+|k|} \]

this all looks very complicated, but by now (2025), it's easy to code up, here's the SciPy code to calculate the PMF:

probabilities = stats.skellam.pmf(k=k_values, mu1=mu1, mu2=mu2)

What use is it?

Here are just a few uses I found:

  • Finance: modeling price changes between trades.
  • Medicine: modeling the change in the number of beds in an ICU, epileptic seizure counts during drug trials, differences in reported AIDS cases, and so on.
  • Sports: differences in home and away team football or hockey scores.
  • Technology: modeling sensor noise in cameras, 

Where did it come from?

Skellam published the original paper on this distribution in 1946. There isn't a lot of background on why he did the work and, as far as I can tell, it wasn't related to World War II research work in any way. It's only really been discussed more widely once people discovered it's use for modeling sports scores. It's been available as an off-the-shelf distribution in SciPy for over a decade now.

As an analyst, what difference does this make to you?

I worked in a place where the data we analyzed wasn't normally distributed (which isn't uncommon, a lot of data sets aren't normally distributed), so it was important that everyone knew at least something about non-normal statistics. I interviewed job candidates for some senior positions and asked them how they would analyze some obviously non-normal data. Far too many of them suggested using methods only suitable for normally distributed data. Some candidates had Master's degrees in relevant areas and told me they had never been taught how to analyze non-normal data, and even worse, they never looked into it themselves. This was a major warning for us recruiting.

Let's imagine you're given a new data set in a new area and you want to model it. It's obviously not normal, so what do you do? In these cases, you need to have an understanding of what other distributions are out there and their general shape and properties. You should just be able to look at data and guess a number of distributions that could work. You don't need to have an encyclopedic knowledge of them all, you just need to know they exist and you should know how to use a few of them. 

Monday, December 15, 2025

Poisson to predict football results?

Goals are Poisson distributed?

I've read a lot of literature that suggests that goals in games like football (soccer) and hockey (ice hockey) are Poisson distributed. But are they? I've found out that it's not as simple as some of the papers and articles out there suggest. To dig into it, I'm going to define some terms and show you some analysis.

The Poisson distribution

The Poisson distribution is a discrete distribution that shows the probability distribution of the number of independent events occurring over a fixed time period or interval. Examples of its use include: the number of calls in a call center per hour, website visits per day, and manufacturing defects per batch. Here's what it looks like:

If this were a chart of defects per batch, the x-axis would be the number of defects and the y-axis would be the probability of that number of defects, so the probability of 2 defects per batch would be 0.275 (or 27.5%).

Here's it's probability mass function formula:

\[PMF = \frac{ \lambda^{k}e^{-\lambda}}{k!} \]

Modeling football goals - leagues and seasons

A lot of articles, blogs, and papers suggest that football scores are well-modeled by the Poisson distribution. This is despite the fact that goals are not wholly independent of one another; it's well-known that scoring a goal changes a game's dynamics. 

To check if the Poisson distribution models scores well, here's what I did.

  1. Collected all English football league match results from 1888 to the present. This data includes the following fields: league_tier, season, home_club, home_goals, away_club_away_goals.
  2. Calculated a field total_goals (away_goals + home_goals).
  3. For each league_tier and each season, calculated relative frequency for total_goals, away_goals, and home_goals.
  4. Curve fit a Poisson distribution to the data.
  5. Calculated \(\chi^2\) and the associated p-value.

This gives me a dataframe of \(\chi^2\)  and p for each league_tier and season. In other words, I know how good a model the Poisson distribution is for goals scored in English league football.

This is the best fit (lowest \(\chi^2\) for total_goals). It's for league_tier 2 (the EFL Championship) and season 2022-2023. The Poisson fit here is very good. There are a lot of league_tiers and seasons with pretty similar fits.

Here's the worst fit (hightest \(\chi^2\) for total_goals). It's for league_tier 2 (the Second Division) and the 1919-1920 season (the first one after the first world war). By eye, it's still a reasonable approximation. It's an outlier though; there aren't many league_tiers and seasons with fits this bad.


Overall, it's apparent that the Poisson distribution is a very good way of modeling football results at the league_tier and season level. The papers and articles are right. But what about at the team level?

Modeling goals at the club level

Each season, a club faces a new set of opponents. If they change league tier (promotion, relegation), their opponents will be pretty much all new. If they stay in the same league, some opponents will be different (again due to promotion and relegation). If we want to test how good the Poisson distribution is at modeling results at the club level, we need to look season-by-season. This immediately introduces a noise problem; there are many more matches played in a league tier in a season than an individual club will play.

Following the same sort of process as before, I looked at how well the Poisson models goals that the club level. The answer is: not well.

The best performing fit has a low \(\chi^2\) = 0.05, the worst has a value of 98643. This is a bit misleading though, a lot of the fits are bad. Rather than show you the best and the worst, I'll just show you the results for one team and one season: Liverpool in 2024-2025.

(To clarify, total goals is the total number of goals scored in a season by a club, it's the sum of their home goals and their away goals.)

I checked the literature for club results modeling and I found that some authors found a Poisson distribution at the club level if they modeled the data over several seasons. I have mixed feelings about this. Although conditions vary within a season, they're more consistent than across different seasons. Over a period of several years, a majority of the players might have changed and of course, the remaining players will have aged. Is the Arsenal 2019 team the same as the Arsenal 2024 team? Where do you draw the line? On the other hand, the authors did find the Poisson distribution fit team results when aggregating over multiple seasons. As with all things in modeling sports results, there are deeper waters here and more thought and experimentation is required.

Although my season-by-season club fit \(\chi^2\) values aren't crazy, I think you'll agree with me that the fit isn't great and not particularly useful. Sadly, this is the consistent story with this data set. The bottom line is, I'm not sure how useful the Poisson distribution is for predicting scores at the club level for a single season.

Some theory that didn't work

It could be noise driving the poor fit at the club level, which is a variant of the "law of small numbers", but it could be something else. Looking at these results, I'm wondering if this is a case of the Poisson Limit Theorem. The Poisson Limit Theorem is simple: it states as the number of trials in a Binomial distribution increases towards infinity, the distribution tends to the Poisson distribution. In other words, Binomial distributions look like Poisson distributions if you have enough data.

The obvious thing to do is to try fitting the data using the Binomial distribution instead. If the Binomial doesn't fit any better, it's not the Poisson Limit Theorem. 

I tried fitting the club data using the Binomial distribution and I got fractionally better results, but not enough that I would use the Binomial distribution for any real predictions. In other words, this isn't the Poisson Limit Theorem at work.

I went back to all the sources that spoke about using the Poisson distribution to predict goals. All of them used data aggregated to the league or season level. One or two used the Poisson to try and predict who would end up at the top of a league at the end of the season. No one showed results at the club level for a single season or wrote about club-level predictions. I guess I know why now.

Some thoughts on next steps

There are four things I'm mulling over:

  • The Poisson distribution is a good fit for a league tier for a season.
  • I don't see the Poisson distribution as a good fit for a club for a season.
  • Some authors report the Poisson distribution is a fit for a club over several (5 or more) seasons. But clubs change over time, sometimes radically over short periods.
  • The Poisson Limit Theorem kicks in if you have enough data.
A league tier consists of several clubs, right now, there are 20 clubs in the Premier League. By aggregating the results over a season for 20 unrelated clubs, I get data that's well-fitted by the Poisson distribution. I'm wondering if the authors who modeled club data over five or more seasons got it right for the wrong reason. What if they aggregated the results of 5 unrelated clubs in the same season or even, different season? In other words, did they see a fit to multi-season club data because of aggregation alone? 

Implications for predicting results

The Poisson distribution is a great distribution to use to model the goals scores at the league and season level, but not so much at the club-level. The Binomial distribution doesn't really work at the club-level either. It may well be each team plays too few matches in a season for us to fit their results using an off-the-shelf distribution. Or put another way, randomness is too big an element of the game to let us make quick and easy predictions.

Sunday, December 14, 2025

Ozymandias

Some poetic background 

'Ozymandias' is one of my favorite poems; I find it easily accessible and the imagery very evocative. As you might expect, I've dug into the background a bit. There are some interesting stories behind the poem and I'm going to tell you one or two.

(Gemini)

Here's the background. The poem is the result of a friendly wager between Horace Smith and Percy Bysshe Shelley in 1817. The wager was to write a poem on the theme of an ancient Egyptian monumental sculpture that was then on it's way to Britain. The statue was one of the spoils of war; Britain and France had been fighting for control in Egypt, with Egypt's antiquities one of the great prizes. The Younger Memnon statue was one of these antiquities, and after some adventures, it was successfully looted and brought to London (the French had tried to take it and failed, so it's another of those Anglo-French rivalries). British society had big expectations for the statue, hence the bet to write a poem about the remains of a large statue in a desert. Shelley published his poem in 1818 and 'won' the bet for the better poem.

(Younger Memnon statue - of Rameses II. British Museum. Creative Commons License.)

The titular Ozymandias is the Greek-language version of the name of the Egyptian pharaoh Ramesses II (1279–1213 BCE). During his 60 year reign, Egypt built many cities and temples and successfully waged war against old rivals; scholars regard him as one of the great pharaohs. The Younger Memnon sculpture depicts Ramesses II in his youth. So in 1817, we have a statue of a once-great pharaoh whose empire has crumbed into dust, leaving only statues and ruins behind.

The poem

Here's the entire poem.

I met a traveller from an antique land

Who said: Two vast and trunkless legs of stone

Stand in the desert. Near them, on the sand,

Half sunk, a shattered visage lies, whose frown,

And wrinkled lip, and sneer of cold command,

Tell that its sculptor well those passions read

Which yet survive, stamped on these lifeless things,

The hand that mocked them and the heart that fed:

And on the pedestal these words appear:

"My name is Ozymandias, King of Kings:

Look on my works, ye Mighty, and despair!"

No thing beside remains. Round the decay

Of that colossal wreck, boundless and bare

The lone and level sands stretch far away.

(Here's a page of literary criticism/analysis of the poem.) 

The other entry

Here's Horace Smith's poem on the same theme.

In Egypt's sandy silence, all alone,

Stands a gigantic Leg, which far off throws

The only shadow that the Desert knows:—

"I am great OZYMANDIAS," saith the stone,

"The King of Kings; this mighty City shows

The wonders of my hand."— The City's gone,—

Naught but the Leg remaining to disclose

The site of this forgotten Babylon.


We wonder — and some Hunter may express

Wonder like ours, when thro' the wilderness

Where London stood, holding the Wolf in chace,

He meets some fragment huge, and stops to guess

What powerful but unrecorded race

Once dwelt in that annihilated place.

I'm not a poetry critic, but even I can see that Shelley's entry is greatly superior.

Best readings

There are many, many of readings of Shelley's poem on the internet. A lot of people like John Gielgud's reading (maybe because he's English and was a classical actor's actor), but for me, Bryan Cranston's reading is the best.

Shelley's life story

Shelley's life story is worth disappearing into a wiki-hole for. It's a lurid tale of political radicalism, lust, and poetry. Even today, some of Shelley's exploits seem wild, and it's easy to see why they would have been shocking two hundred years ago. Of course, it would be remiss of me not to say that Shelley had a hand in the creation of Frankenstein (his wife, Mary Shelley, was the author). Like other great cultural icons, he died young, at 29.

Friday, December 12, 2025

Data sonification: a curious oddity that may have some uses

What is sonification?

The concept is simple: you turn data into sound. Obviously, you can play with frequency and volume, but there are more subtle sonic things you can play with to represent data. Let's imagine you had sales data for different countries and the data went up and down with time, you could assign a different instrument for each country (e.g. drum for the US, piano for Germany, violin for France), and different sales volumes could be represented as different notes. The hope is of course, that the notes get higher as sales increase. 

If you have more musical experience, you could turn data sets into more interesting music, for example, mapping ups and downs in the data to shifts in tone and speed. 

(Gemini)

Examples

Perhaps the simplest sonfication example is the one you've probably seen in movies: using a Geiger counter to measure radiation. The more it clicks, the more radiation there is. Because it's noise rather than a dial, the user can focus their eyes on where they point the detector and use their ears to detect radiation. It's so simple, even James Bond has used a Geiger counter. In a similar vein, metal detectors use sound to alert the user to the presence of metal.

Perhaps the best example I've heard of sonification is Brian Foo mapping income inequality along the New York Subway's 2 line. You can watch video and music here: https://vimeo.com/118358642?fl=pl&fe=sh. He's turned a set of data into a story and you can see how this could be taken further into a full-on multi-media presentation.

Sometimes, our ears can tell us things our eyes can't. Steve Mould's video on "The planets are weirdly in sync" has a great sonification example starting here: https://youtu.be/Qyn64b4LNJ0?t=1110, the sonficiation shows up data relationships that charts or animations can't. The whole video is also worth a watch too (https://www.youtube.com/watch?v=Qyn64b4LNJ0).

There are two other related examples of sonification I want to share. 

In a nuclear facility, you sometimes hear a background white noise sound. That signifies that all is well. If the sound goes away, that signifies something very bad has happened and you need to get out fast. Why not sound an alarm if something bad happens? Because if something really bad happens, there might not be power for the alarm. Silence is a fail-safe.

In a similar vein, years ago I worked on an audio processing system. We needed to know the system was reliable, so we played a CD of music over and over through the system. If we ever heard a break or glitch in the music, we knew the audio system had failed and we needed to intervene to catch the bug. This was a kind of ongoing sonic quality assurance system.

What use is it?

Frankly, sonification isn't something I would see people use every day. It's a special purpose thing, but it's handy to know about. Here are two use cases.

  • The obvious one is presenting company data. This could be sales, or clicks, or conversion etc. With a bit of effort and musical ability, you could do the kind of thing that Brian Foo did. Imagine an investor presentation (or even an all-hands meeting) with a full-on multi-media presentation with charts, video, and sound.
  • The other use is safety and alerting. Imagine a company selling items on a website. It could pipe in music into common areas (e.g. restrooms and lunch areas). If sales are going well, it plays fast music, if they're slow, it plays slow music. If there are no sales at all, you get silence. This is a way of alerting everyone to the rhythm of sales and if something goes wrong. Obviously, this could go too far, but you get the idea.

Finding out more

Sonification: the music of data - https://www.youtube.com/watch?v=br_8wXKgtkg

The planets are weirdly in sync - https://www.youtube.com/watch?v=Qyn64b4LNJ0

Brian Foo's sonifications - https://datadrivendj.com/

NASA's astronomical data sonifications - https://science.nasa.gov/mission/hubble/multimedia/sonifications/

The sound of science - https://pmc.ncbi.nlm.nih.gov/articles/PMC11387736/

Monday, December 1, 2025

Some musings on code generation: kintsugi

Hype and reality

I've been using AI code generation (Claude, Gemini, Cursor...) for months and I'm familiar with its strengths and weaknesses. It feels like I've gone through whole the hype cycle (see https://en.wikipedia.org/wiki/Gartner_hype_cycle) and now I'm firmly on the Plateau of Productivity. Here are some musings covering benefits, disappointments, and a way forward.

(The Japanese art of Kintsugi. Image by Gemini.)

Benefits

Elsewhere, people have waxed lyrical about the benefits of code generation, so I'm just going to add in a few novel points.

It's great when you're unfamiliar with an area of a language; it acts as a prompt or tutorial. In the past, you'd have to wade through pages of documentation and write code to experiment. Alternatively, you could search to see if anyone's tackled your problem and has a solution. If you were really stuck, you could try and ask a question on Stack Overflow and deal with the toxicity. Now, you can get something to get you going quickly.

Modern code development requires properly commenting code, making sure code is "linted" and PEP8 compliant, and creating test cases etc. While these things are important, they can consume a lot of time. Code generation steps on the accelerator pedal and makes them go much faster. In fact, code gen makes it quite reasonable to raise the bar on code quality.

Disappointments

Pandas dataframes

I've found code gen really doesn't do well manipulating Pandas dataframes. Several times, I've wanted to transform dataframes or do something non-trivial, for example, aggregating data, merging dataframes, transforming a column in some complex way and so on. I've found the generated code to either be wrong or really inefficient. In a few cases, the code was wrong, but in a way that was hard to spot; subtle bugs are costly to fix.

Bloated code

This is something other people have commented to me too: sometimes generated code is really bloated. I've had cases where what should have been a single line of code gets turned into 20 or more lines. Some of it is "well-intentioned", meaning lots of error trapping. But sometimes it's just a poor implementation. Bloated code is harder to maintain and slower to run.

Django

It took me a while to find the problems with Django code gen. On the whole, code gen for Django works astonishingly well, it's one of the huge benefits. But I've found the generated code to be inefficient in several ways:

  • The model manipulations have sometimes been odd or poor implementations. A more thoughtful approach to aggregation can make the code more readable and faster.
  • If the network connection is slow or backend computations take some time, a page can take a long time to even start to render. A better approach involves building the page so the user sees something quickly and then adding other elements as they become available. Code gen doesn't do this "out of the box".
  • UI layout can sometimes take a lot of prompting to get right. Mostly, it works really well, but occasionally, code gen finds something it really, really struggles with. Oddly, I've found it relatively easy to fix these issues by hand.

JavaScript oddities

Most of my work is in Python, but occasionally, I've wandered into JavaScript to build apps. I don't know a lot of JavaScript, and that's been the problem, I've been slow to spot code gen wrongness.

My projects have widgets and charts and I found the JavaScript callbacks and code were overcomplicated and bloated. I re-wrote the code to be 50% shorter and much clearer. It cost me some effort to come up to speed with JavaScript to spot and fix things.

Oddly, I found hallucination more of a problem for JavaScript than Python. My code gen system hallucinated the need to include an external CSS that didn't exist and wasn't needed. Code gen also hallucinated "standard" functions that weren't available (that was nice one to debug!).

Similar to my Python experience, I found code gen to be really bad at manipulating data objects. In a few cases, it would give me code that was flat out wrong.

'Unpopular' code

If you're using libraries that have been extensively used by others (e.g. requests, Django, etc.), code gen is mostly good. But when you're using libraries that are a little "off the beaten path", I've found code generation really drops down in quality. In a few cases, it's pretty much unusable.

A way forward through the trough of disappointment

It's possible that more thorough prompting might solve some of these problems, but I'm not entirely convinced. I've found that code generation often doesn't do well with very, very detailed and long prompting. Here's what I think is needed.

Accepting that code generation is flawed and needs adult supervision. It's a tool, not a magic wand. The development process must include checks the code is correct.

Proper training. You need to spot when it's gone wrong and you need to intervene. This means knowing the languages you're code generating. I didn't know JavaScript well enough and I paid the price.

Libraries to learn from and use. Code gen learns from your codebase, but this isn't enough, especially if you're doing something new, and it can also mean code gen is learning the wrong things. Having a library means code gen isn't re-inventing the wheel each time.

In a corporate setting, all this means having thoughtful policies and practices for code gen and code development. Code gen is changing rapidly, which means policies and practices will need to be updated every six months, or when you learn something new.

Kintsugi

Kintsugi is the Japanese art of taking something broken (e.g., a pot or a vase) and mending it in a way that both acknowledges its brokenness and makes it more beautiful. Code generation isn't broken, but it can be made a lot more useful with some careful thought and acknowledging its weaknesses.

Monday, November 24, 2025

Caching and token reduction

This is a short blog post to share some thoughts on how to reduce AI token consumption and improve user response times.

I was at the AI Tinkerers event in Boston and I saw a presentation on using AI report generation for quant education. The author was using a generic LLM to create multiple choice questions on different themes. Similarly, I've been building an LLM system that produces a report  based on data pulled from the internet. In both cases, there are a finite number of topics to generate reports on. My case was much larger, but even so, it was still finite.

The obvious thought is, if you're only generating a few reports or questions & answers, why not generate them in batch? There's no need to keep the user waiting and of course, you can schedule your LLM API calls in the middle of the night when there's less competition for resources. 

(Canva)

In my case, there are potentially thousands of reports, but some reports will be pulled more often than others. A better strategy in my case is something like this:

  1. Take a guess at the most popular reports (or use existing popularity data) and generate those reports overnight (or at a time when competition for resources is low). Cache them.
  2. If the user wants a report that's been cached, return the cached copy.
  3. If the user wants an uncached report:
    • Tell the user there will be a short wait for the LLM
    • Call the LLM API and generate the report
    • Display the report
    • Cache the report
  4. For each cached report, record the LLM and it's creation timestamp. 

You can start to do some clever things here like refresh the reports every 30 days or when the LLM is upgraded etc.

I know this isn't rocket science, but I've been surprised how few LLM demos I've seen use any form of batch processing and caching.

Monday, November 17, 2025

Data scientists need to learn JavaScript

Moving quickly

Over the last few months, I've become very interested in rapid prototype development for data science projects. Here's the key question I asked myself: how can a data scientist build their own app as quickly as possible? Nowadays, speed means code gen, but that's only part of the solution.

The options

The obvious quick development path is using Streamlit; that doesn't require any new skills because it's all in Python. Streamlit is great, and I've used it extensively, but it only takes you so far and it doesn't really scale. Streamlit is really for internal demos, and it's very good at that.

The more sustainable solution is using Django. It's a bigger and more complex beast, but it's scalable. Django requires Python skills, which is fine for most data scientists. Of course, Django apps are deployed on the web and users access them as web pages.

The UI is one place code gen breaks down under pressure

Where things get tricky is adding widgets to Django apps. You might want your app to take some action when the user clicks a button, or have widgets controlling charts etc. Code gen will nicely provide you with the basics, but once you start to do more complicated UI tasks, like updating chart data, you need to write JavaScript or be able to correct code gen'd JavaScript.

(As an aside, for my money, the reason why a number of code gen projects stall is because code gen only takes you so far. To do anything really useful, you need to intervene, providing detailed guidance, and writing code where necessary. This means JavaScript code.)

JavaScript != Python

JavaScript is very much not Python. Even a cursory glance will tell you the JavaScript syntax is unlike Python. More subtly, and more importantly, some of the underlying ideas and approaches are quire different. The bottom line is, a Python programmer is not going to write good enough JavaScript without training.

To build even a medium complexity data science app, you need to know how JavaScript callbacks work, how arrays work, how to debug in the browser, and so on. Because code gen is doing most of the heavy lifting for you, you don't need to be a craftsman, but you do need to be a journeyman.

What data scientists need to do

The elevator pitch is simple:

  • If you want to build a scalable data science app, you need to use Django (or something like it).
  • To make the UI work properly, code gen needs adult supervision and intervention.
  • This means knowing JavaScript.
(Data Scientist becoming JavaScript programmer. Gemini.)

In my view, all that's needed here is a short course, a good book, and some practice. A week should be enough time for an experienced Python programmer to get to where they need to be.

What skillset should data scientists have?

AI is shaking everything up, including data science. In my view, data scientists will have to do more than their "traditional" role. Data scientists who can turn their analysis into apps will have an advantage. 

For me, the skillset a data scientist will need looks a lot like the skillset of a full-stack developer. This means data scientists knowing a bit of JavaScript, code gen, deployment technologies, and so on. They won't need to be experts, but they will need "good enough" skills.

Wednesday, November 12, 2025

How to rapidly build and deploy data science apps using code gen

Introduction

If you want to rapidly build and deploy apps with a data science team, this blog post is written for you.

(Canva)

I’ve seen how small teams of MIT and Harvard students at the sundai.club in Boston are able to produce functioning web apps in twelve hours. I want to understand how they’re doing it, adapt what they’re doing for business, and create data science heavy apps very quickly. This blog post is about what I’ve learned.

Almost all of the sundai.club projects use an LLM as part of their project (e.g., using agentic systems to analyze health insurance denials), but that’s not how they’re able to build so quickly. They get development speed through code generation, the appropriate use of tools, and the use of deployment technologies like Vercel or Render. 

(Building prototypes in 12 hours: the inspiration for this blog post.)

Inspired by what I’ve seen, I developed a pathfinder project to learn how to do rapid development and deployment using AI code gen and deployment tools. My goal was to find out:

  • The skills needed and the depth to which they’re needed.
  • Major stumbling blocks and coping strategies.
  • The process to rapidly build apps.

I'm going to share what I've learned in this blog post. 

Summary of findings

Process is key

Rapid development relies on having three key elements in place:

  • Using the right tools.
  • Having the right skill set.
  • Using AI code gen correctly.

Tools

Fast development must use these tools:

  • AI-enabled IDE.
  • Deployment platform like Render or Vercel.
  • Git.

Data scientists tend to use notebooks and that’s a major problem for rapid development; notebook-based development isn’t going to work. Speed requires the consistent use of AI-enabled IDEs like Cursor or Lovable. These IDEs use AI code generation at the project and code block level, and can generate code in different languages (Python, SQL, JavaScript etc,). They have the ability to generate test code, comment code, and make code PEP8 complaint. It’s not just one-off code gen, it’s applying AI to the whole code development process.

(Screen shot of Cursor used in this project.)

Using a deployment platform like Render or Vercel means deployment can be extremely fast. Data scientists don’t have deployment skills, but these products are straightforward enough that some written guidance should be enough. 

Deployment platforms retrieve code from Git-based systems (e.g., GitHub, GitLab etc.), so data scientists need some familiarity with them. Desktop tools (like GitHub Desktop) make it easier, but they have to be used, which is a process and management issue.

Skillsets and training

The skillset needed is the same as a full-stack engineer with a few tweaks, which is a challenge for data scientists who mostly lack some of the key skills. Here are the skillsets, level needed, and training required for data scientists.

  • Hands-on experience with AI code generation and AI-enabled IDE.
    • What’s needed:
      • Ability to appropriately use code gen at the project and code-block levels. This could be with Cursor, Claude Code, or something similar.
      • Understanding code gen strengths and weaknesses and when not to use it.
      • Experience developing code using an IDE.
    • Training: 
      • To get going, an internal training session plus a series of exercises would be a good choice.
      • At the time of writing, there are no good off-the-shelf courses.
  • Python
    • What’s needed:
      • Decent Python coding skills, including the ability to write functions appropriately (data scientists sometimes struggle here).
      • Django uses inheritance and function decorators, so understanding these properties of Python is important. 
      • Use of virtual environments.
    • Training:
      • Most data scientists have “good enough” Python.
      • The additional knowledge should come from a good advanced Python book. 
      • Consider using experienced software engineers to train data scientists in missing skills, like decomposing tasks into functions, PEP8 and so on.
  • SQL and building a database
    • What’s needed:
      • Create databases, create tables, insert data into tables, write queries.
    • Training:
      • Most data scientists have “good enough” SQL.
      • Additional training could be a books or online tutorials.
  • Django
    • What’s needed:
      • An understanding of Django’s architecture and how it works.
      • The ability to build an app in Django.
    • Training:
      • On the whole, data scientists don’t know Django.
      • The training provided by a short course or a decent text book should be enough.
      • Writing a couple of simple Django apps by hand should be part of the training.
      • This may take 40 hours.
  • JavaScript
    • What’s needed:
      • Ability to work with functions (including callbacks), variables, and arrays.
      • Ability to debug JavaScript in the browser.
      • These skills are needed to add and debug UI widgets. Code generation isn't enough.
    • Training:
      • A short course (or a reasonable text book) plus a few tutorial examples will be enough.
  • HTML and CSS
    • What’s needed:
      • A low level of familiarity is enough.
    • Training:
      • Tutorials on the web or a few YouTube videos should be enough.
  • Git
    • What’s needed:
      • The ability to use Git-based source control systems. 
      • It's needed because deployment platforms rely on code being on Git.
    • Training:
      • Most data scientists have a weak understanding of Git. 
      • A hands-on training course would be the most useful approach.

Code gen is not one-size-fits-all

AI code gen is a tremendous productivity boost and enabler in many areas but not all. For key tasks, like database design and app deployment, AI code gen doesn’t help at all. In other areas, for example, complex database/dataframe manipulations and handling some advanced UI issues, AI helps somewhat but it needs substantial guidance. The AI coding productivity benefit is a range from negative to greatly positive depending on the task. 

The trick is to use AI code gen appropriately and provide adult supervision. This means reviewing what AI produces and intervening. It means knowing when to stop prompting and when to start coding.

Recommendations before attempting rapid application development

  • Make sure your team have the skills I’ve outlined above, either individually or collectively.
  • Use the right tools in the right way.
  • Don’t set unreasonable expectation, understand that your first attempts will be slow as you learn.
  • Run a pilot project or two with loose deadlines. From the pilot project, codify the lessons and ways of working. Focus especially on AI code gen and deployment.

How I learned rapid development: my pathfinder app

For this project, I chose to build an app that analyzes the results of English League Football (soccer) games since the league began in 1888 to the most recently completed season (2024-2025). 

The data set is quite large, which means a database back end. The database will need multiple tables.

It’s a very chart-heavy app. Some of the charts are violin plots that need kernel density estimation, and I’ve added curve fitting and confidence intervals on some line plots. That’s not the most sophisticated data analysis, but it’s enough to prove a point about the use of data science methods in apps. Notably, charts are not covered in most Django texts.

(Just one of the plots from my app. Note the year slider at the bottom.)

In several cases, the charts need widgets: sliders to select the year and radio buttons to select different leagues. This means either using ‘native’ JavaScript or libraries specific to the charting tool (Bokeh). I chose to use native JavaScript for greater flexibility.

To get started, I roughly drew out what I wanted the app to look like. This included different themed analysis (trends over time, goal analysis, etc.) and the charts I wanted. I added widgets to my design where appropriate.

The stack

Here’s the stack I used for this project.

Django was the web framework, which means it handles incoming and outgoing data, manages users, and manages data. Django is very mature, and is very well supported by AI code generation (in particular, Cursor). Django is written in Python.

Postgres. “Out of the box”, Django supports SQLite, but Render (my deployment solution) requires Postgres. 

Bokeh for charts. Bokeh is a Python plotting package that renders its charts in a browser (using HTML and JavaScript). This makes it a good choice for this project. An alternative is Altair, but my experience is that Bokeh is more mature and more amenable to being embedded in web pages.

JavaScript for widgets. I need to add drop down boxes, radio buttons, sliders, and tabs etc. I’ll use whatever libraries are appropriate, but I want code gen to do most of the heavy lifting.

Render.com for deployment. I wanted to deploy my project quickly, which means I don’t want to build out my own deployment solution on AWS etc., I want something more packaged.

I used Cursor for the entire project.

The build process and issues

Building the database

My initial database format gave highly complicated Django models that broke Django’s ORM. I rebuilt the database using a much simpler schema. The lesson here is to keep the database reasonably close to the format in which it will be displayed. 

My app design called for violin plots of attendance by season and by league tier. This is several hundred plots. Originally, I was going to calculate the kernel density estimates for the violin plots at run time, but I decided it would slow the application down too much, so I calculated them beforehand and saved them to a database table. This is a typical trade-off.

For this part of the process, I didn’t find code generation useful.

The next stage was uploading my data to the database. Here, I found code generation very useful. It enabled me to quickly create a Python program to upload data and check the database for consistency.

Building Django

Code gen was a huge boost here. I gave Cursor a markdown file specifying what I wanted and it generated the project very quickly. The UI wasn’t quite what I wanted, but by prompting Cursor, I was able to get it there. It let me create and manipulate dropdown boxes, tabs, and widgets very easily – far, far faster than hand coding. I did try and create a more detailed initial spec, but I found that after a few pages of spec, code generation gets worse; I got better results by an incremental approach.

(One part of the app, a dropdown box and menu. Note the widget and the entire app layout was AI code generated.)

The simplest part of the project is a view of club performance over time. Using a detailed prompt, I was able to get all of the functionality working using only code gen. This functionality included dropdown selection box, club history display, league over time, matches played by season. It needed some tweaks, but I did the tweaks using code gen. Getting this simple functionality running took an hour or two.

Towards the end of the project, I added an admin panel for admin users to create. edit, and delete "ordinary" users. With code gen, This took less than half an hour, including bug fixes and UI tweaks.

For one UI element, I needed to create an API interface to supply JSON rather than HTML. Code gen let me create it in seconds.

However, there were problems.

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write the code.

It did even worse with retrieving data from Django models. Although I aligned my data as closely as I could to the app, it was still necessary to aggregate data. I found code generation did a really poor job and the code needed to be re-written. Code gen was helpful to figure out Django’s model API though.

In one complex case, I needed to break Django’s ORM and make a SQL call directly to the database. Here, code gen worked correctly on the first pass, creating good-quality SQL immediately.

My use of code gen was not one-and-done, it was an interactive process. I used code generation to create code at the block and function level.

Bokeh

My app is very chart heavy, having more than 10 charts and there aren't that many examples of this type of app that I could find. This means that AI code gen doesn't have much to learn from. 

(One of the Bokeh charts. Note the interactive controls on the right of the plot and the fact the plot is part of a tabbed display.)

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write code.

I needed to access the Bokeh chart data from the widget callbacks and update the charts with new data (in JavaScript). This involved a building a JSON API, which code gen created very easily. Sadly, code gen had a much harder time with the JavaScript callback. It’s first pass was gibberish and refining the prompt didn’t help. I had to intervene and ask for code gen on a code block-by-block basis. Even then, I had to re-write some lines of code. Unless the situation changes, my view is, code generation for this kind of problem is probably limited to function definition and block-by-block code generation, with hand coding to correct/improve issues.

(Some of the hand-written code. Code gen couldn't create this.)

Render

By this stage, I had an app that worked correctly on my local machine. The final step was deployment so it would be accessible on the public internet. The sundai.club and others, use Render.com and other similar services to rapidly deploy their apps, so I decided to use the free tier of Render.com.

Render’s free tier is good enough for demo purposes, but it isn’t powerful enough for a commercial deployment (which is fair); that's why I’m not linking to my app in this blog post: too much traffic will consume my free allowance.

Unlike some of its competitors, Render uses Postgres rather than SQLite as its database, hence my choice of Postgres. This means deployment is in two stages:

  • Get the database deployed.
  • Linking the Django app to the database and deploy it.

The process was more complicated than I expected and I ran into trouble. The documentation wasn’t as clear as it needed to be, which didn’t help. The consistent advice in the Render documentation was to turn off debug. This made diagnosing problems almost impossible. I turned debug on and fixed my problems quickly. 

To be clear: code gen was of no help whatsoever.

(Part of Render's deployment screen.)

However, it’s my view this process could be better documented and subsequent deployments could go very smoothly.

General comments about AI code generation

  • Typically, many organizations require code to pass checks (linting, PEP8, test cases etc.) before the developer can check it into source control. Code generation makes it easier and faster to pass these checks. Commenting and code documentation is also much, much faster. 
  • Code generation works really well for “commodity” tasks and is really well-suited to Django. It mostly works well with UI code generation, provided there’s not much complexity.
  • It doesn’t do well with complex data manipulations, although its SQL can be surprisingly good.
  • It doesn’t do well with Bokeh code.
  • It doesn’t do well with complex UI callbacks where data has to be manipulated in particular ways.

Where my app ended up

End-to-end, it took about two weeks, including numerous blind alleys, restarts, and time spent digging up answers. Knowing what I know now, I could probably create an app of this complexity in less than 5 days, fewer still with more people.

My app has multiple pages, with multiple charts on each page (well over 10 charts in total). The chart types include violin plots, line charts, and heatmaps. Because they're Bokeh charts, my app has built-in chart interactivity. I have widgets (e.g., sliders, radio buttons) controlling some of the charts, which communicate back to the database to update the plots. Of course, I also have Django's user management features.

Discussion

There were quite a few surprises along the way in this project: I had expected code generation to do better with Bokeh and callback code, I’d expected Render to be easier to use, and I thought the database would be easier to build. Notably, the Render and database issues are learning issues; it’s possible to avoid these costs on future projects. 

I’ve heard some criticism of code generated apps from people who have produced 70% or even 80% of what they want, but are unable to go further. I can see why this happens. Code gen will only take you so far, and will produce junk under some circumstances that are likely to occur with moderately complex apps. When things get tough, it requires a human with the right skills to step in. If you don’t have the right skills, your project stalls. 

My goal with this project was to figure out the skills needed for rapid application development and deployment. I wanted to figure out the costs of enabling a data science team to build their own apps. What I found is the skill set needed is the skill set of a full-stack engineer. In other words, rapid development and deployment is firmly in the realm of software engineers and not data scientists. If data scientists want to build apps, there's a learning curve and a leaning cost. Frankly, I'm coming round to the opinion that data scientists need a broader software skill set.

For a future version of this project, I would be tempted to split off the UI entirely. The Django code would be entirely a JSON server, accessed through the API. The front end would be in Next.js. This would mean having charting software entirely in JavaScript. Obviously, there's a learning curve cost here, but I think it would give more consistency and ultimately an easier to maintain solution. Once again, it points to the need for a full-stack skill set.

To make this project go faster next time, here's what I would do:

  • Make the database structure reasonably close to how data is to be displayed. Don't get too clever and don't try to optimize it before you begin.
  • Figure out a way to commoditize creating charts and updating them through a JavaScript callback. The goal is of course to make the process more amenable to code generation. 
  • Related to charts, figure out a better way of using the ORM to avoid using SQL for more complex queries. Figure out a way to get better ORM code generation results.
  • Document the Render deployment process and have a simple checklist or template code.

Bottom line: it’s possible to do rapid application development and deployment with the right approach, the right tools, and using code gen correctly. Training is key.

Using the app

I want to tinker with my app, so I don't want to exhaust my Render free tier. If you'd like to see my app, drop me a line (https://www.linkedin.com/in/mikewoodward/) and I'll grant you access.

If you want to see my app code, that's easier. You can see it here: https://github.com/MikeWoodward/English-Football-Forecasting/tree/main/5%20Django%20app