Monday, December 1, 2025

Some musings on code generation: kintsugi

Hype and reality

I've been using AI code generation (Claude, Gemini, Cursor...) for months and I'm familiar with its strengths and weaknesses. It feels like I've gone through whole the hype cycle (see https://en.wikipedia.org/wiki/Gartner_hype_cycle) and now I'm firmly on the Plateau of Productivity. Here are some musings covering benefits, disappointments, and a way forward.

(The Japanese art of Kintsugi. Image by Gemini.)

Benefits

Elsewhere, people have waxed lyrical about the benefits of code generation, so I'm just going to add in a few novel points.

It's great when you're unfamiliar with an area of a language; it acts as a prompt or tutorial. In the past, you'd have to wade through pages of documentation and write code to experiment. Alternatively, you could search to see if anyone's tackled your problem and has a solution. If you were really stuck, you could try and ask a question on Stack Overflow and deal with the toxicity. Now, you can get something to get you going quickly.

Modern code development requires properly commenting code, making sure code is "linted" and PEP8 compliant, and creating test cases etc. While these things are important, they can consume a lot of time. Code generation steps on the accelerator pedal and makes them go much faster. In fact, code gen makes it quite reasonable to raise the bar on code quality.

Disappointments

Pandas dataframes

I've found code gen really doesn't do well manipulating Pandas dataframes. Several times, I've wanted to transform dataframes or do something non-trivial, for example, aggregating data, merging dataframes, transforming a column in some complex way and so on. I've found the generated code to either be wrong or really inefficient. In a few cases, the code was wrong, but in a way that was hard to spot; subtle bugs are costly to fix.

Bloated code

This is something other people have commented to me too: sometimes generated code is really bloated. I've had cases where what should have been a single line of code gets turned into 20 or more lines. Some of it is "well-intentioned", meaning lots of error trapping. But sometimes it's just a poor implementation. Bloated code is harder to maintain and slower to run.

Django

It took me a while to find the problems with Django code gen. On the whole, code gen for Django works astonishingly well, it's one of the huge benefits. But I've found the generated code to be inefficient in several ways:

The model manipulations have sometimes been odd or poor implementations. A more thoughtful approach to aggregation can make the code more readable and faster.
If the network connection is slow or backend computations take some time, a page can take a long time to even start to render. A better approach involves building the page so the user sees something quickly and then adding other elements as they become available. Code gen doesn't do this "out of the box".
UI layout can sometimes take a lot of prompting to get right. Mostly, it works really well, but occasionally, code gen finds something it really, really struggles with. Oddly, I've found it relatively easy to fix these issues by hand.

JavaScript oddities

Most of my work is in Python, but occasionally, I've wandered into JavaScript to build apps. I don't know a lot of JavaScript, and that's been the problem, I've been slow to spot code gen wrongness.

My projects have widgets and charts and I found the JavaScript callbacks and code were overcomplicated and bloated. I re-wrote the code to be 50% shorter and much clearer. It cost me some effort to come up to speed with JavaScript to spot and fix things.

Oddly, I found hallucination more of a problem for JavaScript than Python. My code gen system hallucinated the need to include an external CSS that didn't exist and wasn't needed. Code gen also hallucinated "standard" functions that weren't available (that was nice one to debug!).

Similar to my Python experience, I found code gen to be really bad at manipulating data objects. In a few cases, it would give me code that was flat out wrong.

'Unpopular' code

If you're using libraries that have been extensively used by others (e.g. requests, Django, etc.), code gen is mostly good. But when you're using libraries that are a little "off the beaten path", I've found code generation really drops down in quality. In a few cases, it's pretty much unusable.

A way forward through the trough of disappointment

It's possible that more thorough prompting might solve some of these problems, but I'm not entirely convinced. I've found that code generation often doesn't do well with very, very detailed and long prompting. Here's what I think is needed.

Accepting that code generation is flawed and needs adult supervision. It's a tool, not a magic wand. The development process must include checks the code is correct.

Proper training. You need to spot when it's gone wrong and you need to intervene. This means knowing the languages you're code generating. I didn't know JavaScript well enough and I paid the price.

Libraries to learn from and use. Code gen learns from your codebase, but this isn't enough, especially if you're doing something new, and it can also mean code gen is learning the wrong things. Having a library means code gen isn't re-inventing the wheel each time.

In a corporate setting, all this means having thoughtful policies and practices for code gen and code development. Code gen is changing rapidly, which means policies and practices will need to be updated every six months, or when you learn something new.

Kintsugi

Kintsugi is the Japanese art of taking something broken (e.g., a pot or a vase) and mending it in a way that both acknowledges its brokenness and makes it more beautiful. Code generation isn't broken, but it can be made a lot more useful with some careful thought and acknowledging its weaknesses.

Monday, November 24, 2025

Caching and token reduction

This is a short blog post to share some thoughts on how to reduce AI token consumption and improve user response times.

I was at the AI Tinkerers event in Boston and I saw a presentation on using AI report generation for quant education. The author was using a generic LLM to create multiple choice questions on different themes. Similarly, I've been building an LLM system that produces a report based on data pulled from the internet. In both cases, there are a finite number of topics to generate reports on. My case was much larger, but even so, it was still finite.

The obvious thought is, if you're only generating a few reports or questions & answers, why not generate them in batch? There's no need to keep the user waiting and of course, you can schedule your LLM API calls in the middle of the night when there's less competition for resources.

(Canva)

In my case, there are potentially thousands of reports, but some reports will be pulled more often than others. A better strategy in my case is something like this:

Take a guess at the most popular reports (or use existing popularity data) and generate those reports overnight (or at a time when competition for resources is low). Cache them.
If the user wants a report that's been cached, return the cached copy.
If the user wants an uncached report:

Tell the user there will be a short wait for the LLM
Call the LLM API and generate the report
Display the report
Cache the report

For each cached report, record the LLM and it's creation timestamp.

You can start to do some clever things here like refresh the reports every 30 days or when the LLM is upgraded etc.

I know this isn't rocket science, but I've been surprised how few LLM demos I've seen use any form of batch processing and caching.

Monday, November 17, 2025

Data scientists need to learn JavaScript

Moving quickly

Over the last few months, I've become very interested in rapid prototype development for data science projects. Here's the key question I asked myself: how can a data scientist build their own app as quickly as possible? Nowadays, speed means code gen, but that's only part of the solution.

The options

The obvious quick development path is using Streamlit; that doesn't require any new skills because it's all in Python. Streamlit is great, and I've used it extensively, but it only takes you so far and it doesn't really scale. Streamlit is really for internal demos, and it's very good at that.

The more sustainable solution is using Django. It's a bigger and more complex beast, but it's scalable. Django requires Python skills, which is fine for most data scientists. Of course, Django apps are deployed on the web and users access them as web pages.

The UI is one place code gen breaks down under pressure

Where things get tricky is adding widgets to Django apps. You might want your app to take some action when the user clicks a button, or have widgets controlling charts etc. Code gen will nicely provide you with the basics, but once you start to do more complicated UI tasks, like updating chart data, you need to write JavaScript or be able to correct code gen'd JavaScript.

(As an aside, for my money, the reason why a number of code gen projects stall is because code gen only takes you so far. To do anything really useful, you need to intervene, providing detailed guidance, and writing code where necessary. This means JavaScript code.)

JavaScript != Python

JavaScript is very much not Python. Even a cursory glance will tell you the JavaScript syntax is unlike Python. More subtly, and more importantly, some of the underlying ideas and approaches are quire different. The bottom line is, a Python programmer is not going to write good enough JavaScript without training.

To build even a medium complexity data science app, you need to know how JavaScript callbacks work, how arrays work, how to debug in the browser, and so on. Because code gen is doing most of the heavy lifting for you, you don't need to be a craftsman, but you do need to be a journeyman.

What data scientists need to do

The elevator pitch is simple:

If you want to build a scalable data science app, you need to use Django (or something like it).
To make the UI work properly, code gen needs adult supervision and intervention.
This means knowing JavaScript.

(Data Scientist becoming JavaScript programmer. Gemini.)

In my view, all that's needed here is a short course, a good book, and some practice. A week should be enough time for an experienced Python programmer to get to where they need to be.

What skillset should data scientists have?

AI is shaking everything up, including data science. In my view, data scientists will have to do more than their "traditional" role. Data scientists who can turn their analysis into apps will have an advantage.

For me, the skillset a data scientist will need looks a lot like the skillset of a full-stack developer. This means data scientists knowing a bit of JavaScript, code gen, deployment technologies, and so on. They won't need to be experts, but they will need "good enough" skills.

Wednesday, November 12, 2025

How to rapidly build and deploy data science apps using code gen

Introduction

If you want to rapidly build and deploy apps with a data science team, this blog post is written for you.

(Canva)

I’ve seen how small teams of MIT and Harvard students at the sundai.club in Boston are able to produce functioning web apps in twelve hours. I want to understand how they’re doing it, adapt what they’re doing for business, and create data science heavy apps very quickly. This blog post is about what I’ve learned.

Almost all of the sundai.club projects use an LLM as part of their project (e.g., using agentic systems to analyze health insurance denials), but that’s not how they’re able to build so quickly. They get development speed through code generation, the appropriate use of tools, and the use of deployment technologies like Vercel or Render.

(Building prototypes in 12 hours: the inspiration for this blog post.)

Inspired by what I’ve seen, I developed a pathfinder project to learn how to do rapid development and deployment using AI code gen and deployment tools. My goal was to find out:

The skills needed and the depth to which they’re needed.
Major stumbling blocks and coping strategies.
The process to rapidly build apps.

I'm going to share what I've learned in this blog post.

Summary of findings

Process is key

Rapid development relies on having three key elements in place:

Using the right tools.
Having the right skill set.
Using AI code gen correctly.

Tools

Fast development must use these tools:

AI-enabled IDE.
Deployment platform like Render or Vercel.
Git.

Data scientists tend to use notebooks and that’s a major problem for rapid development; notebook-based development isn’t going to work. Speed requires the consistent use of AI-enabled IDEs like Cursor or Lovable. These IDEs use AI code generation at the project and code block level, and can generate code in different languages (Python, SQL, JavaScript etc,). They have the ability to generate test code, comment code, and make code PEP8 complaint. It’s not just one-off code gen, it’s applying AI to the whole code development process.

(Screen shot of Cursor used in this project.)

Using a deployment platform like Render or Vercel means deployment can be extremely fast. Data scientists don’t have deployment skills, but these products are straightforward enough that some written guidance should be enough.

Deployment platforms retrieve code from Git-based systems (e.g., GitHub, GitLab etc.), so data scientists need some familiarity with them. Desktop tools (like GitHub Desktop) make it easier, but they have to be used, which is a process and management issue.

Skillsets and training

The skillset needed is the same as a full-stack engineer with a few tweaks, which is a challenge for data scientists who mostly lack some of the key skills. Here are the skillsets, level needed, and training required for data scientists.

Hands-on experience with AI code generation and AI-enabled IDE.

What’s needed:

Ability to appropriately use code gen at the project and code-block levels. This could be with Cursor, Claude Code, or something similar.
Understanding code gen strengths and weaknesses and when not to use it.
Experience developing code using an IDE.

Training:

To get going, an internal training session plus a series of exercises would be a good choice.
At the time of writing, there are no good off-the-shelf courses.

Python

What’s needed:

Decent Python coding skills, including the ability to write functions appropriately (data scientists sometimes struggle here).
Django uses inheritance and function decorators, so understanding these properties of Python is important.
Use of virtual environments.

Training:

Most data scientists have “good enough” Python.
The additional knowledge should come from a good advanced Python book.
Consider using experienced software engineers to train data scientists in missing skills, like decomposing tasks into functions, PEP8 and so on.

SQL and building a database

What’s needed:

Create databases, create tables, insert data into tables, write queries.

Training:

Most data scientists have “good enough” SQL.
Additional training could be a books or online tutorials.

Django

What’s needed:

An understanding of Django’s architecture and how it works.
The ability to build an app in Django.

Training:

On the whole, data scientists don’t know Django.
The training provided by a short course or a decent text book should be enough.
Writing a couple of simple Django apps by hand should be part of the training.
This may take 40 hours.

JavaScript

What’s needed:

Ability to work with functions (including callbacks), variables, and arrays.
Ability to debug JavaScript in the browser.
These skills are needed to add and debug UI widgets. Code generation isn't enough.

Training:

A short course (or a reasonable text book) plus a few tutorial examples will be enough.

HTML and CSS

What’s needed:

A low level of familiarity is enough.

Training:

Tutorials on the web or a few YouTube videos should be enough.

What’s needed:

The ability to use Git-based source control systems.
It's needed because deployment platforms rely on code being on Git.

Training:

Most data scientists have a weak understanding of Git.
A hands-on training course would be the most useful approach.

Code gen is not one-size-fits-all

AI code gen is a tremendous productivity boost and enabler in many areas but not all. For key tasks, like database design and app deployment, AI code gen doesn’t help at all. In other areas, for example, complex database/dataframe manipulations and handling some advanced UI issues, AI helps somewhat but it needs substantial guidance. The AI coding productivity benefit is a range from negative to greatly positive depending on the task.

The trick is to use AI code gen appropriately and provide adult supervision. This means reviewing what AI produces and intervening. It means knowing when to stop prompting and when to start coding.

Recommendations before attempting rapid application development

Make sure your team have the skills I’ve outlined above, either individually or collectively.
Use the right tools in the right way.
Don’t set unreasonable expectation, understand that your first attempts will be slow as you learn.
Run a pilot project or two with loose deadlines. From the pilot project, codify the lessons and ways of working. Focus especially on AI code gen and deployment.

How I learned rapid development: my pathfinder app

For this project, I chose to build an app that analyzes the results of English League Football (soccer) games since the league began in 1888 to the most recently completed season (2024-2025).

The data set is quite large, which means a database back end. The database will need multiple tables.

It’s a very chart-heavy app. Some of the charts are violin plots that need kernel density estimation, and I’ve added curve fitting and confidence intervals on some line plots. That’s not the most sophisticated data analysis, but it’s enough to prove a point about the use of data science methods in apps. Notably, charts are not covered in most Django texts.

(Just one of the plots from my app. Note the year slider at the bottom.)

In several cases, the charts need widgets: sliders to select the year and radio buttons to select different leagues. This means either using ‘native’ JavaScript or libraries specific to the charting tool (Bokeh). I chose to use native JavaScript for greater flexibility.

To get started, I roughly drew out what I wanted the app to look like. This included different themed analysis (trends over time, goal analysis, etc.) and the charts I wanted. I added widgets to my design where appropriate.

The stack

Here’s the stack I used for this project.

Django was the web framework, which means it handles incoming and outgoing data, manages users, and manages data. Django is very mature, and is very well supported by AI code generation (in particular, Cursor). Django is written in Python.

Postgres. “Out of the box”, Django supports SQLite, but Render (my deployment solution) requires Postgres.

Bokeh for charts. Bokeh is a Python plotting package that renders its charts in a browser (using HTML and JavaScript). This makes it a good choice for this project. An alternative is Altair, but my experience is that Bokeh is more mature and more amenable to being embedded in web pages.

JavaScript for widgets. I need to add drop down boxes, radio buttons, sliders, and tabs etc. I’ll use whatever libraries are appropriate, but I want code gen to do most of the heavy lifting.

Render.com for deployment. I wanted to deploy my project quickly, which means I don’t want to build out my own deployment solution on AWS etc., I want something more packaged.

I used Cursor for the entire project.

The build process and issues

Building the database

My initial database format gave highly complicated Django models that broke Django’s ORM. I rebuilt the database using a much simpler schema. The lesson here is to keep the database reasonably close to the format in which it will be displayed.

My app design called for violin plots of attendance by season and by league tier. This is several hundred plots. Originally, I was going to calculate the kernel density estimates for the violin plots at run time, but I decided it would slow the application down too much, so I calculated them beforehand and saved them to a database table. This is a typical trade-off.

For this part of the process, I didn’t find code generation useful.

The next stage was uploading my data to the database. Here, I found code generation very useful. It enabled me to quickly create a Python program to upload data and check the database for consistency.

Building Django

Code gen was a huge boost here. I gave Cursor a markdown file specifying what I wanted and it generated the project very quickly. The UI wasn’t quite what I wanted, but by prompting Cursor, I was able to get it there. It let me create and manipulate dropdown boxes, tabs, and widgets very easily – far, far faster than hand coding. I did try and create a more detailed initial spec, but I found that after a few pages of spec, code generation gets worse; I got better results by an incremental approach.

(One part of the app, a dropdown box and menu. Note the widget and the entire app layout was AI code generated.)

The simplest part of the project is a view of club performance over time. Using a detailed prompt, I was able to get all of the functionality working using only code gen. This functionality included dropdown selection box, club history display, league over time, matches played by season. It needed some tweaks, but I did the tweaks using code gen. Getting this simple functionality running took an hour or two.

Towards the end of the project, I added an admin panel for admin users to create. edit, and delete "ordinary" users. With code gen, This took less than half an hour, including bug fixes and UI tweaks.

For one UI element, I needed to create an API interface to supply JSON rather than HTML. Code gen let me create it in seconds.

However, there were problems.

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write the code.

It did even worse with retrieving data from Django models. Although I aligned my data as closely as I could to the app, it was still necessary to aggregate data. I found code generation did a really poor job and the code needed to be re-written. Code gen was helpful to figure out Django’s model API though.

In one complex case, I needed to break Django’s ORM and make a SQL call directly to the database. Here, code gen worked correctly on the first pass, creating good-quality SQL immediately.

My use of code gen was not one-and-done, it was an interactive process. I used code generation to create code at the block and function level.

Bokeh

My app is very chart heavy, having more than 10 charts and there aren't that many examples of this type of app that I could find. This means that AI code gen doesn't have much to learn from.

(One of the Bokeh charts. Note the interactive controls on the right of the plot and the fact the plot is part of a tabbed display.)

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write code.

I needed to access the Bokeh chart data from the widget callbacks and update the charts with new data (in JavaScript). This involved a building a JSON API, which code gen created very easily. Sadly, code gen had a much harder time with the JavaScript callback. It’s first pass was gibberish and refining the prompt didn’t help. I had to intervene and ask for code gen on a code block-by-block basis. Even then, I had to re-write some lines of code. Unless the situation changes, my view is, code generation for this kind of problem is probably limited to function definition and block-by-block code generation, with hand coding to correct/improve issues.

(Some of the hand-written code. Code gen couldn't create this.)

Render

By this stage, I had an app that worked correctly on my local machine. The final step was deployment so it would be accessible on the public internet. The sundai.club and others, use Render.com and other similar services to rapidly deploy their apps, so I decided to use the free tier of Render.com.

Render’s free tier is good enough for demo purposes, but it isn’t powerful enough for a commercial deployment (which is fair); that's why I’m not linking to my app in this blog post: too much traffic will consume my free allowance.

Unlike some of its competitors, Render uses Postgres rather than SQLite as its database, hence my choice of Postgres. This means deployment is in two stages:

Get the database deployed.
Linking the Django app to the database and deploy it.

The process was more complicated than I expected and I ran into trouble. The documentation wasn’t as clear as it needed to be, which didn’t help. The consistent advice in the Render documentation was to turn off debug. This made diagnosing problems almost impossible. I turned debug on and fixed my problems quickly.

To be clear: code gen was of no help whatsoever.

(Part of Render's deployment screen.)

However, it’s my view this process could be better documented and subsequent deployments could go very smoothly.

General comments about AI code generation

Typically, many organizations require code to pass checks (linting, PEP8, test cases etc.) before the developer can check it into source control. Code generation makes it easier and faster to pass these checks. Commenting and code documentation is also much, much faster.
Code generation works really well for “commodity” tasks and is really well-suited to Django. It mostly works well with UI code generation, provided there’s not much complexity.
It doesn’t do well with complex data manipulations, although its SQL can be surprisingly good.
It doesn’t do well with Bokeh code.
It doesn’t do well with complex UI callbacks where data has to be manipulated in particular ways.

Where my app ended up

End-to-end, it took about two weeks, including numerous blind alleys, restarts, and time spent digging up answers. Knowing what I know now, I could probably create an app of this complexity in less than 5 days, fewer still with more people.

My app has multiple pages, with multiple charts on each page (well over 10 charts in total). The chart types include violin plots, line charts, and heatmaps. Because they're Bokeh charts, my app has built-in chart interactivity. I have widgets (e.g., sliders, radio buttons) controlling some of the charts, which communicate back to the database to update the plots. Of course, I also have Django's user management features.

Discussion

There were quite a few surprises along the way in this project: I had expected code generation to do better with Bokeh and callback code, I’d expected Render to be easier to use, and I thought the database would be easier to build. Notably, the Render and database issues are learning issues; it’s possible to avoid these costs on future projects.

I’ve heard some criticism of code generated apps from people who have produced 70% or even 80% of what they want, but are unable to go further. I can see why this happens. Code gen will only take you so far, and will produce junk under some circumstances that are likely to occur with moderately complex apps. When things get tough, it requires a human with the right skills to step in. If you don’t have the right skills, your project stalls.

My goal with this project was to figure out the skills needed for rapid application development and deployment. I wanted to figure out the costs of enabling a data science team to build their own apps. What I found is the skill set needed is the skill set of a full-stack engineer. In other words, rapid development and deployment is firmly in the realm of software engineers and not data scientists. If data scientists want to build apps, there's a learning curve and a leaning cost. Frankly, I'm coming round to the opinion that data scientists need a broader software skill set.

For a future version of this project, I would be tempted to split off the UI entirely. The Django code would be entirely a JSON server, accessed through the API. The front end would be in Next.js. This would mean having charting software entirely in JavaScript. Obviously, there's a learning curve cost here, but I think it would give more consistency and ultimately an easier to maintain solution. Once again, it points to the need for a full-stack skill set.

To make this project go faster next time, here's what I would do:

Make the database structure reasonably close to how data is to be displayed. Don't get too clever and don't try to optimize it before you begin.
Figure out a way to commoditize creating charts and updating them through a JavaScript callback. The goal is of course to make the process more amenable to code generation.
Related to charts, figure out a better way of using the ORM to avoid using SQL for more complex queries. Figure out a way to get better ORM code generation results.
Document the Render deployment process and have a simple checklist or template code.

Bottom line: it’s possible to do rapid application development and deployment with the right approach, the right tools, and using code gen correctly. Training is key.

Using the app

I want to tinker with my app, so I don't want to exhaust my Render free tier. If you'd like to see my app, drop me a line (https://www.linkedin.com/in/mikewoodward/) and I'll grant you access.

If you want to see my app code, that's easier. You can see it here: https://github.com/MikeWoodward/English-Football-Forecasting/tree/main/5%20Django%20app

Thursday, November 6, 2025

How to get data analysis very wrong: sample size effects

We're not reading the data right

In the real world, we’re under pressure to get results from data analysis. Sometimes, the pressure to deliver certainty means we forget some of the basics of analysis. In this blog post, I’m going to talk about one pitfall you can make which can cause you to give wildly wrong answers. I’ll start with an example.

School size - smaller schools are better?

You’ve probably heard the statement that “small schools produce better results than large schools”. Small-school advocates point out that small schools disproportionately appear in the top performing groups in an area. It sounds like small schools are the way to go, or are they? It’s also true that small schools disproportionately appear among the worse schools in an area. So, which is it, are small schools better or worse?

The answer is: both. Small schools have a higher variation in results because they have fewer students. The results are largely due to “statistical noise” [1].

We can easily see the effects of sample size “statistical noise”, more properly called variance, in a very simple example. Imagine tossing a coin and scoring heads a 1 and tails a 0. You would expect the mean over many tosses to be close to 0.5, but how many tosses do you have to do? I wrote a simple program to simulate tossing a coin and I summed up the results as a I went along. The charts below show four simulations. The x axis of each chart is the number of tosses, the y axis is the running mean, the blue line is the simulation, and the red dotted line is 0.5.

The charts clearly show higher variance at low numbers of simulations. It takes a surprisingly large number of tosses for the mean to get close top 0.5. If we want more certainty, and less variance, we need bigger samples sizes.

We can repeat the experiment, but this time with a six-sided dice and record the running mean. We’d see the same result, more variance for shorter simulations. Let’s try a more interesting example (you’ll see why in a minute). Let’s imagine a 100-sided dice and run the experiment multiple times , recording the mean results after each simulation (I’ve shown a few runs here).

Let’s change the terminology a bit here. The 100-sided dice is a percentage test result. Each student rolls the dice. If there are 100 students in a school, there are 100 die rolls, if there are 1,500 students in the school, we roll the die 1,500 times. We now have a simulation of school test results and the effect of school size.

I simulated 500 schools with 500 to 1,500 students. Here are the results.

As you can see, there’s more variance for shorter smaller schools than larger schools. This neatly explains why smaller schools are both the best in an area and the worst.

You might object to the simplicity of my analysis, surely real school results don't look like this.What does real-world data show? Wainer [1] did the work and got the real results (read his paper for more detials). Here's a screen shot from his paper showing real-world school results. It looks a lot like my simple-minded simulation.

Sample size variation is not the full explanation for school results, but it is a factor. Any analysis has to take it into account. Problems occur because of simple (wrong) analysis and overly-simple conclusions.

The law of large numbers

The effect that variance goes down with increasing sample size is known as the law of large numbers. It’s widely taught and there’s a lot written about it online. Unfortunately, most of the discussions get lost in the weeds very quickly. These two references do a very good job of explaining what’s going on: [1] [2].

The law of large numbers has a substantial body of mathematical theory behind it. It has an informal counter-part, that's a bit easier to understand, called the law of small numbers that says that there’s more variance in smaller samples than large ones. Problems occur because people assume that small samples behave in the same way as larger samples (small school results have the same variance as large school results for example).

So far, this sounds simple and obvious, but in reality, most data analysts aren’t fully aware of the effect of sample size. It doesn’t help that the language used in the real-world doesn’t conform to the language used in the classroom.

Small sales territories are the best?

Let’s imagine you were given some sales data on rep performance for an American company and you were asked to find factors that led to better performance.

Most territories have about 15-20 reps, with a handful having 5 or less reps. The top 10 leader board for the end of the year shows you that the reps from the smaller territories are doing disproportionally well.The sales VP is considering changing her sales organization to create smaller territories and she wants you to confirm what she’s seen in the data. Should she re-organize to smaller territories to get better results?

Obviously, I’ve prepped you with answer, but if I hadn’t, would you have concluded smaller territories are the way to go?

Rural lives are healthier

Now imagine you’re an analyst in a health insurance company in the US. You’ve come across data on the prevalence on kidney cancer by US county. You’ve found that the lowest prevalence is in rural counties. Should you set company policy based on this data? It seems obvious that the rural lifestyle is healthier. Should health insurance premiums include a rural/urban cost difference?

I’ve taken this example from the paper by Wainer [1]. As you might have guessed, rural counties have both the lowest and the highest rates of kidney cancer because their populations are small, so the law of small numbers kicks in. I’ve reproduced Wainer’s chart here: the x axis is county population and the y-axis is cancer rate, see his paper for more about the chart. It’s a really great example of the effect of sample size on variance.

A/B test hell

Let’s take a more subtle example. You’re running an A/B test that’s inconclusive. The results are really important to the company. The CMO is telling everyone that all the company needs to do is run the test for a bit longer. You are the analyst and you’ve been asked if running more tests is the solution. What do you say?

The only time it's worth running the test a bit longer is if the test is on the verge of significance. Other than that, it's probably not worth it. Belle's book [3] has a nice chapter on sample size calculations you can access for free online [4]. The bottom line is, the smaller the effect, the larger the sample size you need for significance. The relationship isn't linear. I've seen A/B tests that would have to run for over a year to reach significance.

Surprisingly, I've seen analysts who don't know how to do a sample size/duration estimate for an A/B test. That really isn't a good place to be when the business is relying on you for answers,

The missing math

Because I’m aiming for a more general audience, I’ve been careful here not to include equations. If you’re an analyst, you need to know:

What variance is and how to calculate it.
How sample size can affect results - you need to look for it everywhere.
How to estimate how much of what you're seeing is due to sample size effects and how much due to something "real".

Unfortunately, references for the law of large numbers get overly technical overly quickly. A good place to start is references that cover variance and standard deviation calculations. I like reference [5], but be aware it is technical.

The bottom line

The law of large numbers can be hidden in data; the language used and the data presentation can all confuse what’s going on. You need to be acutely aware of sample size effects: you need to know how to calculate them and how they can manifest themselves in data in surprising ways.

References

[1] Howard Wainer, “The Most Dangerous Equation”, https://www.americanscientist.org/article/the-most-dangerous-equation

[2] Jeremy Orloff, Jonathan Bloom, “Central Limit Theorem and the Law of Large Numbers”, https://math.mit.edu/~dav/05.dir/class6-prep.pdf

[3] Gerald van Belle, "Statistical rules of thumb", http://www.vanbelle.org/struts.htm

[4] Gerald van Belle, "Statistical rules of thumb chapter 2 - sample size", http://www.vanbelle.org/chapters/webchapter2.pdf

[5] Steven Miller, "The probability lifesaver"

Sunday, October 26, 2025

Context 7: code generation using the most recent libraries

The problem

One of my complaints about using AI code gen with Cursor has been its "backwardness"; it tends to use older versions of libraries. This sometimes means your generated code isn't as well-structured as it could be. It's why AI code gen has sometimes felt to me like working with a grumpy older senior engineer.

What we want is some way of telling Cursor (or AI code gen in general) to use the latest library versions and use the latest code samples. Of course, we could supply links to the library documentation ourselves as part of the prompt, but this is tedious, which means we're prone to forgetting it.

Wouldn't it be great to have a list of all the latest libraries and documentation and supply it directly to Cursor via an MCP server? That means, we'll always point to the latest version of the code and docs, Cursor will pick it up automatically, and someone else bears the cost of keeping the whole thing up to date. With such a service, we could always generate code using the latest version of libraries.

(Gemini)

The solution

As you've guessed, such a thing exists and it's called Context 7. Context 7 provides links to the latest version of over 49,000 libraries, everything from Next.js to Requests. It provides these links in a form that's usable by LLMs.

If you really wanted to, you could include these links via a prompt. For example, for Streamlit, you could use the results here in a prompt: https://context7.com/websites/streamlit_io. But that's inconvenient. You're better off using the Context 7 MCP Server and telling Cursor to use it in code generation.

How to implement it

There's a lot of advice about installing the Context 7 MCP Server on Cursor, some of it misleading and others wrong or out of date. Here's the easiest way to do it:

Have Cursor running in the background.
Go to this page on GitHub: https://github.com/upstash/context7
Go down to the section "Install in Cursor" and expand the section.
Click on the "Add to Cursor" button:

This should automatically add the Context MCP server to your installed MCP servers. To check that it has, do the following in Cursor:

Click on the "Cursor" menu option, then click "Settings...".
Click "Cursor Settings".
Click "Tools & MCP".

You should see this:

Next, you need to tell Cursor to use Context 7 when generating code. You could do this on every prompt, but that's tedious. You're much better off adding a rule. The Context 7 GitHub page even tells you how to do it: https://github.com/upstash/context7?tab=readme-ov-file#-tips.

Using it

This is the best part, if you've added the MCP server and you've added the rule, there's nothing else to do, you'll be using the latest version of your libraries when you generate code. I've heard a few people comment that they needed to restart Cursor, but I found it works just fine.

The cost

Using Context 7 will cost you more tokens, but in my view, it's a price worth paying for more up to date code.

Who's behind Context 7?

A company called Upstash have created Context 7 and they're providing it for free. To be clear: I have no affiliation of any kind with Upstash and have received no benefit or reward from them.

Bottom line

Use Context 7 in your code generation.

Tuesday, October 14, 2025

Competency porn

Over the last few weeks, I’ve increasingly heard the term “competency porn” used to describe movies or books. It’s a handy term, but I’m not sure I agree with how it’s been used. I’m going to give a little history of the term, give some examples, and tell you where I disagree with what’s been said online.

Leverage creator John Rogers created the phrase around 2009. He used it to describe an audience’s thrill at seeing (human) characters using specialist and well-developed skills to resolve some difficult situation. The situation might get very tough and evolve in ways the characters don’t expect, but they’re in control and it’s their calm use of their skills that saves the day. There are two main genres of competency porn: medical dramas and gangster/heist dramas.

A good example of medical competency porn is House. The titular character is a very talented (but deeply flawed) human being who uses his exceptional diagnostic skills to save lives. Things are never easy and there are plenty of diagnostic dead ends, but the show’s appeal lies in House’s ability to think through the situation and create opportunities for healing. Although the show mostly focuses on House, it's plain there's a (sometimes reluctant) team behind him.

(Canva)

Perhaps the best example of competency porn is the heist or gangster movie/TV series. In the heist movie, we see a group of highly-skilled (but flawed) individuals come together to overcome a series of challenges to steal something (a good example being Ocean’s Eleven). Of course, there are problems to solve along the way, but they never lose control of the situation and they overcome troubles through inventiveness borne from their skills. The pleasure lies in watching the interaction between skilled people working together to execute a detailed plan under difficult circumstances.

(Gemini)

Competency porn characters are never “Mary Sue” types, meaning a character who has no character flaws or weaknesses. Famously, House is a seriously flawed individual, and most gangster characters have some weaknesses or problems.

For me, superheroes can’t be competency porn. Their special powers mean they're super-human and the risk of failure is less. Their powers mean I empathize with them less; I could maybe be a safe cracker if I practiced for years, but there’s no way I could learn to fly, no matter how many buildings I jumped off. For the same reasons, I don’t think Dr Who is competency porn; famously, Dr Who isn’t human and has abilities and knowledge a human doesn’t have. Of course, superheroes always have a little “Mary Sue” tinge too.

By contrast, Law and Order is a great example of competency porn. The characters are human, highly-trained and experienced and they use their abilities to arrest and convict criminals. Almost all the time, they’re in control, and of course, they have personality flaws and weaknesses.

Controversially, I don’t think Alien is a competency porn movie. The human characters are not in control of the situation and most of them don’t have relevant specialist skills. In fact, it's their ineptness and poor judgement that puts them at risk. There’s not a lot of calmness in the movie either. For the same reasons, horror movies can’t be competency porn.

Star Trek is usually competency porn. The characters are mostly well-trained, highly-skilled, and in control. They have a mission to accomplish, which they do through team work and the use of their complementary skill sets.

The online consensus is that Arthur C. Clarke’s Rendezvous with Rama is competency porn. It fits the definition: the characters are all human with specialist skills and they overcome challenges calmly. But for me, the characters are a little “Mary Sue” and there’s a whiff of super hero about one or two of them. Another problem for me is the pay-off. In the heist movie, the gangsters steal the money, in medical dramas, the doctor cures the patient, and in Law and Order, the criminal goes to jail. But in Rendezvous with Rama, there is no payoff: the crew leave Rama and that’s it. Other than exploration and preventing Rama being bombed, there’s no real sense the characters have achieved anything lasting.

The pleasure in competency porn is seeing a group of highly-skilled and in-control people collectively pull off something that would otherwise seem impossible. They’re not super-human in any way, so we can dream that we too could act and win as they do.

Friday, October 10, 2025

Regression to the mean

An unfortunate phrase with unfortunate consequences

"Regression to the mean" is a simple idea that has profound consequences. It's led people astray for decades, if not centuries. I'm going to explain what it is, the consequences of not understanding it, and what you can do to protect yourself and your organization.

Let's give a simple definition for now: it's the tendency, when sampling data, for more extreme values to be followed by values closer to the mean. Here's an example, if I give the same children IQ tests over time, I'll see very high scores followed by more average scores, and some very low scores followed by more average scores. It doesn't mean the children are improving or getting worse, it's just regression to the mean. The problems occur when people attach a deeper meaning, as we'll see.

(Francis Galton, popularizer of "Regression to the mean")

What it means - simple examples

I'm going to start with an easy example that everyone should be familiar with, a simple game with a pack of cards.

Take a standard pack of playing cards and label the cards in each suit 1 to 13 (Ace is 1, 2 is 2, Jack is 11, etc.). The mean card value is 7.5.
Draw a card at random.
Imagine it's a Queen (12). Now, replace the card and draw another card. Is it likely the card will have a lower value or a higher value?

The probability is (11/13), it will have a lower value.

Now imagine you drew an ace (1), replace the card and draw again.

The probability of drawing another ace is 1/13.
The probability of drawing a 2 or higher is 12/13.

It's obvious in this example that "extreme" value cards are very likely to be followed by more "average" value cards. This is regression to the mean at work. It's nothing complex, just a probability distribution at work.

The cards example seems simple and obvious. Playing cards are very familiar and we're comfortable with randomness (in fact, almost all card games rely on randomness). The problem occurs when we have real measurements, we tend to give explanations to the data when randomness (and regression to the mean) is all that's there.

Let's say we're measuring the average speed of cars on a freeway. Here are 100 measurements of car speeds. What would you conclude about the freeway? What pattern can you see in the data and what does it tell you about driver behavior (e.g. lower speeds following higher speeds and vice versa)? What might cause it?

['46.7', '63.3', '80.0', '71.7', '34.2', '55.0', '67.5', '34.2', '67.5', '67.5', '59.2', '63.3', '55.0', '34.2', '63.3', '63.3', '63.3', '59.2', '75.8', '71.7', '42.5', '42.5', '34.2', '34.2', '59.2', '67.5', '59.2', '71.7', '71.7', '67.5', '50.8', '63.3', '34.2', '63.3', '30.0', '38.3', '50.8', '34.2', '75.8', '75.8', '46.7', '80.0', '55.0', '46.7', '38.3', '38.3', '75.8', '59.2', '34.2', '42.5', '71.7', '71.7', '80.0', '80.0', '71.7', '34.2', '63.3', '71.7', '46.7', '42.5', '46.7', '46.7', '63.3', '80.0', '80.0', '38.3', '38.3', '46.7', '38.3', '34.2', '46.7', '75.8', '55.0', '30.0', '55.0', '75.8', '30.0', '42.5', '67.5', '30.0', '50.8', '67.5', '67.5', '71.7', '67.5', '67.5', '42.5', '75.8', '75.8', '34.2', '55.0', '50.8', '38.3', '71.7', '46.7', '71.7', '50.8', '71.7', '42.5', '42.5']

Let's imagine the authorities introduced a speed camera at the measurement I've indicated in red. What might you conclude about the effect of the speed camera?

You shouldn't conclude anything at all from this data. It's entirely random. In fact, it has the same probability distribution as the pack of cards example. I've used 13 different average speeds, each with the same probability of occurrence. What you're seeing is the result of me drawing cards from a pack and giving them floating point numbers like 71.7 instead of a number like 9. The speed camera had no effect in this case. The data set shows the regression to the mean and nothing more.

The pack of cards and the vehicles example are exactly the same example. In the pack of cards case, we understand randomness and we can intuitively see what regression to the mean actually means. Once we have a real world problem, like the cars on the freeway, our tendency is to look for explanations that aren't there and we discount randomness. Looking for meaning in random data has had bad consequences, as we'll see.

Schools example

In the last few decades in the US, several states have introduced standardized testing to measure school performance. Students in the same year group take the same test and, based on the results, the state draws conclusions about the relative standing of schools; it may intervene in low performing schools. The question is, how do we measure the success of these interventions? Surely, we would expect to see an improvement in test scores taken the next year? In reality, it's not so simple.

The average test result for a group of students will obviously depend on things like teaching, prior attainment etc. But there are also random factors at work. Individual students might perform better or worse than expected due to sickness, or family issues, or a host of other random issues. Of course, different year groups in the same school might have a different mix of abilities. All of which means that regression to the mean should show up in consecutive tests. In other words, low performing schools might show an improvement and high performing schools might show a degradation entirely due to random factors.

This isn't a theoretical example: regression to the mean has been clearly shown in school scores in Massachusetts, California and in other states (see Haney, Smith & Smith). Sadly, state politicians and civil servants have intervened based on scores and drawn conclusions where they shouldn't.

Children's education evokes a lot of emotion and political interest, which is not a good mix. It's important to understand concepts like regression to the mean so we can better understand what's really going on.

Heights example

"Regression to the mean" was originally called "regression to mediocrity", and was based on the study of human heights. If regression to mediocrity sounds very disturbing, it should do. It's closely tied to eugenics through Francis Galton. I'm not going to dwell on the links between statistics and eugenics here, but you should know the origins of statistics aren't sin free.

In 1880s England, Galton studied the heights of parents and their children. I've reproduced some of his results below. He found that parents who were above average height tended to have children closer to the average height, and that parent parents below average height tended to have children closer to the average height. This is the classic regression to the mean example.

Think for the moment about possible different outcomes of a study like this. If taller parents had taller children, and shorter parents had shorter children, then we might expect to see two population groups emerging (short people and tall people) and maybe the start of speciation. Conversely, if tall parents had short children, and short parents had tall children, this would be very noticeable and commented on. Regression to the mean turns out to be a good explanation of what we observe in nature.

Galton's height study was very influential for both the study of genetics and the creation of statistics as a discipline.

New sports players

Let's take a cohort of baseball players in their first season. Obviously, talent makes a difference, but there are random factors at play too. We might expect some players to do extremely well, others to do well, some to do OK, some to do poorly, and some to do very poorly. Regression to the mean tells us that some standout players may well perform worse the next year. Other, lower-ranked players will perform better for the same reason. The phenomena of new outstanding players performing worse in their second year is often called the "sophomore slump" and a lot has been written about it, but in reality, it can mostly be explained by regression to the mean.

You can read more about regression to the mean in sports here:

Business books

Popular business books often fall into the regression to the mean trap. Here's what happens. A couple of authors do an analysis of top performing businesses, usually measured by stock price, and find some commonalities. They develop these commonalities into a framework and write a best-selling business book whose thesis is, if you follow the framework, you'll be successful. They follow this with another book that's not quite as good. Then they write a third book that only the true believers read.

Unfortunately, the companies they select as winners don't do as well over a decade or more, and the longer the timescale, the worse the performance. Over the long-run, the authors' promise that they've found the elixir of success is shown to be not true. Their books go from the best seller list to the remainder bucket.

A company's stock price is determined by many factors, for example, its competitors, the market state, and so on. Only some of them are under the control of the company. Conditions change over time in unpredictable ways. Regression to the mean suggests that great stock price performers now might not be in future, and low performers may do better. Regression to the mean neatly explains why picking winners today des not mean the same companies will be winners in the years to come. In other words, basic statistics makes a mockery of many business books.

Reading more:

The Halo Effect: . . . and the Eight Other Business Delusions That Deceive Managers - Phil Rosenzweig

My experience

I've seen regression to the mean pop up in all kinds of business data sets and I've seen people make the classic mistake of trying to derive meaning from randomness. Here are some examples.

Sales data has a lot of random fluctuations, and of course, the smaller the sample, the greater the fluctuations. I've seen sales people have a stand out year followed by a very average year and vice versa. I've seen the same pattern at a regional and country level too. Unfortunately, I've also seen analysts tie themselves in knots trying to explain these patterns. Even worse, they've made foolish predictions based on small sample sets and just a few years' worth of data.

I've seen very educated people get very excited by changes in company assessment data. They think they've spotted something significant because companies that performed well one year tended to perform a bit worse the next etc. Regression to the mean explained all the data.

How not to be fooled

Regression to the mean is hidden in lots of data sets and can lead you into making poor decisions. If you're analyzing a dataset, here are some questions to ask:

Is your data the result of some kind of sampling process?
Does randomness play a part in your collection process or in the data?
Are there unknowns that might influence your data?

If the answer to any of these questions is yes, you should assume you'll find regression to the mean in your dataset. Be careful about your analysis and especially careful about explaining trends. Of course, the smaller your data set, the more vulnerable you are.

You can estimate the effect of regression to the mean on your data using a variety of methods. I'm not going to go into them too much here because I don't want to make this blog post too long. In the literature, you'll see references on running a randomized control trial (RCT) also known as an A/B test. That's great in theory, but the reality is that it's not appropriate for most business situations. In practice, you'll have to run simulations or do some straightforward estimation of the fractional regression to the mean.