Wednesday, November 12, 2025

How to rapidly build and deploy data science apps using code gen

Introduction

If you want to rapidly build and deploy apps with a data science team, this blog post is written for you.

(Canva)

I’ve seen how small teams of MIT and Harvard students at the sundai.club in Boston are able to produce functioning web apps in twelve hours. I want to understand how they’re doing it, adapt what they’re doing for business, and create data science heavy apps very quickly. This blog post is about what I’ve learned.

Almost all of the sundai.club projects use an LLM as part of their project (e.g., using agentic systems to analyze health insurance denials), but that’s not how they’re able to build so quickly. They get development speed through code generation, the appropriate use of tools, and the use of deployment technologies like Vercel or Render. 

(Building prototypes in 12 hours: the inspiration for this blog post.)

Inspired by what I’ve seen, I developed a pathfinder project to learn how to do rapid development and deployment using AI code gen and deployment tools. My goal was to find out:

  • The skills needed and the depth to which they’re needed.
  • Major stumbling blocks and coping strategies.
  • The process to rapidly build apps.

I'm going to share what I've learned in this blog post. 

Summary of findings

Process is key

Rapid development relies on having three key elements in place:

  • Using the right tools.
  • Having the right skill set.
  • Using AI code gen correctly.

Tools

Fast development must use these tools:

  • AI-enabled IDE.
  • Deployment platform like Render or Vercel.
  • Git.

Data scientists tend to use notebooks and that’s a major problem for rapid development; notebook-based development isn’t going to work. Speed requires the consistent use of AI-enabled IDEs like Cursor or Lovable. These tools use AI code generation at the project and code block level, and can generate code in different languages (Python, SQL, JavaScript etc,). They have the ability to generate test code, comment code, and make code PEP8 complaint. It’s not just one-off code gen, it’s applying AI to the whole code development process.

(Screen shot of Cursor used in this project.)

Using a deployment platform like Render or Vercel means deployment can be extremely fast. Data scientists don’t have deployment skills, but these products are straightforward enough that some written guidance should be enough. 

Deployment platforms retrieve code from Git-based systems (e.g., GitHub, GitLab etc.), so data scientists need some familiarity with them. Desktop tools (like GitHub Desktop) make it easier, but they have to be used, which is a process and management issue.

Skillsets and training

The skillset needed is the same as a full-stack engineer with a few tweaks, which is a challenge for data scientists who mostly lack some of the key skills. Here are the skillsets, level needed, and training required for data scientists.

  • Hands-on experience with AI code generation and AI-enabled IDE.
    • What’s needed:
      • Ability to appropriately use code gen at the project and code-block levels. This could be with Cursor, Claude Code, or something similar.
      • Understanding code gen strengths and weaknesses and when not to use it.
      • Experience developing code using an IDE.
    • Training: 
      • To get going, an internal training session plus a series of exercises would be a good choice.
      • At the time of writing, there are no good off-the-shelf courses.
  • Python
    • What’s needed:
      • Decent Python coding skills, including the ability to write functions appropriately (data scientists sometimes struggle here).
      • Django uses inheritance and function decorators, so understanding these properties of Python is important. 
      • Use of virtual environments.
    • Training:
      • Most data scientists have “good enough” Python.
      • The additional knowledge should come from a good advanced Python book. 
      • Consider using experienced software engineers to train data scientists in missing skills, like decomposing tasks into functions, PEP8 and so on.
  • SQL and building a database
    • What’s needed:
      • Create databases, create tables, insert data into tables, write queries.
    • Training:
      • Most data scientists have “good enough” SQL.
      • Additional training could be a books or online tutorials.
  • Django
    • What’s needed:
      • An understanding of Django’s architecture and how it works.
      • The ability to build an app in Django.
    • Training:
      • On the whole, data scientists don’t know Django.
      • The training provided by a short course or a decent text book should be enough.
      • Writing a couple of simple Django apps by hand should be part of the training.
      • This may take 40 hours.
  • JavaScript
    • What’s needed:
      • Ability to work with functions (including callbacks), variables, and arrays.
      • Ability to debug JavaScript in the browser.
      • These skills are needed to add and debug UI widgets. Code generation isn't enough.
    • Training:
      • A short course (or a reasonable text book) plus a few tutorial examples will be enough.
  • HTML and CSS
    • What’s needed:
      • A low level of familiarity is enough.
    • Training:
      • Tutorials on the web or a few YouTube videos should be enough.
  • Git
    • What’s needed:
      • The ability to use Git-based source control systems. 
      • It's needed because deployment platforms rely on code being on Git.
    • Training:
      • Most data scientists have a weak understanding of Git. 
      • A hands-on training course would be the most useful approach here.

Code gen is not one-size-fits-all

AI code gen is a tremendous productivity boost and enabler in many areas but not all. For key tasks, like database design and app deployment, AI code gen doesn’t help at all. In other areas, for example, complex database/dataframe manipulations and handling some advanced UI issues, AI helps somewhat but it needs substantial guidance. The AI coding productivity benefit is a range from negative to greatly positive depending on the task. 

The trick is to use AI code gen appropriately and provide active (adult) supervision. This means reviewing what AI produces and intervening. It means knowing when to stop prompting and when to start coding.

Recommendations before attempting rapid application development

  • Make sure your team have the skills I’ve outlined above, either individually or collectively.
  • Use the right tools in the right way.
  • Don’t set unreasonable expectation, understand that your first attempts will be slow as you learn.
  • Run a pilot project or two with loose deadlines. From the pilot project, codify the lessons and ways of working. Focus especially on AI code gen and deployment.

How I learned rapid development: my pathfinder app

For this project, I chose to build an app that analyzes the results of English League Football (soccer) games since the league began in 1888 to the most recently completed season (2024-2025). 

The data set is quite large, which means a database back end. The database will need multiple tables.

It’s a very chart-heavy app. Some of the charts are violin plots that need kernel density estimation, and I’ve added curve fitting and confidence intervals on some line plots. That’s not the most sophisticated data analysis, but it’s enough to prove a point about the use of data science methods in apps. Notably, charts are not covered in most Django texts.

(Just one of the plots from my app. Note the year slider at the bottom.)

In several cases, the charts need widgets: sliders to select the year and radio buttons to select different leagues. This means either using ‘native’ JavaScript or libraries specific to the charting tool (Bokeh). I chose to use native JavaScript for greater flexibility.

To get started, I roughly drew out what I wanted the app to look like. This included different themed analysis (trends over time, goal analysis, etc.) and the charts I wanted. I added widgets to my design where appropriate.

The stack

Here’s the stack I used for this project.

Django was the web framework, which means it handles incoming and outgoing data, manages users, and manages data. Django is very mature, and is very well supported by AI code generation (in particular, Cursor). Django is written in Python.

Postgres. “Out of the box”, Django supports SQLite, but Render (my deployment solution) requires Postgres. 

Bokeh for charts. Bokeh is a Python plotting package that renders its charts in a browser (using HTML and JavaScript). This makes it a good choice for this project. An alternative is Altair, but my experience is that Bokeh is more mature and more amenable to being embedded in web pages.

JavaScript for widgets. I need to add drop down boxes, radio buttons, sliders, and tabs etc. I’ll use whatever libraries are appropriate, but I want code gen to do most of the heavy lifting.

Render.com for deployment. I wanted to deploy my project quickly, which means I don’t want to build out my own deployment solution on AWS etc., I want something more packaged.

I used Cursor for the entire project.

The build process and issues

Building the database

My initial database format gave highly complicated Django models that broke Django’s ORM. I rebuilt the database using a much simpler schema. The lesson here is to keep the database reasonably close to the format in which it will be displayed. 

My app design called for violin plots of attendance by season and by league tier. This is several hundred plots. Originally, I was going to calculate the kernel density estimates for the violin plots at run time, but I decided it would slow the application down too much, so I calculated them beforehand and saved them to a database table. This is a typical trade-off.

For this part of the process, I didn’t find code generation useful.

The next stage was uploading my data to the database. Here, I found code generation very useful. It enabled me to quickly create a Python program to upload data and check the database for consistency.

Building Django

Code gen was a huge boost here. I gave Cursor a markdown file specifying what I wanted and it generated the project very quickly. The UI wasn’t quite what I wanted, but by prompting Cursor, I was able to get it there. It let me create and manipulate dropdown boxes, tabs, and widgets very easily – far, far faster than hand coding. I did try and create a more detailed initial spec, but I found that after a few pages of spec, code generation gets worse; I got better results by an incremental approach.

(One part of the app, a dropdown box and menu. Note the widget and the entire app layout was AI code generated.)

For one UI element, I needed to create an API interface to supply JSON rather than HTML. Code gen let me create it in seconds.

However, there were problems.

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write the code.

It did even worse with retrieving data from Django models. Although I aligned my data as closely as I could to the app, it was still necessary to aggregate data. I found code generation did a really poor job and the code needed to be re-written. Code gen was helpful to figure out Django’s model API though.

In one complex case, I needed to break Django’s ORM and make a SQL call directly to the database. Here, code gen worked correctly on the first pass, creating good-quality SQL immediately.

My use of code gen was not one-and-done, it was an interactive process. I used code generation to create code at the block and function level.

Bokeh

My app is very chart heavy, having more than 10 charts and there aren't that many examples of this type of app that I could find. This means that AI code gen doesn't have much to learn from. 

(One of the Bokeh charts. Note the interactive controls on the right of the plot and the fact the plot is part of a tabbed display.)

Code gen didn’t do well with generating Bokeh code for my plots and I had to intervene to re-write code.

I needed to access the Bokeh chart data from the widget callbacks and update the charts with new data (in JavaScript). This involved a building a JSON API, which code gen created very easily. Sadly, code gen had a much harder time with the JavaScript callback. It’s first pass was gibberish and refining the prompt didn’t help. I had to intervene and ask for code gen on a code block-by-block basis. Even then, I had to re-write some lines of code. Unless the situation changes, my view is, code generation for this kind of problem is probably limited to function definition and block-by-block code generation, with hand coding to correct/improve issues.

(Some of the hand-written code. Code gen couldn't create this.)

Render

By this stage, I had an app that worked correctly on my local machine. The final step was deployment so it would be accessible on the public internet. The sundai.club and others, use Render.com and other similar services to rapidly deploy their apps, so I decided to use the free tier of Render.com.

Render’s free tier is good enough for demo purposes, but it isn’t powerful enough for a commercial deployment (which is fair); that's why I’m not linking to my app in this blog post: too much traffic will consume my free allowance.

Unlike some of its competitors, Render uses Postgres rather than SQLite as its database, hence my choice of Postgres. This means deployment is in two stages:

  • Get the database deployed.
  • Linking the Django app to the database and deploy it.

This process was more complicated than I expected and I ran into trouble. The documentation wasn’t as clear as it needed to be, which didn’t help. The consistent advice in the Render documentation was to turn off debug. This made diagnosing problems almost impossible. I turned debug on and fixed my problems very quickly. 

To be clear: code gen was of no help whatsoever.

(Part of Render's deployment screen.)

However, it’s my view this process could be better documented and subsequent deployments could go very smoothly.

General comments about AI code generation

  • Typically, many organizations require code to pass checks (linting, PEP8, test cases etc.) before the developer can check it into source control. Code generation makes it easier and faster to pass these checks. Commenting and code documentation is also much, much faster. 
  • Code generation works really well for “commodity” tasks and is really well-suited to Django. It mostly works well with UI code generation, provided there’s not much complexity.
  • It doesn’t do well with complex data manipulations, although its SQL can be surprisingly good.
  • It doesn’t do well with Bokeh code.
  • It doesn’t do well with complex UI callbacks where data has to be manipulated in particular ways.

Where my app ended up

End-to-end, it took about two weeks, including numerous blind alleys, restarts, and time spent digging up answers. Knowing what I know now, I could probably create an app of this complexity in less than 5 days, less with more people.

My app has multiple pages, with multiple charts on each page (well over 10 charts in total). The chart types include violin plots, line charts, and heatmaps. Because they're Bokeh charts, my app has built-in chart interactivity. I have widgets (e.g., sliders, radio buttons) controlling some of the charts, which communicate back to the database to update the plots. Of course, I also have Django's user management features.

Discussion

There were quite a few surprises along the way in this project: I had expected code generation to do better with Bokeh and callback code, I’d expected Render to be easier to use, and I thought the database would be easier to build. Notably, the Render and database issues are learning issues; it’s possible to avoid these costs on future projects. 

I’ve heard some criticism of code generated apps from people who have produced 70% or even 80% of what they want, but are unable to go further. I can see why this happens. Code gen will only take you so far, and will produce junk under some circumstances that are likely to occur with moderately complex apps. When things get tough, it requires a human with the right skills to step in. If you don’t have the right skills, your project stalls. 

My goal with this project was to figure out the skills needed for rapid application development and deployment. I wanted to figure out the costs of enabling a data science team to build their own apps. What I found is the skill set needed is the skill set of a full-stack engineer. In other words, rapid development and deployment is firmly in the realm of software engineers and not data scientists. If data scientists want to build apps, there's a learning curve and a leaning cost. Frankly, I'm coming round to the opinion that data scientists need a broader software skill set.

Bottom line: it’s possible to do rapid application development and deployment with the right approach, the right tools, and using code gen correctly. Training is key.

Using the app

I want to tinker with my app, so I don't want to exhaust my Render free tier. If you'd like to see my app, drop me a line (https://www.linkedin.com/in/mikewoodward/) and I'll grant you access.

If you want to see my app code, that's easier. You can see it here: https://github.com/MikeWoodward/English-Football-Forecasting/tree/main/5%20Django%20app 

Thursday, November 6, 2025

How to get data analysis very wrong: sample size effects

We're not reading the data right

In the real world, we’re under pressure to get results from data analysis. Sometimes, the pressure to deliver certainty means we forget some of the basics of analysis. In this blog post, I’m going to talk about one pitfall you can make which can cause you to give wildly wrong answers. I’ll start with an example.

School size - smaller schools are better?

You’ve probably heard the statement that “small schools produce better results than large schools”. Small-school advocates point out that small schools disproportionately appear in the top performing groups in an area. It sounds like small schools are the way to go, or are they? It’s also true that small schools disproportionately appear among the worse schools in an area. So, which is it, are small schools better or worse?

The answer is: both. Small schools have a higher variation in results because they have fewer students. The results are largely due to “statistical noise” [1].

We can easily see the effects of sample size “statistical noise”, more properly called variance, in a very simple example. Imagine tossing a coin and scoring heads a 1 and tails a 0. You would expect the mean over many tosses to be close to 0.5, but how many tosses do you have to do? I wrote a simple program to simulate tossing a coin and I summed up the results as a I went along. The charts below show four simulations. The x axis of each chart is the number of tosses, the y axis is the running mean, the blue line is the simulation, and the red dotted line is 0.5.

The charts clearly show higher variance at low numbers of simulations. It takes a surprisingly large number of tosses for the mean to get close top 0.5. If we want more certainty, and less variance, we need bigger samples sizes.

We can repeat the experiment, but this time with a six-sided dice and record the running mean. We’d see the same result, more variance for shorter simulations. Let’s try a more interesting example (you’ll see why in a minute). Let’s imagine a 100-sided dice and run the experiment multiple times , recording the mean results after each simulation (I’ve shown a few runs here). 

Let’s change the terminology a bit here. The 100-sided dice is a percentage test result. Each student rolls the dice. If there are 100 students in a school, there are 100 die rolls, if there are 1,500 students in the school, we roll the die 1,500 times. We now have a simulation of school test results and the effect of school size.

I simulated 500 schools with 500 to 1,500 students. Here are the results.

As you can see, there’s more variance for shorter smaller schools than larger schools. This neatly explains why smaller schools are both the best in an area and the worst.

You might object to the simplicity of my analysis, surely real school results don't look like this.What does real-world data show? Wainer [1] did the work and got the real results (read his paper for more detials). Here's a screen shot from his paper showing real-world school results. It looks a lot like my simple-minded simulation.

Sample size variation is not the full explanation for school results, but it is a factor. Any analysis has to take it into account. Problems occur because of simple (wrong) analysis and overly-simple conclusions.

The law of large numbers

The effect that variance goes down with increasing sample size is known as the law of large numbers. It’s widely taught and there’s a lot written about it online. Unfortunately, most of the discussions get lost in the weeds very quickly. These two references do a very good job of explaining what’s going on: [1] [2].

The law of large numbers has a substantial body of mathematical theory behind it. It has an informal counter-part, that's a bit easier to understand, called the law of small numbers that says that there’s more variance in smaller samples than large ones. Problems occur because people assume that small samples behave in the same way as larger samples (small school results have the same variance as large school results for example).

So far, this sounds simple and obvious, but in reality, most data analysts aren’t fully aware of the effect of sample size. It doesn’t help that the language used in the real-world doesn’t conform to the language used in the classroom.

Small sales territories are the best?

Let’s imagine you were given some sales data on rep performance for an American company and you were asked to find factors that led to better performance.

Most territories have about 15-20 reps, with a handful having 5 or less reps. The top 10 leader board for the end of the year shows you that the reps from the smaller territories are doing disproportionally well.The sales VP is considering changing her sales organization to create smaller territories and she wants you to confirm what she’s seen in the data. Should she re-organize to smaller territories to get better results?

Obviously, I’ve prepped you with answer, but if I hadn’t, would you have concluded smaller territories are the way to go?

Rural lives are healthier

Now imagine you’re an analyst in a health insurance company in the US. You’ve come across data on the prevalence on kidney cancer by US county. You’ve found that the lowest prevalence is in rural counties. Should you set company policy based on this data? It seems obvious that the rural lifestyle is healthier. Should health insurance premiums include a rural/urban cost difference?

I’ve taken this example from the paper by Wainer [1]. As you might have guessed, rural counties have both the lowest and the highest rates of kidney cancer because their populations are small, so the law of small numbers kicks in. I’ve reproduced Wainer’s chart here: the x axis is county population and the y-axis is cancer rate, see his paper for more about the chart. It’s a really great example of the effect of sample size on variance.

A/B test hell

Let’s take a more subtle example. You’re running an A/B test that’s inconclusive. The results are really important to the company. The CMO is telling everyone that all the company needs to do is run the test for a bit longer. You are the analyst and you’ve been asked if running more tests is the solution. What do you say?

The only time it's worth running the test a bit longer is if the test is on the verge of significance. Other than that, it's probably not worth it. Belle's book [3] has a nice chapter on sample size calculations you can access for free online [4]. The bottom line is, the smaller the effect, the larger the sample size you need for significance. The relationship isn't linear. I've seen A/B tests that would have to run for over a year to reach significance. 

Surprisingly, I've seen analysts who don't know how to do a sample size/duration estimate for an A/B test. That really isn't a good place to be when the business is relying on you for answers,

The missing math

Because I’m aiming for a more general audience, I’ve been careful here not to include equations. If you’re an analyst, you need to know:

  • What variance is and how to calculate it.
  • How sample size can affect results - you need to look for it everywhere.
  • How to estimate how much of what you're seeing is due to sample size effects and how much due to something "real".
Unfortunately, references for the law of large numbers get overly technical overly quickly. A good place to start is references that cover variance and standard deviation calculations. I like reference [5], but be aware it is technical.

The bottom line

The law of large numbers can be hidden in data; the language used and the data presentation can all confuse what’s going on. You need to be acutely aware of sample size effects: you need to know how to calculate them and how they can manifest themselves in data in surprising ways.

References

[1] Howard Wainer, “The Most Dangerous Equation”, https://www.americanscientist.org/article/the-most-dangerous-equation

[2] Jeremy Orloff, Jonathan Bloom, “Central Limit Theorem and the Law of Large Numbers”, https://math.mit.edu/~dav/05.dir/class6-prep.pdf 

[3] Gerald van Belle, "Statistical rules of thumb", http://www.vanbelle.org/struts.htm 

[4] Gerald van Belle, "Statistical rules of thumb chapter 2 - sample size", http://www.vanbelle.org/chapters/webchapter2.pdf

[5] Steven Miller, "The probability lifesaver"




Sunday, October 26, 2025

Context 7: code generation using the most recent libraries

The problem

One of my complaints about using AI code gen with Cursor has been its "backwardness"; it tends to use older versions of libraries. This sometimes means your generated code isn't as well-structured as it could be. It's why AI code gen has sometimes felt to me like working with a grumpy older senior engineer.

What we want is some way of telling Cursor (or AI code gen in general) to use the latest library versions and use the latest code samples. Of course, we could supply links to the library documentation ourselves as part of the prompt, but this is tedious, which means we're prone to forgetting it.

Wouldn't it be great to have a list of all the latest libraries and documentation and supply it directly to Cursor via an MCP server? That means, we'll always point to the latest version of the code and docs, Cursor will pick it up automatically, and someone else bears the cost of keeping the whole thing up to date. With such a service, we could always generate code using the latest version of libraries.

(Gemini)

The solution

As you've guessed, such a thing exists and it's called Context 7. Context 7 provides links to the latest version of over 49,000 libraries, everything from Next.js to Requests. It provides these links in a form that's usable by LLMs.

If you really wanted to, you could include these links via a prompt. For example, for Streamlit, you could use the results here in a prompt: https://context7.com/websites/streamlit_io.  But that's inconvenient. You're better off using the Context 7 MCP Server and telling Cursor to use it in code generation.

How to implement it

There's a lot of advice about installing the Context 7 MCP Server on Cursor, some of it misleading and others wrong or out of date. Here's the easiest way to do it:

  1. Have Cursor running in the background.
  2. Go to this page on GitHub: https://github.com/upstash/context7
  3. Go down to the section "Install in Cursor" and expand the section.
  4. Click on the "Add to Cursor" button: 

This should automatically add the Context MCP server to your installed MCP servers.  To check that it has, do the following in Cursor:

  1. Click on the "Cursor" menu option, then click "Settings...".
  2. Click "Cursor Settings".  
  3. Click "Tools & MCP".
You should see this:


Next, you need to tell Cursor to use Context 7 when generating code. You could do this on every prompt, but that's tedious. You're much better off adding a rule. The Context 7 GitHub page even tells you how to do it: https://github.com/upstash/context7?tab=readme-ov-file#-tips.

Using it

This is the best part, if you've added the MCP server and you've added the rule, there's nothing else to do, you'll be using the latest version of your libraries when you generate code. I've heard a few people comment that they needed to restart Cursor, but I found it works just fine.

The cost

Using Context 7 will cost you more tokens, but in my view, it's a price worth paying for more up to date code.

Who's behind Context 7?

A company called Upstash have created Context 7 and they're providing it for free. To be clear: I have no affiliation of any kind with Upstash and have received no benefit or reward from them.

Bottom line

Use Context 7 in your code generation.

Tuesday, October 14, 2025

Competency porn

Over the last few weeks, I’ve increasingly heard the term “competency porn” used to describe movies or books. It’s a handy term, but I’m not sure I agree with how it’s been used. I’m going to give a little history of the term, give some examples, and tell you where I disagree with what’s been said online.

Leverage creator John Rogers created the phrase around 2009. He used it to describe an audience’s thrill at seeing (human) characters using specialist and well-developed skills to resolve some difficult situation. The situation might get very tough and evolve in ways the characters don’t expect, but they’re in control and it’s their calm use of their skills that saves the day. There are two main genres of competency porn: medical dramas and gangster/heist dramas.

A good example of medical competency porn is House. The titular character is a very talented (but deeply flawed) human being who uses his exceptional diagnostic skills to save lives. Things are never easy and there are plenty of diagnostic dead ends, but the show’s appeal lies in House’s ability to think through the situation and create opportunities for healing. Although the show mostly focuses on House, it's plain there's a (sometimes reluctant) team behind him.

(Canva)

Perhaps the best example of competency porn is the heist or gangster movie/TV series. In the heist movie, we see a group of highly-skilled (but flawed) individuals come together to overcome a series of challenges to steal something (a good example being Ocean’s Eleven). Of course, there are problems to solve along the way, but they never lose control of the situation and they overcome troubles through inventiveness borne from their skills.  The pleasure lies in watching the interaction between skilled people working together to execute a detailed plan under difficult circumstances.

(Gemini)

Competency porn characters are never “Mary Sue” types, meaning a character who has no character flaws or weaknesses. Famously, House is a seriously flawed individual, and most gangster characters have some weaknesses or problems. 

For me, superheroes can’t be competency porn. Their special powers mean they're super-human and the risk of failure is less. Their powers mean I empathize with them less; I could maybe be a safe cracker if I practiced for years, but there’s no way I could learn to fly, no matter how many buildings I jumped off. For the same reasons, I don’t think Dr Who is competency porn; famously, Dr Who isn’t human and has abilities and knowledge a human doesn’t have. Of course, superheroes always have a little “Mary Sue” tinge too.

By contrast, Law and Order is a great example of competency porn. The characters are human, highly-trained and experienced and they use their abilities to arrest and convict criminals. Almost all the time, they’re in control, and of course, they have personality flaws and weaknesses. 

Controversially, I don’t think Alien is a competency porn movie. The human characters are not in control of the situation and most of them don’t have relevant specialist skills. In fact, it's their ineptness and poor judgement that puts them at risk. There’s not a lot of calmness in the movie either. For the same reasons, horror movies can’t be competency porn.

Star Trek is usually competency porn. The characters are mostly well-trained, highly-skilled, and in control. They have a mission to accomplish, which they do through team work and the use of their complementary skill sets.

The online consensus is that Arthur C. Clarke’s Rendezvous with Rama is competency porn. It fits the definition: the characters are all human with specialist skills and they overcome challenges calmly. But for me, the characters are a little “Mary Sue” and there’s a whiff of super hero about one or two of them. Another problem for me is the pay-off. In the heist movie, the gangsters steal the money, in medical dramas, the doctor cures the patient, and in Law and Order, the criminal goes to jail. But in Rendezvous with Rama, there is no payoff: the crew leave Rama and that’s it. Other than exploration and preventing Rama being bombed, there’s no real sense the characters have achieved anything lasting.

The pleasure in competency porn is seeing a group of highly-skilled and in-control people collectively pull off something that would otherwise seem impossible. They’re not super-human in any way, so we can dream that we too could act and win as they do.

Friday, October 10, 2025

Regression to the mean

An unfortunate phrase with unfortunate consequences

"Regression to the mean" is a simple idea that has profound consequences. It's led people astray for decades, if not centuries. I'm going to explain what it is, the consequences of not understanding it, and what you can do to protect yourself and your organization.

Let's give a simple definition for now: it's the tendency, when sampling data, for more extreme values to be followed by values closer to the mean. Here's an example, if I give the same children IQ tests over time, I'll see very high scores followed by more average scores, and some very low scores followed by more average scores. It doesn't mean the children are improving or getting worse, it's just regression to the mean. The problems occur when people attach a deeper meaning, as we'll see.

(Francis Galton, popularizer of "Regression to the mean")

What it means - simple examples

I'm going to start with an easy example that everyone should be familiar with, a simple game with a pack of cards.

  • Take a standard pack of playing cards and label the cards in each suit 1 to 13 (Ace is 1, 2 is 2, Jack is 11, etc.). The mean card value is 7.5. 
  • Draw a card at random. 
  • Imagine it's a Queen (12). Now, replace the card and draw another card. Is it likely the card will have a lower value or a higher value? 
    • The probability is (11/13), it will have a lower value. 
  • Now imagine you drew an ace (1), replace the card and draw again. 
    • The probability of drawing another ace is 1/13.
    • The probability of drawing a 2 or higher is 12/13. 
It's obvious in this example that "extreme" value cards are very likely to be followed by more "average" value cards. This is regression to the mean at work. It's nothing complex, just a probability distribution at work.

The cards example seems simple and obvious. Playing cards are very familiar and we're comfortable with randomness (in fact, almost all card games rely on randomness). The problem occurs when we have real measurements, we tend to give explanations to the data when randomness (and regression to the mean) is all that's there.

Let's say we're measuring the average speed of cars on a freeway. Here are 100 measurements of car speeds. What would you conclude about the freeway? What pattern can you see in the data and what does it tell you about driver behavior (e.g. lower speeds following higher speeds and vice versa)? What might cause it? 

['46.7', '63.3', '80.0', '71.7', '34.2', '55.0', '67.5', '34.2', '67.5', '67.5', '59.2', '63.3', '55.0', '34.2', '63.3', '63.3', '63.3', '59.2', '75.8', '71.7', '42.5', '42.5', '34.2', '34.2', '59.2', '67.5', '59.2', '71.7', '71.7', '67.5', '50.8', '63.3', '34.2', '63.3', '30.0', '38.3', '50.8', '34.2', '75.8', '75.8', '46.7', '80.0', '55.0', '46.7', '38.3', '38.3', '75.8', '59.2', '34.2', '42.5', '71.7', '71.7', '80.0', '80.0', '71.7', '34.2', '63.3', '71.7', '46.7', '42.5', '46.7', '46.7', '63.3', '80.0', '80.0', '38.3', '38.3', '46.7', '38.3', '34.2', '46.7', '75.8', '55.0', '30.0', '55.0', '75.8', '30.0', '42.5', '67.5', '30.0', '50.8', '67.5', '67.5', '71.7', '67.5', '67.5', '42.5', '75.8', '75.8', '34.2', '55.0', '50.8', '38.3', '71.7', '46.7', '71.7', '50.8', '71.7', '42.5', '42.5']

Let's imagine the authorities introduced a speed camera at the measurement I've indicated in red. What might you conclude about the effect of the speed camera?

You shouldn't conclude anything at all from this data. It's entirely random. In fact, it has the same probability distribution as the pack of cards example. I've used 13 different average speeds, each with the same probability of occurrence. What you're seeing is the result of me drawing cards from a pack and giving them floating point numbers like 71.7 instead of a number like 9. The speed camera had no effect in this case. The data set shows the regression to the mean and nothing more.

The pack of cards and the vehicles example are exactly the same example. In the pack of cards case, we understand randomness and we can intuitively see what regression to the mean actually means. Once we have a real world problem, like the cars on the freeway, our tendency is to look for explanations that aren't there and we discount randomness. Looking for meaning in random data has had bad consequences, as we'll see.

Schools example

In the last few decades in the US, several states have introduced standardized testing to measure school performance. Students in the same year group take the same test and, based on the results, the state draws conclusions about the relative standing of schools; it may intervene in low performing schools. The question is, how do we measure the success of these interventions? Surely, we would expect to see an improvement in test scores taken the next year? In reality, it's not so simple.

The average test result for a group of students will obviously depend on things like teaching, prior attainment etc. But there are also random factors at work. Individual students might perform better or worse than expected due to sickness, or family issues, or a host of other random issues. Of course, different year groups in the same school might have a different mix of abilities. All of which means that regression to the mean should show up in consecutive tests. In other words, low performing schools might show an improvement and high performing schools might show a degradation entirely due to random factors.

This isn't a theoretical example: regression to the mean has been clearly shown in school scores in Massachusetts, California and in other states (see Haney, Smith & Smith). Sadly, state politicians and civil servants have intervened based on scores and drawn conclusions where they shouldn't.

Children's education evokes a lot of emotion and political interest, which is not a good mix. It's important to understand concepts like regression to the mean so we can better understand what's really going on.

Heights example

"Regression to the mean" was originally called "regression to mediocrity", and was based on the study of human heights. If regression to mediocrity sounds very disturbing, it should do. It's closely tied to eugenics through Francis Galton. I'm not going to dwell on the links between statistics and eugenics here, but you should know the origins of statistics aren't sin free.

In 1880s England, Galton studied the heights of parents and their children. I've reproduced some of his results below. He found that parents who were above average height tended to have children closer to the average height, and that parent parents below average height tended to have children closer to the average height. This is the classic regression to the mean example. 

Think for the moment about possible different outcomes of a study like this. If taller parents had taller children, and shorter parents had shorter children, then we might expect to see two population groups emerging (short people and tall people) and maybe the start of speciation. Conversely, if tall parents had short children, and short parents had tall children, this would be very noticeable and commented on. Regression to the mean turns out to be a good explanation of what we observe in nature.

Galton's height study was very influential for both the study of genetics and the creation of statistics as a discipline.

New sports players

Let's take a cohort of baseball players in their first season. Obviously, talent makes a difference, but there are random factors at play too. We might expect some players to do extremely well, others to do well, some to do OK, some to do poorly, and some to do very poorly.  Regression to the mean tells us that some standout players may well perform worse the next year. Other, lower-ranked players will perform better for the same reason. The phenomena of new outstanding players performing worse in their second year is often called the "sophomore slump" and a lot has been written about it, but in reality, it can mostly be explained by regression to the mean.

You can read more about regression to the mean in sports here:

Business books

Popular business books often fall into the regression to the mean trap. Here's what happens. A couple of authors do an analysis of top performing businesses, usually measured by stock price, and find some commonalities. They develop these commonalities into a framework and write a best-selling business book whose thesis is, if you follow the framework, you'll be successful. They follow this with another book that's not quite as good. Then they write a third book that only the true believers read.

Unfortunately, the companies they select as winners don't do as well over a decade or more, and the longer the timescale, the worse the performance. Over the long-run, the authors' promise that they've found the elixir of success is shown to be not true. Their books go from the best seller list to the remainder bucket.

A company's stock price is determined by many factors, for example, its competitors, the market state, and so on. Only some of them are under the control of the company. Conditions change over time in unpredictable ways.  Regression to the mean suggests that great stock price performers now might not be in future, and low performers may do better. Regression to the mean neatly explains why picking winners today des not mean the same companies will be winners in the years to come. In other words, basic statistics makes a mockery of many business books.

Reading more:

  • The Halo Effect: . . . and the Eight Other Business Delusions That Deceive Managers - Phil Rosenzweig 

My experience

I've seen regression to the mean pop up in all kinds of business data sets and I've seen people make the classic mistake of trying to derive meaning from randomness. Here are some examples.

Sales data has a lot of random fluctuations, and of course, the smaller the sample, the greater the fluctuations. I've seen sales people have a stand out year followed by a very average year and vice versa. I've seen the same pattern at a regional and country level too. Unfortunately, I've also seen analysts tie themselves in knots trying to explain these patterns. Even worse, they've made foolish predictions based on small sample sets and just a few years' worth of data.

I've seen very educated people get very excited by changes in company assessment data. They think they've spotted something significant because companies that performed well one year tended to perform a bit worse the next etc. Regression to the mean explained all the data.

How not to be fooled

Regression to the mean is hidden in lots of data sets and can lead you into making poor decisions. If you're analyzing a dataset, here are some questions to ask:

  • Is your data the result of some kind of sampling process? 
  • Does randomness play a part in your collection process or in the data?
  • Are there unknowns that might influence your data?

If the answer to any of these questions is yes, you should assume you'll find regression to the mean in your dataset. Be careful about your analysis and especially careful about explaining trends. Of course, the smaller your data set, the more vulnerable you are.

You can estimate the effect of regression to the mean on your data using a variety of methods. I'm not going to go into them too much here because I don't want to make this blog post too long. In the literature, you'll see references on running a randomized control trial (RCT) also known as an A/B test. That's great in theory, but the reality is that it's not appropriate for most business situations. In practice, you'll have to run simulations or do some straightforward estimation of the fractional regression to the mean.

Friday, September 26, 2025

More money means more goals

Winner takes all?

Do clubs with the most expensive players score more goals in English league football? The answer is a strong yes.

In this blog post, I'll show an analysis of goals scored vs. club transfer value and you'll clearly see a strong correlation. Of course, it's not the only factor that affects goals scored, but it's a strong signal.

(Google Gemini. Note the Euro has three legs!)

The data

The data comes from TransferMarkt (https://www.transfermarkt.com/) who publish a market values for clubs. The market value is the estimated transfer value of all the players in the club squad. Obviously, transfer values change over time when players are bought, sold, or are injured. TransferMarkt have club transfer values at the start of each season and they also provide biweekly values. For this analysis, I've used the season start values. The dataset starts properly in 2010 for the top four tiers.

The charts

The charts below show goals for, against, and net (for - against) vs. total club transfer value for each club for each season for each league. The slider lets you change the year and the buttons let you change the league tier. The points on the charts are individual clubs and the line is a linear regression fit. The r2 and p-value for the fit are in the chart title. The blue band is the 95% confidence interval on the fit.

In addition to the buttons and slider, the charts are interactive:

  • You can hover over points and see their values.
  • You can zoom-in or zoom-out using the tool menu on the left.
  • You can save the charts using the tools menu on the left.

Take a while to play with the charts.

What the charts show

All leagues show the following trends:

  • Higher club value = more for goals
  • Higher club value = fewer against goals
  • Higher club value = more net goals 

The strength of this correlation varies by league and by time, but it's there.

The r2 value varies in the range 0.4 to 0.91, suggesting a good correlation, but it's not the only factor; there are other factors we need to consider to fully model goals. The p-values are close to 0, indicating this correlation is very unlikely to have happened by chance.

Take a look at league tier 3 for 2024 (this is currently called "League One"). There's a huge outlier and it's Birmingham City. These guys were in the Premier League not so long ago, but suffered a number of problems on and off the pitch which led to their relegation. They've recently had a big cash injection are are now owned (in part) by Tom Brady. Part of this big cash injection was new management and new players. As a result, they were promoted back to the EFL Championship (tier 2) in 2025. In other words, they're a big club temporarily fallen on hard times; they're an outlier.

If you take a look at tier 2, you'll see the top valued clubs are pretty much all clubs recently relegated from the Premier League. To play in the Premier League, you need top-quality talent, and that's expensive. On the flip side, you get more gate revenue and TV money. Relegated teams face a number of issues: star players may leave and revenues drop precipitously. To stand any chance of being promoted, clubs need to retain top-talent at the same time as their revenue has fallen. These conflicting requirements can and has led to financial instability. To ease the relegation transition, the Premier League provides "parachute" payments to relegated clubs.  The upshot is, newly relegated teams are in a better place than the other clubs in the league; they have parachute money and good players.

Children's fiction, Ted Lasso, and Wrexham 

Growing up in England, there was a lot of football fiction aimed at kids. A staple of the genre was a struggling team that somehow make it to the top, out-playing bigger and more expensive teams. Sadly, this just isn't the reality and probably never was; money is pretty much the only way up. Looking back, I'm not sure the financial underdog fantasy was helpful.

Both the fictional Ted Lasso and the real Wrexham are in the news. Notably, neither Ted Lasso nor Wrexham are rags-to-riches tales. 

In Ted Lasso, the fictional Richmond team owner brought in Ted Lasso to tank the team performance to spite her ex-husband. The team had plenty of money (lack of money was never a major story line). Perhaps the writers felt that having a cheap team rise to the top would be too unrealistic. 

Wrexham's upward path has been paid for by Hollywood money, and in fact Wrexham's club value is pretty typical of a League One team, they're very much not the financial underdog. 

The rags-to-riches fantasy, or maybe, the financial underdog-wins-all fantasy, is just a fantasy.

The bottom line

The bottom line is the bottom line. Money talks, and if you want to score the goals, you've got to spend the cash.