Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

Thursday, June 5, 2025

Cursor for data science - a scorecard

What is this scorecard?

I've been investigating how to use Cursor for data science. This means using it on a real project and finding out its strengths and weaknesses. This blog post is a summary of my experiences and I'm posting it as a guide to others.

(Gemini)

Things in this space are changing quickly. This post is up to date as of June 2025. I may update this post in the future, but if you're reading this six months in the future and it hasn't been updated, please contact me if you want to hear more (https://www.linkedin.com/in/mikewoodward/).

Cursor scorecard

General

Area Grade
Getting started D
Usability B
Debugging C
Code generation C
Code completion A
Code commenting A
Code tidying D
PEP8 compliance B
Documentation A
GitHub integration C
Error finding B

Specific tasks

Area Grade
Pandas dataframe manipulation C
Web scraping D
Data cleansing C
Prototyping A

Getting started with Cursor

Getting started is hard. This is very definitely an early adopter tool: 

  • Product documentation is sparse. 
  • There are very few online written tutorials. 
  • There are a handful of courses, but only on Udemy. 
  • Although there are many, many videos on YouTube, there are problems with them.

All of the YouTube videos I watched followed the same format, the development of a UI-based app. In all cases, the videos showed connections to LLMs to do some form of text processing, and in some cases, videos went through the process of connecting to databases, but none of the videos showed any significant (data science) computation in Python. On reflection, pretty much every Cursor demo I’ve seen has been focused on prototyping. That's fine if your application is a prototype, but not so great otherwise.

I got started by watching videos, talking to people at Meetup groups, and working on this project. That’s great for me, but it’s not scalable.

Although the Cursor free tier is useful, you very quickly exhaust your free tokens. To do any form of evaluation, you need a subscription. It’s cheap enough for that not to be a problem, but you should be aware you’ll need to spend some money.

Usability

The obvious problem is that Cursor isn’t a notebook. Given that most data scientists are addicted to notebooks (with good reason), it’s a major stumbling block any data science roll-out will have to deal with. In fact, it may well stop data science adoption dead in its tracks in some organizations.

Once you get round the notebook issue, usability is mostly good, but it’s a mixed bag. There are settings like rules which should be easier and more obvious to set up; the fact you an specify rules in “natural” English feels like a benefit, but I’d rather have something more restrictive that’s less open to interpretation. Rules have a bit of a voodoo flavor right now.

Debugging

Frankly, I found debugging harder than other environments. I missed having notebook-like features. There’s a variable explorer, but it’s weaker than in an IDE like Spyder. On the plus side, you can set breakpoints and step through the code.

Code generation

Very, very mixed results here.

Bottom line: code generation often can’t be trusted for anything technical and requires manual review. However for commodity tasks, it does very well.

Positives

It did outstandingly well at generating a UI in Streamlit. The code was a little old-fashioned and didn’t use the latest features, but it got me to a working solution astonishingly fast.

It produces ‘framework’ code really well and saved a lot of time. For example, I wanted to save results to a CSV and save intermediate results. It generated that code for me in seconds. Similarly, I wanted to create ‘commodity’ functions to do relatively simple tasks, and it generated them very quickly. It can automate much of the ‘boring’ coding work.

It also did well on some low-level and obscure tasks that would have otherwise requires some time on Stack Overflow, e.g. date conversion.

Negatives

Technical code generation is not a good story. With very careful prompting, it got me to an acceptable solution for statistics-oriented code. But I had to check the code carefully. Several times, it produced code that was either flat-out wrong or just a really bad implementation.

I found that code that required details instructions (e.g. specific dataframe joins) could be generated, but given how detailed the prompt needed to be, the cost savings for code generation were minimal.

On occasions, code generation gave overly-complex solutions to simple tasks, for example, its solution for changing the text “an example” to “An Example” was a function using a loop.

From a higher-level code structure perspective, code generation is not good. Persistently, it would create new functions rather than generalizing and re-using existing functions. For example, I had boiler-plate code to open a CSV and read it into a Pandas dataframe with error checking. Code generation created a new function to read in data rather than re-use the existing code. Once I told it to consolidate all the read functions, it did. Overall, it’s not good at generating well-structured code.

Although it’s a niche topic, it’s worth mentioning that code generation didn’t work at all well for web scraping.

Code completion

Excellent. Best I’ve come across.

There were several cases where code generation didn’t work very well, but code completion did. Code completion works well if the context is good, for example, if you create a clear comment, the system will offer code completion based on your comment, and almost all the time, it will do well.

I found code completion to be a very compelling feature.

Commenting code

This is definitely a Cursor superpower. It’s almost unbelievably good.

Code tidying

Some of the time, if you ask it to tidy your code, it will do the right thing. However, most of the time I found it introduces errors. 

PEP8 compliance

Surprisingly, generated code/completion code isn’t PEP8 ‘out of the box’, for example, it will happily give you code that’s way over 79 characters. Even asking the AI to make the code PEP8 compliant sometimes takes multiple attempts. I had set a rule for PEP8 compliance, but it still did didn’t fully comply.

Documentation

This means creating markdown files that explain what the code is doing. It did a really great job here.

GitHub integration

Setup was really easy. Usage was mostly OK, but I ran into a few issues where Cursor needlessly tied itself in knots. More seriously, it deleted a bunch of data files. 

Contrasting the usability of GitHub in Cursor with the GitHub desktop app, the GitHub desktop app has the edge right now. 

Github integration needs some work.

Error finding

In most cases, it did really well finding and correcting run-time errors, however I found a case where its error correction made the code much worse; this was processing a complex HTML table. Code generation couldn’t give me a correct answer, and asking the engine (Claude) to correct the error just produced worse code.

Pandas dataframe manipulation

This means the ability to manipulate Pandas dataframes in non-trivial ways, for example, using groupby correctly.

Cursor can do it quite well for basic manipulations, but it fails for even moderately complicated tasks. For example, I asked it to find cases where a club only appeared as an away team or a home team. The generated code looked as if it might be correct, but it wasn’t. In fact, the code didn’t work at all and I had to write it by hand. This was by no means a one-off, Cursor consistently failed to produce correct code for dataframe manipulations.

Code generation for scraping data

On the plus side, it managed to give me the URLs for the pages I wanted to scrape purely on a prompt, which frankly felt a bit supernatural.

On the negative side, it really can’t generate code that works for anything other than a simple scrape. Even asking it to correct its errors doesn’t work very well. The general code structure was OK, but a little too restrictive and I had to remove some of its generated functions. It’s marginal to me whether it’s really worth using code generation here. However, code completion was helpful.

Data cleansing

Cleaning data with code generation ran into the Pandas dataframe problem I've discussed above. Code completion was helpful, but once the manipulations become more complex, I had to hand write them.

Prototyping

By prototyping, I mean creating a UI-based application, for example, a Streamlit app, or even a standalone web app using react.js with a Python backend.

The results were outstanding.

You can generate apps in a fraction of the time it takes to do it by hand.

There are some downsides:

  • Security is often not baked-in and has to be added later.
  • The code often uses structures that are a little behind the latest thinking, e.g. not using new features of libraries.

Wednesday, June 4, 2025

Recommendations for rolling out generative AI to data science and technical coding teams

Summary - proceed with caution

This report gives guidance for rolling out code generation to data science teams. One size doesn't fit all, so you should use the post as a guide to shape your thinking, not as a recipe that can't be changed.

There are substantial productivity gains to be had from rolling out generative AI for code generation to data science teams, but there are major issues to be managed and overcome. Without effective leadership, including expectation setting, roll-outs will fail. 

Replacing notebooks with an agentic AI like Cursor will not succeed. The most successful strategy is likely the combined use of notebooks and an agentic AI IDE which will give data scientists an understanding of the benefits of the technology and its limitations. This is in preparation for the probable appearance of agentic notebook products in the near future.

For groups that use IDEs (like software developers), I recommend immediate use of Cursor or one of its competitors. I'm covering this in a separate report.

(Perplexity.AI)

Introduction

Why, who, and how

This is a guide for rolling out generative AI (meaning code generation) for data science teams. It covers the benefits you might expect to see, the issues you'll encounter, and some suggestions for coping with them. 

My comments and recommendations are based on my use of Cursor (an agentic IDE) along with Claude, Open AI and other code generation LLMs. I'm using them on multiple data science projects. 

As of June 2025, there are no data science agentic AI notebooks that have reached widespread adoption, however, in my opinion, that's likely to change later on in 2025. Data science teams that understand the use of agentic AI for code generation will have an advantage over teams that do not, so early adoption is important.

Although I'm focused on data science, all my comments apply to anyone doing technical coding, by which I mean code that's algorithmically complex or uses "advanced" statistics. This can include people with the job titles "Analyst" or "Software Engineer".

I'm aware that not everyone knows what Cursor and the other agentic AI-enabled IDEs are, so I'm writing a separate blog post about them.

(Gemini)

The situation for software engineers

For more traditional software engineering roles, agentic AI IDEs offer substantial advantages and don't suffer from the "not a notebook" problem. Despite some of the limitations and drawbacks of code generation, the gains are such that I recommend an immediate managed, and thoughtful roll-out. A managed and thoughtful roll-out means setting realistic goals, having proper training, and clear communications. 

  • Realistic goals covers productivity gains; promising productivity gains of 100% or more is unrealistic.
  • Proper training means educating the team on when to use code gen and when not to use it. 
  • Clear communications means the team must be able to share their experiences and learn from one another during the roll-out phase.

I have written a separate report for software engineering deployment.

Benefits for data science

Cursor can automate a lot of the "boring" stuff that consumes data scientist's time, but isn't core algorithm development (the main thing they're paid to do). Here's a list:

  • Commenting code. This includes function commenting using, for example, the Google function documentation format.
  • Documentation. This means documenting how code works and how it's structured, e.g. create a markdown file explaining how the code base works.
  • Boilerplate code. This includes code like reading in data from a data source.
  • Test harnesses, test code, and test data. Code generation is excellent at generating regression test frameworks, including test data.
  • PEP8 compliance. Cursor can restructure code to meet PEP8 requirements.

There are other key advantages too:

  • Code completion. Given a comment or a specifc prompts, Cursor can generate code blocks, including using the correct API parameters. This means less time looking up how to use APIs.
  • Code generation. Cursor can generate the outline of functions and much of the functionality, but this has to be well-managed.

Overall, if used correctly, Cursor can give a significant productivity boost for data science teams.

Problems for data science

It's not plain sailing, there are several issue to overcome to get productivity benefits. You should be aware of them and have a plan to address them.

It's not a notebook

(Gemini)

On the whole, data scientists don't use IDEs, they use notebooks. Cursor, and all the other agentic IDEs, are not notebooks. This is the most important issue to deal with and it's probably going to be the biggest cause of roll-out failure.

Notebooks have features that IDEs don't, specifically the ability to do "data interactive" development and debugging; which is the key reason why data scientists use them. Unfortunately, none of the agentic AI systems have anything that comes close to a notebook's power. Cursor's debugging is not AI enabled and does not easily allow notebook cell-like data investigations. 

Getting data scientists to abandon notebooks and move wholesale to an agentic IDE like Cursor is an uphill task and is unlikely to succeed. 

A realistic view of code generation for data science

Complex code is not a good match

Cursor and LLMs in general, are bad at generating technically complex code, e.g. code using "advanced statistical" methods. For example, asking for code to demonstrate random variable convolution can sometimes yield weird and wrong answers. The correctness of the solution depends precisely on the prompt. It also needs the data scientist to closely review the generated code. Given that you need to know the answer and you need to experiment to get the right prompt, the productivity gain of using code generation in these cases is very low or even negative.

It's also worth pointing out that for Python code generation, code gen works very poorly for Pandas dataframe manipulation beyond simple transformations.

Code completion

Code completion is slightly different from code generation and suffers from fewer problems, but it can sometimes yield crazily wrong code.

Data scientists are not software engineers and neither is Cursor

Data scientists focus on building algorithms, not on complete systems. In my experience, data scientists are bad at structuring code (e.g. functional decomposition), a situation made worse by notebooks. Neither Cursor, nor any of its competitors or LLMs, will make up for this shortcoming. 

Refactoring is risky

Sometimes, code needs to be refactored. This means changing variable names, removing unused code, structuring code better, etc. From what I've seen, asking Cursor to do this can introduce serious errors. Although refactoring can be done successfully, it needs careful and limited AI prompting.

"Accept all" will lead to failure

I'm aware of real-world cases where junior staff have blindly accepted all generated code and it hasn't ended well. Bear in mind, generated code can sometimes be very wrong. All generated code (and code completion code) must be reviewed. 

Code generation roll-out recommendations

Run a  pilot program first

A successful roll-out will require some experience, but where does this experience come from? There are two possibilities:

  • "Hidden" experience. It's likely that some staff have experimented with AI code gen, even if they're not data scientists. You can co-opt this experience.
  • Running a pilot program. Get a small number of staff to experiment intensively for a short period.
Where possible, I recommend a short pilot program prior to any widespread roll-out. The program should use a small number of staff and run for a month. Here are some guidelines for running a pilot program:

  • Goals:
    • To learn the strengths and weaknesses of agentic AI code generation for data science.
    • To learn enough to train others.
    • To produce a first-pass "rules of engagement".
  • Staff:
    • Use experienced/senior staff only. 
    • Use a small team, five people or less.
    • If you can, use people who have experimented with Cursor and/or code generation.
    • Don't use skeptics or people with a negative attitude.
  • Communication:
    • Frequent staff meetings to discuss learnings. Strong meeting leadership to ensure participation and sharing.
    • Slack (or the equivalent) channels.
  • Tasks:
    • Find a way of using agentic IDEs (e.g. Cursor) with notebooks. This is the most important task. The project will fail if you don't get a workable answer.
    • Work out "rules of engagement".
    • Work out how to train others.
  • Duration
    • Start to end, a month.

If you don't have any in-house experience, how do you "cold start" a pilot program? Here are my suggestions:

  • Go to local meetup.com events and see what others are doing.
  • Find people who have done this elsewhere (LinkedIn!) and pay them for advice.
  • Watch YouTube videos (but be aware, this is low-productivity exercise).

Don't try and roll-out AI code generation blind. 

Expectation setting

There are some wild claims about productivity benefits for code generation. In some cases they're true, you really can substantially reduce the time and cost of some projects. But for other projects (especially data science projects) the savings are less. Overstating the benefits has several consequences:

  • Loss of credibility with company leadership.
  • Loss of credibility with staff and harm to morale.

You need to have a realistic sense of the impact on your projects. You need to set realistic expectations right from the start.

How can you get that realistic sense? Through a pilot program.

Clear goals and measuring success

All projects need clear goals and some form of success metric. The overall goal here is to increase productivity using code generation while avoiding the implementation issues. Direct measures of success here are hard as few organizations have measures of code productivity and data science projects vary wildly in complexity. Some measures might be:

  • Fraction of code with all functions documented correctly.
  • Fraction of projects with regression tests.
  • High levels of staff usage of agentic AI IDEs.
The ultimate measure is of course that projects are developed faster.

At an individual level, metrics might include:

  • Contributions to "rules of engagement".
  • Contributions to Slack channel (or the equivalent).

Initial briefing and on-going communications 


(Canva)

Everyone in the process must have a realistic sense of the benefits of this technology and the problems, this includes the staff doing the work, their managers, and all executive and C-level staff.

Here are my suggestions:

  • Written briefing on benefits and problems.
  • Briefing meetings for all stakeholders.
  • Written "rules of engagement" stating how code is to be used and not used. These rules will be updated as the project proceeds.
  • Regular feedback sessions for hands-on participants. These sessions are where people share their experiences.
  • Regular reports to executives on project progress.
  • On-going communications forum. This could be something like a Slack channel.
  • Documentation hub. This is a single known place where users can go to get relevant materials, e.g.
    • Set-up instructions
    • Cursor rules (or the equivalent)
    • "Rules of engagement"

Clear lines of responsibility

Assuming there are multiple people involved in an evaluation or roll-out, we need to define who does what. For this project, this means:

  • One person to act as the (Cursor) rules controller. The quality of generated code depends on rules, if everyone uses wildly different rules the results will inconsistent. The rules controller will provide recommended rules that everyone should use. Participants can experiment with rules, but they must keep the controller informed.
  • One person to act as recommendations controller. As I've explained, there are "dos" and "don'ts" for working with code generation, these are the "rules of engagement". One person should be responsible for continually keeping this up to date. 

Limits on project scope

There are multiple IDEs on the market and their are multiple LLMs that will generate code. Evaluating all of them will take considerable time and be expensive. My recommendation is to choose one IDE (e.g. Cursor, Windsurf, Lovable or one of the others) and one agentic AI. It's OK to have some experimentation at the boundaries, e.g. experimenting with a different agentic AIs, but this needs to be managed - as always, project discipline is important.

Training

(Canva)

Just setting people up and telling them to get started won't work. Almost all data scientists won't be familiar with Cursor and the VS Code IDE it's based on. Cursor works differently from other IDEs, and there's little in the way of useful tutorials online. This begs the question, how do you get the expertise to train your team? 

The answer is a pilot program as I've explained. This should enable you to bootstrap your initial training needs using in-house experience.

You should record the training so everyone can access it later if they run into trouble. Training must include what not to do, including pointing out failure modes (e.g. blindly accepting generated code), this is the "rules of engagement".

It may also be worth re-training people partway through the project with the knowledge gained so far.

(Don't forget, data scientists mostly don't use IDEs, so part of your training must cover basic IDE usage.)

Notebook and Cursor working together

This is the core problem for data science. Figuring out a way of using an agentic IDE and a notebook together will be challenging. Here are my recommendations.

  1. Find a way of ensuring the agentic IDE and the notebook can use the same code file. Most notebooks can read in Python files and there are sometimes ways of preserving cell boundaries in Python (e.g. using the "# %%" format).
  2. Edit the same Python file in Cursor and in the notebook (this may mean refreshing the notebook so it picks up any changes, Cursor seems to pick up changes by itself).
  3. Use Cursor for comments, code completion etc. Use the notebook for live code development and debugging.

(Canva)

Precisely how to do this will depend on the exact choice of agentic IDE and notebook.

This process is awkward, but it's the best of the options right now.

(Cursor) Rules

Agentic IDEs rely on a set of rules that guide code generation. These are like settings but expressed in English prose. These rules will help govern the style of the generated code. What these rules are called will vary from IDE to IDE but in Cursor, they're called "Rules".

I suggest you start with a minimal set of Rules, perhaps 10 or so. Here are three to get you started:

"Act as an experienced data scientist creating robust, re-usable, and readable code.

Use the latest Python features, including the walrus operator. Use list comprehensions rather than loops where it makes sense.

Use meaningful variable names. Do not use df as the name of a dataframe variable."

There are several sites online that suggest Rules. Most suggest verbose and long Rules. My experience is that shorter and more concise works better.

Regression tests

As part of the development process, use Cursor to generate test cases for your code, which includes generating test data. This is one of Cursor's superpowers and one of the places where you can see big productivity improvements.

Cursor can occasionally introduce errors into existing code. Part of the "rules of engagement" must be running regression tests periodically or when the IDE has made substantial changes. In traditional development, this is expensive, but agentic IDEs substantially reduce the cost.

GitHub

Cursor integrates with GitHub and you can update Git repositories with a single prompt. However, it can occasionally mess things up. You should have a good set of tactics for GitHub integration, including having an in-house expert who can fix issues should they arise.

"Rules of engagement"

I've referred to this document a number of times. This is a written document that describes how to use code gen AI and how not to use it. Here are the kinds of things it should contain:

"Use code generation via the prompt to create function and code outlines, e.g. specifying that a file will contain 5 functions with a description of what the functions do. Most of the time, it's better to ask the agent to product code stubs. However, if a function is boilerplate, e.g. reading a CSV file into a dataframe, then you can prompt for full code generation for that function.
...
Do not use code generation or code completion for medium to complex dataframe manipulations. You can use it for simple dataframe manipulations. You can use code completion to get a hint, but you shouldn't trust it.
...
Use  the prompt to comment your code, but be clear in your prompt that you want comments only and no other changes.
... 

Before running regression tests, prompt the AI to comment your code. 

"

You should periodically update the rules of engagement and make sure users know the rules have changed. As I stated earlier, one person should be responsible for maintaining and updating the rules of engagement.

Conclusions

Successfully rolling out agentic AI code generation to data scientists is not a trivial tasks. It will require a combination of business and technical savvy. As ever, there are political waters to navigate, both up and down the organization.

There are some, key ideas I want to reiterate:
  • Agentic IDEs are not notebooks. You need to find a way of working that combines notebooks and IDEs. Success depends on this.
  • Pilot programs will let you bootstrap a roll-out, without them, you'll find roll-outs difficult to impossible.
  • Training, "rules of engagement", and communication are crucial.

Other resources

I'm in the process of developing a very detailed analysis of using Cursor for data science. This analysis would form the basis of the "rules of engagement". I'm also working on a document similar to this for more traditional software engineering. If you're interested in chatting, contact me on LinkedIn: https://www.linkedin.com/in/mikewoodward/.


Monday, May 19, 2025

What is a random variable?

Just because we can't predict something exactly doesn't mean we can't say anything about it at all

There are all kinds of problems where we can't say exactly what the value of something is, but we can still say useful things about it. Here are some examples.

  • The number of goals scored in a football or hockey match.  We might not be able to predict the number of goals scored in a particular match, but we can say something:
    • We know that the number of goals must be an integer greater than or equal to 0.
    • We know that the number of goals is likely to be low and that high scores are unlikely; seeing two goals is far more likely than seeing 100 goals.
  • The number of people buying tickets at a movie theater. We know this will depend on the time of year, the day of the week, the weather, and the movies playing, etc. but even allowing for these factors, there's randomness. People might go on dates (or cancel them) or decide on a whim to see a movie. In this case, we know the minimum tickets is zero, the maximum is the number of seats, and that only an integer number of tickets can be sold. 
  • The speed of a car on the freeway. Plainly, this is affected by a number of factors, but there's also randomness at play. We know the speed will be a real number greater than zero. We know that in the absence of traffic, it's more likely the car will be traveling at the speed limit than say 20mph.
  • The score you get by rolling a dice.
(Dietmar Rabich / Wikimedia Commons / “Würfel, gemischt -- 2021 -- 5577” / CC BY-SA 4.0

For print products: Dietmar Rabich / https://commons.wikimedia.org/wiki/File:W%C3%BCrfel,_gemischt_--_2021_--_5577.jpg / https://creativecommons.org/licenses/by-sa/4.0/
Alternatively: Dietmar Rabich / https://w.wiki/9A49 / https://creativecommons.org/licenses/by-sa/4.0/)

In all these cases, we're trying to measure something, but randomness is at play, which means we can't predict an exact result, but we can still make probabilistic predictions. We can also do math with these predictions, which means we can use them to build computer models and make predictions about how a system might behave.

The variables we're trying to measure are called random variables and I'm going to describe what they are in this blog post. I'm going to start by providing some background ideas we'll need to understand, then I'm going to show you why random variables are useful.

What is a mathematical function?

Functions are going to be important to this story, so bear with me.

In math, a function is some operation where you give it some input and it produces some output. The classic examples you may remember are the trigonometric functions like \(sin(x)\), \(cos(x)\), and \(tan(x)\). A function could have several inputs, for example, this is a function: \(z = a_0 + a_1x^1 + a_2 y^3\).

Functions are very common in math, so much so that it can be a little hard to spot them, as we'll see.

Describing randomness - distributions

A probability distribution is a math function that tells you how likely the outcome of a process is. For example, a traffic light can be red, yellow, or green. How likely is it that the next traffic light I come to will be red, yellow, or green? It must be one of them, so the probabilities must sum to one, but we know that yellow is shorter than red or green, so yellow is less likely. Obviously, we can discuss the relative likelihood of red or green.

Probability distributions can get very complicated, but many of them follow well-known patterns. For example, when rolling an unbiased dice, the probability distribution is a discrete uniform distribution that looks like this:

the number of goals scored in a hockey or football match is known to be well-modeled by a (discrete) Poisson distribution that looks like this:

male (or female) heights are well-modeled by a (continuous) normal distribution that looks like this:

There are hundreds of known distributions, but in practice, only a few are "popular".

Discrete or continuous

There are two type of measurements we typically take: continuous and discrete.

Discrete measurements are things that come in discrete chunks, for example, the number of sheep in a flock, the number of goals in a match, the number of people in a movie theater, and so on. Categorical variables are "sort of" discrete, for example the colors of a traffic light, though they are a special case.

Continuous measurements are things that can take any value (including any number of digits after the decimal point). For example, the speed of a car on the freeway could be 72.15609... mph, someone's height might be 183.876... cm and so on. 

This seems clear, but sometimes we muddy the waters a bit. Let's say we're measuring height and we measure in whole cm. This transforms the measurement from a continuous one to a discrete one.

There are two types of probability distribution: continuous and discrete. We use continuous distributions for continuous quantities and discrete for discrete quantities. You should note that in the real world, it's often not this simple.

Random variables

A random variable is a math function the output of which depends on some random process. The values of the random variable follow a probability distribution. Here are some examples of observations that we can describe using random variables:

  • the lifetime of a lightbulb
  • goals scored
  • the result of rolling a dice
  • the speed of cars of a freeway
  • the height of a person
  • sales revenue

Dice are easy to understand, so I'll use it as an example. We don't know what the result of throwing the dice will be, but we know the probability distribution is uniform discrete, so the probability of throwing a 1 is \(\dfrac{1}{6}\), the probability of throwing a 2 is \(\dfrac{1}{6}\), and so on. Let's say we're gambling on dice, betting $1 and winning $6 if our number comes up. Using random variable math, we can work out what our gain or loss might be. In the dice example, it's trivial, but in other cases, it gets harder and we need some more advanced math. 

Random variables have a set of all possible results, which can be finite or infinite, that's called the sample space. The sample space is a set denoted by \(\Omega\). For the dice example, the sample space is simply:

\[\Omega = \{1,2,3,4,5,6\}\]

For a continuous quantity, like the lifetime of a bulb:

\[\Omega =  \{x | x ∈ \mathbb{R} \} \]

which means an infinite sample space. 

Infinite sample spaces, or large discrete sample spaces means we can't work things out by hand, we need more powerful math to do anything useful, and that's where things get hard.

A measurement (or observation) is the process of selecting a value from the sample space. Remember, the random variable has a probability distribution that tells you how likely different values are to be selected. 

Arithmetic with random variables - doing something useful

In this section and the next, I'll start to show you some interesting things you can do with random variables. To illustrate a key idea, we'll use a simple example. We'll work out the probability distribution for the combined scores we get by throwing two unbiased dice. 

We know the distribution is uniform for both dice, so we could work it out by hand like this:

Table 1: combining the scores of two dice
Dice 1 Dice 2 Combined score Probability
1 1 2 \(\dfrac{1}{36}\)
1 2 3 \(\dfrac{1}{36}\)
1 3 4 \(\dfrac{1}{36}\)
...
2 1 3 \(\dfrac{1}{36}\)
2 2 4 \(\dfrac{1}{36}\)
2 3 5 \(\dfrac{1}{36}\)
...

the next step is adding up the probabilities of the combined scores:

  • there's only one way of getting 2, so it's probability is \(\dfrac{1}{36}\)
  • there are two ways of getting 3, so it's probability is \(\dfrac{1}{36} + \dfrac{1}{36}\)
  • ...

this is really tedious, and obviously would be hugely expensive for a large sample space. There's a much faster way I'm going to show you.

To add two random variables, we use a process called convolution. This is a fancy way of saying we multiply the elements of one random variables by all the elements of the other random variable and add the probabilities. Mathematically, it looks like this for a discrete random variables, where \(f\) is the distribution for the first dice and \(g\) the distribution for the second dice:

\[f * g[n] = \sum_{m=-M}^{M}{f[n-m]g[n]}\]

In Python, we need to do it in two stages: work out the sample space and work out the probabilities. Here's some code to do it for two dice.  

import numpy as np

score1, score2 = np.arange(1, 7), np.arange(1, 7)
prob1, prob2 = np.ones(6) / 6, np.ones(6) / 6

combo_score = list(range(score1[0] + score2[0], score1[-1] + score2[-1] + 1))
combo_prob = np.convolve(prob1, prob2)

print(combo_score)
print(combo_prob)

This is easy to do by hand for two dice, but not when the the data sets get a lot bigger, that's when we need computers.

The discrete case is easy enough, but the continuous case is harder and the math is more advanced. Let's take an example to make things more concrete. Let's imagine a company with two sales areas. An analyst is modeling them as continuous random variables. How do we work out the total sales? The answer is continuous convolution of the two sales areas and here's the answer:

\[(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) \,d\tau\]

This is obviously a lot more complicated. It's so complicated, I'm going to spend a little time explaining how to do it.

Broadly speaking, there are three approaches to continuous convolution: special cases, symbolic calculation, and discrete approximations.

In a handful of cases, convolving two continuous random variables has known answers. For example, convolving normal distributions gives a normal distribution and convolving uniform distributions gives an Irwin-Hall distribution.

In almost all cases, it's possible to do a symbolic calculation using integration. You might think that something like SymPy could do it, but in practice, you need to do it by hand. Obviously, you need to be good at calculus. There are several textbooks that have some examples of the process and there are a number of discussions on StackOverflow. From what I've seen, college courses in advanced probability theory seem to have course questions on convolving random variables with different distributions and students have asked for help with them online. This should give you an inkling of the level of difficulty.

The final approach is to use discrete approximations to continuous functions and use discrete convolution. This tends to be the default in most cases.

Worked example with random variables: predicting revenue and net income

Let's say we want to model the total sales revenue (\(t\)) from several regions (\(s_0, s_1, ...s_n\)) that are independent. We also have a model of expenses for the company as a whole (\(e\)). How can we model total revenue and net income?

Let's assume the sales revenue in each region is modeled by random variables, each having a normal distribution. We have mean values \(\mu_0, \mu_1, ..\mu_n\) and standard deviations \(\alpha_0, \alpha_1, ...\alpha_n\). To get total sales, we have to do convolution:

\[t = s_0 * s_1 * ... * s_n\]

This sounds complicated, but for the normal distribution, there's a short-cut. Convolving normal with normal gives normal, all we have to do is add the means and the variances. So the total sales number is a normal distribution with mean and variance:

\[\mu = \sum_{i=0}^{n}\mu_i\]

\[\alpha^2 = \sum_{i=0}^{n}\alpha_{i}^{2}\]

Getting net income is tiny bit harder. If you remember your accountancy text books, net income \(ni\) is:

\[ni = t - e\]

If expenses are modeled by the normal distribution, the answer here is just a variation of the process I used for combining sales. But what if expenses are modeled by some other distribution? That's where things get tough. 

Combining random variables with different probability distributions is hard. There's no good inventory I could find on the web of known solutions. You can do the symbolic calculation by hand, but that requires a good grasp of calculus. You might think that something like SymPy would work, but at the time of writing, SymPy doesn't have a good way of doing it. The final way of doing it is using a discrete approximation, but that's time consuming to do.  Bottom line: there's no easy solution if the distributions aren't all normal or aren't all uniform.

Division and multiplication with random variables

Most problems using random variables seem to boil down to adding them. If you need to multiply or divide random variables, there are ways to do it. The book "The Probability Lifesaver" by Stephen J. Miller  explains how.

Minimum, maximum, and expected values

I said that convolving random variables can be very hard, but getting some values is pretty straightforward.

The maximum of two random variables \(f\) and \(g\) is simply \(max(f) + max(g)\)

The minimum of two random variables \(f\) and \(g\) is simply \(min(f) + min(g)\)

What about the mean? It turns out, getting the mean is easy too. The mean value of a random variable is often called the expectation value and is the result of a function called \(E\), so the mean of a random value \(X\) is \(E(X)\).  The formula for the mean of two random variables is:

\[E(X + Y) = E(X) + E(Y)\]

In simple words, we add the means. 

Note I didn't say what the underlying distributions were. That's because it doesn't matter.

What if we apply some function to a random variable? It turns out, you can calculate the mean of a function of a random variable fairly easily and the arithmetic for combining multiple means is well known. There are pages on Wikipedia that will show you how to do it (in general, search for "linear combinations of expectation values" to get started).

Bringing it all together

There are a host of business and technical problems where we can't give a precise answer, but we can model the distribution of answers using random variables. There's a ton of theory surrounding the properties and uses of random variable, but it does get hard. By combining random variables, we can build models of more complicated systems, for example, we could forecast the range of net incomes for a company for a year. In some cases (e.g. normal distributions), combining random variables is easy, in other cases, it takes us in the world of calculus or discrete approximations. 

Yes, random variables are hard, but they're very powerful.

Tuesday, March 18, 2025

Data science jokes

Data science jokes


(An OpenAI generated image of some data scientists laughing. There are two reasons why you know it's fake: they're all beautiful and they're all laughing at these jokes.)

Where do data scientists go unplanned camping?
In a random forest.

Who do they bring on their trip?
Their nearest neighbors.

What do zoo keepers and data scientists have in common?
They both import pandas.

Where do data scientists go camping to get away from it all?
In an isolation forest.

What's the different between ML and AI?
If it's written in Python, then it's probably ML.
If it's written in PowerPoint, then it's probably AI.

A Machine Learning algorithm walks into a bar.
The bartender asks, "What'll you have?"
The algorithm says, "What's everyone else having?"

Data science is 80% preparing data, and 20% complaining about preparing data.

A SQL query walks into a bar, walks up to two tables, and asks, “Can I join you?”

How did the data scientist describe their favorite movie? It had a great training set.

Why do data scientists love parks?
Because of all the natural logs!

What’s the difference between an entomologist and a data scientist?
Entomologists classify bugs. Data scientists remove bugs from their classifiers.

Why did the data set go to therapy?
It had too many issues with its relationships!

Why does Python live on land?
Because it's above C-level.

One of these jokes was generated by OpenAI. Can you tell which one?

Monday, March 10, 2025

Everything you wanted to know about the normal distribution but were afraid to ask

Normal is all around you, and so is not-normal

The normal distribution is the most important statistical distribution. In this blog post, I'm going to talk about its properties, where it occurs, and why it's so very important. I'm also going to talk about how using the normal distribution when you shouldn't can lead to disaster and what you can do about it.

(Ainali, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons)

A rose by any other name

The normal distribution has a number of different names in different disciplines:

  • Normal distribution. This is the name used by statisticians and data scientists.
  • Gaussian distribution. This is what physicists call it.
  • The bell curve. The names used by social scientists and by people who don't understand statistics.

I'm going to call it the normal distribution in this blog post, and I'd advise you to call it this too. Even if you're not a data scientist, using the most appropriate name helps with communication.

What it is and what it looks like

When we're measuring things in the real world, we see different values. For example, if we measure the heights of 10 year old boys in a town, we'd see some tall boys, some short boys, and most boys around the "average" height. We can work out what fraction of boys are a certain height and plot a chart of frequency on the y axis and height on the x axis. This gives us a probability or frequency distribution. There are many, many different types of probability distribution, but the normal distribution is the most important.

(As an aside, you may remember making histograms at school. These are "sort-of" probability distributions. For example, you might have recorded the height of all the children in a class, grouped them into height ranges, counted the number of children in each height range, and plotted the chart. The y axis would have been a count of how many children in that height range. To turn this into a probability distribution, the y axis would become the fraction of all children in that height range. )

Here's what a normal probability distribution looks like. Yes, it's the classic bell curve shape which is exactly symmetrical.


The formula describing the curve is quite complex, but all you need to know for now is that it's described by two numbers: the mean (often written \(\mu\)) and a standard deviation (often written \(\sigma\)). The mean tells you where the peak is and the standard deviation gives you a measure of the width of the curve. 

To greatly summarize: values near the mean are the most likely to occur and the further you go from the mean, the less likely they are. This lines up with our boys' heights example: there aren't many very short or very tall boys and most boys are around the mean height.

Obviously, if you change the mean or the standard deviation, you change the curve, for example, you can change the location of the mean or you can make the curve wider or narrower. It turns out changing the mean and standard deviation just scales the curve because of its underlying mathematical properties. Most distributions don't behave like this; changing parameters can greatly change the entire shape of the distribution (for example, the beta distribution wildly changes shape if you change its parameters). The normal scaling property has some profound consequences, but for now, I'll just focus on one. We can easily map all normal distributions to one standard normal distribution. Because the properties of the standard normal are known, we can easily do math on the standard normal. To put it another way, it greatly speeds up what we need to do.

Why the normal distribution is so important

Here are some normal distribution examples from the real world.

Let's say you're producing precision bolts. You need to supply 1,000 bolts of a precise specification to a customer. Your production process has some variability. How many bolts do you need to manufacture to get 1,000 good ones? If you can describe the variability using a normal distribution (which is the case for many manufacturing processes), you can work out how many you need to produce.

Imagine you're outfitting an army and you're buying boots. You want to buy the minimum number of boots while still fitting everyone. You know that many body dimensions follow the normal distribution (most famously, chest circumference), so you can make a good estimate of how many boots of different sizes to buy.

Finally, let's say you've bought some random stocks. What might the daily change in value be? Under usual conditions, the change in value follows a normal distribution, so you can estimate what your portfolio might be worth tomorrow.

It's not just these three examples, many phenomena in different disciplines are well described by the normal distribution.

The normal distribution is also common because of something called the central limit theorem (CLT). Let's say I'm taking measurement samples from a population, e.g. measuring the speed of cars on a freeway. The CLT says that the distribution of the sample means will follow a normal distribution regardless of the underlying distribution.  In the car speed example, I don't know how the speeds are distributed, but I can calculate a mean and know how certain I am that the mean value is the true (population) mean. This sounds a bit abstract but it has profound consequences in statistics and means that normal distribution comes up time and time again.

Finally, it's important because it's so well-known. The math to describe and use the normal distribution has been known for centuries. It's been written about in hundreds of textbooks in different languages. More importantly, it's very widely taught; almost all numerate degrees will cover it and how to use it. 

Let's summarize why it's important:

  • It comes up in nature, in finance, in manufacturing etc.
  • It comes up because of the CLT.
  • The math to use it is standardized and well-known.

What useful things can I do with the normal distribution?

Let's take an example from the insurance world. Imagine an insurance company insures house contents and cars. Now imagine the claim distribution for cars follows a normal distribution and the claims distribution for house contents also follows a normal distribution. Let's say in a typical year the claims distributions look something like this (cars on the left, houses on the right).

(The two charts look identical except for the numbers on the x and y axis. That's expected. I said before that all normal distributions are just scaled versions of the standard normal. Another way of saying this is, all normal distribution plots look the same.)

What does the distribution look like for cars plus houses?

The long winded answer is to use convolution (or even Monte Carlo). But because the house and car distribution are normal, we can just do:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

So we can calculate the combined distribution in a heartbeat. The combined distribution looks like this (another normal distribution, just with a different mean and standard deviation).

To be clear: this only works because the two distributions were normal.

It's not just adding distributions together. The normal distribution allows for shortcuts if we're multiplying or dividing etc. The normal distribution makes things that would otherwise be hard very fast and very easy.

Some properties of the normal distribution

I'm not going to dig into the math here, but I am going to point out a few things about the distribution you should be aware of.

The "standard normal" distribution goes from \(-\infty\) to \(+\infty\). The further away you get from the mean, the lower the probability, and once you go several standard deviations away, the probability is quite small, but never-the-less, it's still present. Of course, you can't show \(\infty\) on a chart, so most people cut off the x-axis at some convenient point. This might give the misleading impression that there's an upper or lower x-value; there isn't. If your data has upper or lower cut-off values, be very careful modeling it using a normal distribution. In this case, you should investigate other distributions like the truncated normal.

The normal distribution models continuous variables, e.g. variables like speed or height that can have any number of decimal places (but see the my previous paragraph on \(\infty\)). However, it's often used to model discrete variables (e.g. number of sheep, number of runs scored, etc.). In practice, this is mostly OK, but again, I suggest caution.

Abuses of the normal distribution and what you can do

Because it's so widely known and so simple to use, people have used it where they really shouldn't. There's a temptation to assume the normal when you really don't know what the underlying distribution is. That can lead to disaster.

In the financial markets, people have used the normal distribution to predict day-to-day variability. The normal distribution predicts that large changes will occur with very low probability; these are often called "black swan events". However, if the distribution isn't normal, "black swan events" can occur far more frequently than the normal distribution would predict. The reality is, financial market distributions are often not normal. This creates opportunities and risks. The assumption of normality has lead to bankruptcies.

Assuming normality can lead to models making weird or impossible predictions. Let's say I assume the numbers of units sold for a product is normally distributed. Using previous years' sales, I forecast unit sales next year to be 1,000 units with a standard deviation of 500 units. I then create a Monte Carlo model to forecast next years' profits. Can you see what can go wrong here? Monte Carlo modeling uses random numbers. In the sales forecast example, there's a 2.28% chance the model will select a negative sales number which is clearly impossible. Given that Monte Carlo models often use tens of thousands of simulations, it's extremely likely the final calculation will have been affected by impossible numbers.  This kind of mistake is insidious and hard to spot and even experienced analysts make it.

If you're a manager, you need to understand how your team has modeled data. 

  • Ask what distributions they've used to model their data. 
  • Ask them why they've used that distribution and what evidence they have that the data really is distributed that way. 
  • Ask them how they're going to check their assumptions. 
  • Most importantly, ask them if they have any detection mechanism in place to check for deviation from their expected distribution.

History - where the normal came from

Rather unsatisfactorily, there's no clear "Eureka!" moment for the discovery of the distribution, it seems to have been the accumulation of the work of a number of mathematicians. Abraham de Moivre  kicked off the process in 1733 but didn't formalize the distribution, leaving Gauss to explicitly describe it in 1801 [https://medium.com/@will.a.sundstrom/the-origins-of-the-normal-distribution-f64e1575de29].

Gauss used the normal distribution to model measurement errors and so predict the path of the asteroid Ceres [https://en.wikipedia.org/wiki/Normal_distribution#History]. This sounds a bit esoteric, but there's a point here that's still relevant. Any measurement taking process involves some form of error. Assuming no systemic bias, these errors are well-modeled by the normal distribution. So any unbiased measurement taking today (e.g opinion polling, measurements of particle mass, measurement of precision bolts, etc.) uses the normal distribution to calculate uncertainty.

In 1810, Laplace placed the normal distribution at the center of statistics by formulating the Central Limit Theorem. 

The math

The probability distribution function is given by:

\[f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e ^ {-\frac{1}{2} ( \frac{x - \mu}{\sigma}) ^ 2  }\]

\(\sigma\) is the standard deviation and \(\mu\) is the mean. In the normal distribution, the mean is the same as the mode is the same as the median.

This formula is almost impossible to work with directly, but you don't need to. There are extensive libraries that will do all the calculations for you.

Adding normally distributed parameters is easy:

\(\mu_{combined} = \mu_{houses} + \mu_{cars} \)

\(\sigma_{combined}^2 = \sigma_{houses}^2 + \sigma_{cars}^2\)

Wikipedia has an article on how to combine normally distributed quantities, e.g. addition, multiplication etc. see  https://en.wikipedia.org/wiki/Propagation_of_uncertainty.

Monday, March 3, 2025

Outliers have more fun

What's an outlier and why should you care?

Years ago I worked for a company that gave me t-shirt that said "Outliers have more fun". I've no idea what it meant, but outliers are interesting, and not in a good way. They'll do horrible things to your data and computing costs if you don't get a handle on them.

Simply put, an outlier is one or more data items that are extremely different from your other data items. Here's a joke that explains the idea:

There's a group of office workers drinking in a bar in Seattle. Bill Gates walks in and suddenly, the entire bar starts celebrating. Why? Because on average, they'd all become multi-millionaires.

Obviously, Bill Gates is the outlier in the data set. In this post, I'm going to explain what outliers do to data and what you can do to protect yourself.

(Jessica  Tam, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons)

Outliers and the mean

Let's start with the explaining the joke to death because everyone enjoys that.

Before Bill Gates walks in, there are 10 people in the bar drinking. Their salaries are: $80,000, $81,000, $82,000, $83,000, $84,000, $85,000, $86,000, $87,000, $88,000, and $89,000 giving a mean of $84,500. Let's assume Bill Gates earns $1,000,000,000 a year. Once Bill Gates walks into the bar, the new mean salary is $90,985,909; which is plainly not representative of the bar as a whole. Bill Gates is a massive outlier who's pulled the average way beyond what's representative.

How susceptible your data is to this kind of outlier effect depends on the type and distribution of your data. If your data is scores out of 10, and a "typical" score is 5, the average isn't going to be pulled too far away by an outlier (because the maximum is 10 and the minimum is zero, which are not hugely different from the typical value of 5). If there's no upper or lower limit (e.g salaries, house prices, amount of debt etc.), then you're vulnerable, and you may be even more vulnerable if your distribution is right skewed (e.g. something like a log normal distribution).

What can you do if this is the case? Use the median instead. The median is the middle value. In our Seattle bar example, the median is $84,500 before Bill Gates walks in and $85,000 afterwards. That's not much of a change and is much more representative of the salaries of everyone. This is the reason why you hear "median salaries" reported in government statistics rather than "mean salaries".

If you do use the median, please be aware that it has different mathematical properties from the mean. It's fine as a measure of the average, but if you're doing calculations based on medians, be careful.

Outliers and the standard deviation

The standard deviation is a representation of the spread of the data. The bigger the number, the wider the spread. In our bar example, before Bill Gates walks in, the standard deviation is $2,872. This seems reasonable as the salaries are pretty close together. After Bill Gates walks in, the standard deviation is $287,455,495 which is even bigger than the new mean. This number suggests all the salaries are quite different, which is not the case, only one is.

The standard deviation is susceptible to outliers in the same way the mean is, but for some reason, people often overlook it. I've seen people be very aware of outliers when they're calculating an average, but forget all about it when they're calculating a standard deviation.

What can you do? The answer here's isn't as clear. A good choice is the interquartile range (IQR), but it's not the same measurement. The IQR represents 75% of the data and not the 67% that the standard deviation does. For the bar, the IQR is $4,500 before Bill Gates walks in and $5,000 afterwards.  If you want a measure of 'dispersion', the IQR is a good choice, if you want a drop in replacement for the standard deviation, you'll have to give it more thought. 

Why the median and IQR are not drop in replacements for the mean and standard deviation

The mean and median are subtly different measures and have different mathematical properties. The same applies to standard deviation and IQR. It's important to understand the trade-offs when you use them.

Combining means is easy, we can do it through formula understood for hundreds of years. But we can't combine medians in the same way; the math doesn't work like that. Here's an example, let's imagine we have two bars, one with 10 drinkers earning a mean of $80,000, the other with 10 drinkers earning a mean of $90,000. The mean across the two bars is $85,000. We can do addition, subtraction, multiplication, division, and other operations with means. But if we know the median of the first bar is $81,000 and the median of the second bar is $89,000, we can't combine them. The same is true of the standard deviation and IQR, there are formula to combine standard deviations, but not IQRs.

In the Seattle bar example, we wanted one number to represent the salaries of the people in the bar. The best average is the median and the best measure of spread is the IQR, the reason being outliers. However, if we wanted an average we could apply across multiple bars, or if we wanted to do some calculations using the average and spread, we'd be better off with the mean and standard deviation.

Of course, it all comes down to knowing what you want and why. Like any job, you've got to know your tools.

The effect of more samples

Sometimes, more data will save you. This is especially true if your data is normally distributed and outliers are very rare. If your data distribution is skewed, it might not help that much. I've worked with some data sets with massive skews and the mean can vary widely depending on how many samples you take. Of course, if you have millions of samples, then you'll mostly be OK.

Outliers and calculation costs

This warning won't apply to everyone. I've built systems where the computing cost depends on the range of the data (maximum - minimum). The bigger the range, the more the cost. Outliers in this case can drive computation costs up, especially if there's a possibility of a "Bill Gates" type effect that can massively distort the data. If this applies to your system, you need to detect outliers and take action.

Final advice 

If you have a small sample size (10 or less): use the median and the IQR.

If your data is highly right skewed: use the median and the IQR.

Remember the median and the IQR are not the same as the mean and the standard deviation and be extremely careful using them in calculations.

If your computation time depends on the range of the data, check for outliers.

Monday, February 3, 2025

Using AI (LLM) to generate data science code

What AI offers data science code generation and what it doesn't

Using generative AI for coding support has become increasingly popular for good reason; the productivity gain can be very high. But what are its limits? Can you use code gen for real data science problems?

(I, for one, welcome our new AI overlords. D J Shin, CC BY-SA 3.0 , via Wikimedia Commons)

To investigate, I decided to look at two cases: a 'simple' piece of code generation to build a Streamlit UI, and a technically complicated case that's more typical of data science work. I generated Python code and evaluated it for correctness, structure, and completeness. The results were illuminating, as we'll see, and I think I understand why they came out the way they did.

My setup is pretty standard, I'm using Github copilot in Microsoft Visual Studio and Github Copilot directly from the website. In both cases, I chose the Claude model (more on why later).

Case 1: "commodity" UI code generation

The goal of this experiment was to see if I could automatically generate a good enough complete multi-page Streamlit app. The app was to have multiple dialog boxes on each page and was to be runnable without further modification.

Streamlit provides a simple UI for Python programs. It's several years old and extremely popular (meaning, there are plenty of code examples in Github). I've built apps using Streamlit, so I'm familiar with it and its syntax. 

The specification

The first step was a written English specification. I wrote a one-page Word document detailing what I wanted for every page of the app. I won't reproduce it here for brevity's sake, but here's a brief except:

The second page is called “Load model”. This will allow the user to load an existing model from a file. The page will have some descriptive text on what the page does. There will be a button that allows a user to load a file. The user will only be able to load a single with a file extension “.mdl”. If the user successfully loads a model, the code will load it into a session variable that the other pages can access. The “.mdl” file will be a JSON file and the software will check that the file is valid and follows some rules. The page will tell the user if the file has been successfully loaded or if there’s an error. If there’s an error, the page will tell the user what the error is.

In practice, I had to iterate on the specification a few times to get things right, but it only a took a couple of iterations.

What I got

Code generation was very fast and the results were excellent. I was able to run the application immediately without modification and it did what I wanted it to do.

(A screen shot of part of the generated Streamlit app.)

It produced the necessary Python files, but it also produced:

  • a requirements.txt file - which was correct
  • a dummy JSON file for my data, inferred from my description
  • data validation code
  • test code

I didn't ask for any of these things, it just produced them anyway.

There were several downsides though. 

I found the VS Code interface a little awkward to use, for me the Github Copilot web page was a much better experience (except that you have to copy the code).

Slight changes to my specification sometimes caused large changes to the generated code. For example, I added a sentence asking for a new dialog box and the code generation incorrectly dropped a page from my app. 

It seemed to struggle with long "if-then" type paragraphs, for example "If the user has loaded a model ...LONG TEXT... If the user hasn't loaded a model ...LONG TEXT...".

The code was quite old-fashioned in several ways. Code generation created the app pages in a pages folder and prefixed the pages with "1_", "2_" etc. This is how the demos on the Streamlit website are structured, but it's not how I would do it, it's kind of old school and a bit limited. Notably, the code generation didn't use some of the newer features of Streamlit; on the whole it was a year or so behind the curve.

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I'm convinced by code generation here. Yes, it was a little behind the times and a little awkwardly structured, but it worked and it gave me something very close to what I wanted within a few minutes.

I could have written this myself (and I have done before), but I find this kind of coding tedious and time consuming (it would have taken me a day to do what I did using code gen in an hour). 

I will be using code gen for this type of problem in the future.

Case 2: data science code generation

What about a real data science problem, how well does it perform?

I chose to use random variables and quasi-Monte Carlo as something more meaty. The problem was to create two random variables and populate them with samples drawn from a quasi-Monte Carlo "random" number generator with a normal distribution. For each variable, work out the distribution (which we know should be normal). Combine the variables with convolution to create a third variable, and plot the resulting distribution. Finally, calculate the mean and standard deviation of all three variables.

The specification

I won't show it here for brevity, but it was a slightly longer than the description I gave above. Notably, I had to iterate on it several times.

What I got

This was a real mixed bag.

My first pass code generation didn't use quasi Monte Carlo at all. It normalized the distributions before the convolution for no good reason which meant the combined result was wrong. It used a histogram for the distribution which was kind-of OK. It did generate the charts just fine though. Overall, it was the kind of work a junior data scientist might produce.

On my second pass, I told it to use Sobel' sequences and I told it to use kernel density estimation to calculate the distribution. This time it did very well. The code was nicely commented too. Really surprisingly, it used the correct way of generating sequences (using dimensions).

(After some prompting, this was my final chart, which is correct.)

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I had to be much more prescriptive here to get what I wanted, but the results were good, but only because I knew to tell it to use Sobel' and I knew to tell it to use kernel density estimation. 

Again, I'm convinced that code gen works.

Observations

The model

I tried the experiment with both Claude 3.5 and GPT 4o. Claude gave much better results. Other people have reported similar experiences.

Why this works and some fundamental limitations

Github has access to a huge code base, so the LLM is based on the collective wisdom of a vast number of programmers. However, despite appearances, it has no insight; it can't go beyond what others have done. This is why the code it produced for the Streamlit demo was old-fashioned. It's also why I had to be prescriptive for my data science case, for example, it just didn't understand what quasi Monte Carlo meant without additional prompting.

AI is known to hallucinate, and we see see something of that here. You really have to know what you're doing to use AI generated code. If you blindly implement AI generated code, things are going to go badly for you very quickly.

Productivity

Code generation and support is a game changer. It ramps up productivity enormously. I've heard people say, it's like having a (free) senior engineer by your side. I agree. Despite the issues I've come across, code generation works "good enough".

Employment

This has obvious implications for employment. With AI code generation and with AI coding support, you need fewer software engineers/analysts/data scientists. The people you do need are those with more insight and the ability spot where the AI generated code has gone wrong, which is bad news for for more junior people or those entering the workforce. It may well be a serious problem for students seeking internships.

Let me say this plainly: people will lose their jobs because of this technology.

My take on the employment issue and what you can do

There's an old joke that sums things up. "A householder calls in a mechanic because their washing machine had broken down. The mechanic looks at the washing machine and rocks it around a bit. Then the mechanic kicks the machine. It starts working! The mechanic writes a bill for $200. The householder explodes, '$200 to kick a washing machine, this is outrageous!'. The mechanic thinks for a second and says, 'You're quite right. Let me re-write the bill'. The new bill says 'Kicking the washing machine $1, knowing where to kick the washing machine $199'." To put it bluntly, you need to be the kind of mechanic that knows where to kick the machine.

(You've got to know where to kick it. LG전자, CC BY 2.0 , via Wikimedia Commons)

Code generation has no insight. It makes errors. You have to have experience and insight to know when it's gone wrong. Not all human software engineers have that insight.

You should be very concerned if:
  • You're junior in your career or you're just entering the workforce.
  • You're developing BI-type apps as the main or only thing you do.
  • There are many people doing exactly the same software development work as you.
If that applies to you, here's my advice:
  • Use code generation and code support. You need to know first hand what it can do and the threat it poses. Remember, it's a productivity boost and the least productive people are the first to go.
  • Develop domain knowledge. If your company is in the finance industry, make sure you understand finance, which means knowing the legal framework etc.. If it's a drug discovery, learn the principles of drug discovery. Get some kind of certification (online courses work fine). Apply your knowledge to your work. Make sure your employer knows it.
  • Develop specialist skills, e.g. statistics. Use those skills in your work.
  • Develop human skills. This means talking to customers, talking to people in other departments.

Some takeaways

  • AI generated code is good enough for use, even in more complicated cases.
  • It's a substantial productivity boost. You should be using it.
  • It's a tool, not a magic wand. It does get things wrong and you need to be skilled enough to spot errors.