Monday, February 3, 2025

Using AI (LLM) to generate data science code

What AI offers data science code generation and what it doesn't

Using generative AI for coding support has become increasingly popular for good reason; the productivity gain can be very high. But what are its limits? Can you use code gen for real data science problems?

(I, for one, welcome our new AI overlords. D J Shin, CC BY-SA 3.0 , via Wikimedia Commons)

To investigate, I decided to look at two cases: a 'simple' piece of code generation to build a Streamlit UI, and a technically complicated case that's more typical of data science work. I generated Python code and evaluated it for correctness, structure, and completeness. The results were illuminating, as we'll see, and I think I understand why they came out the way they did.

My setup is pretty standard, I'm using Github copilot in Microsoft Visual Studio and Github Copilot directly from the website. In both cases, I chose the Claude model (more on why later).

Case 1: "commodity" UI code generation

The goal of this experiment was to see if I could automatically generate a good enough complete multi-page Streamlit app. The app was to have multiple dialog boxes on each page and was to be runnable without further modification.

Streamlit provides a simple UI for Python programs. It's several years old and extremely popular (meaning, there are plenty of code examples in Github). I've built apps using Streamlit, so I'm familiar with it and its syntax. 

The specification

The first step was a written English specification. I wrote a one-page Word document detailing what I wanted for every page of the app. I won't reproduce it here for brevity's sake, but here's a brief except:

The second page is called “Load model”. This will allow the user to load an existing model from a file. The page will have some descriptive text on what the page does. There will be a button that allows a user to load a file. The user will only be able to load a single with a file extension “.mdl”. If the user successfully loads a model, the code will load it into a session variable that the other pages can access. The “.mdl” file will be a JSON file and the software will check that the file is valid and follows some rules. The page will tell the user if the file has been successfully loaded or if there’s an error. If there’s an error, the page will tell the user what the error is.

In practice, I had to iterate on the specification a few times to get things right, but it only a took a couple of iterations.

What I got

Code generation was very fast and the results were excellent. I was able to run the application immediately without modification and it did what I wanted it to do.

(A screen shot of part of the generated Streamlit app.)

It produced the necessary Python files, but it also produced:

  • a requirements.txt file - which was correct
  • a dummy JSON file for my data, inferred from my description
  • data validation code
  • test code

I didn't ask for any of these things, it just produced them anyway.

There were several downsides though. 

I found the VS Code interface a little awkward to use, for me the Github Copilot web page was a much better experience (except that you have to copy the code).

Slight changes to my specification sometimes caused large changes to the generated code. For example, I added a sentence asking for a new dialog box and the code generation incorrectly dropped a page from my app. 

It seemed to struggle with long "if-then" type paragraphs, for example "If the user has loaded a model ...LONG TEXT... If the user hasn't loaded a model ...LONG TEXT...".

The code was quite old-fashioned in several ways. Code generation created the app pages in a pages folder and prefixed the pages with "1_", "2_" etc. This is how the demos on the Streamlit website are structured, but it's not how I would do it, it's kind of old school and a bit limited. Notably, the code generation didn't use some of the newer features of Streamlit; on the whole it was a year or so behind the curve.

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I'm convinced by code generation here. Yes, it was a little behind the times and a little awkwardly structured, but it worked and it gave me something very close to what I wanted within a few minutes.

I could have written this myself (and I have done before), but I find this kind of coding tedious and time consuming (it would have taken me a day to do what I did using code gen in an hour). 

I will be using code gen for this type of problem in the future.

Case 2: data science code generation

What about a real data science problem, how well does it perform?

I chose to use random variables and quasi-Monte Carlo as something more meaty. The problem was to create two random variables and populate them with samples drawn from a quasi-Monte Carlo "random" number generator with a normal distribution. For each variable, work out the distribution (which we know should be normal). Combine the variables with convolution to create a third variable, and plot the resulting distribution. Finally, calculate the mean and standard deviation of all three variables.

The specification

I won't show it here for brevity, but it was a slightly longer than the description I gave above. Notably, I had to iterate on it several times.

What I got

This was a real mixed bag.

My first pass code generation didn't use quasi Monte Carlo at all. It normalized the distributions before the convolution for no good reason which meant the combined result was wrong. It used a histogram for the distribution which was kind-of OK. It did generate the charts just fine though. Overall, it was the kind of work a junior data scientist might produce.

On my second pass, I told it to use Sobel' sequences and I told it to use kernel density estimation to calculate the distribution. This time it did very well. The code was nicely commented too. Really surprisingly, it used the correct way of generating sequences (using dimensions).

(After some prompting, this was my final chart, which is correct.)

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I had to be much more prescriptive here to get what I wanted, but the results were good, but only because I knew to tell it to use Sobel' and I knew to tell it to use kernel density estimation. 

Again, I'm convinced that code gen works.

Observations

The model

I tried the experiment with both Claude 3.5 and GPT 4o. Claude gave much better results. Other people have reported similar experiences.

Why this works and some fundamental limitations

Github has access to a huge code base, so the LLM is based on the collective wisdom of a vast number of programmers. However, despite appearances, it has no insight; it can't go beyond what others have done. This is why the code it produced for the Streamlit demo was old-fashioned. It's also why I had to be prescriptive for my data science case, for example, it just didn't understand what quasi Monte Carlo meant without additional prompting.

AI is known to hallucinate, and we see see something of that here. You really have to know what you're doing to use AI generated code. If you blindly implement AI generated code, things are going to go badly for you very quickly.

Productivity

Code generation and support is a game changer. It ramps up productivity enormously. I've heard people say, it's like having a (free) senior engineer by your side. I agree. Despite the issues I've come across, code generation works "good enough".

Employment

This has obvious implications for employment. With AI code generation and with AI coding support, you need fewer software engineers/analysts/data scientists. The people you do need are those with more insight and the ability spot where the AI generated code has gone wrong, which is bad news for for more junior people or those entering the workforce. It may well be a serious problem for students seeking internships.

Let me say this plainly: people will lose their jobs because of this technology.

My take on the employment issue and what you can do

There's an old joke that sums things up. "A householder calls in a mechanic because their washing machine had broken down. The mechanic looks at the washing machine and rocks it around a bit. Then the mechanic kicks the machine. It starts working! The mechanic writes a bill for $200. The householder explodes, '$200 to kick a washing machine, this is outrageous!'. The mechanic thinks for a second and says, 'You're quite right. Let me re-write the bill'. The new bill says 'Kicking the washing machine $1, knowing where to kick the washing machine $199'." To put it bluntly, you need to be the kind of mechanic that knows where to kick the machine.


(You've got to know where to kick it. LG전자, CC BY 2.0 , via Wikimedia Commons)

Code generation has no insight. It makes errors. You have to have experience and insight to know when it's gone wrong. Not all human software engineers have that insight.

You should be very concerned if:
  • You're junior in your career or you're just entering the workforce.
  • You're developing BI-type apps as the main or only thing you do.
  • There are many people doing exactly the same software development work as you.
If that applies to you, here's my advice:
  • Use code generation and code support. You need to know first hand what it can do and the threat it poses. Remember, it's a productivity boost and the least productive people are the first to go.
  • Develop domain knowledge. If your company is in the finance industry, make sure you understand finance, which means knowing the legal framework etc.. If it's a drug discovery, learn the principles of drug discovery. Get some kind of certification (online courses work fine). Apply your knowledge to your work. Make sure your employer knows it.
  • Develop specialist skills, e.g. statistics. Use those skills in your work.
  • Develop human skills. This means talking to customers, talking to people in other departments.

Some takeaways

  • AI generated code is good enough for use, even in more complicated cases.
  • It's a substantial productivity boost. You should be using it.
  • It's a tool, not a magic wand. It does get things wrong and you need to be skilled enough to spot errors.

No comments:

Post a Comment