Showing posts with label llm. Show all posts

Monday, February 3, 2025

Using AI (LLM) to generate data science code

What AI offers data science code generation and what it doesn't

Using generative AI for coding support has become increasingly popular for good reason; the productivity gain can be very high. But what are its limits? Can you use code gen for real data science problems?

(I, for one, welcome our new AI overlords. D J Shin, CC BY-SA 3.0 , via Wikimedia Commons)

To investigate, I decided to look at two cases: a 'simple' piece of code generation to build a Streamlit UI, and a technically complicated case that's more typical of data science work. I generated Python code and evaluated it for correctness, structure, and completeness. The results were illuminating, as we'll see, and I think I understand why they came out the way they did.

My setup is pretty standard, I'm using Github copilot in Microsoft Visual Studio and Github Copilot directly from the website. In both cases, I chose the Claude model (more on why later).

Case 1: "commodity" UI code generation

The goal of this experiment was to see if I could automatically generate a good enough complete multi-page Streamlit app. The app was to have multiple dialog boxes on each page and was to be runnable without further modification.

Streamlit provides a simple UI for Python programs. It's several years old and extremely popular (meaning, there are plenty of code examples in Github). I've built apps using Streamlit, so I'm familiar with it and its syntax.

The specification

The first step was a written English specification. I wrote a one-page Word document detailing what I wanted for every page of the app. I won't reproduce it here for brevity's sake, but here's a brief except:

The second page is called “Load model”. This will allow the user to load an existing model from a file. The page will have some descriptive text on what the page does. There will be a button that allows a user to load a file. The user will only be able to load a single with a file extension “.mdl”. If the user successfully loads a model, the code will load it into a session variable that the other pages can access. The “.mdl” file will be a JSON file and the software will check that the file is valid and follows some rules. The page will tell the user if the file has been successfully loaded or if there’s an error. If there’s an error, the page will tell the user what the error is.

In practice, I had to iterate on the specification a few times to get things right, but it only a took a couple of iterations.

What I got

Code generation was very fast and the results were excellent. I was able to run the application immediately without modification and it did what I wanted it to do.

(A screen shot of part of the generated Streamlit app.)

It produced the necessary Python files, but it also produced:

a requirements.txt file - which was correct
a dummy JSON file for my data, inferred from my description
data validation code
test code

I didn't ask for any of these things, it just produced them anyway.

There were several downsides though.

I found the VS Code interface a little awkward to use, for me the Github Copilot web page was a much better experience (except that you have to copy the code).

Slight changes to my specification sometimes caused large changes to the generated code. For example, I added a sentence asking for a new dialog box and the code generation incorrectly dropped a page from my app.

It seemed to struggle with long "if-then" type paragraphs, for example "If the user has loaded a model ...LONG TEXT... If the user hasn't loaded a model ...LONG TEXT...".

The code was quite old-fashioned in several ways. Code generation created the app pages in a pages folder and prefixed the pages with "1_", "2_" etc. This is how the demos on the Streamlit website are structured, but it's not how I would do it, it's kind of old school and a bit limited. Notably, the code generation didn't use some of the newer features of Streamlit; on the whole it was a year or so behind the curve.

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I'm convinced by code generation here. Yes, it was a little behind the times and a little awkwardly structured, but it worked and it gave me something very close to what I wanted within a few minutes.

I could have written this myself (and I have done before), but I find this kind of coding tedious and time consuming (it would have taken me a day to do what I did using code gen in an hour).

I will be using code gen for this type of problem in the future.

Case 2: data science code generation

What about a real data science problem, how well does it perform?

I chose to use random variables and quasi-Monte Carlo as something more meaty. The problem was to create two random variables and populate them with samples drawn from a quasi-Monte Carlo "random" number generator with a normal distribution. For each variable, work out the distribution (which we know should be normal). Combine the variables with convolution to create a third variable, and plot the resulting distribution. Finally, calculate the mean and standard deviation of all three variables.

The specification

I won't show it here for brevity, but it was a slightly longer than the description I gave above. Notably, I had to iterate on it several times.

What I got

This was a real mixed bag.

My first pass code generation didn't use quasi Monte Carlo at all. It normalized the distributions before the convolution for no good reason which meant the combined result was wrong. It used a histogram for the distribution which was kind-of OK. It did generate the charts just fine though. Overall, it was the kind of work a junior data scientist might produce.

On my second pass, I told it to use Sobel' sequences and I told it to use kernel density estimation to calculate the distribution. This time it did very well. The code was nicely commented too. Really surprisingly, it used the correct way of generating sequences (using dimensions).

(After some prompting, this was my final chart, which is correct.)

Dependency on engine

I tried this with both Claude 3.5 and GPT 4o. Unequivocally, Claude gave the best answers.

Overall

I had to be much more prescriptive here to get what I wanted, but the results were good, but only because I knew to tell it to use Sobel' and I knew to tell it to use kernel density estimation.

Again, I'm convinced that code gen works.

Observations

The model

I tried the experiment with both Claude 3.5 and GPT 4o. Claude gave much better results. Other people have reported similar experiences.

Why this works and some fundamental limitations

Github has access to a huge code base, so the LLM is based on the collective wisdom of a vast number of programmers. However, despite appearances, it has no insight; it can't go beyond what others have done. This is why the code it produced for the Streamlit demo was old-fashioned. It's also why I had to be prescriptive for my data science case, for example, it just didn't understand what quasi Monte Carlo meant without additional prompting.

AI is known to hallucinate, and we see see something of that here. You really have to know what you're doing to use AI generated code. If you blindly implement AI generated code, things are going to go badly for you very quickly.

Productivity

Code generation and support is a game changer. It ramps up productivity enormously. I've heard people say, it's like having a (free) senior engineer by your side. I agree. Despite the issues I've come across, code generation works "good enough".

Employment

This has obvious implications for employment. With AI code generation and with AI coding support, you need fewer software engineers/analysts/data scientists. The people you do need are those with more insight and the ability spot where the AI generated code has gone wrong, which is bad news for for more junior people or those entering the workforce. It may well be a serious problem for students seeking internships.

Let me say this plainly: people will lose their jobs because of this technology.

My take on the employment issue and what you can do

There's an old joke that sums things up. "A householder calls in a mechanic because their washing machine had broken down. The mechanic looks at the washing machine and rocks it around a bit. Then the mechanic kicks the machine. It starts working! The mechanic writes a bill for $200. The householder explodes, '$200 to kick a washing machine, this is outrageous!'. The mechanic thinks for a second and says, 'You're quite right. Let me re-write the bill'. The new bill says 'Kicking the washing machine $1, knowing where to kick the washing machine $199'." To put it bluntly, you need to be the kind of mechanic that knows where to kick the machine.

(You've got to know where to kick it. LG전자, CC BY 2.0 , via Wikimedia Commons)

Code generation has no insight. It makes errors. You have to have experience and insight to know when it's gone wrong. Not all human software engineers have that insight.

You should be very concerned if:

You're junior in your career or you're just entering the workforce.
You're developing BI-type apps as the main or only thing you do.
There are many people doing exactly the same software development work as you.

If that applies to you, here's my advice:

Use code generation and code support. You need to know first hand what it can do and the threat it poses. Remember, it's a productivity boost and the least productive people are the first to go.
Develop domain knowledge. If your company is in the finance industry, make sure you understand finance, which means knowing the legal framework etc.. If it's a drug discovery, learn the principles of drug discovery. Get some kind of certification (online courses work fine). Apply your knowledge to your work. Make sure your employer knows it.
Develop specialist skills, e.g. statistics. Use those skills in your work.
Develop human skills. This means talking to customers, talking to people in other departments.

Some takeaways

AI generated code is good enough for use, even in more complicated cases.
It's a substantial productivity boost. You should be using it.
It's a tool, not a magic wand. It does get things wrong and you need to be skilled enough to spot errors.

Thursday, August 3, 2023

Using ChatGPT for real to interpret text

What's real and what isn't with ChatGPT?

There's a huge amount of hype surrounding ChatGPT and I've heard all kinds of "game changing" stories around it. But what's real and what's not?

In this blog post, I'm going to show you one of the real things ChatGPT can do: extract meaning from text. I'll show you how well it performs, discuss some of its shortcomings, and highlight important considerations for using it in business. I'm going to do it with real code and real data.

We're going to use ChatGPT to extract meaning from news articles, specifically, two articles on the Women's World Cup.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons. I for one, welcome our new robot overlords...

The Women's World Cup

At the time of writing, the Women's World Cup is in full swing and England have just beaten China 6-1. There were plenty of news stories about it, so I took just two and tried to extract structured, factual data from the articles.

Here are the two articles:

BBC Sports: https://www.bbc.com/sport/football/66369572 - I used the article text extracted from the web page, which notably doesn't include the final score.
Guardian Sports: https://www.skysports.com/football/news/12016/12931883/lauren-james-forward-says-her-performance-in-englands-6-1-womens-world-cup-win-over-china-what-dreams-are-made-of - I used the article text extracted from the web page.

Here is the data I wanted to pull out of the text:

The sport being played
The competition
The names of the teams
Who won
The score
The attendance

I wanted it in a structured format, in this case, JSON.

Obviously, you could read the articles and just extract the information, but the value of ChatGPT is doing this at scale, to scan thousands or millions of articles to search for key pieces of information. Up until now, this has been done by paying people in the developing world to read articles and extract data. ChatGPT offers the prospect of slashing the cost of this kind of work and making it widely available.

Let's see it in action.

Getting started

This example is all in Python and I'm assuming you have a good grasp of the language.

Download the OpenAI library:

pip install openai

Register for OpenAI and get an API key. At the time of writing, you get $5 in free credits and this tutorial won't consume much of that $5.

You'll need to set your API key in your code. To get going, we'll just paste it into our Python file:

import openai

openai.api_key = "YOUR_KEY"

You should note that OpenAI will rescind any keys they find on the public internet. My use of the key in code is very sloppy from a security point of view. Only do it to get started.

Some ChatGPT basics

We're going to focus on just one part of ChatGPT, the ChatCompletion API. Because there's some complexity here, I'm going to go through some of the background before diving into the code.

To set the certainty of its answers, ChatGPT has a concept of "temperature". This is a parameter that sets how "sure" the answer is; the lower the number the more sure the answer. A more certain answer comes at the price of creativity, so for some applications, you might want to choose a higher temperature (for example, you might want a higher temperature for a chatbot). The temperature range is 0 to 1, and we'll use 0 for this example because we want highly reliable analysis.

There are several ChatGPT models each with a different pricing structure. As you might expect, the larger and more recent models are more expensive, so for this tutorial, I'm going to use an older and cheaper model, "gpt-3.5-turbo", that works well enough to show what ChatGPT can do.

ChatGPT works on a model of "roles" and "messages". Roles are the actors in a chat; for a chatbot there will be a "user" role, which is the human entering text, an "assistant" role which is the chat response, and a "system" role controlling the assistant. Messages are the text from the user or the assistant or a "briefing" for the system. For a chatbot, we need multiple messages, but to extract meaning from text, we just need one. To analyze the World Cup articles, we only need the user role.

To get an answer, we need to pose a question or give ChatGPT an instruction on what to do. That's part of the "content" we set in the messages parameter. The content must contain the text we want to analyze and instructions on what we want returned. This is a bigger topic and I'm going to dive into it next.

Prompt engineering part 1

Setting the prompt correctly is the core of ChatGBP and it's a bit of an art, which is why it's been called prompt engineering. You have to very carefully write your prompt to get the results you expect.

Oddly, ChatGPT doesn't separate the text from the query; they're all bundled together in the same prompt. This means you have to clearly tell ChatGPT what you want to analyze and how you want it analyzed.

Let's start with a simple example, let's imagine you want to know how many times the letter "e" occurs in the text "The kind old elephant." Here's how you might write the prompt:

f"""In the following text, how often does the letter e occur:

"The kind old elephant"

"""

This gives us the correct answer (3). We'll come back to this prompt later because it shows some of the pitfalls of working with ChatGPT. In general, we need to be crystal clear about the text we want the system to analyze.

Let's say we wanted the result in JSON, here's how we might write the prompt:

f"""

In the following text, how often does the letter e occur, write your answer as JSON:

"The kind old elephant"

"""

Which gives us {"e": 3}

We can ask more complex questions about some text, but we need to very carefully layout the query and distinguish between text and questions. Here's an example.

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word "key_name" \

1) how often does the letter e occur key_name = "letter" \

2) what animal is referred to key_name = "animal" \

```The kind old elephant```

"""

Using ChatGPT

Let's put what we've learned together and build a ChatGPT query to ask questions about the Women's World Cup. Here's the code using the BBC article.

world = """

Lauren James produced a sensational individual

performance as England entertained to sweep aside

China and book their place in the last 16 of the

Women's World Cup as group winners.

It was a display worthy of their status as European

champions and James once again lit the stage alight

in Adelaide with two sensational goals and three assists.

The 13,497 in attendance were treated to a masterclass

from Chelsea's James, who announced her arrival at the

World Cup with the match-winner against Denmark on Friday.

She helped England get off to the perfect start when

she teed up Alessia Russo for the opener, and

later slipped the ball through to Lauren Hemp to

coolly place it into the bottom corner.

It was largely one-way traffic as England dominated

and overwhelmed, James striking it first time into

the corner from the edge of the box to make it 3-0

before another stunning finish was ruled out by video

assistant referee (VAR) for offside in the build-up.

China knew they were heading out of the tournament

unless they responded, so they came out with more

aggression in the second half, unnerving England

slightly when Shuang Wang scored from the penalty

spot after VAR picked up a handball by defender

Lucy Bronze.

But James was not done yet - she volleyed Jess Carter's

deep cross past helpless goalkeeper Yu Zhu for

England's fourth before substitute Chloe Kelly and

striker Rachel Daly joined the party.

England, who had quietly gone about their business

in the group stages, will have raised eyebrows with

this performance before their last-16 match against

Nigeria on Monday, which will be shown live on

BBC One at 08:30 BST.

China are out of the competition after Denmark beat

Haiti to finish in second place in Group D.

England prove worth without Walsh

Manager Sarina Wiegman kept everyone guessing when

she named her starting XI, with England fans

anxiously waiting to see how they would set up

without injured midfielder Keira Walsh.

Wiegman's response was to unleash England's attacking

talent on a China side who struggled to match them

in physicality, intensity and sharpness.

James oozed magic and unpredictability, Hemp used her

pace to test China's defence and captain Millie Bright

was ferocious in her tackling, winning the ball back

on countless occasions.

After nudging past Haiti and Denmark with fairly

underwhelming 1-0 wins, England were keen to impose

themselves from the start. Although China had chances

in the second half, they were always second best.

Goalkeeper Mary Earps will be disappointed not to keep

a clean sheet, but she made two smart saves to deny

Chen Qiaozhu.

While England are yet to meet a side ranked inside

the world's top 10 at the tournament, this will help

quieten doubts that they might struggle without the

instrumental Walsh.

"We're really growing into the tournament now," said

captain Bright. "We got a lot of criticism in the first

two games but we were not concerned at all.

"It's unbelievable to be in the same team as

[the youngsters]. It feels ridiculous and I'm quite

proud. Players feeling like they can express themselves

on the pitch is what we want."

James given standing ovation

The name on everyone's lips following England's win

over Denmark was 'Lauren James', and those leaving

Adelaide on Tuesday evening will struggle to forget

her performance against China any time soon.

She punished China for the space they allowed her on

the edge of the box in the first half and could have

had a hat-trick were it not for the intervention of VAR.

Greeted on the touchline by a grinning Wiegman,

James was substituted with time to spare in the second

half and went off to a standing ovation from large

sections of the stadium.

"She's special - a very special player for us and

for women's football in general," said Kelly. "She's

a special talent and the future is bright."

She became only the third player on record (since 2011)

to be directly involved in five goals in a Women's

World Cup game.

With competition for attacking places in England's

starting XI extremely high, James has proven she is

far too good to leave out of the side and is quickly

becoming a star at this tournament at the age of 21.

"""

prompt = f"""

In the text indicated by three back ticks answer the \

following questions and output your answer as JSON \

using the key names indicated by the word key_name" \

1) What sport was being played? key_name="sport" \

2) What competition was it? key_name="competition" \

3) What teams were playing? key_name = "teams" \

4) Which team won? key_name = "winner" \

5) What was the final score? key_name = "score" \

6) How many people attended the match? key_name = "attendance" \

```{world}```

"""

messages = [{"role": "user", "content": prompt}]

response = (openai

.ChatCompletion

.create(model=model,

messages=messages,

temperature=0)

)

print(response.choices[0].message["content"])

Here are the results this code produces:

{

"sport": "Football",

"competition": "Women's World Cup",

"teams": "England and China",

"winner": "England",

"score": "England 5 - China 1",

"attendance": 13497

}

This is mostly right, but not quite. The score was actually 6-1. Even worse, the results are very sensitive to the text layout; changing line breaks changes the score.

I ran the same query, but with the Guardian article instead and here's what I got:

{

"sport": "football",

"competition": "World Cup",

"teams": "England and China",

"winner": "England",

"score": "6-1",

"attendance": null

}

With a better prompt, it might be possible to get better consistency and remove some of the formatting inconsistencies. By analyzing multiple articles on the same event, it may be possible to increase the accuracy still further.

Hallucinations

Sometimes ChatGPT gets it very wrong and supplies wildly wrong answers. We've seen a little of that with its analysis of the World Cup game, it wrongly inferred a score of 5-1 when it should have been 6-1. But ChatGPT can get it wrong in much worse ways.

I ran the queries above with text from the BBC and The Guardian. What if I ran the query with no text at all? Here's what I get when there's no text at all to analyze.

{

"sport": "football",

"competition": "World Cup",

"teams": ["France", "Croatia"],

"winner": "France",

"score": "4-2",

"attendance": "80,000"

}

Which is completely made up, hence the term hallucination.

Prompt engineering part 2

Let's go back to my elephant example from earlier and write it this way:

prompt = f"""

In the following text, "The kind old elephant",

how often does the letter e occur

"""

model="gpt-3.5-turbo"

messages = [{"role": "user", "content": prompt}]

response = (openai

.ChatCompletion

.create(model=model,

messages=messages,

temperature=0)

)

print(response.choices[0].message["content"])

Here's what the code returns:

In the phrase "The kind old elephant," the letter "e" occurs 4 times.

Which is clearly wrong.

In this case, the problem is the placement of the text to be analyzed. Moving the text to the end of the prompt and being more explicit about what should be returned helps. Even simply adding the phrase "Give your answer as JSON" to the prompt fixes the issue.

This is why the precise form of the prompt you use is critical and why it may take several iterations to get it right.

What does all this mean?

The promise of ChatGPT

It is possible to analyze text and extract information from it. This is huge and transformative for business. Here are just a few of the things that are possible:

Press clippings automation.
Extraction of information from bills of lading.
Automated analysis of SEC filings.
Automated analysis of company formation documents.
Entity extraction.

We haven't even touched on some of the many other things ChatGPT can do, for example:

Language translation.
Summarization.
Report writing.

How to deliver on that promise

As I've shown in this blog post, the art is in prompt engineering. To get it right, you need to invest a good deal of time in getting your prompts just right and you need to test out your prompts on a wide range of inputs. The good news is, this isn't rocket science.

The skills you need

The biggest change ChatGPT introduces is skill levels. Previously, doing this kind of analysis required a good grasp of theory and underlying libraries. It took quite a lot of effort to build a system to analyze text. Not any more; the skill level has just dropped precipitously; previously, you needed a Ph.D., now you don't. Now it's all about formulating a good prompt and that's something a good analyst can do really well.

The bottom line

ChatGPT, and LLMs in general, are transformative. Any business that relies on information must know how to use them.

Tuesday, July 25, 2023

ChatGPT and code generation: be careful

I've heard bold pronouncements that Large Language Models (LLMs), and ChatGPT in particular, will greatly speed up software development with all kinds of consequences. Most of these pronouncements seem to come from 'armchair generals' who are a long way from writing code. I'm going to chime in with my real-world experiences and give you a more realistic view.

D J Shin, CC BY-SA 3.0, via Wikimedia Commons

I've used ChatGPT to generate Python code to solve some small-scale problems. These are things like using an API or doing some simple statistical analysis or chart plotting. Recently, I've branched out to more complex problems, which is where its limitations become more obvious.

In my experience, ChatGPT is excellent for generating code for small problems. It might not solve the problem completely, but it will automate most of the boring pieces and give you a good platform to get going. The code it generates is good with some exceptions. It doesn't generate doc strings for functions, it's light on comments, and it doesn't always follow PEP8 layout, but it does lay out its code clearly and it uses functions well. The supporting documentation it creates is great, in fact, it's much better than the documentation most humans produce.

For larger problems, it falls down, sometimes badly. I gave it a brief to create code to demonstrate the Central Limit Theorem (CLT) using Bokeh charts with several underlying distributions. Part of the brief it did well and it clearly understood how to demonstrate the CLT, but there were problems I had to fix. It generated code for an out-of-date version of Bokeh which required some digging and coding to fix; this could have been cured by simply adding comments about the versions of libraries it was using. It also chose some wrong variable names (it used the reverse of what I would have chosen). More importantly, it did some weird and wrong things with the data at the end of the process, I spotted its mistake in a few minutes and spent 30 minutes rewriting code to correct it. I had similar problems with other longer briefs I gave ChatGPT.

Obviously, the problems I encountered could have been due to incomplete or ambiguous briefs. A solution might have been to spend time refining my brief until it gave me the code I wanted, but that may have taken some time. Which would have been faster, writing new detailed briefs or fixing code that was only a bit wrong?

More worryingly, I spotted what was wrong because I knew the output I expected. What if this had been a new problem where I didn't know what the result should look like?

After playing around with ChatGPT for a while, here are my takeaways:

ChatGPT code generation is about the level of a good junior programmer.
You should use it as a productivity boost to automate the boring bits of coding, a jump start.
Never trust the code and always check what it's doing. Don't use it when you don't know what the result should look like.

Obviously, this is ChatGPT today and the technology isn't standing still. I would expect future versions to improve on commenting etc. What will be harder is the brief. The problem here isn't the LLM, it's with the person writing the brief. English is a very imperfect language for detailed specifications which means we're stuck with ambiguities. I might write what I think is the perfect brief, only to find out I've been imprecise or ambiguous. Technology change is unlikely to fix this problem in the short term.

Of course, other industries have gone through similar disruptive changes in the past. The advent of CAD/CAM didn't mean the end of factory work, it raised productivity at the expense of requiring a higher skill set. The workers with the higher skillset gained, and those with a lesser skillset lost out.

In my view, here's how things are likely to evolve. LLMs will become standard tools to aid data scientists and software developers. They'll be productivity boosters that will require a high skill set to use. The people most negatively impacted will be junior staff or the less skilled, the people who gain the most will be those with experience and a high skill level.