Showing posts with label AI. Show all posts
Showing posts with label AI. Show all posts

Thursday, June 11, 2026

AI Winters

I've been doing some reading about the history of AI and I found out some things that were new to me. There's a consistent pattern of over-promising, under-delivering, and ludicrous press hype, all with disastrous consequences. There have been at least two dramatic falls in funding over the last decades, called AI winters.

(Gemini's view of an AI winter)

First AI Winter (1974-1980)

The concept of AI has been around for a while, but the first real demonstrations came with the perceptron experiments in the 1950s. The press ran with wild with speculation and massively over-hyped the technology, famously, the New York Times forecast that conscious, self-replicating robots were just around the corner. The perceptron was a great start, but the technology didn't progress very far or very quickly. Of course, it couldn't live up to the hype.

In the early 1960s, researchers spent a great deal of time and money on machine translation, most notably from Russian into English, with the Cold War obviously providing the money and the motivation. Unfortunately, the methods and the computing power just weren't there and the results were very disappointing, certainly nowhere near the level needed to be useful and nowhere near the level needed for funding to continue.

Despite these setbacks, money still flowed into AI research. Eventually, governments started to take an interest in whether their money was producing results, which, frankly, it wasn't. In 1973, the British Government published the Lighthill report which was a devastating assessment of the whole field, and as a result, the UK Government withdrew almost all funding. In the US, government agencies produced similar analysis with similar effects.

Over the next few years, no breakthroughs came, which seemed to justify official skepticism. Despite the lack of breakthroughs, AI research continued, with membership of AI research organizations increasing.

Second AI Winter (1987-2000)

By the early 1980s, things had changed. Japan had risen as an industrial power and it's powerful industry ministry, MITI, had made waves in the west. When MITI decided to fund "Fifth Generation Computer Systems" that were supposed to deliver AI, western governments got a dose of FOMO. At the same time, LISP was having its moment in the sun, driven by early successes in expert systems. Government funding came back and entrepreneurs founded companies to exploit the new technology. Notably, there were a number of companies producing LISP-specific hardware.

Once again, it was a false dawn. By the late 1980s, general-purpose cheaper and capable workstations had arrived, and LISP was ported to these machines. In turn, this led to the collapse of the LISP-specific machine market and to the collapse of the companies making these machines. Investors took note.

Expert systems generally ran into trouble. Outside a few domains, they weren't that successful and there were no breakthroughs. The idea limped on with a few variants, but only had minor successes.

The mighty MITI suffered a setback when progress on Fifth Generation systems was a lot slower than it had expected or wanted. In 1992, it quietly closed the project.

By the early 1990s, AI had a bad reputation again. It had suffered two hype-driven booms and had failed to deliver twice. Investors were skittish, so investment dried up. Governments spent their research money elsewhere and universities focused on other areas. But there were people still working in the area and working on new ideas. Later on, those ideas would bear fruit spectacularly.

What does this mean?

It's a cliche to say "this time, it's different", but so far it is. Yes, the technology is hyped, but the business benefits are obvious, the skeptical voices are louder, and the hype isn't as foolish as it was before. Comparing the press coverage from the late 1950s to now, you see hype, but it's more grounded in reality and there are fewer flights of fancy (no talk of self-replicating robots).

Gartner have a nice model of adoption called the hype cycle. Here's a typical chart used to explain it, taken from Wikipedia. The chart's pretty self-evident, so I won't explain it.

AI's path is more complex than the simple hype cycle, but you can see the same general pattern. We're in high growth now, so it's likely we're in the "Slope of Enlightenment". Are things likely to slow down as we reach the "Plateau of Productivity"? Not any time soon.

Are we likely to see another AI Winter? Probably not, but if it does happen, my guess is it will be a combination of data center constraints plus government action plus human revolt.

Tuesday, April 7, 2026

Ralph for beginners

What's Ralph and why do you care?

Ralph is all about automating the code generating process. You can use it to build small applications while you eat your lunch and build bigger applications while you sleep. Apart from the initial setup, the skills required are mostly those of a product manager; specifically, the ability to write a detailed requirements document.

Why do we need another Ralph blog post?

I found it hard to get going with Ralph because the existing content was either too theoretical or not practical enough. I figured it out in the end, but I thought I could write something to help other people get going faster, so that's what you're reading.

The what and why of Ralph

LLMs have a limited context window, which means they can only do a limited amount of reasoning. In turn, this means LLMs have problems generating code for large or complex projects. In my experience, once the prompt gets beyond a page or two, the quality falls off and code gen starts to miss things. The net result is, you need to have a human in the loop to code or to prompt; the human spots places where code gen has failed and prompts the LLM to fix the issues.

Ralph solves the problem by slicing the whole project into "bite-size requirements" with acceptance criteria after each requirement. If code gen for a requirement doesn't meet its acceptance criteria, Ralph tries again. In this way. it constructs the project step-by-step until it's built all the requirements and so delivers the complete project. The entire process is automated and there's no human involvement.

(Gemini's view of the Ralph loop. A nice AI generated image about AI.)

The Ralph loop gets its name from The Simpson's character Ralph Wiggum. If you've never watched The Simpson's, here's what you need to know: Ralph is well-meaning, but intellectually slow. Imagine you're instructing Ralph on how to build something. You'd break down the project into chunks and have Ralph run tests to make sure each chunk was correct before moving onto the next chunk. Ralph would build the project piece-by-piece until the whole thing was finished. This way might be slow, but you'd get it done right.

Ralph Wiggum, Fair use, Link

AIs and CLI

To get Ralph to work, you'll need to install a code gen CLI on you local machine. The most common tutorials I've seen on the web use the Claude CLI, so install this if you don't have an existing code gen solution. I got Ralph working with Cursor via the Cursor CLI, so I know that works too. Whatever AI you choose, you'll need an active subscription; you're not going to do this for free.

Skills

Next up, you'll need to install a skills file for your LLM. If this were a normal blog post, I'd tell you exactly where to go to get the skills file, but I'm not going to do that. The Ralph world is changing so quickly, any links I give you will be out of date by the time you read this. You'll need to search to get the latest version of the Ralph skills file you need.

(Skills enable code generation tools to do specific tasks. If you don't know what a skills file is in the context of a code-generating LLM, take some time to find out before moving ahead.)

Git for the LLM to use

As I'll explain later, Ralph uses git, so you'll need a git account and you'll need to create a repo for this project. I used my GitHub account, so I know GitHub works fine for this.

The Product Requirements Document (PRD)

This is where the fun starts. You need to write a Products Requirements Document using Markdown. The PRD lists all the requirements, each requirement being a "bite-sized chunk". Here's an excerpt from a PRD.md file on my system.

MBTA-002: How the app appears to users

Description

    • The app will consist of three pages: "trains & alerts", "map & facilities", and "about".
    • It will be possible for the user to easily navigate between pages (e.g. using a tab control or buttons).

Acceptance criteria

    • There are three pages on the app: "trains & alerts", "map & facilities", and "about".
    • On each page, the user can navigate to the other pages using a control, e.g. a tab control or buttons.

Here's what's going on

  • The PRD consists of multiple sections like this one. Each section is a "bite-sized chunk" of functionality the LLM can generate code for. Think of the sections as individual requirements.
  • The section (or requirement) title includes an ID (MBTA-002) and a descriptive title.
  • The Description sub-section contains bullet points that describe the functionality you want. Remember, the point of the Ralph loop is to keep things simple, so keep the sub-section short.
  • The Acceptance criteria sub-section states the criteria the generated code must pass. If the code passes, the LLM moves onto the next requirement. If it doesn't pass, it repeats the code generation process (there's more to this I'll discuss later).

Anyone with good Product Management skills should be able to quickly build a PRD like this.

(In practice, the Acceptance criteria sub-section looks a lot like the Description sub-section. What I do is write up the Description sub-section, then ask my LLM to add acceptance criteria based on my Description. I then add in any new acceptance criteria I can think of.)

PRD.md to JSON

The Ralph loop processes a JSON file, so the next step is the production of a JSON file from the PRD.md file. This is done using the skill you installed earlier. It's a simple call to a bash script; on my Cursor installation, the script is called convert.sh.

The output is a long JSON file consisting of multiple records. Each record is a requirement taken from the PRD.md file. Here's the JSON record for the requirement in the previous section. 

{

"id": "2.1",

"category": "ui",

"story": "Build base template with BosWay branding and navigation between three pages.",

"steps": [

"Header shows BosWay and page context e.g. BosWay - about (MBTA-001).",

"Add tabs or buttons to switch trains & alerts, map & facilities, about (MBTA-002)."

],

"acceptance": "Three routes work; every page can reach the other two; titles consistent with PRD.",

"priority": 3,

"passes": false,

"notes": ""

},

This JSON record is so important, I'm going to ask you to take a closer look at it. You can see the Description and the Acceptance criteria here, albeit worded differently. The other three sections to look at are priority, passes, and notes.

  • priority tells the Ralph loop what to work on next (start with the highest priority and working down).
  • passes. This starts as false. If the LLM successfully implements the requirement, it sets this value to true.
  • notes. This contains notes for the LLM on the next pass through the loop. Let's say the loop fails the Acceptance criteria, the notes field will contain details on the failure. On the next pass of the loop, the LLM uses these notes to try and do better. What generates these notes? The LLM.
The fields in the JSON records are read and updated by the Ralph loop. There's no human in the loop. In practice, you probably won't even view the JSON file.

Once you've generated the JSON file, you're ready to run the Ralph loop.

The Ralph loop

The Ralph loop takes the JSON file as input and processes the requirements one-by-one, starting with the most important. The bash file to do it is called start.sh on my system and it's a little complex. I'll talk through how it works at a high level, leaving out some advanced bits.

Before starting the loop, the code performs various checks, e.g. the JSON file exists, the git settings are correct and so on.

The script then moves onto the Ralph loop. Because the Ralph loop does a lot, I'm going break it down piece-by-piece.

  • On each trip round the loop, the code starts with some checks. It checks if the process is rate-limited on the AI API or if there are other reasons why it can't continue.
  • From the JSON file, it reads the requirement with the highest priority where passes is false.
  • It passes this requirement to the AI API along with the current git code version.
  • The AI generates code, or changes the existing code, to meet the requirement.
  • The AI generates tests based on the acceptance criteria.
    • If the tests pass, the AI updates the JSON passes field to true.
    • If the tests fail, the AI may update the notes field to provide a hint how to do better next time round. (Remember, the passes field is false by default so it doesn't change the value if the loop fails.)
  • The loop saves the generated code to a local git branch.

In the loop, there are some more advanced bits and pieces I'm going to briefly mention here that might be important to you:

  • There are API call timeouts.
  • You can set a maximum number of iterations to prevent the loop getting stuck and burning through your tokens.
  • There's a circuit breaker that can stop the loop if zero files are changed or if the same error is detected on multiple loops.
  • You can set a rate limit to prevent the LLM provider from banning you.
There main Ralph file (start.sh) calls several bash scripts to run checks etc.

When the Ralph loop finishes, you should have the code for your project. In practice, you'll need to tweak what you get back, but in my experience, you'll be very close.

How long the loop takes depends on the thoroughness of your PRD and the size of your project. As a general rule of thumb, a smallish project (e.g. building an interactive web app based on a simple data source) might take an hour.

Cost!

Ralph burns through API calls. Most LLM providers will give you a limited number of API calls per month which is separate from your token allocation. Even one Ralph project can burn through your entire API allocation. The bottom line is, Ralph can be an expensive thing to play with (low hundreds of USD to properly experiment). I suggest you think carefully about your projects and test Ralph in a considered way.

The reality

I've made it sound like the Ralph process is quite smooth. Right now, it isn't, there are bumps along the way, for example, the setup process is a little complicated, the Ralph loop reporting needs a bit of user-friendly tweaking, the online descriptions aren't as helpful as they should be, and so on.

BUT.

It works and it works well.

My experience

It took me some effort to get Ralph up and running, but once I figured it out, it blew me away. It built an entire project without human intervention and it got it nearly right. Importantly, I realized the bits it missed were gaps in the PRD. In other words, I needed a better spec.

The Ralph loop changes the balance of skills in favor of a more detailed up-front spec that anyone with product management skills can write.

That's quite a profound change.

My recommendations

I do recommend you try a Ralph loop for yourself and I have some suggestions for making your experimentation easier.

  1. Allocate enough setup time to install skills etc. This can be frustrating, so be prepared.
  2. Choose a project you've done before. This means you know what the end result should be.
  3. Write a very detailed PRD as described above. Use an LLM to add acceptance criteria and add some of your own. Thoroughness here is key.
  4. Run the Ralph loop.
  5. Compare the Ralph results to your prior results.

Good luck!

Tuesday, March 10, 2026

Arthur C. Clarke and AI

The history of AI

I was looking over the history of AI and I was struck by how far ahead of the curve Arthur C. Clarke was. It's not just technical issues either, he was way ahead on the cultural impacts as we'll see. Of course, Clarke was too optimistic about when AI would arrive, but I think we can forgive him that.

(ITU Pictures, CC BY 2.0, via Wikimedia Commons)

Clarke and AI in his fiction

Clarke wrote quite a lot about AI and computing. The most famous example is the psychopathic AGI HAL 9000 in the 1968 movie "2001: A Space Odyssey", but he had been writing about computing for some time. In 1953, he published "The Nine Billion Names of God" which has a computer as a central element, and there followed several novels and stories through the 1950s and 1960s. In 1979's, "The Fountains of Paradise", one of the characters has a medical implant that can synthesize speech to call for help if the wearer has a medical emergency. 

Clarke's AI futurism

Although he's mostly known today as a science fiction writer, Clarke also popped up on TV as a futurist, giving his thoughts on how technology might develop. This included speaking about AI and its implications. Listening to these recordings now is eye-opening as we'll see.

The first clip is from 1964. Some of his futurism is (way) off, but a surprising amount is accurate. I was going to just give you a link to the AI piece, but the whole clip is worth listening to.

Here's a Nova episode from 1978 about the new "thinking machines". Clarke's segments are worth viewing. He speaks at the start, and at 34:44, 36:27, and most importantly at 41:35. If you want a bit of a chill, go to 52:48.

If you didn't know these clips were from 1978 and had the transcript alone, when would you think they had been recorded?

Ahead of his time: society vs technology

I was at a conference in 2025 where experts were speaking on AI, shockingly, they focused exclusively on technology without giving a moment's thought to the impact on employment and society. It's apparent to me that Arthur C. Clarke in 1978 had more foresight than some of the experts in 2025.

Given his foresight, it's slightly surprising Clarke didn't explore the themes of super-intelligent AIs displacing people in his fiction. It would have been interesting to read a Clarke novel with societal AI change as a backdrop. 

Wednesday, February 11, 2026

The perceptron

Why study the perceptron?

Perceptrons were one of the first learning systems and an important early stepping-stone to most recent AI innovations. That alone would be motivation enough to study them, however the reaction of the press, and the consequences of the hype, are a cautionary tale for us in 2026.

I'm going to share with you the why and the how of the perceptron, with some of the consequences of the hype.

Why do we care about systems that learn?

Go back to the 1950s, why would you care about a system that can learn? There’s the obvious coolness of it, but there are important real-world applications.

Photo analysts study reconnaissance photos looking for hidden bunkers or other items of military significance. The work is tiring and boring at times, but it’s hard to automate because it relies on human interpretation rather than a hard and fast set of rules. The “enemy” constantly changes how they disguise their installations, so whoever or whatever is analyzing photos must continually learn.

A similar problem occurs in post offices. If a post office wants to automate letter sorting, it has to automate reading handwritten addresses. Each person’s handwriting is different, which means creating definitive rules about letter or number formation is hard.

A learning system can adapt itself to new information and so stay productive when things change. In practice, this means it can be taught to recognize a new way a country is disguising a bunker or a new way someone is writing the number 5. It doesn’t require its creators to continually tweak settings. Of course, these automated systems can process letters or images etc. much faster (and cheaper) than human beings, which makes them very attractive.

Given the demand existed, how can you create a system that learns?

How do biological systems learn?

The obvious learning systems are biological. By the 1950s, we’d made some progress understanding how brains work, in particular, we had a basic understanding of how neurons worked, which are the lowest level of processing in the brain. 

Neurons take sensory input signals from dendrites into the soma, where the input is “processed”. If the input signal crosses some threshold, the soma fires an output signal (an action potential) through an axon. Neurons learn by changing the way they “weight” different dendrite signals, so changing the conditions under which they fire. 

The output of one neuron could be the input to another neuron and real brains have layers of processing. 

The picture below shows the arrangement for a single neuron.

(Gemini)

My explanation of how neurons work is very simplistic and in reality, it’s much more complicated. In real brains, neurons learn together and there are other biological processes going on involving dendrites. If you want to read more about biological neurons, here are some good references:

The perceptron 

In 1957, at the Cornell Aeronautical Laboratory in Buffalo, New York, the psychologist Frank Rosenblatt was studying human learning (specifically, the neuron) and trying to replicate it in software and hardware. His team built a prototype system, called the perceptron, that could “learn” in a very limited sense. The learning task was simple image classification.

(Rosenblatt and the perceptron. National Museum of the U.S. Navy, Public domain, via Wikimedia Commons)

The Mark I Perceptron input was a 20x20 photocell array; a photocell is very limited form of digital camera. These 400 inputs were fed to “association units” that weighted the inputs. The weights were set by potentiometers that were adjusted by electric motors. Importantly, the initial weights were random to avoid bias. The system summed the weights and used a simple threshold algorithm (response units) to decide the image classification, if the sum of the weighted signals was above the threshold the algorithm output a signal (a true output), if the sum of the weighted signals was below the threshold, the algorithm did not output a signal (a false output). Technically, the name of the threshold function is a Heaviside step function. If the perceptron made an error, the relevant weights were adjusted. The perceptron required 50 training iterations to reliably distinguish between squares and triangles.

(From the perceptron user manual.)

In 2026, this sounds really basic, but in 1957 it was a breakthrough. Rosenblatt and his team had demonstrated that a machine could learn and change how it “sees” the world.

References:

The perceptron theory

Here’s a simple representation of the perceptron. The inputs from the photocell are fed in and assigned weights. There’s a bias term to account for bias in the photocells, for example, the photocells might give a very small signal instead of zero when there’s no image. The weighted inputs (and the bias) are summed. If the weighted sum exceeds some threshold, the perceptron fires, if not, it doesn’t.

(The perceptron is a linear classifier, meaning it can only separates point on a hyperplane. In two dimensions, this means it can only separate points using a straight line.)

Mathematically, this is how it works.

\[u = \sum w_i x_i + b \]

\[y = f(u(x)) = \begin{cases} 1, & \text{if } u(x) > \theta \\ 0, & \text{otherwise} \end{cases}\]

In the vector notation used in machine learning, the equations are usually written:

\[y = h( \textbf{ w} \cdot \textbf{x } + b ) \]

where h is the Heaviside step function.

So far, this is pretty simple, but how does it learn? Rosenblatt insisted on starting training from a random state, so that gives us a starting point. Then we expose the perceptron to some training data where we know what the output should be (the data is labeled). Here’s how we update the weights:

\[ w_i  \leftarrow w_i + \Delta w_i \]

\[ \Delta w_i = \eta(t - o)x_i \]

where:

  • \(t\) is the target or correct output
  • \(o\) is the measured output
  • \(\eta\) is the training rate and \( 0 \lt \eta \leq 1\)

We update the weights and try again in an iterative loop. This continues until we can successfully predict the training data set within a certain error, or we’ve reached a set number of iterations, or we’re seeing no improvement. This is similar to how machine learning systems work today.

References:

Perceptron problems

There were lots of issues with the perceptron in its original form. Let’s start with the worst: the hype.

Rosenblatt gave interviews to the press on his system and they ran with it, but not in a good way. A 1958 New York Times article was typical, the headline read “NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser”, with a lede: “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Other press stories were similarly sensational and hyped the technology. The press very much set the expectation that walking, talking AIs were just around the corner. Of course, the technology couldn’t deliver what the press forecast, which helped lead to a loss of confidence.

The technical problems varied from the straightforward to the severe.

The original perceptron used a simple threshold to decide whether to fire or not, but this caused problems for training weights. Most important training algorithms use derivatives (for example, gradient descent). A simple threshold isn’t differentiable, which means it can’t be used in these kinds of training algorithms. Fortunately, this is relatively easy to fix using a differentiable function to replace the simple threshold. There are a number of possible differentiable functions, and a popular choice is the sigmoid function. (The function that decides whether to fire or not is now called the activation function).

A more serious problem is the logical limitations of the simple perceptron. As Minksy and Papert showed in 1969, there are some logical structures (most notably, XOR), you can’t build using the simple single-layer perceptron architecture. Although multi-layer networks solve these problems, the Minsky and Papert book and their papers significantly damaged research in this area, as we'll see.

This is only a summary of the difficulties the perceptron faced. For a fuller description, check out: https://yuxi.ml/essays/posts/perceptron-controversy/

What happened next

By the early 1970s, the hype bubble had burst. Minsky and Papert’s book had an impact and governments found disappointing results from funding perceptron-based projects; projects promised big results, but in reality, very little was produced. Governmental patience eventually wore thin and eventually they concluded this form of AI research wasn't worth funding. The research money went elsewhere leading to the first “AI Winter” which lasted for a decade or so. 

Sadly, AI experienced another hype bubble and collapse in the late 1980s, a second "AI Winter". As a whole, AI research began to get a bad reputation.

The “AI Winters” bled talent and money away from neural network development, but research still continued.  Although multi-layer networks had been developed by the 1960s, it wasn’t known how to train them until the Rumelhart, Hinton, and Williams 1986 paper “Learning representations by back-propagating errors” [https://www.nature.com/articles/323533a0] popularized the back propagation method. Convolutional Neural Networks (CNNs) using back propagation and a convolutional structure were demonstrated in 1989. With these technologies as the backbone, LLMs were developed starting in the mid-to-late 2010s. It’s only the enormous success of LLMs that has brought a flood of money into AI research and a resurgence of interest in its origins.

Rosenblatt had a wide variety of research interests, including astronomy and photometry (measuring light). By any measure he was a genius. Unfortunately, in 1971 he died at the age of 43 in a boating accident. His death was just a few years into the first "AI Winter", so he saw the hype and the subsequent bubble bursting. Sadly, he never go to see how the field eventually developed.

Thoughts on the story

The original perceptron was very much based on what had gone before, but it was a breakthrough and ahead of its time, which was part of the problem. The necessary technology wasn’t there to advance quickly. Unfortunately, the hype in the press, fed by  Rosenblatt and others, set unrealistic expectations. While great for short-term research funding, it was terrible for the long-term when the hype bubble burst.

AI as a whole has been prone to hype cycles through its entire existence. It's no wonder there's a lot of discussion online about the latest AI bubble bursting. My feeling is, it is different this time, but we're still in a bubble and people are going to get hurt when it eventually pops.

Monday, February 9, 2026

Learning by hand is better than learning by AI

Accelerating learning with AI?

Recently, I've been learning a new LLM API from a vendor. There's a ton of documentation to wade through to get to what I need to know and the vendor's examples are overly detailed. In other words, it's costly to figure out how to use their API.

(Gemini)

I decided to use code gen to get me up and running quickly. In the process, I found out how to speed up learning, but equally important, I found out what not to do.

Code gen everywhere!

My first thought was to code gen the entire problem and figure out what was going on from the code. This didn't work so well.

The code worked and gave me the answer I expected, but there were two problems. Firstly, the code was bloated and secondly, it wasn't clear why it was doing what it was doing. The bloated code made it hard to wade through and zero in on what I wanted. It wasn't clear to me why it had split something into two operations, despite code gen commenting the code. Because I didn't know the vendor's API, I couldn't be sure the code was correct; it didn't look right, but was it?

Hand coding wins - mostly

I recoded the whole thing by hand the old fashioned way, but using the generated code as an inspiration (what function to call and what arguments to use). I tried the LLM calls in the way I thought they should work, but the code didn't work the way I thought it would. On the upside, the error message I got was very helpful and I tracked down why it didn't work. Now I knew why code gen had made two LLM calls instead of one and I knew what outputs and inputs I should use.

The next step was properly formatting the final output. Foolishly, I tried code gen again. It gave me code, but once again, I couldn't follow why it was doing what it was doing. I went back looking at the data structure in detail and moved forward by hand.

But code gen was still helpful. I used it to help me fill in API argument calls and to build a Pydantic data structure. I also used it to format my code. Yes, this isn't as helpful as I'd hoped, but it's still something and it still made things easier for me.

Why code gen didn't work fully

Code gen created functioning code, not tutorial code, so the comments it generated weren't appropriate to learn what was going on and why.

Because I didn't know the API, I couldn't tell if code gen was correct. As it turned out, code gen produced code that was overly complex, but it was correct.

Lessons

This experience crystallized some other experiences I've had with AI code gen.

If I didn't care about understanding what's going on underneath, code gen would be OK. It would work perfectly well for a demo. Where things start to go wrong is if you're building a production system where performance matters or a system that will be long-lived - in these cases the why of coding matters.

Code generation is an accelerator if you know what you're doing. If you don't know the libraries (or language) you're using, you're on thin ice. Eventually, something bad is going to happen and you won't know how to fix it.

Wednesday, January 14, 2026

Replit vs. Cursor - who wins?

Building Business Apps - Cursor vs. Replit

For a while now, I've been very interested in using AI to build BI-type apps. I know you can do it with Cursor, but it requires a strong technical background. I've heard people have had great success with Replit, so I thought I would give it a go. I decided to build the same app in both Cursor and Replit. It's a kind of battle of the tools.

(Gemini.)

For my comparison contest. I chose to build a simple app that shows the weather and news for a given location.

Round 1: getting started/ease of use

I gave both contenders the same prompt and asked them to build me an app. Both tools gave me an app in about the same time. However, I found Replit much, much easier to use; by contrast, Cursor can be tough to get started with.

Round 1 is a decisive victory for Replit.

Round 2: building the app

Both apps had problems and I needed to tweak them to get them working. I found I had to give Replit multiple prompts to fix problems; problems that just didn't occur in Cursor. Replit got stuck on some simple things and I had to get creative with prompting to get round them, all the while my AI token consumption went up. Cursor didn't need this level of imaginative prompting.

I'm giving this round to Cursor on points.

Round 3: editing the visual layout

Replit let me edit the visual layout of the app directly, while Cursor did not. I know Cursor has a visual editor, but I just couldn't get it to work. This is of course an ease of use thing, and overall, Replit is easier. For this app, I didn't need to tweak the layout but it's an important consideration. 

Round 3 is a decisive victory for Replit.

Round 4: what is the app doing?

I wanted to know what the apps were doing "under the hood" so I wanted to see the code. Cursor is unashamedly a code editor, so it was simple. By contrast, Replit hides the code away and it requires a bit of digging. On a related theme, Cursor is much better at debugging, so it's easier to track down errors.

Round 4 is a victory for Cursor.

Round 5: changing the app under the hood

I wanted to change the app "under the hood", which meant changing some of the code. Cursor generates code that's very well commented, so it's easy to see what's going on. By contrast, Replit's code is sparsely commented and I found it difficult to understand what each file did. Bear in mind though, Replit is trying to be an app creation tool not a code editor.

Round 5 is a victory for Cursor.

Round 6: running the app locally

Both Replit and Cursor did well here. This round is a draw.

Round 7: deploying the app to the web

Replit makes this really easy, There's a simple process to go through and your app is deployed. Cursor doesn't do deployment and the deployment services like Render have a learning curve.

Round 7 is a victory for Replit.

A disturbing thought

I was looking at how both apps turned out and something struck me when I was looking at the code for the Cursor app: what services did these apps use? I didn't specify what APIs I wanted to use, the AIs chose for me.

Both of these apps converted an address to a latitude/longitude, showed a map, got local news, got a climate chart for the year, and so on. But what APIs (services) did they use underneath? What were the terms and conditions of the services? What are the limitations of the services? The answer is: you have to find out for yourself. Which means either asking the AI or digging into the code.

If I sign up for an API key, I have to go to a website, read what the service offers, and accept the terms and conditions. For example, some APIs forbid commercial use, some are very rate limited, and others require an acknowledgment in the app or web page. If you build an app using an AI, how do you know what you've agreed to? Will your app get rate limited? Will you get banned for using the API service inappropriately? What are the risks? It seems like a feeble defense to say "my AI made me do it".

It looks like the onus is on you to figure this out, which is definitely a problem.

Who won?

Looking at the results of the contest, my answer is: it depends on your end goal.

If you want a tool to let you build a "simplish" app and you don't have much, if any, coding experience, then Replit is the clear winner. On the downside, it will be very difficult to add more complex features later.

If you want to build a more complex app and you have coding experience, then Cursor wins. Cursor also wins if you think that you'll need to edit the app code in the future. 

What would I chose for internal reporting or BI-type development? On balance, Cursor, but it's not a clear victory. Here's my logic.

  • I love the idea of democratizing analysis. I like giving users the power to answer their own questions. This would appear to favor Replit, but...
  • I worry about maintainability and extendability. I've seen too many cases where a one-off app has become business critical and no-one knows how to maintain it. This favors Cursor because in my view, it produces more maintainable code.

Future directions

The ultimate goal is a tool that lets a non-coder quickly and simply build an app, even a complex one, that's maintainable in the future. This could be building an app for internal use (within an organization) or external use. The app development process will be a combination of natural language prompting and visual editing. Right now, we're really, really close to that goal and it's probably arriving later in 2026.

I'm sure some readers will feel I'm being harsh when I say Replit isn't quite there yet; for me, it needs less prompting and better code layout and documentation. Cursor has a way to go and I'm not convinced they're going in this direction (they may well stay focused on code development). 

In my view, the bigger problem is not app development but data availability. To build internal apps, the internal data has to be available, which means it has to be well-described and in a place where the app development program (and the app itself) can access it. In many organizations, their data isn't as well organized as it should be (to put it politely). It's like having a car but not being able to find gas (or only finding the wrong gas), it makes the car useless. To make internal app development really fly, internal data has to be organized "good enough". We may well see more focus on data organization within companies as a result.

Both Cursor hand Replit have the advantage that they both ultimately use common languages and packages. This means that the skills to maintain apps created using them are common in any company with programmers or analysts on staff. Contrast that with BI tools where the skills and knowledge of how to use the BI tools are only in the BI group. I can see tools like Cursor and Replit encroaching more and more into BI territory, especially as app development becomes democratized.

Tuesday, December 30, 2025

Why are weather forecasting sites so bad?

Just show me what's relevant!

Weather forecasting in the US has got really bad for no real reason. I'm not talking about the accuracy, I'm talking about the way the data is presented. Oddly, it's the professional weather sites that are the worst.

Here's what I want. I want a daily view of the weather for the next week. I want temperature highs and lows, chances of rain/snow when and how much, and some details on the wind if it's going to be unusual. A line of two of text would be great for each day. I don't mind ads, but I don't want so many that I can't read the data. It's not much to ask, but it seems like it's hard to get.

(Gemini)

What the commercial sites give me

The commercial sites give me visual clutter everywhere. There are ads all over their pages. Of course, ads scream for attention, so multiple ads are distracting and make the page hard to use. If I try and change anything on the page, I get an ad I have to click away from.  Because they have to allow space for ads and links to other content, the screen real estate they can use for actual weather  data is very limited. Throw in some over-size icons and you leave even less room for meaningful text and data. 

The hourly views they provide are very detailed, but oddly, poorly presented. If I want the hourly forecast for three days' time, I have to scroll through lots of stuff - which I guess is the point. The summary views are too truncated because of their cluttered presentations.

The radar charts are nice, as is the animation, but again they're distracting. The choice of colors makes me feel like I'm reading a 1980s superhero comic.

Of course, these websites have to be paid for and the money comes from ads. It seems like it's ads or subscriptions and I'm already paying too much in subscription fees. It feels like things aren't going to get better.

Google and others

Google provides a very good weather summary, as do a number of other sites. Unfortunately, they don't provide all the data I want, but they get pretty close. Their data presentation is great too. 

TV is the worst

Let me be blunt. I don't trust TV forecasts. I've read that they tend to exaggerate bad weather to get viewers, this includes exaggerating rainfall and exaggerating weather severity. I've read of TV forecasters who were asked by their station manager to make forecasts worse to drive ratings. There's a saying in journalism, "if it bleeds, it leads" and it seems like sometimes weather forecasts fit into this category. It may well be that some or all of my local stations are not like this, but I have no way of knowing. If they want to gain my trust, they should publish data on their accuracy, but none of them do.

For reasons I'll get to in a minute, AI has made me lose faith in TV forecasters completely. 

NWS

By now, many of you will be screaming about the National Weather Service. They provide free forecasts and plenty of data via their API. They have exactly the data I want, but it's poorly presented. Their website feels very late 1990s, and there may be reasons for that.

There's been an on-and-off campaign against the NWS for some time now. The argument against it is that it's unfair competition for the commercial weather forecast providers.  Bear in mind that the commercial providers all use NWS data underneath and that we the tax payers have paid for weather collection. The push is to have the NWS stop providing data and forecasts to the public but still provide the data to commercial providers in bulk. In effect, this means the public will pay for data collection and pay again to see the data they paid to be collected. I can't help feeling that part of the awkward NWS data presentation is to deflect the unfair competition argument.

The NWS' parent agency is NOAA and recently, NOAA has suffered substantial cuts. At this time, it's not clear what the effect of these cuts are, but it can't be good for forecasting.

What I did about it

I built my own app using AI code gen and using an LLM to give me the text I wanted.

I wrote a long prompt to tell Cursor to build an app. I told it to get a US zip code, find the biggest town or city in the zip code, and convert it to latitude and longitude. Next up, I told it to get the NWS seven day forecast and pass the data to Google Gemini and produce a summary forecast from the data. Finally, I added in a weather chatbot, just because. I put the whole thing into Streamlit.

My app isn't perfect, but it's pretty close to what I want. It all fits on one page so it's easy to see the daily forecast and the overall summary is very readable. If I have questions, I can just ask the chatbot. I'm now using my app when I want a forecast because it has what I want and it's faster and easier to use than the alternatives. It's way better than watching the TV weather forecast and I'm convinced my app isn't biased to emphasize drama.

(My app, simple but effective.)

(Future enhancements I'm thinking of adding include:

  • Changing to a tabbed display.
  • Summary and seven day view on the main tab.
  • Hourly views on another tab - including Google-like charts.
  • Adding a radar view tab using the NWS radar data.
  • Adding text-to-speech via an AI service.
This is all about adding more functionality in an easy-to-use way that lets me get what I want quickly.)

My app took 10 minutes to write.

Let me say this again. I built an app that's better for me than the existing commercial weather forecasting services and I did it in 10 minutes. 

There are implications here.

Let's say I'm a radio station and my existing meteorologist retires or leaves. Why not replace them with an app? I can generate a soothing calming voice using AI so I can automate the whole forecast and save myself some money. I can do the same thing if I'm a TV station too; I can hire someone cheap to read the forecast or generate a movie of the forecast. I could also amp up the urgency of any bad news without any fear of someone pushing back. In other words, AI is a game changer.

So long as the NWS exists and is providing free data, the potential exists to disrupt the weather forecasting market using AI. 

What other markets like this could AI disrupt?

Tuesday, December 23, 2025

Using Cursor for data science: a talk

Code generation is good enough for data science use

I gave a talk at PyData Boston on using Cursor for data science. Here's the talk.



Friday, December 19, 2025

Small adventures with small language models

Small is the new large

I've been talking to people about small language models (SLMs) for a little while now. They've told me they've got great results and they're saving money compared to using LLMs; these are people running businesses so they know what they're talking about. At an AI event, someone recommended I read the recent and short NVIDIA SLM paper, so I did. The paper was compelling; it gave the simple message that SLMs are useful now and you can save time and money if you use them instead of LLMs. 

(If you want to use SLMs, you'll be using Ollama and HuggingFace. They work together really well.)

As a result of what I've heard and read, I've looked into SLMs and I'm going to share with you what I've found. The bottom line is: they're worth using, but with strong caveats.

What is a SLM?

The boundary between an SLM and an LLM is a bit blurry, but to put it simply, an SLM is any model small enough to run on a single computer (even a laptop). In reality, SLMs require quite a powerful machine (developer spec) as we'll see, but nothing special, and certainly nothing beyond the budget of almost all businesses. Many (but not all) SLMs are open-source.

(If your laptop is "business spec", e.g., a MacBook Air, you probably don't have enough computing power to test out SLMs.) 

How to get started

To really dive into SLMs, you need to be able to use Python, but you can get started without coding. Let's start with the non-coders path because this is the easiest way for everyone to get going.

The first port of call is visiting ollama.com and downloading their software for your machine. Install the software and run it. You should see a UI like this.

Out-of-the-box, Ollama doesn't install any SLMs, so I'm going to show you how to install a model. From the drop down menu on the bottom right, select llama3.2. This will install the model on your machine which will take a minute or so. Remember, these models are resource hogs and using them will slow down your machine.

Once you've installed a model, ask it a question. For example, "Who is the Prime Minister of Canada?". The answer doesn't really matter, this is just a simple proof that your installation was successful. 

(By the way, the Ollama logo is very cute and they make great use of it. It shows you the power of good visual design.)

So many models!

The UI drop down list shows a number of models, but these are a fraction of what's available. Go to this page to see a few more: https://ollama.com/library. This is a nice list, but you actually have access to thousands more. HuggingFace has a repository of models that follow the GGUF format, you can see the list here: https://huggingface.co/models?library=gguf

Some models are newer than others and some are better than others at certain tasks. HuggingFace have a leaderboard that's useful here: https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard. It does say LLM, but it includes SLMs too and you can select just a SLM view of the models. There are also model cards you can explore that give you insight into the performance of each model for different types of tasks. 

To select the right models for your project, you'll need to define your problem and look for a model metric that most closely aligns with what you're trying to do. That's a lot of work, but to get started, you can install the popular models like mistral, llama3.2, and phi3 and get testing.

Who was the King of England in 1650?

You can't just generically evaluate an SLM, you have to evaluate it for a the task you want to do. For example, if you want a chatbot to talk about the stock you have in your retail company, it's no use testing the model on questions like "who was King of England in 1650?". It's nice if the model knows Kings & Queens, but not really very useful to you. So your first task is defining your evaluation criteria.

(England didn't have a King in 1650, it was a republic. Parliament had executed the previous King in 1649. This is an interesting piece of history, but why do you care if your SLM knows it?)

Text analysis: data breaches

For my evaluation, I chose a project analyzing press reports on data breaches. I selected nine questions I wanted answers to from a press report. Here are my questions:

  • "Does the article discuss a data breach - answer only Yes or No"
  • "Which entity was breached?"
  • "How many records were breached?"
  • "What date did the breach occur - answer using dd-MMM-YYYY format, if the date is not mentioned, answer Unknown, if the date is approximate, answer with a range of dates"
  • "When was the breach discovered, be as accurate as you can"
  • "Is the cause of the breach known - answer Yes or No only"
  • "If the cause of the breach is known state it"
  • "Were there any third parties involved - answer only Yes or No"
  • "If there were third parties involved, list their names"

The idea is simple, give the SLM a number of press reports. Get it to answer the questions on each article. Check the accuracy of the results for each SLM.

As it turns out, my questions needs some work, but they're good enough to get started.

Where to run your SLM?

The first choice you face is which computer to run your SLM on. Your choices boil down to evaluating it on the cloud or on your local machine. If you evaluate on the cloud, you need to choose a machine that's powerful enough but also works with your budget. Of course, the advantage of cloud deployment is you can choose any machine you like. If you choose your local machine, it needs to be powerful enough for the job. The advantage of local deployment is that it's easier and cheaper to get started.

To get going quickly, I chose my local machine, but as it turned out, it wasn't quite powerful enough.

The code

This is where we part ways with the Ollama app and turn to coding. 

The first step is installing the Ollama Python module (https://github.com/ollama/ollama-python). Unfortunately, the documentation isn't great, so I'm going to help you through it.

We need to install the SLMs on our machine. This is easy to do, you can either do it via the command line or via the API. I'll just show you the command line way to install the model llama3.2:

ollama pull llama3.2

Because we have the same nine questions we want to ask of each article, I'm going to create a 'custom' SLM. This means selecting a model (e.g. Llama3.2) and customizing it with my questions. Here's my code.

ollama.create(
model='breach_analyzer',
from_='llama3.2',
system=system_prompt,
stream=True,
):

The system_prompt is my nine questions I showed you earlier plus a general prompt. model is the name I'm giving my custom model; in this case I'm calling it breach_analyzer.

Now I've customized my model, here's how I call it:

response = ollama.generate(
model='breach_analyzer',
prompt=prompt,
format=BreachAnalysisResponse.model_json_schema(),
)

The prompt is the text of the article I want to analyze. The format is the JSON format I want the results to be in.  The response is the response from the model using the JSON format defined by BreachAnalysisResponse.model_json_schema().

Note I'm using generate here and not chat. My queries are "one-off" and there's no sense of a continuing dialog. If I'd wanted a continuing dialog, I'd have used the chat function.

Here's how my code works overall:

  1. Read in the text from six online articles.
  2. Load the model the user has selected (either mistral, llama3.2, or phi3).
  3. Customize the model.
  4. Run all six online articles through the customized model.
  5. Collect the results and analyze them.
I created two versions of my code, a command line version for testing and a Streamlit version for proper use. You can see both versions here: https://github.com/MikeWoodward/SLM-experiments/tree/main/Ollama

The results

The first thing I discovered is that these models are resource hogs! They hammered my machine and took 10-20 minutes to run each evaluation of six articles. My laptop is a 2020 developer spec MacBook Pro but it isn't really powerful enough to evaluate SLMs. The first lesson is, you need a powerful, recent machine to make this work; one that has GPUs built in that the SML can access. I've heard from other people that running SLMs on high-spec machines leads to fast (usable) response times.

The second lesson is accuracy. Of the three models I evaluated, not all of them answered my questions correctly. One of the articles was an article about tennis and not about data breaches, but one of the models incorrectly said it was about data breaches. Another of the models told me it was unclear whether there were third parties involved in a breach and then told me the name of the third party! 

On reflection, I needed to tweak my nine questions to get clearer answers. But this was difficult because of the length of time it took to analyze each article. This is a general problem; it took so long to run the models that any tweaking of code or settings took too much time.

The overall winner in terms of accuracy was Phi-3, but this was also the slowest to run on my machine, taking nearly 20 minutes to analyze six articles. From commentary I've seen elsewhere, this model runs acceptably fast on a more powerful machine.

Here's the key question: could I replace paid-for LLMs with SLMs? My answer is: almost certainly yes, if you deploy your SLMs on a high-spec computer. There's certainly enough accuracy here to warrant a serious investigation.

How I could have improved the results?

The most obvious thing is a faster machine. A brand new top-of-the-range MacBookPro with lots of memory and built-in GPUs. Santa, if you're listening, this is what I'd like. Alternatively, I could have gone onto the cloud and used a GPU machine.

My prompts could be better. They need some tweaking.

I get the text of these articles using requests. As part of the process, it gives me all of the text on the page, which includes a lot of irrelevant stuff. A good next step would be to get rid of some of the extraneous and distracting text. There are lots of ways to do that and it's a job any competent programmer could do.

If I could solve the speed problem, it would be good to investigate using multiple models. This could take several forms:

  • asking the same questions using multiple models and voting on the results
  • using different models for different questions.

What's notable about these ways of improving the results is how simple they are.

Some musings

  • Evaluating SLMs is firmly in the technical domain. I've heard of non-technical people try to play with these models, but they end up going nowhere because it takes technical skills to make them do anything useful. 
  • There are thousands of models and selecting the right one for your use case can be a challenge. I suggest going with the most recent and/or ones that score most highly on the HuggingFace leaderboard.
  • It takes a powerful machine to run these models. A new high-end machine with GPUs would probably run these models "fast enough". If you have a very recent and powerful local machine, it's worth playing around with SLMs locally to get started, but for serious evaluation, you need to get on the cloud and spend money.
  • Some US businesses are allergic to models developed in certain countries, some European businesses want models developed in Europe. If the geographic origin of your model is important, you need to check before you start evaluating.
  • You can get cost savings compared to LLMs, but there's hard work to be done implementing SLMs.

I have a lot more to say about evaluations and SLMs that I'm not saying here. If you want to hear more, reach out to me.

Next steps

Ian Stokes-Rees gave an excellent tutorial at PyData Boston on this topic and that's my number one choice for where to go next.

After that, I suggest you read the Ollama docs and join their Discord server. After that, the Hugging Face Community is a good place to go. Lastly, look at the YouTube tutorials out there.

Monday, December 1, 2025

Some musings on code generation: kintsugi

Hype and reality

I've been using AI code generation (Claude, Gemini, Cursor...) for months and I'm familiar with its strengths and weaknesses. It feels like I've gone through whole the hype cycle (see https://en.wikipedia.org/wiki/Gartner_hype_cycle) and now I'm firmly on the Plateau of Productivity. Here are some musings covering benefits, disappointments, and a way forward.

(The Japanese art of Kintsugi. Image by Gemini.)

Benefits

Elsewhere, people have waxed lyrical about the benefits of code generation, so I'm just going to add in a few novel points.

It's great when you're unfamiliar with an area of a language; it acts as a prompt or tutorial. In the past, you'd have to wade through pages of documentation and write code to experiment. Alternatively, you could search to see if anyone's tackled your problem and has a solution. If you were really stuck, you could try and ask a question on Stack Overflow and deal with the toxicity. Now, you can get something to get you going quickly.

Modern code development requires properly commenting code, making sure code is "linted" and PEP8 compliant, and creating test cases etc. While these things are important, they can consume a lot of time. Code generation steps on the accelerator pedal and makes them go much faster. In fact, code gen makes it quite reasonable to raise the bar on code quality.

Disappointments

Pandas dataframes

I've found code gen really doesn't do well manipulating Pandas dataframes. Several times, I've wanted to transform dataframes or do something non-trivial, for example, aggregating data, merging dataframes, transforming a column in some complex way and so on. I've found the generated code to either be wrong or really inefficient. In a few cases, the code was wrong, but in a way that was hard to spot; subtle bugs are costly to fix.

Bloated code

This is something other people have commented to me too: sometimes generated code is really bloated. I've had cases where what should have been a single line of code gets turned into 20 or more lines. Some of it is "well-intentioned", meaning lots of error trapping. But sometimes it's just a poor implementation. Bloated code is harder to maintain and slower to run.

Django

It took me a while to find the problems with Django code gen. On the whole, code gen for Django works astonishingly well, it's one of the huge benefits. But I've found the generated code to be inefficient in several ways:

  • The model manipulations have sometimes been odd or poor implementations. A more thoughtful approach to aggregation can make the code more readable and faster.
  • If the network connection is slow or backend computations take some time, a page can take a long time to even start to render. A better approach involves building the page so the user sees something quickly and then adding other elements as they become available. Code gen doesn't do this "out of the box".
  • UI layout can sometimes take a lot of prompting to get right. Mostly, it works really well, but occasionally, code gen finds something it really, really struggles with. Oddly, I've found it relatively easy to fix these issues by hand.

JavaScript oddities

Most of my work is in Python, but occasionally, I've wandered into JavaScript to build apps. I don't know a lot of JavaScript, and that's been the problem, I've been slow to spot code gen wrongness.

My projects have widgets and charts and I found the JavaScript callbacks and code were overcomplicated and bloated. I re-wrote the code to be 50% shorter and much clearer. It cost me some effort to come up to speed with JavaScript to spot and fix things.

Oddly, I found hallucination more of a problem for JavaScript than Python. My code gen system hallucinated the need to include an external CSS that didn't exist and wasn't needed. Code gen also hallucinated "standard" functions that weren't available (that was nice one to debug!).

Similar to my Python experience, I found code gen to be really bad at manipulating data objects. In a few cases, it would give me code that was flat out wrong.

'Unpopular' code

If you're using libraries that have been extensively used by others (e.g. requests, Django, etc.), code gen is mostly good. But when you're using libraries that are a little "off the beaten path", I've found code generation really drops down in quality. In a few cases, it's pretty much unusable.

A way forward through the trough of disappointment

It's possible that more thorough prompting might solve some of these problems, but I'm not entirely convinced. I've found that code generation often doesn't do well with very, very detailed and long prompting. Here's what I think is needed.

Accepting that code generation is flawed and needs adult supervision. It's a tool, not a magic wand. The development process must include checks the code is correct.

Proper training. You need to spot when it's gone wrong and you need to intervene. This means knowing the languages you're code generating. I didn't know JavaScript well enough and I paid the price.

Libraries to learn from and use. Code gen learns from your codebase, but this isn't enough, especially if you're doing something new, and it can also mean code gen is learning the wrong things. Having a library means code gen isn't re-inventing the wheel each time.

In a corporate setting, all this means having thoughtful policies and practices for code gen and code development. Code gen is changing rapidly, which means policies and practices will need to be updated every six months, or when you learn something new.

Kintsugi

Kintsugi is the Japanese art of taking something broken (e.g., a pot or a vase) and mending it in a way that both acknowledges its brokenness and makes it more beautiful. Code generation isn't broken, but it can be made a lot more useful with some careful thought and acknowledging its weaknesses.

Monday, November 24, 2025

Caching and token reduction

This is a short blog post to share some thoughts on how to reduce AI token consumption and improve user response times.

I was at the AI Tinkerers event in Boston and I saw a presentation on using AI report generation for quant education. The author was using a generic LLM to create multiple choice questions on different themes. Similarly, I've been building an LLM system that produces a report  based on data pulled from the internet. In both cases, there are a finite number of topics to generate reports on. My case was much larger, but even so, it was still finite.

The obvious thought is, if you're only generating a few reports or questions & answers, why not generate them in batch? There's no need to keep the user waiting and of course, you can schedule your LLM API calls in the middle of the night when there's less competition for resources. 

(Canva)

In my case, there are potentially thousands of reports, but some reports will be pulled more often than others. A better strategy in my case is something like this:

  1. Take a guess at the most popular reports (or use existing popularity data) and generate those reports overnight (or at a time when competition for resources is low). Cache them.
  2. If the user wants a report that's been cached, return the cached copy.
  3. If the user wants an uncached report:
    • Tell the user there will be a short wait for the LLM
    • Call the LLM API and generate the report
    • Display the report
    • Cache the report
  4. For each cached report, record the LLM and it's creation timestamp. 

You can start to do some clever things here like refresh the reports every 30 days or when the LLM is upgraded etc.

I know this isn't rocket science, but I've been surprised how few LLM demos I've seen use any form of batch processing and caching.

Monday, November 17, 2025

Data scientists need to learn JavaScript

Moving quickly

Over the last few months, I've become very interested in rapid prototype development for data science projects. Here's the key question I asked myself: how can a data scientist build their own app as quickly as possible? Nowadays, speed means code gen, but that's only part of the solution.

The options

The obvious quick development path is using Streamlit; that doesn't require any new skills because it's all in Python. Streamlit is great, and I've used it extensively, but it only takes you so far and it doesn't really scale. Streamlit is really for internal demos, and it's very good at that.

The more sustainable solution is using Django. It's a bigger and more complex beast, but it's scalable. Django requires Python skills, which is fine for most data scientists. Of course, Django apps are deployed on the web and users access them as web pages.

The UI is one place code gen breaks down under pressure

Where things get tricky is adding widgets to Django apps. You might want your app to take some action when the user clicks a button, or have widgets controlling charts etc. Code gen will nicely provide you with the basics, but once you start to do more complicated UI tasks, like updating chart data, you need to write JavaScript or be able to correct code gen'd JavaScript.

(As an aside, for my money, the reason why a number of code gen projects stall is because code gen only takes you so far. To do anything really useful, you need to intervene, providing detailed guidance, and writing code where necessary. This means JavaScript code.)

JavaScript != Python

JavaScript is very much not Python. Even a cursory glance will tell you the JavaScript syntax is unlike Python. More subtly, and more importantly, some of the underlying ideas and approaches are quire different. The bottom line is, a Python programmer is not going to write good enough JavaScript without training.

To build even a medium complexity data science app, you need to know how JavaScript callbacks work, how arrays work, how to debug in the browser, and so on. Because code gen is doing most of the heavy lifting for you, you don't need to be a craftsman, but you do need to be a journeyman.

What data scientists need to do

The elevator pitch is simple:

  • If you want to build a scalable data science app, you need to use Django (or something like it).
  • To make the UI work properly, code gen needs adult supervision and intervention.
  • This means knowing JavaScript.
(Data Scientist becoming JavaScript programmer. Gemini.)

In my view, all that's needed here is a short course, a good book, and some practice. A week should be enough time for an experienced Python programmer to get to where they need to be.

What skillset should data scientists have?

AI is shaking everything up, including data science. In my view, data scientists will have to do more than their "traditional" role. Data scientists who can turn their analysis into apps will have an advantage. 

For me, the skillset a data scientist will need looks a lot like the skillset of a full-stack developer. This means data scientists knowing a bit of JavaScript, code gen, deployment technologies, and so on. They won't need to be experts, but they will need "good enough" skills.