Tuesday, March 10, 2026

Arthur C. Clarke and AI

The history of AI

I was looking over the history of AI and I was struck by how far ahead of the curve Arthur C. Clarke was. It's not just technical issues either, he was way ahead on the cultural impacts as we'll see. Of course, Clarke was too optimistic about when AI would arrive, but I think we can forgive him that.

(ITU Pictures, CC BY 2.0, via Wikimedia Commons)

Clarke and AI in his fiction

Clarke wrote quite a lot about AI and computing. The most famous example is the psychopathic AGI HAL 9000 in the 1968 movie "2001: A Space Odyssey", but he had been writing about computing for some time. In 1953, he published "The Nine Billion Names of God" which has a computer as a central element, and there followed several novels and stories through the 1950s and 1960s. In 1979's, "The Fountains of Paradise", one of the characters has a medical implant that can synthesize speech to call for help if the wearer has a medical emergency.

Clarke's AI futurism

Although he's mostly known today as a science fiction writer, Clarke also popped up on TV as a futurist, giving his thoughts on how technology might develop. This included speaking about AI and its implications. Listening to these recordings now is eye-opening as we'll see.

The first clip is from 1964. Some of his futurism is (way) off, but a surprising amount is accurate. I was going to just give you a link to the AI piece, but the whole clip is worth listening to.

Here's a Nova episode from 1978 about the new "thinking machines". Clarke's segments are worth viewing. He speaks at the start, and at 34:44, 36:27, and most importantly at 41:35. If you want a bit of a chill, go to 52:48.

If you didn't know these clips were from 1978 and had the transcript alone, when would you think they had been recorded?

Ahead of his time: society vs technology

I was at a conference in 2025 where experts were speaking on AI, shockingly, they focused exclusively on technology without giving a moment's thought to the impact on employment and society. It's apparent to me that Arthur C. Clarke in 1978 had more foresight than some of the experts in 2025.

Given his foresight, it's slightly surprising Clarke didn't explore the themes of super-intelligent AIs displacing people in his fiction. It would have been interesting to read a Clarke novel with societal AI change as a backdrop.

Monday, March 9, 2026

Rendezvous with Rama

I saw some news about a possible movie adaptation of “Rendezvous with Rama” and it set me thinking again about the book and what I thought about it. There’s quite a lot here, so I thought it would be worth sharing in a blog post. Let’s start with some history.

Arthur C. Clarke

Clarke (born 1917) was the pre-eminent British science fiction writer in the mid part of the 20th century with a prodigious output of novels and short stories. Globally, he was considered one of the “big three” of science fiction and he sold well in the English-speaking world and beyond.

Famously, “2001: A Space Odyssey” was based on his 1948 short story ("The Sentinel") and Clarke wrote the movie screenplay with Kubrick. The movie's psychotic HAL 9000 computer was an example of his fascination with new field of AI, though he would have been aware of the real-world “AI Winter” that came in the early 1970s.

I think it’s fair to say that much of Clarke’s fiction was driven by story rather than serious character development; many, but not all, of his characters seem a little one-dimensional and the dialog is sometimes flat. Unfortunately, parts of the misogynistic and class-based attitudes of the time leak into some of his writing. To a degree, this is surprising because Clarke himself was gay, but perhaps none of us can fully escape the attitudes of our times.

Clarke emigrated to Sri Lanka in 1956, where he lived until his death in 2008.

The story of Rendezvous with Rama

In the year 2131, Spaceguard detects a large object entering the solar system which it later names “Rama”. A probe detects that it’s a 20 x 50km cylinder, obviously constructed by aliens. Because of its trajectory, the only crewed space vessel that can intercept it is the space freighter Endeavour. Endeavour’s crew aren’t explorers, they’re just a well-trained freighter crew who happen to be in the right place at the right time. The crew intercept Rama and board it.

(Rama as imagined by Nano Banana)

Inside Rama, they find several city-sized clusters of objects and a central cylindrical sea, but no life and no controlling AI. As Rama gets closer to the sun, it warms up and comes to life, meaning strange robotic lifeforms start appearing and doing things the crew don't understand. One of the crew explores deeper into the interior (in a very contrived way!) and has to be rescued, which brings some elements of danger into the novel (which up to this point has been a “space procedural”). The rescue is against the clock as the crew know their time on Rama is limited because of its flight path.

(The inside of Rama, as imagined by Nano Banana.)

Unfortunately, Rama is seen by a threat by some human groups, and the whole object is in danger, requiring the crew on the Endeavour to carefully defend Rama.

After the crew save Rama, and themselves, they leave Rama as it gets closer to the Sun. Rama then heads off towards the Magellan cloud, leaving a lot of unanswered questions.

The book was published in 1973.

Let’s turn to some of the themes in the book.

The crew

In movies like Alien, ships' crews are portrayed as space “truckers”: rude, crude, and rebellious. They have some level of training, but they’re not experts by any means. They have problems following orders and working as a team.

The crew of the Endeavour are very different; they’re highly trained, they work as a team, and they can follow orders. There’s a pointed discussion early on about avoiding heroics and working together; the ethic of quiet competence permeates the book. I’ve heard the book described as competency porn, and I agree. This isn’t a crew of space truckers, it’s like the crew of a supertanker or some other ocean-going vessel, which feels both more likely and more realistic to me.

A big part of the crew are the chimpanzees engineered to have a higher IQ that enables them to do some jobs that would otherwise be done by humans. Notably, these simps stay on the Endeavour and I think they're an underused part of the story. I also get the sense that the simps are a replacement for the AIs that would otherwise run things.

(Nano Banana.)

AI?

Clarke talked a lot about AI, but in this novel, AI is conspicuous by its absence. There are no self-aware AIs in Endeavour or in Rama. I’m speculating, but I think Clarke would have seen AI go “off the boil” in the early 1970s. Perhaps he felt that after HAL in 2001, there was nowhere new to go with AI stories. Of course, by not having an AI in Rama, Clarke can keep the mystery – there’s no sentient AI that tells the humans everything they want to know.

Rama is alien

This was my second big take-away from the novel. Rama feels very alien, from the cylinder to the biots, to the way it works. Rama makes no attempt to explain itself to the crew of the Endeavour and there are no clues explaining “why”. I very much get the sense that something non-human built and operated this thing for its own purposes. The crew leave Rama with many more questions than answers.

Wonder

When I first read this as a teenager, I came away with a huge sense of wonder. What is this thing? Who sent it? Why did they send it? When I re-read it many years later as an adult, I didn’t quite get that same sense of wonder, but maybe that’s because I’m more jaded now.

Wonder seems to have fallen out of favor with sci-fi writers. I can't remember reading a recent book that gave me a sense of awe or grandeur. On the other hand, characterization and dialog are very much in favor (which is a good thing), I've read a lot of recently published books with vivid characters and dialog.

With the death of wonder, I can't help feel we've lost part of what made the genre a bit different.

Subsequent books

There are some sequel novels written by Gentry Lee. My advice: don’t read them.

Movie version

Rama isn’t an action-adventure book, but it does have some adventure themes and it does ask some thought provoking questions. It would plainly have to be a big-budget sci-fi movie.

(Nano Banana)

Morgan Freeman has spent decades trying to bring the book to the screen without success. However, as of 2021, the film is in “development” with Denis Villeneuve (“Arrival”, “Dune”) writing the script and set to direct it. Sadly, Villeneuve will work on the new James Bond movie and other projects first, so a Rama movie is still a few years in the future at best.

Overall thoughts

It’s true that you can never go back. On re-reading the book as an adult, I saw all the flaws I didn’t see as a child, and I saw little of the wonder and excitement I felt back then. The characterization is a bit flat as is the dialog. Some of the scenarios the crew find themselves in on Rama feel a bit contrived. The politics feel off.

But….

The book offers a more intelligent view of what a first contact might be. Nothing is trying to eat you or conquer you, and nothing is trying to be your friend or show you the galaxy. The aliens just don’t care and are doing alien things.

The humans in the book aren’t super men and women, but neither are they cynical individualists. They’re just competent people working together as a team.

These ideas of alien aliens and competent humans make the book different and noteworthy.

Is the book flawed? Yes. Is it worth reading? Yes. Will I be in line to see the movie? Hell yes.

Wednesday, February 11, 2026

Data is the new Lego

In 2018, I wrote a company blog post. As with most corporate content of this type, it was eventually deleted. But I liked what I wrote and I want to keep it, so I found it on the Wayback Machine and I'm reposting it here.

Reposting it serves another purpose. My post was plagiarized by someone who claimed it as their own. I want to own my work and not have other people claim it as theirs. Plagiarist have an easier time cheating if the original is hidden away on the Wayback Machine.

You can find the piece on the Wayback Machine here: https://web.archive.org/web/20190820192824/https://www.truefit.com/en/Blog/August-2018/Data-is-the-New-Lego

It was written for the company True Fit: https://truefit.com/

(Gemini)

Here's the post.

-----------

When I was a child, I used to love playing with Lego, or “Legos” as my American friends often say; my brothers and I built spaceships and trucks and houses and animals. As time went on, our creations became more ambitious, functional, and lifelike. We could each have insisted our Lego was our own, but by pooling resources, we collectively went further. Family and friends gave us Lego including unusual and hard to find bricks, which enabled us to make more accurate models. We were growing up too, and as our play became more sophisticated, we learned how to build better models.

I’m not young anymore and my bones creak on cold mornings, but I still remember playing with Lego as I go to work each morning and play with data to build models. Using data to solve real world problems, like style, fit, and size recommendations, is surprisingly like my childhood Lego memories. To build something useful you need lots of data, data diversity, and the knowledge to build the right models in the right way.

If you don’t have enough Lego bricks, the things you build aren’t realistic; the model is crude, the colors don’t match, and there are gaps. It’s the same with machine learning and computer models; if you don’t have enough data, your models are crude, and you have quantitative and qualitative errors. The history of computer modeling is rife with examples of people making bad decisions using models made with incomplete data. In dealing with style, fit, and size recommendations, not enough data means giving bad advice because your models are too crude to accurately model people and garments. This is where pooling data wins; by pooling our Lego, my brothers and I could build what we wanted; in fashion, by pooling data from many retailers, you can build better models because you have a more complete picture of consumers’ behavior and the unique style characteristics, size, and shape of garments.

To build a good quality Lego model you need a diversity of pieces – models built with just the standard 2x4 bricks are crude and inaccurate. This is where getting Lego from friends and family was so useful – we got more diverse bricks that let us build more accurate models. In fashion, you need a diversity of data on people and garments too. Simply extrapolating from the average size to plus sizes is like using 2x4 bricks for everything; one size does not fit all and you end up with something that isn’t accurate for users who aren’t ‘average’.

Simply assuming US consumers and apparel are the same as German consumers and apparel is like using the same few Lego bricks for different models; different markets need different data. Simply believing a $20,000 dress fits the same as a $100 dress is like building Lego models when the special pieces you need are missing; it’s the kind of thing you do when you don’t have the data you need. In fact, having data on $100 and $20,000 dresses lets you build richer models that make better recommendations for all dresses. The key to good modeling is having data on a diverse set of consumers and garments.

Young children make crude Lego models, the colors don’t match and the shapes are wrong; older children build working models with careful color schemes. A similar thing happens with data and algorithms. As you get to know and manipulate your data, your algorithms, and their interactions, you come to understand their limitations and you strive to build something better. As time goes by, increasing volumes of data point out the flaws in your work and you fix them– your models become better and better. In other words, the learning curve applies to building Lego and computer modeling.

It might be a brutal childhood truth, but the children with the most Lego, the best pieces, and the time to play produce the best models. The same brutal truth applies for any AI based machine learning or computer modeling project. The projects with the biggest data volumes, the most diverse data, and the best teams to use that data will produce the most accurate models. That’s why it’s fun to play with the massive data set from True Fit’s fashion Genome: it includes data from the largest number of retailers and brands; there’s a diversity of country, people sizes, and garments; and my colleagues know what they’re doing. There’s the added benefit of doing something novel and helping people find clothes they’ll love, that suit their personal style preferences, and will fit and flatter them – Lego models only make a few people happy but style, fit, and size recommendations can make millions of people happier by helping them connect more easily with the clothes and shoes that better express who they are and how they feel. Coming to work each day, it’s like playing with the world’s largest Lego set and it makes me happy.

Sometimes late at night, when it’s quiet and there’s no-one around to judge, I quietly put together Lego models. It’s a consoling and comforting reminder of my childhood, like eating ice cream, playing chase, and England losing in the World Cup. Lego has taught me a lot about data and models and collaboration. But there’s one big difference between building Lego models with my brothers and building computer models with my colleagues: I don’t fight with my colleagues quite so often.

True Fit is determined to improve the customer shopping experience by using its rich data collection from thousands of brands to provide accurate size recommendations. A larger collection of Lego increases the size of scope of projects that can be built just as a vast data collection increases the scope of customers who are provided with accurate style, fit, and size recommendations. To learn more about True Fit's data collection, called the Genome, visit here.

The perceptron

Why study the perceptron?

Perceptrons were one of the first learning systems and an important early stepping-stone to most recent AI innovations. That alone would be motivation enough to study them, however the reaction of the press, and the consequences of the hype, are a cautionary tale for us in 2026.

I'm going to share with you the why and the how of the perceptron, with some of the consequences of the hype.

Why do we care about systems that learn?

Go back to the 1950s, why would you care about a system that can learn? There’s the obvious coolness of it, but there are important real-world applications.

Photo analysts study reconnaissance photos looking for hidden bunkers or other items of military significance. The work is tiring and boring at times, but it’s hard to automate because it relies on human interpretation rather than a hard and fast set of rules. The “enemy” constantly changes how they disguise their installations, so whoever or whatever is analyzing photos must continually learn.

A similar problem occurs in post offices. If a post office wants to automate letter sorting, it has to automate reading handwritten addresses. Each person’s handwriting is different, which means creating definitive rules about letter or number formation is hard.

A learning system can adapt itself to new information and so stay productive when things change. In practice, this means it can be taught to recognize a new way a country is disguising a bunker or a new way someone is writing the number 5. It doesn’t require its creators to continually tweak settings. Of course, these automated systems can process letters or images etc. much faster (and cheaper) than human beings, which makes them very attractive.

Given the demand existed, how can you create a system that learns?

How do biological systems learn?

The obvious learning systems are biological. By the 1950s, we’d made some progress understanding how brains work, in particular, we had a basic understanding of how neurons worked, which are the lowest level of processing in the brain.

Neurons take sensory input signals from dendrites into the soma, where the input is “processed”. If the input signal crosses some threshold, the soma fires an output signal (an action potential) through an axon. Neurons learn by changing the way they “weight” different dendrite signals, so changing the conditions under which they fire.

The output of one neuron could be the input to another neuron and real brains have layers of processing.

The picture below shows the arrangement for a single neuron.

(Gemini)

My explanation of how neurons work is very simplistic and in reality, it’s much more complicated. In real brains, neurons learn together and there are other biological processes going on involving dendrites. If you want to read more about biological neurons, here are some good references:

The perceptron

In 1957, at the Cornell Aeronautical Laboratory in Buffalo, New York, the psychologist Frank Rosenblatt was studying human learning (specifically, the neuron) and trying to replicate it in software and hardware. His team built a prototype system, called the perceptron, that could “learn” in a very limited sense. The learning task was simple image classification.

(Rosenblatt and the perceptron. National Museum of the U.S. Navy, Public domain, via Wikimedia Commons)

The Mark I Perceptron input was a 20x20 photocell array; a photocell is very limited form of digital camera. These 400 inputs were fed to “association units” that weighted the inputs. The weights were set by potentiometers that were adjusted by electric motors. Importantly, the initial weights were random to avoid bias. The system summed the weights and used a simple threshold algorithm (response units) to decide the image classification, if the sum of the weighted signals was above the threshold the algorithm output a signal (a true output), if the sum of the weighted signals was below the threshold, the algorithm did not output a signal (a false output). Technically, the name of the threshold function is a Heaviside step function. If the perceptron made an error, the relevant weights were adjusted. The perceptron required 50 training iterations to reliably distinguish between squares and triangles.

(From the perceptron user manual.)

In 2026, this sounds really basic, but in 1957 it was a breakthrough. Rosenblatt and his team had demonstrated that a machine could learn and change how it “sees” the world.

References:

The perceptron theory

Here’s a simple representation of the perceptron. The inputs from the photocell are fed in and assigned weights. There’s a bias term to account for bias in the photocells, for example, the photocells might give a very small signal instead of zero when there’s no image. The weighted inputs (and the bias) are summed. If the weighted sum exceeds some threshold, the perceptron fires, if not, it doesn’t.

(The perceptron is a linear classifier, meaning it can only separates point on a hyperplane. In two dimensions, this means it can only separate points using a straight line.)

Mathematically, this is how it works.

\[u = \sum w_i x_i + b \]

\[y = f(u(x)) = \begin{cases} 1, & \text{if } u(x) > \theta \\ 0, & \text{otherwise} \end{cases}\]

In the vector notation used in machine learning, the equations are usually written:

\[y = h( \textbf{ w} \cdot \textbf{x } + b ) \]

where h is the Heaviside step function.

So far, this is pretty simple, but how does it learn? Rosenblatt insisted on starting training from a random state, so that gives us a starting point. Then we expose the perceptron to some training data where we know what the output should be (the data is labeled). Here’s how we update the weights:

\[ w_i \leftarrow w_i + \Delta w_i \]

\[ \Delta w_i = \eta(t - o)x_i \]

where:

$t$ is the target or correct output
$o$ is the measured output
$\eta$ is the training rate and $ 0 \lt \eta \leq 1$

We update the weights and try again in an iterative loop. This continues until we can successfully predict the training data set within a certain error, or we’ve reached a set number of iterations, or we’re seeing no improvement. This is similar to how machine learning systems work today.

References:

Perceptron problems

There were lots of issues with the perceptron in its original form. Let’s start with the worst: the hype.

Rosenblatt gave interviews to the press on his system and they ran with it, but not in a good way. A 1958 New York Times article was typical, the headline read “NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser”, with a lede: “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Other press stories were similarly sensational and hyped the technology. The press very much set the expectation that walking, talking AIs were just around the corner. Of course, the technology couldn’t deliver what the press forecast, which helped lead to a loss of confidence.

The technical problems varied from the straightforward to the severe.

The original perceptron used a simple threshold to decide whether to fire or not, but this caused problems for training weights. Most important training algorithms use derivatives (for example, gradient descent). A simple threshold isn’t differentiable, which means it can’t be used in these kinds of training algorithms. Fortunately, this is relatively easy to fix using a differentiable function to replace the simple threshold. There are a number of possible differentiable functions, and a popular choice is the sigmoid function. (The function that decides whether to fire or not is now called the activation function).

A more serious problem is the logical limitations of the simple perceptron. As Minksy and Papert showed in 1969, there are some logical structures (most notably, XOR), you can’t build using the simple single-layer perceptron architecture. Although multi-layer networks solve these problems, the Minsky and Papert book and their papers significantly damaged research in this area, as we'll see.

This is only a summary of the difficulties the perceptron faced. For a fuller description, check out: https://yuxi.ml/essays/posts/perceptron-controversy/

What happened next

By the early 1970s, the hype bubble had burst. Minsky and Papert’s book had an impact and governments found disappointing results from funding perceptron-based projects; projects promised big results, but in reality, very little was produced. Governmental patience eventually wore thin and eventually they concluded this form of AI research wasn't worth funding. The research money went elsewhere leading to the first “AI Winter” which lasted for a decade or so.

Sadly, AI experienced another hype bubble and collapse in the late 1980s, a second "AI Winter". As a whole, AI research began to get a bad reputation.

The “AI Winters” bled talent and money away from neural network development, but research still continued. Although multi-layer networks had been developed by the 1960s, it wasn’t known how to train them until the Rumelhart, Hinton, and Williams 1986 paper “Learning representations by back-propagating errors” [https://www.nature.com/articles/323533a0] popularized the back propagation method. Convolutional Neural Networks (CNNs) using back propagation and a convolutional structure were demonstrated in 1989. With these technologies as the backbone, LLMs were developed starting in the mid-to-late 2010s. It’s only the enormous success of LLMs that has brought a flood of money into AI research and a resurgence of interest in its origins.

Rosenblatt had a wide variety of research interests, including astronomy and photometry (measuring light). By any measure he was a genius. Unfortunately, in 1971 he died at the age of 43 in a boating accident. His death was just a few years into the first "AI Winter", so he saw the hype and the subsequent bubble bursting. Sadly, he never go to see how the field eventually developed.

Thoughts on the story

The original perceptron was very much based on what had gone before, but it was a breakthrough and ahead of its time, which was part of the problem. The necessary technology wasn’t there to advance quickly. Unfortunately, the hype in the press, fed by Rosenblatt and others, set unrealistic expectations. While great for short-term research funding, it was terrible for the long-term when the hype bubble burst.

AI as a whole has been prone to hype cycles through its entire existence. It's no wonder there's a lot of discussion online about the latest AI bubble bursting. My feeling is, it is different this time, but we're still in a bubble and people are going to get hurt when it eventually pops.

Monday, February 9, 2026

Learning by hand is better than learning by AI

Accelerating learning with AI?

Recently, I've been learning a new LLM API from a vendor. There's a ton of documentation to wade through to get to what I need to know and the vendor's examples are overly detailed. In other words, it's costly to figure out how to use their API.

(Gemini)

I decided to use code gen to get me up and running quickly. In the process, I found out how to speed up learning, but equally important, I found out what not to do.

Code gen everywhere!

My first thought was to code gen the entire problem and figure out what was going on from the code. This didn't work so well.

The code worked and gave me the answer I expected, but there were two problems. Firstly, the code was bloated and secondly, it wasn't clear why it was doing what it was doing. The bloated code made it hard to wade through and zero in on what I wanted. It wasn't clear to me why it had split something into two operations, despite code gen commenting the code. Because I didn't know the vendor's API, I couldn't be sure the code was correct; it didn't look right, but was it?

Hand coding wins - mostly

I recoded the whole thing by hand the old fashioned way, but using the generated code as an inspiration (what function to call and what arguments to use). I tried the LLM calls in the way I thought they should work, but the code didn't work the way I thought it would. On the upside, the error message I got was very helpful and I tracked down why it didn't work. Now I knew why code gen had made two LLM calls instead of one and I knew what outputs and inputs I should use.

The next step was properly formatting the final output. Foolishly, I tried code gen again. It gave me code, but once again, I couldn't follow why it was doing what it was doing. I went back looking at the data structure in detail and moved forward by hand.

But code gen was still helpful. I used it to help me fill in API argument calls and to build a Pydantic data structure. I also used it to format my code. Yes, this isn't as helpful as I'd hoped, but it's still something and it still made things easier for me.

Why code gen didn't work fully

Code gen created functioning code, not tutorial code, so the comments it generated weren't appropriate to learn what was going on and why.

Because I didn't know the API, I couldn't tell if code gen was correct. As it turned out, code gen produced code that was overly complex, but it was correct.

Lessons

This experience crystallized some other experiences I've had with AI code gen.

If I didn't care about understanding what's going on underneath, code gen would be OK. It would work perfectly well for a demo. Where things start to go wrong is if you're building a production system where performance matters or a system that will be long-lived - in these cases the why of coding matters.

Code generation is an accelerator if you know what you're doing. If you don't know the libraries (or language) you're using, you're on thin ice. Eventually, something bad is going to happen and you won't know how to fix it.

Wednesday, January 14, 2026

Replit vs. Cursor - who wins?

Building Business Apps - Cursor vs. Replit

For a while now, I've been very interested in using AI to build BI-type apps. I know you can do it with Cursor, but it requires a strong technical background. I've heard people have had great success with Replit, so I thought I would give it a go. I decided to build the same app in both Cursor and Replit. It's a kind of battle of the tools.

(Gemini.)

For my comparison contest. I chose to build a simple app that shows the weather and news for a given location.

Round 1: getting started/ease of use

I gave both contenders the same prompt and asked them to build me an app. Both tools gave me an app in about the same time. However, I found Replit much, much easier to use; by contrast, Cursor can be tough to get started with.

Round 1 is a decisive victory for Replit.

Round 2: building the app

Both apps had problems and I needed to tweak them to get them working. I found I had to give Replit multiple prompts to fix problems; problems that just didn't occur in Cursor. Replit got stuck on some simple things and I had to get creative with prompting to get round them, all the while my AI token consumption went up. Cursor didn't need this level of imaginative prompting.

I'm giving this round to Cursor on points.

Round 3: editing the visual layout

Replit let me edit the visual layout of the app directly, while Cursor did not. I know Cursor has a visual editor, but I just couldn't get it to work. This is of course an ease of use thing, and overall, Replit is easier. For this app, I didn't need to tweak the layout but it's an important consideration.

Round 3 is a decisive victory for Replit.

Round 4: what is the app doing?

I wanted to know what the apps were doing "under the hood" so I wanted to see the code. Cursor is unashamedly a code editor, so it was simple. By contrast, Replit hides the code away and it requires a bit of digging. On a related theme, Cursor is much better at debugging, so it's easier to track down errors.

Round 4 is a victory for Cursor.

Round 5: changing the app under the hood

I wanted to change the app "under the hood", which meant changing some of the code. Cursor generates code that's very well commented, so it's easy to see what's going on. By contrast, Replit's code is sparsely commented and I found it difficult to understand what each file did. Bear in mind though, Replit is trying to be an app creation tool not a code editor.

Round 5 is a victory for Cursor.

Round 6: running the app locally

Both Replit and Cursor did well here. This round is a draw.

Round 7: deploying the app to the web

Replit makes this really easy, There's a simple process to go through and your app is deployed. Cursor doesn't do deployment and the deployment services like Render have a learning curve.

Round 7 is a victory for Replit.

A disturbing thought

I was looking at how both apps turned out and something struck me when I was looking at the code for the Cursor app: what services did these apps use? I didn't specify what APIs I wanted to use, the AIs chose for me.

Both of these apps converted an address to a latitude/longitude, showed a map, got local news, got a climate chart for the year, and so on. But what APIs (services) did they use underneath? What were the terms and conditions of the services? What are the limitations of the services? The answer is: you have to find out for yourself. Which means either asking the AI or digging into the code.

If I sign up for an API key, I have to go to a website, read what the service offers, and accept the terms and conditions. For example, some APIs forbid commercial use, some are very rate limited, and others require an acknowledgment in the app or web page. If you build an app using an AI, how do you know what you've agreed to? Will your app get rate limited? Will you get banned for using the API service inappropriately? What are the risks? It seems like a feeble defense to say "my AI made me do it".

It looks like the onus is on you to figure this out, which is definitely a problem.

Who won?

Looking at the results of the contest, my answer is: it depends on your end goal.

If you want a tool to let you build a "simplish" app and you don't have much, if any, coding experience, then Replit is the clear winner. On the downside, it will be very difficult to add more complex features later.

If you want to build a more complex app and you have coding experience, then Cursor wins. Cursor also wins if you think that you'll need to edit the app code in the future.

What would I chose for internal reporting or BI-type development? On balance, Cursor, but it's not a clear victory. Here's my logic.

I love the idea of democratizing analysis. I like giving users the power to answer their own questions. This would appear to favor Replit, but...
I worry about maintainability and extendability. I've seen too many cases where a one-off app has become business critical and no-one knows how to maintain it. This favors Cursor because in my view, it produces more maintainable code.

Future directions

The ultimate goal is a tool that lets a non-coder quickly and simply build an app, even a complex one, that's maintainable in the future. This could be building an app for internal use (within an organization) or external use. The app development process will be a combination of natural language prompting and visual editing. Right now, we're really, really close to that goal and it's probably arriving later in 2026.

I'm sure some readers will feel I'm being harsh when I say Replit isn't quite there yet; for me, it needs less prompting and better code layout and documentation. Cursor has a way to go and I'm not convinced they're going in this direction (they may well stay focused on code development).

In my view, the bigger problem is not app development but data availability. To build internal apps, the internal data has to be available, which means it has to be well-described and in a place where the app development program (and the app itself) can access it. In many organizations, their data isn't as well organized as it should be (to put it politely). It's like having a car but not being able to find gas (or only finding the wrong gas), it makes the car useless. To make internal app development really fly, internal data has to be organized "good enough". We may well see more focus on data organization within companies as a result.

Both Cursor hand Replit have the advantage that they both ultimately use common languages and packages. This means that the skills to maintain apps created using them are common in any company with programmers or analysts on staff. Contrast that with BI tools where the skills and knowledge of how to use the BI tools are only in the BI group. I can see tools like Cursor and Replit encroaching more and more into BI territory, especially as app development becomes democratized.

Friday, January 9, 2026

The Siren Song

A happy siren accident

I was searching the web for something, and by a happy accident of mistyping, I found a completely unrelated and wonderful event. What I saw inspired this blog post.

I'm going to write about sirens, those loud things that scare you into taking your safety seriously.

(World War II British siren, Robert Jarvis, via Wikimedia Commons. Creative Commons Attribution 3.0 Unported license.)

Siren etmology

The word siren comes from ancient Greek mythology. Sirens were female, human-like beings who used their voices to lure young men to their deaths. In the Jason and the argonauts story, the crew had to sail passed an island of sirens who sang to lure the ship onto the rocks. The crew had Orpheus play his lyre to drown them out so they could pass safely. Unfortunately, one man, Butes, succumbed to the sirens' song and went overboard to reach them.

(The Siren by John Willam Waterhouse, via Wikimedia Commons. Note the siren's fishy feet.)

From this legend, we get the use of the word siren to describe a beautiful woman who's dangerous, and also its use to describe a device for making loud tones. I'm going to skip the sexist use and focus on noisy devices.

Of course, I need to mention the reversal here: sirens in ancient Greece used beautiful sounds to lure you to your death, moderns sirens use ugly sounds to save your life.

What's a siren?

A siren is a device that makes loud and piercing noises to alert people of a danger. You can use pretty much any mechanism you like to produce a noise, but in modern times, it tends to be rotating disks pushing air through holes, or electronics. Modern sirens produce relatively 'simple' sounds compared to musical instruments, adding to their impact.

How they work

I'm going to focus on mechanical slotted disk sirens because they're what most people associate with the word siren. You can make any sound you like with electronics, but that's boring.

Sound is a pressure wave moving through the air (or other medium). It consists of a wave of compression and rarefaction, meaning the air is compressed (higher pressure) and decompressed (lower pressure). Sound is movement within the air, wind is the movement of the air itself. This is an important distinction for a siren as we'll see.

To make a noise, we have to set up a sound wave. Moving air alone won't work. For instance, blowing air through a straw won't make a noise. If we want to turn blowing air through a straw into a noise (and so create a simple siren), we have to create a compression wave. We can do it using an electric drill.

This article in Scientific American (https://www.scientificamerican.com/article/building-a-disk-siren/) describes the process. To simplify, create a disk with holes around the edge. Mount it on a electric drill and spin it up. Have a child blow through a straw above the holes in the rotating disk. You should hear a siren like sound.

Obviously, operating an electric drill close to a child's face could be an interesting experience, so buyer beware.

Blowing through the straw doesn't make a noise, but the holes in the rotating disk stop and start the airflow, so creating a compression wave and hence a sound. Because the holes are equally spaced and the drill is rotating at a constant angular velocity, you hear what's approximately a single frequency. The faster the drill goes, the higher the frequency.

To make this much louder, we need to push a lot more air through the holes. Instead of a child blowing through a straw, we need an electrical fan pushing air through holes. That's what electro-mechnical sirens do.

In most sirens, it's the fan that rotates and the holes remain stationary, The holes are placed at the edge of a stationary disk called a stator. It looks something like this.

(See https://www.thingiverse.com/thing:4889851.)

The holes are often called ports. How many there are and how fast the rotor spins determines the frequency.

The rotor both blows air through the holes and blocks the holes, creating a pressure wave. The rotor looks something like this.

(See https://www.youtube.com/watch?v=XAfvOjdZpkg)

Note the design. The 'fins' push air out of the holes when the holes in the stator and rotator line up. The fins also block the holes as the rotor rotates. So the rotor alternately blocks the holes and pushes air through them. This is what creates the pressure wave and hence the sound.

The design I've shown here creates a single tone. Most sirens create two tones, so they consist of either two rotors and stators each producing a separate tone, or a single rotor and stator in a 'sandwich'. I've shown both designs below. The 'sandwich' terminology is mine, so don't go searching for it!

(Siren that produces different tones at different ends. Srikantasarangi, CC0, via Wikimedia Commons)

('Sandwich' design for two-tone sirens, from airraidsirens.com. The tones are created at the same end of the siren.)

Siren sounds

The tone a siren creates depends on the speed of the motor, the number of holes, and the diameter of the stator/rotor. As the motor starts up, its angular velocity increases from zero, which means the frequency the siren produces increases. Conversely, as the motor slows down to a stop, the frequency drops. By turning the power off and on, or by varying the power to the siren, we can create a moaning or wailing effect.

Sirens don't create a pure sound sine wave, but it's fairly close. They produce a roughly triangular sound wave that has lots of harmonics (see https://www.airraidsirens.net/tech_howtheywork.html). Because of this distinct sound wave shape, a siren is clearly an artificial sound and that's what the authorities want.

A single tone is OK, but you can achieve a stronger psychological effect on the population with two tones or more. Sound waves interfere with one another to create new frequencies; with a two-tone siren, you can create what's called a minor third, a new tone. Because a minor third is musically a sad or downbeat sound, siren designers often deliberately design for it.

Lower frequencies travel further than higher frequencies, which is why sirens tend to use them. On the flip side, it's harder for humans to locate the source of lower frequency sounds, but that doesn't really matter for a warning. You don't need people to know where the siren is, you just need them to hear it and run. These lower frequencies are typically in the range 400-500 Hz, with the mid-range 450 Hz generally considered the most annoying.

World War II - wailing Winne and moaning Minnie

The most famous sirens of World War II are the air raid sirens used in the UK. They're mostly associated with the London Blitz, but they were used in other British cities. They used two different signals: one to alert for an air-raid and the other the all-clear.

Here's a recording of the air-raid alert sound (first minute). Note the wailing sound caused by varying the power to the siren. These sirens used lower frequencies, designed to be penetrating, and used a minor third for a spooky downbeat sound.

Imagine sirens like this going off all at once all over a city to warn you that planes are coming to drop bombs on you.

The wailing sounds led to the sirens being called wailing Winnne or moaning Minnie. The same names were also used for Nazi weaponry too, so be careful of your internet searches.

Here's the all clear signal (same video, but towards the end). It's a continuous tone.

In 2012, the British band Public Service Broadcasting released a track called "London Can Take It", based on a 1940 British propaganda film that was narrated by the American Quentin Reynolds. It starts with an air-raid siren. Is this the only pop-song that uses an air-raid siren?

Post WWII - civil defense in different forms

During the Cold War, sirens were deployed in many cities to warn of an attack, though I'm not sure how useful hiding from a nuclear weapon would be.

Over the same time period, siren usage was extended to include warning of danger from natural disasters like tornadoes or flooding. As you might expect, the technology became more sophisticated and more compact using electronics to generate sound, meaning smaller sirens were possible as were different sounds. Smaller sirens were deployed on emergency vehicles and you've certainly heard them.

(Siren mounted on a fire truck. FiremanKurt, CC BY-SA 3.0, via Wikimedia Commons)

Despite all this change, the fundamental acoustics stay the same, which means that sirens that warn the population (and so cover a wide area) must have large horn-type 'speakers' to broadcast their signals. In other words, warning sirens are big.

Build your own siren

There are loads of sites on the web that show you how you can build your own air-raid type siren. Most of them assume you've got access to a beefy electrical motor (like the ones used to power grinders), though a few have designs you can use with an electric drill.

Several sites will tell you how to build an air-raid siren from wood, but the skill level is quite high. I'm a little put off by designs that require me to cut a perfect circle with a jigsaw and balance it carefully. I'm not sure my woodworking skills are up to it.

Other sites have instructions for 3D-printing the components. This seems more doable, but the designs are mostly for sirens that can fit on an electric drill. Even though this seems easier than woodworking, there are some tricky engineering stages.

The other problem is of course the noise. If you get it right, your home-built siren is going to be loud. I'm sure my neighbors would be pleased to hear my siren on a quiet Sunday afternoon.

SirenCon

My happy internet accident was searching for a conference but coming across the similarly named SirenCon, a conference for people who like sirens (https://www.sirencon.com/home). I spent more time than I should clicking around their site and finding out more.

Think for a minute about how this works. SirenCon attendees will want to set off sirens which is not good news for the neighbors. Where in New York City could you hold it, whereabouts in any big city could you hold it? The same logic applies to small towns and the suburbs. Where would be a good place to hold a loud conference?

The answer unsurprisingly is in the countryside. For SirenCon, they meet once a year in the woods in rural Wisconsin, in Rhindelander. Their location seems to be away from any population centers.

Each year, people come and show off their sirens. The 2025 siren list is here: https://www.sirencon.com/the-2025-line-up. Rather wonderfully, there's live streaming and you can watch and listen too seven and a half hours of siren fun here: https://www.youtube.com/live/ZV24Ioriar4

I think it's great that people with a niche interest like this can get together and share their passion. Good luck to them and I hope they have a wonderful 2026 SirenCon.

I've got the power: what statistical power means

Important, but overlooked

Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.

In this blog post, I'm going to teach you all about power.

Hypothesis testing

All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.

With any hypothesis test, there are four possible outcomes (using simplified language):

The null hypothesis is actually true (there is no effect)

We say there is no effect (true negative)
We say there is an effect (false positive)

The null hypothesis is actually false (there is an effect)

We say there is no effect (false negative)
We say there is an effect (true positive)

I've summarized the possibilities in the table below.

		Null Hypothesis is
		True	False
Decision about null hypothesis	Fail to reject	True negative Correct inference Probability threshold= 1 - $ \alpha $	False negative Type II error Probability threshold= $ \beta $
Decision about null hypothesis	Reject	False positive Type I error Probability threshold = $ \alpha $	True positive Correct inference Probability threshold = Power = 1 - $ \beta $

A lot of attention goes on $\alpha$, called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.

By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be $ \alpha $ and $ \beta $ because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than $ \beta $.

In pictures

To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.

Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.

To be confident there is an effect, we want the power to be as high as possible.

Calculating power - before and after

Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice.

Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.

Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.

To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website.

A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test.
We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided.
Now we need to define the thresholds for our test, which are $ \alpha $ and power. Common values are 0.05 and 0.8.
Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch.

We can put all this into G*Power and here's what we get.

Test type	Tail(s)	$ \alpha $	Power	Proportion 1	Proportion 2	Sample size
z-test	Two-tailed	0.05	0.8	0.02	0.022	161364
z-test	Two-tailed	0.05	0.95	0.02	0.022	267154

The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).

Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.

Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results.

After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.

A nightmare scenario

Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set $\alpha$ to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.

Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.

When the test is over, you calculate a p-value of 0.01, which is less than your $\alpha$, so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?

It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.

What do you do?

You have four choices, each with positives and negatives.

Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.

None of these options are great. You need strong statistics to decide on the right way forward for your business.

(A/B testing was painted as an easy-to-use wonder technique. The reality is, it just isn't.)

What's a good value?

The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?

Let's go back to the definitions of $ \alpha $ and $ \beta $ (remember, $ \beta $ is 1 - power). $ \alpha $ corresponds to the probability of a false positive, $ \beta $ corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for $ \alpha $ and $ \beta $ are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?

In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same $ \alpha $ and $ \beta $ value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.

Why is power so misunderstood?

Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, $ \alpha $ is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.

Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.

Test type	Tail(s)	\( \alpha \)	Power	Proportion 1	Proportion 2	Sample size
z-test	Two-tailed	0.05	0.8	0.02	0.022	161364
z-test	Two-tailed	0.05	0.95	0.02	0.022	267154