Wednesday, February 11, 2026

Data is the new Lego

In 2018, I wrote a company blog post. As with most corporate content of this type, it was eventually deleted. But I liked what I wrote and I want to keep it, so I found it on the Wayback Machine and I'm reposting it here.

Reposting it serves another purpose. My post was plagiarized by someone who claimed it as their own. I want to own my work and not have other people claim it as theirs. Plagiarist have an easier time cheating if the original is hidden away on the Wayback Machine.

You can find the piece on the Wayback Machine here: https://web.archive.org/web/20190820192824/https://www.truefit.com/en/Blog/August-2018/Data-is-the-New-Lego

It was written for the company True Fit: https://truefit.com/

(Gemini)

Here's the post.

-----------

When I was a child, I used to love playing with Lego, or “Legos” as my American friends often say; my brothers and I built spaceships and trucks and houses and animals. As time went on, our creations became more ambitious, functional, and lifelike. We could each have insisted our Lego was our own, but by pooling resources, we collectively went further. Family and friends gave us Lego including unusual and hard to find bricks, which enabled us to make more accurate models. We were growing up too, and as our play became more sophisticated, we learned how to build better models.

I’m not young anymore and my bones creak on cold mornings, but I still remember playing with Lego as I go to work each morning and play with data to build models. Using data to solve real world problems, like style, fit, and size recommendations, is surprisingly like my childhood Lego memories. To build something useful you need lots of data, data diversity, and the knowledge to build the right models in the right way.

If you don’t have enough Lego bricks, the things you build aren’t realistic; the model is crude, the colors don’t match, and there are gaps. It’s the same with machine learning and computer models; if you don’t have enough data, your models are crude, and you have quantitative and qualitative errors. The history of computer modeling is rife with examples of people making bad decisions using models made with incomplete data. In dealing with style, fit, and size recommendations, not enough data means giving bad advice because your models are too crude to accurately model people and garments. This is where pooling data wins; by pooling our Lego, my brothers and I could build what we wanted; in fashion, by pooling data from many retailers, you can build better models because you have a more complete picture of consumers’ behavior and the unique style characteristics, size, and shape of garments.

To build a good quality Lego model you need a diversity of pieces – models built with just the standard 2x4 bricks are crude and inaccurate. This is where getting Lego from friends and family was so useful – we got more diverse bricks that let us build more accurate models. In fashion, you need a diversity of data on people and garments too. Simply extrapolating from the average size to plus sizes is like using 2x4 bricks for everything; one size does not fit all and you end up with something that isn’t accurate for users who aren’t ‘average’.

Simply assuming US consumers and apparel are the same as German consumers and apparel is like using the same few Lego bricks for different models; different markets need different data. Simply believing a $20,000 dress fits the same as a $100 dress is like building Lego models when the special pieces you need are missing; it’s the kind of thing you do when you don’t have the data you need. In fact, having data on $100 and $20,000 dresses lets you build richer models that make better recommendations for all dresses. The key to good modeling is having data on a diverse set of consumers and garments.

Young children make crude Lego models, the colors don’t match and the shapes are wrong; older children build working models with careful color schemes. A similar thing happens with data and algorithms. As you get to know and manipulate your data, your algorithms, and their interactions, you come to understand their limitations and you strive to build something better. As time goes by, increasing volumes of data point out the flaws in your work and you fix them– your models become better and better. In other words, the learning curve applies to building Lego and computer modeling.

It might be a brutal childhood truth, but the children with the most Lego, the best pieces, and the time to play produce the best models. The same brutal truth applies for any AI based machine learning or computer modeling project. The projects with the biggest data volumes, the most diverse data, and the best teams to use that data will produce the most accurate models. That’s why it’s fun to play with the massive data set from True Fit’s fashion Genome: it includes data from the largest number of retailers and brands; there’s a diversity of country, people sizes, and garments; and my colleagues know what they’re doing. There’s the added benefit of doing something novel and helping people find clothes they’ll love, that suit their personal style preferences, and will fit and flatter them – Lego models only make a few people happy but style, fit, and size recommendations can make millions of people happier by helping them connect more easily with the clothes and shoes that better express who they are and how they feel. Coming to work each day, it’s like playing with the world’s largest Lego set and it makes me happy.

Sometimes late at night, when it’s quiet and there’s no-one around to judge, I quietly put together Lego models. It’s a consoling and comforting reminder of my childhood, like eating ice cream, playing chase, and England losing in the World Cup. Lego has taught me a lot about data and models and collaboration. But there’s one big difference between building Lego models with my brothers and building computer models with my colleagues: I don’t fight with my colleagues quite so often.

True Fit is determined to improve the customer shopping experience by using its rich data collection from thousands of brands to provide accurate size recommendations. A larger collection of Lego increases the size of scope of projects that can be built just as a vast data collection increases the scope of customers who are provided with accurate style, fit, and size recommendations. To learn more about True Fit's data collection, called the Genome, visit here.

The perceptron

Why study the perceptron?

Perceptrons were one of the first learning systems and an important early stepping-stone to most recent AI innovations. That alone would be motivation enough to study them, however the reaction of the press, and the consequences of the hype, are a cautionary tale for us in 2026.

I'm going to share with you the why and the how of the perceptron, with some of the consequences of the hype.

Why do we care about systems that learn?

Go back to the 1950s, why would you care about a system that can learn? There’s the obvious coolness of it, but there are important real-world applications.

Photo analysts study reconnaissance photos looking for hidden bunkers or other items of military significance. The work is tiring and boring at times, but it’s hard to automate because it relies on human interpretation rather than a hard and fast set of rules. The “enemy” constantly changes how they disguise their installations, so whoever or whatever is analyzing photos must continually learn.

A similar problem occurs in post offices. If a post office wants to automate letter sorting, it has to automate reading handwritten addresses. Each person’s handwriting is different, which means creating definitive rules about letter or number formation is hard.

A learning system can adapt itself to new information and so stay productive when things change. In practice, this means it can be taught to recognize a new way a country is disguising a bunker or a new way someone is writing the number 5. It doesn’t require its creators to continually tweak settings. Of course, these automated systems can process letters or images etc. much faster (and cheaper) than human beings, which makes them very attractive.

Given the demand existed, how can you create a system that learns?

How do biological systems learn?

The obvious learning systems are biological. By the 1950s, we’d made some progress understanding how brains work, in particular, we had a basic understanding of how neurons worked, which are the lowest level of processing in the brain.

Neurons take sensory input signals from dendrites into the soma, where the input is “processed”. If the input signal crosses some threshold, the soma fires an output signal (an action potential) through an axon. Neurons learn by changing the way they “weight” different dendrite signals, so changing the conditions under which they fire.

The output of one neuron could be the input to another neuron and real brains have layers of processing.

The picture below shows the arrangement for a single neuron.

(Gemini)

My explanation of how neurons work is very simplistic and in reality, it’s much more complicated. In real brains, neurons learn together and there are other biological processes going on involving dendrites. If you want to read more about biological neurons, here are some good references:

The perceptron

In 1957, at the Cornell Aeronautical Laboratory in Buffalo, New York, the psychologist Frank Rosenblatt was studying human learning (specifically, the neuron) and trying to replicate it in software and hardware. His team built a prototype system, called the perceptron, that could “learn” in a very limited sense. The learning task was simple image classification.

(Rosenblatt and the perceptron. National Museum of the U.S. Navy, Public domain, via Wikimedia Commons)

The Mark I Perceptron input was a 20x20 photocell array; a photocell is very limited form of digital camera. These 400 inputs were fed to “association units” that weighted the inputs. The weights were set by potentiometers that were adjusted by electric motors. Importantly, the initial weights were random to avoid bias. The system summed the weights and used a simple threshold algorithm (response units) to decide the image classification, if the sum of the weighted signals was above the threshold the algorithm output a signal (a true output), if the sum of the weighted signals was below the threshold, the algorithm did not output a signal (a false output). Technically, the name of the threshold function is a Heaviside step function. If the perceptron made an error, the relevant weights were adjusted. The perceptron required 50 training iterations to reliably distinguish between squares and triangles.

(From the perceptron user manual.)

In 2026, this sounds really basic, but in 1957 it was a breakthrough. Rosenblatt and his team had demonstrated that a machine could learn and change how it “sees” the world.

References:

The perceptron theory

Here’s a simple representation of the perceptron. The inputs from the photocell are fed in and assigned weights. There’s a bias term to account for bias in the photocells, for example, the photocells might give a very small signal instead of zero when there’s no image. The weighted inputs (and the bias) are summed. If the weighted sum exceeds some threshold, the perceptron fires, if not, it doesn’t.

(The perceptron is a linear classifier, meaning it can only separates point on a hyperplane. In two dimensions, this means it can only separate points using a straight line.)

Mathematically, this is how it works.

\[u = \sum w_i x_i + b \]

\[y = f(u(x)) = \begin{cases} 1, & \text{if } u(x) > \theta \\ 0, & \text{otherwise} \end{cases}\]

In the vector notation used in machine learning, the equations are usually written:

\[y = h( \textbf{ w} \cdot \textbf{x } + b ) \]

where h is the Heaviside step function.

So far, this is pretty simple, but how does it learn? Rosenblatt insisted on starting training from a random state, so that gives us a starting point. Then we expose the perceptron to some training data where we know what the output should be (the data is labeled). Here’s how we update the weights:

\[ w_i \leftarrow w_i + \Delta w_i \]

\[ \Delta w_i = \eta(t - o)x_i \]

where:

$t$ is the target or correct output
$o$ is the measured output
$\eta$ is the training rate and $ 0 \lt \eta \leq 1$

We update the weights and try again in an iterative loop. This continues until we can successfully predict the training data set within a certain error, or we’ve reached a set number of iterations, or we’re seeing no improvement. This is similar to how machine learning systems work today.

References:

Perceptron problems

There were lots of issues with the perceptron in its original form. Let’s start with the worst: the hype.

Rosenblatt gave interviews to the press on his system and they ran with it, but not in a good way. A 1958 New York Times article was typical, the headline read “NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser”, with a lede: “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Other press stories were similarly sensational and hyped the technology. The press very much set the expectation that walking, talking AIs were just around the corner. Of course, the technology couldn’t deliver what the press forecast, which helped lead to a loss of confidence.

The technical problems varied from the straightforward to the severe.

The original perceptron used a simple threshold to decide whether to fire or not, but this caused problems for training weights. Most important training algorithms use derivatives (for example, gradient descent). A simple threshold isn’t differentiable, which means it can’t be used in these kinds of training algorithms. Fortunately, this is relatively easy to fix using a differentiable function to replace the simple threshold. There are a number of possible differentiable functions, and a popular choice is the sigmoid function. (The function that decides whether to fire or not is now called the activation function).

A more serious problem is the logical limitations of the simple perceptron. As Minksy and Papert showed in 1969, there are some logical structures (most notably, XOR), you can’t build using the simple single-layer perceptron architecture. Although multi-layer networks solve these problems, the Minsky and Papert book and their papers significantly damaged research in this area, as we'll see.

This is only a summary of the difficulties the perceptron faced. For a fuller description, check out: https://yuxi.ml/essays/posts/perceptron-controversy/

What happened next

By the early 1970s, the hype bubble had burst. Minsky and Papert’s book had an impact and governments found disappointing results from funding perceptron-based projects; projects promised big results, but in reality, very little was produced. Governmental patience eventually wore thin and eventually they concluded this form of AI research wasn't worth funding. The research money went elsewhere leading to the first “AI Winter” which lasted for a decade or so.

Sadly, AI experienced another hype bubble and collapse in the late 1980s, a second "AI Winter". As a whole, AI research began to get a bad reputation.

The “AI Winters” bled talent and money away from neural network development, but research still continued. Although multi-layer networks had been developed by the 1960s, it wasn’t known how to train them until the Rumelhart, Hinton, and Williams 1986 paper “Learning representations by back-propagating errors” [https://www.nature.com/articles/323533a0] popularized the back propagation method. Convolutional Neural Networks (CNNs) using back propagation and a convolutional structure were demonstrated in 1989. With these technologies as the backbone, LLMs were developed starting in the mid-to-late 2010s. It’s only the enormous success of LLMs that has brought a flood of money into AI research and a resurgence of interest in its origins.

Rosenblatt had a wide variety of research interests, including astronomy and photometry (measuring light). By any measure he was a genius. Unfortunately, in 1971 he died at the age of 43 in a boating accident. His death was just a few years into the first "AI Winter", so he saw the hype and the subsequent bubble bursting. Sadly, he never go to see how the field eventually developed.

Thoughts on the story

The original perceptron was very much based on what had gone before, but it was a breakthrough and ahead of its time, which was part of the problem. The necessary technology wasn’t there to advance quickly. Unfortunately, the hype in the press, fed by Rosenblatt and others, set unrealistic expectations. While great for short-term research funding, it was terrible for the long-term when the hype bubble burst.

AI as a whole has been prone to hype cycles through its entire existence. It's no wonder there's a lot of discussion online about the latest AI bubble bursting. My feeling is, it is different this time, but we're still in a bubble and people are going to get hurt when it eventually pops.

Monday, February 9, 2026

Learning by hand is better than learning by AI

Accelerating learning with AI?

Recently, I've been learning a new LLM API from a vendor. There's a ton of documentation to wade through to get to what I need to know and the vendor's examples are overly detailed. In other words, it's costly to figure out how to use their API.

(Gemini)

I decided to use code gen to get me up and running quickly. In the process, I found out how to speed up learning, but equally important, I found out what not to do.

Code gen everywhere!

My first thought was to code gen the entire problem and figure out what was going on from the code. This didn't work so well.

The code worked and gave me the answer I expected, but there were two problems. Firstly, the code was bloated and secondly, it wasn't clear why it was doing what it was doing. The bloated code made it hard to wade through and zero in on what I wanted. It wasn't clear to me why it had split something into two operations, despite code gen commenting the code. Because I didn't know the vendor's API, I couldn't be sure the code was correct; it didn't look right, but was it?

Hand coding wins - mostly

I recoded the whole thing by hand the old fashioned way, but using the generated code as an inspiration (what function to call and what arguments to use). I tried the LLM calls in the way I thought they should work, but the code didn't work the way I thought it would. On the upside, the error message I got was very helpful and I tracked down why it didn't work. Now I knew why code gen had made two LLM calls instead of one and I knew what outputs and inputs I should use.

The next step was properly formatting the final output. Foolishly, I tried code gen again. It gave me code, but once again, I couldn't follow why it was doing what it was doing. I went back looking at the data structure in detail and moved forward by hand.

But code gen was still helpful. I used it to help me fill in API argument calls and to build a Pydantic data structure. I also used it to format my code. Yes, this isn't as helpful as I'd hoped, but it's still something and it still made things easier for me.

Why code gen didn't work fully

Code gen created functioning code, not tutorial code, so the comments it generated weren't appropriate to learn what was going on and why.

Because I didn't know the API, I couldn't tell if code gen was correct. As it turned out, code gen produced code that was overly complex, but it was correct.

Lessons

This experience crystallized some other experiences I've had with AI code gen.

If I didn't care about understanding what's going on underneath, code gen would be OK. It would work perfectly well for a demo. Where things start to go wrong is if you're building a production system where performance matters or a system that will be long-lived - in these cases the why of coding matters.

Code generation is an accelerator if you know what you're doing. If you don't know the libraries (or language) you're using, you're on thin ice. Eventually, something bad is going to happen and you won't know how to fix it.

Wednesday, January 14, 2026

Replit vs. Cursor - who wins?

Building Business Apps - Cursor vs. Replit

For a while now, I've been very interested in using AI to build BI-type apps. I know you can do it with Cursor, but it requires a strong technical background. I've heard people have had great success with Replit, so I thought I would give it a go. I decided to build the same app in both Cursor and Replit. It's a kind of battle of the tools.

(Gemini.)

For my comparison contest. I chose to build a simple app that shows the weather and news for a given location.

Round 1: getting started/ease of use

I gave both contenders the same prompt and asked them to build me an app. Both tools gave me an app in about the same time. However, I found Replit much, much easier to use; by contrast, Cursor can be tough to get started with.

Round 1 is a decisive victory for Replit.

Round 2: building the app

Both apps had problems and I needed to tweak them to get them working. I found I had to give Replit multiple prompts to fix problems; problems that just didn't occur in Cursor. Replit got stuck on some simple things and I had to get creative with prompting to get round them, all the while my AI token consumption went up. Cursor didn't need this level of imaginative prompting.

I'm giving this round to Cursor on points.

Round 3: editing the visual layout

Replit let me edit the visual layout of the app directly, while Cursor did not. I know Cursor has a visual editor, but I just couldn't get it to work. This is of course an ease of use thing, and overall, Replit is easier. For this app, I didn't need to tweak the layout but it's an important consideration.

Round 3 is a decisive victory for Replit.

Round 4: what is the app doing?

I wanted to know what the apps were doing "under the hood" so I wanted to see the code. Cursor is unashamedly a code editor, so it was simple. By contrast, Replit hides the code away and it requires a bit of digging. On a related theme, Cursor is much better at debugging, so it's easier to track down errors.

Round 4 is a victory for Cursor.

Round 5: changing the app under the hood

I wanted to change the app "under the hood", which meant changing some of the code. Cursor generates code that's very well commented, so it's easy to see what's going on. By contrast, Replit's code is sparsely commented and I found it difficult to understand what each file did. Bear in mind though, Replit is trying to be an app creation tool not a code editor.

Round 5 is a victory for Cursor.

Round 6: running the app locally

Both Replit and Cursor did well here. This round is a draw.

Round 7: deploying the app to the web

Replit makes this really easy, There's a simple process to go through and your app is deployed. Cursor doesn't do deployment and the deployment services like Render have a learning curve.

Round 7 is a victory for Replit.

A disturbing thought

I was looking at how both apps turned out and something struck me when I was looking at the code for the Cursor app: what services did these apps use? I didn't specify what APIs I wanted to use, the AIs chose for me.

Both of these apps converted an address to a latitude/longitude, showed a map, got local news, got a climate chart for the year, and so on. But what APIs (services) did they use underneath? What were the terms and conditions of the services? What are the limitations of the services? The answer is: you have to find out for yourself. Which means either asking the AI or digging into the code.

If I sign up for an API key, I have to go to a website, read what the service offers, and accept the terms and conditions. For example, some APIs forbid commercial use, some are very rate limited, and others require an acknowledgment in the app or web page. If you build an app using an AI, how do you know what you've agreed to? Will your app get rate limited? Will you get banned for using the API service inappropriately? What are the risks? It seems like a feeble defense to say "my AI made me do it".

It looks like the onus is on you to figure this out, which is definitely a problem.

Who won?

Looking at the results of the contest, my answer is: it depends on your end goal.

If you want a tool to let you build a "simplish" app and you don't have much, if any, coding experience, then Replit is the clear winner. On the downside, it will be very difficult to add more complex features later.

If you want to build a more complex app and you have coding experience, then Cursor wins. Cursor also wins if you think that you'll need to edit the app code in the future.

What would I chose for internal reporting or BI-type development? On balance, Cursor, but it's not a clear victory. Here's my logic.

I love the idea of democratizing analysis. I like giving users the power to answer their own questions. This would appear to favor Replit, but...
I worry about maintainability and extendability. I've seen too many cases where a one-off app has become business critical and no-one knows how to maintain it. This favors Cursor because in my view, it produces more maintainable code.

Future directions

The ultimate goal is a tool that lets a non-coder quickly and simply build an app, even a complex one, that's maintainable in the future. This could be building an app for internal use (within an organization) or external use. The app development process will be a combination of natural language prompting and visual editing. Right now, we're really, really close to that goal and it's probably arriving later in 2026.

I'm sure some readers will feel I'm being harsh when I say Replit isn't quite there yet; for me, it needs less prompting and better code layout and documentation. Cursor has a way to go and I'm not convinced they're going in this direction (they may well stay focused on code development).

In my view, the bigger problem is not app development but data availability. To build internal apps, the internal data has to be available, which means it has to be well-described and in a place where the app development program (and the app itself) can access it. In many organizations, their data isn't as well organized as it should be (to put it politely). It's like having a car but not being able to find gas (or only finding the wrong gas), it makes the car useless. To make internal app development really fly, internal data has to be organized "good enough". We may well see more focus on data organization within companies as a result.

Both Cursor hand Replit have the advantage that they both ultimately use common languages and packages. This means that the skills to maintain apps created using them are common in any company with programmers or analysts on staff. Contrast that with BI tools where the skills and knowledge of how to use the BI tools are only in the BI group. I can see tools like Cursor and Replit encroaching more and more into BI territory, especially as app development becomes democratized.

Friday, January 9, 2026

The Siren Song

A happy siren accident

I was searching the web for something, and by a happy accident of mistyping, I found a completely unrelated and wonderful event. What I saw inspired this blog post.

I'm going to write about sirens, those loud things that scare you into taking your safety seriously.

(World War II British siren, Robert Jarvis, via Wikimedia Commons. Creative Commons Attribution 3.0 Unported license.)

Siren etmology

The word siren comes from ancient Greek mythology. Sirens were female, human-like beings who used their voices to lure young men to their deaths. In the Jason and the argonauts story, the crew had to sail passed an island of sirens who sang to lure the ship onto the rocks. The crew had Orpheus play his lyre to drown them out so they could pass safely. Unfortunately, one man, Butes, succumbed to the sirens' song and went overboard to reach them.

(The Siren by John Willam Waterhouse, via Wikimedia Commons. Note the siren's fishy feet.)

From this legend, we get the use of the word siren to describe a beautiful woman who's dangerous, and also its use to describe a device for making loud tones. I'm going to skip the sexist use and focus on noisy devices.

Of course, I need to mention the reversal here: sirens in ancient Greece used beautiful sounds to lure you to your death, moderns sirens use ugly sounds to save your life.

What's a siren?

A siren is a device that makes loud and piercing noises to alert people of a danger. You can use pretty much any mechanism you like to produce a noise, but in modern times, it tends to be rotating disks pushing air through holes, or electronics. Modern sirens produce relatively 'simple' sounds compared to musical instruments, adding to their impact.

How they work

I'm going to focus on mechanical slotted disk sirens because they're what most people associate with the word siren. You can make any sound you like with electronics, but that's boring.

Sound is a pressure wave moving through the air (or other medium). It consists of a wave of compression and rarefaction, meaning the air is compressed (higher pressure) and decompressed (lower pressure). Sound is movement within the air, wind is the movement of the air itself. This is an important distinction for a siren as we'll see.

To make a noise, we have to set up a sound wave. Moving air alone won't work. For instance, blowing air through a straw won't make a noise. If we want to turn blowing air through a straw into a noise (and so create a simple siren), we have to create a compression wave. We can do it using an electric drill.

This article in Scientific American (https://www.scientificamerican.com/article/building-a-disk-siren/) describes the process. To simplify, create a disk with holes around the edge. Mount it on a electric drill and spin it up. Have a child blow through a straw above the holes in the rotating disk. You should hear a siren like sound.

Obviously, operating an electric drill close to a child's face could be an interesting experience, so buyer beware.

Blowing through the straw doesn't make a noise, but the holes in the rotating disk stop and start the airflow, so creating a compression wave and hence a sound. Because the holes are equally spaced and the drill is rotating at a constant angular velocity, you hear what's approximately a single frequency. The faster the drill goes, the higher the frequency.

To make this much louder, we need to push a lot more air through the holes. Instead of a child blowing through a straw, we need an electrical fan pushing air through holes. That's what electro-mechnical sirens do.

In most sirens, it's the fan that rotates and the holes remain stationary, The holes are placed at the edge of a stationary disk called a stator. It looks something like this.

(See https://www.thingiverse.com/thing:4889851.)

The holes are often called ports. How many there are and how fast the rotor spins determines the frequency.

The rotor both blows air through the holes and blocks the holes, creating a pressure wave. The rotor looks something like this.

(See https://www.youtube.com/watch?v=XAfvOjdZpkg)

Note the design. The 'fins' push air out of the holes when the holes in the stator and rotator line up. The fins also block the holes as the rotor rotates. So the rotor alternately blocks the holes and pushes air through them. This is what creates the pressure wave and hence the sound.

The design I've shown here creates a single tone. Most sirens create two tones, so they consist of either two rotors and stators each producing a separate tone, or a single rotor and stator in a 'sandwich'. I've shown both designs below. The 'sandwich' terminology is mine, so don't go searching for it!

(Siren that produces different tones at different ends. Srikantasarangi, CC0, via Wikimedia Commons)

('Sandwich' design for two-tone sirens, from airraidsirens.com. The tones are created at the same end of the siren.)

Siren sounds

The tone a siren creates depends on the speed of the motor, the number of holes, and the diameter of the stator/rotor. As the motor starts up, its angular velocity increases from zero, which means the frequency the siren produces increases. Conversely, as the motor slows down to a stop, the frequency drops. By turning the power off and on, or by varying the power to the siren, we can create a moaning or wailing effect.

Sirens don't create a pure sound sine wave, but it's fairly close. They produce a roughly triangular sound wave that has lots of harmonics (see https://www.airraidsirens.net/tech_howtheywork.html). Because of this distinct sound wave shape, a siren is clearly an artificial sound and that's what the authorities want.

A single tone is OK, but you can achieve a stronger psychological effect on the population with two tones or more. Sound waves interfere with one another to create new frequencies; with a two-tone siren, you can create what's called a minor third, a new tone. Because a minor third is musically a sad or downbeat sound, siren designers often deliberately design for it.

Lower frequencies travel further than higher frequencies, which is why sirens tend to use them. On the flip side, it's harder for humans to locate the source of lower frequency sounds, but that doesn't really matter for a warning. You don't need people to know where the siren is, you just need them to hear it and run. These lower frequencies are typically in the range 400-500 Hz, with the mid-range 450 Hz generally considered the most annoying.

World War II - wailing Winne and moaning Minnie

The most famous sirens of World War II are the air raid sirens used in the UK. They're mostly associated with the London Blitz, but they were used in other British cities. They used two different signals: one to alert for an air-raid and the other the all-clear.

Here's a recording of the air-raid alert sound (first minute). Note the wailing sound caused by varying the power to the siren. These sirens used lower frequencies, designed to be penetrating, and used a minor third for a spooky downbeat sound.

Imagine sirens like this going off all at once all over a city to warn you that planes are coming to drop bombs on you.

The wailing sounds led to the sirens being called wailing Winnne or moaning Minnie. The same names were also used for Nazi weaponry too, so be careful of your internet searches.

Here's the all clear signal (same video, but towards the end). It's a continuous tone.

In 2012, the British band Public Service Broadcasting released a track called "London Can Take It", based on a 1940 British propaganda film that was narrated by the American Quentin Reynolds. It starts with an air-raid siren. Is this the only pop-song that uses an air-raid siren?

Post WWII - civil defense in different forms

During the Cold War, sirens were deployed in many cities to warn of an attack, though I'm not sure how useful hiding from a nuclear weapon would be.

Over the same time period, siren usage was extended to include warning of danger from natural disasters like tornadoes or flooding. As you might expect, the technology became more sophisticated and more compact using electronics to generate sound, meaning smaller sirens were possible as were different sounds. Smaller sirens were deployed on emergency vehicles and you've certainly heard them.

(Siren mounted on a fire truck. FiremanKurt, CC BY-SA 3.0, via Wikimedia Commons)

Despite all this change, the fundamental acoustics stay the same, which means that sirens that warn the population (and so cover a wide area) must have large horn-type 'speakers' to broadcast their signals. In other words, warning sirens are big.

Build your own siren

There are loads of sites on the web that show you how you can build your own air-raid type siren. Most of them assume you've got access to a beefy electrical motor (like the ones used to power grinders), though a few have designs you can use with an electric drill.

Several sites will tell you how to build an air-raid siren from wood, but the skill level is quite high. I'm a little put off by designs that require me to cut a perfect circle with a jigsaw and balance it carefully. I'm not sure my woodworking skills are up to it.

Other sites have instructions for 3D-printing the components. This seems more doable, but the designs are mostly for sirens that can fit on an electric drill. Even though this seems easier than woodworking, there are some tricky engineering stages.

The other problem is of course the noise. If you get it right, your home-built siren is going to be loud. I'm sure my neighbors would be pleased to hear my siren on a quiet Sunday afternoon.

SirenCon

My happy internet accident was searching for a conference but coming across the similarly named SirenCon, a conference for people who like sirens (https://www.sirencon.com/home). I spent more time than I should clicking around their site and finding out more.

Think for a minute about how this works. SirenCon attendees will want to set off sirens which is not good news for the neighbors. Where in New York City could you hold it, whereabouts in any big city could you hold it? The same logic applies to small towns and the suburbs. Where would be a good place to hold a loud conference?

The answer unsurprisingly is in the countryside. For SirenCon, they meet once a year in the woods in rural Wisconsin, in Rhindelander. Their location seems to be away from any population centers.

Each year, people come and show off their sirens. The 2025 siren list is here: https://www.sirencon.com/the-2025-line-up. Rather wonderfully, there's live streaming and you can watch and listen too seven and a half hours of siren fun here: https://www.youtube.com/live/ZV24Ioriar4

I think it's great that people with a niche interest like this can get together and share their passion. Good luck to them and I hope they have a wonderful 2026 SirenCon.

I've got the power: what statistical power means

Important, but overlooked

Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.

In this blog post, I'm going to teach you all about power.

Hypothesis testing

All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.

With any hypothesis test, there are four possible outcomes (using simplified language):

The null hypothesis is actually true (there is no effect)

We say there is no effect (true negative)
We say there is an effect (false positive)

The null hypothesis is actually false (there is an effect)

We say there is no effect (false negative)
We say there is an effect (true positive)

I've summarized the possibilities in the table below.

		Null Hypothesis is
		True	False
Decision about null hypothesis	Fail to reject	True negative Correct inference Probability threshold= 1 - $ \alpha $	False negative Type II error Probability threshold= $ \beta $
Decision about null hypothesis	Reject	False positive Type I error Probability threshold = $ \alpha $	True positive Correct inference Probability threshold = Power = 1 - $ \beta $

A lot of attention goes on $\alpha$, called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.

By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be $ \alpha $ and $ \beta $ because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than $ \beta $.

In pictures

To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.

Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.

To be confident there is an effect, we want the power to be as high as possible.

Calculating power - before and after

Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice.

Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.

Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.

To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website.

A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test.
We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided.
Now we need to define the thresholds for our test, which are $ \alpha $ and power. Common values are 0.05 and 0.8.
Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch.

We can put all this into G*Power and here's what we get.

Test type	Tail(s)	$ \alpha $	Power	Proportion 1	Proportion 2	Sample size
z-test	Two-tailed	0.05	0.8	0.02	0.022	161364
z-test	Two-tailed	0.05	0.95	0.02	0.022	267154

The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).

Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.

Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results.

After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.

A nightmare scenario

Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set $\alpha$ to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.

Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.

When the test is over, you calculate a p-value of 0.01, which is less than your $\alpha$, so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?

It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.

What do you do?

You have four choices, each with positives and negatives.

Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.

None of these options are great. You need strong statistics to decide on the right way forward for your business.

(A/B testing was painted as an easy-to-use wonder technique. The reality is, it just isn't.)

What's a good value?

The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?

Let's go back to the definitions of $ \alpha $ and $ \beta $ (remember, $ \beta $ is 1 - power). $ \alpha $ corresponds to the probability of a false positive, $ \beta $ corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for $ \alpha $ and $ \beta $ are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?

In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same $ \alpha $ and $ \beta $ value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.

Why is power so misunderstood?

Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, $ \alpha $ is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.

Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.

Wednesday, December 31, 2025

Whiskey prices!

Whiskey prices and age of the single malt

I was in a large alcohol supermarket the other day and I was looking at Scotch whiskey prices. I could see the same single malt at 18, 21, and 25 years. What struck me was how non-linear the price was. Like any good data scientist, I collected some data and took a closer look. I ended up taking a deeper dive into the whiskey market as you'll read.

(Gemini. Whiskey that's old enough to drink.)

The data and charts

From an online alcohol seller, I collected data on the retail prices of several single malt Scotch whiskies with different ages, being careful to make a like-for-like comparison and obviously comparing the same bottle size (750 ml). This is more difficult than it sounds as there are many varieties, even within the same single malt brand.

Here are the results. You can interact with this chart through the menu on the right. Yes, 50 year old whiskies do sell for $40,000.

First impressions are that the relationship between price and age is highly non-linear. To see this in more detail, I've redrawn the chart using a log y-axis.

This presentation suggests an exponential relationship between price and age. To confirm it, I did a simple curve fit and got an exponential fit that's very good.

What's going on with the price curve?

The exponential age-price curve is well-known and has been discussed in the literature [1, 2]. What might make the curve exponential? I find the literature a bit confusing here, so I'll offer some descriptions of the whiskey market and whiskey itself.

First off, by definition, whiskey takes a long time to come to market; by definition a 21 year old Scotch has been in a barrel for 21 years. This means distillers are making predictions about the demand for their product far into the future. A 50 year old whiskey on sale today was put into a barrel when Jaws was a new movie and when Microsoft was formed; do you think they could have made an accurate forecast for 2025 demand back then? Of course, the production process means the the supply is finite and relatively inelastic; you can't quickly make more 50 year old whiskey.

How whiskey ages adds to the difficulty distillers have with production. Unlike wine, whiskey ages in the barrel but not in the bottle; an 18 year old single malt bottled in 2019 is the same as an 18 year old single malt bottled in 2025. So once whiskey is bottled, it should be sold as soon as possible to avoid bottle storage costs. This punishes premature bottling; if you over-bottle, you either sell at a reduced price or bear storage costs.

There is a possible exception to whiskey not aging in the bottle known as the Old Bottle Effect (OBE). Expert tasters can taste novel flavors in whiskeys that have spent a long time in the bottle. These tastes are thought to come from oxidation, with oxygen permeating very slowly through the pores in the cork [3]. Generally speaking, oxidation is considered a bad thing for alcoholic drinks, but it seems in the case of whiskey, a little is OK. Viewing the online images of 50 year old whiskey bottles, it looks like they've been bottled recently, so I'm not convinced OBE has any bearing on whiskey prices,

Whiskey is distilled and gets its taste from the barrels, which means that unlike wine, there are no vintage years. Whiskey is unaffected by terroir or the weather; a 21 year old Scotch should taste the same regardless of the year it was bottled, which has a couple of consequences.

If you bottle too much whiskey and have to store it instead of selling it, you won't be able to charge a price premium for the bottles you store (over bottling = higher costs).
On the analysis side, it's possible to compare the prices of the same whiskey over several years; a 25 year old whiskey in 2019 is the same product as a 25 year old whiskey in 2025.

One notable production price driver is evaporation. Each year, 2-5% of whiskey in barrels is lost due to it, which is the so-called "angel's share". Let's assume a 4% annual loss from a 200 liter barrel and see what it does to the amount of whiskey we can sell (I've rounded the numbers to the nearest liter).

Year	Whiskey volume
0	200
3	177
10	133
15	108
18	96
21	85
25	72
30	59
40	39
50	26

By law, whiskey has to be matured for 3 years and in reality, the youngest single malts are 10 years old. To get the same revenue as selling the barrel at 10 years, a 50 year old barrel has to be sold for (133/26) or about 5 times the price. That helps explain the increase with age, but not the extent of the increase.

Storage costs obviously vary linearly with age and we can add in the time value of money (which follows the same type of equation as the angel's share). These costs obviously drive up the cost of older whiskey more, but all the production and supply-side factors still don't get us to an exponential price curve.

Before moving to the demand side, I should talk a bit about the phenomena of independent bottlers, also known as cask brokers or cask investment companies. These are companies that buy whiskey in barrels from the distillers and store the barrels. They either bottle the whiskey themselves or sell the barrels, sometime selling barrels back to the original distiller. As far as I can see, they're operating like a kind of futures market. There are several of these companies, the biggest being Gordon & MacPhail who were founded in 1895. It's not clear to me what effect these companies might have on the supply of single malts.

On the demand side, whiskey has been a boom-and-bust industry.

Up until the late 1970s, there had been a whiskey boom and distilleries had upped production in response. Unfortunately, that led to over-production and the creation of a whiskey 'loch' (by comparison with the wine lake and the butter mountain created by over-production). By the early 1980s, distilleries were closing and the industry was in a significant downturn. This led to a sharp reduction in production. For us in 2025, it means the supply of older whiskey is very much less than demand.

More recently, there was a whiskey boom from the early 2000s to the early 2020s. Demand increased substantially but with a fixed supply. Increased demand + fixed supply = increased price, and as older whiskies are rarer, this suggests that older whiskies appreciate in price more.

It's an anecdotal point, but I seem to remember it was uncommon to see "young" whiskeys less than 18 years old. It's only recently that I've seen lots of 10 year old whiskeys on sale. If this is true, it would be a distillers response to the boom; bottle and sell as much as you can now while demand is high. Bottling whiskies younger will have the side-effect of reducing the supply of older whiskeys.

Of course, the whiskey boom has seen older whiskies become luxury goods. The Veblan effect might be relevant here, this is an observation that when the price of some luxury goods increases, demand increases (the opposite dynamic from "normal" goods). Small additions to a product might drive up the price disproportionately (handbags being a good example), in this case, the small additions would be an increase in the age of the whiskey (say from 40 years to 45 years).

As rare and old whiskies have become more expensive, investors have moved in and bought whiskey not as something to drink, but as something to buy and sell. This has brought more money into the high-end of the market, adding to the price rise.

Let's pull all these strands together. Whiskey seems to be a boom-and-bust industry coupled with long-term production and a fixed supply. Over recent years, there's been a boom in whiskey consumption. Market dynamics suggest that distillers sell now while the market is good, which means bottling earlier, which in turn means fewer older whiskies for the future. Really old whiskies are quite rare because of the industry downturn that occurred decades ago and because of maturation costs. Rareness coupled with wealth and desirability pushes the price up to stratospheric levels for older whiskies. The price-age curve is then a function of supply, distillers bottling decisions, and market demand. That still doesn't get us to the exponential curve, but you can see how we could produce a model to get there.

What about blends, other countries, and science?

If single malt whiskey is becoming unaffordable, what about blends? Like wine, the theory goes that the blender can buy whiskey from different distillers and combine them to produce a superior product. However, like wine, the practice is somewhat different. Blends have been associated with the lower end of the market and I've had some really nasty cheap blended whiskey. At the upper end of the blend market, a 750ml bottle of Johnnie Walker Blue Label retails for about $178, and I've heard it's very good. For comparison, the $178 price tag puts it in the price range of some 18-21 year old whiskies. There are rumors that some lesser-known blends are single malts in all but name, so they might be worth investigating, but at over $150 a bottle, this feels a bit like gambling.

What about whiskey or whisky from other countries? I'm not sure I count bourbon as a Scotch-type whiskey, it kind of feels like its own thing - perhaps it's a separate branch of the whiskey family. Irish whiskey is very good and the market isn't as developed as Scotch, but prices are still high. I've tried Japanese whiskey and I didn't like it, maybe the more expensive stuff is better, but it's an expensive risk. I've seen Indian whiskey, but again the price was too high for me to want to try my luck.

What about engineered whiskey? Whiskey gets its flavor from wooden barrels and if you know the chemistry, you can in principle make an equivalent product much faster. There are several companies trying to do this and they've been trying for several years. The big publicity about these so-called molecular spirits was around 2019, but they've not dented the Scotch market at all and their products aren't widely available. The whiskey "equivalents" I've seen retail for about $40, making them much cheaper than single malts, however, the reviews are mixed. The price point does mean I'm inclined to take a risk; if I can find a bottle, I'll buy one.

Whiskey or whisky?

Both spellings are correct. Usage depends on where you are and the product you're talking about. Whiskey is the Irish spelling and it's the spelling used in the US for this category of spirits. Whisky is the Scottish spelling and it's the spelling they use on their bottles. Because I'm writing in the US, I've used whiskey in this blog post even though I'm writing about the Scottish product. I decided I couldn't win on a spelling choice, so I chose to be consistent.

The future

During the 2000s whiskey boom, investors created new distilleries and re-opened old ones, which suggests production is likely to increase over the coming years. At the same time, the whiskey boom is slowing down and sales are flattening. Are we headed to another whiskey crash? I kind of doubt it, but I think prices will stabilize or even come down slightly for younger whiskies (21 years or younger). Older whiskies will still be rare because of the industry slump in the 1980s and they're likely to remain eye-wateringly expensive.

Of course, I'll be having a glass of single malt in the near future, but I'll try not to bore everyone with whiskey facts!

References

Moroz D, Pecchioli B. Should You Invest in an Old Bottle of Whisky or in a Bottle of Old Whisky? A Hedonic Analysis of Vintage Single Malt Scotch Whisky Prices. Journal of Wine Economics. 2019;14(2):145-163. doi:10.1017/jwe.2019.13
Page, Ian B. "Why do distilleries produce multiple ages of whisky?." Journal of Wine Economics 14.1 (2019): 26-47.
https://hedonism.co.uk/what-obe-old-bottle-effect

Test type	Tail(s)	\( \alpha \)	Power	Proportion 1	Proportion 2	Sample size
z-test	Two-tailed	0.05	0.8	0.02	0.022	161364
z-test	Two-tailed	0.05	0.95	0.02	0.022	267154