Wednesday, January 14, 2026

Replit vs. Cursor - who wins?

Building Business Apps - Cursor vs. Replit

For a while now, I've been very interested in using AI to build BI-type apps. I know you can do it with Cursor, but it requires a strong technical background. I've heard people have had great success with Replit, so I thought I would give it a go. I decided to build the same app in both Cursor and Replit. It's a kind of battle of the tools.

(Gemini.)

For my comparison contest. I chose to build a simple app that shows the weather and news for a given location.

Round 1: getting started/ease of use

I gave both contenders the same prompt and asked them to build me an app. Both tools gave me an app in about the same time. However, I found Replit much, much easier to use; by contrast, Cursor can be tough to get started with.

Round 1 is a decisive victory for Replit.

Round 2: building the app

Both apps had problems and I needed to tweak them to get them working. I found I had to give Replit multiple prompts to fix problems; problems that just didn't occur in Cursor. Replit got stuck on some simple things and I had to get creative with prompting to get round them, all the while my AI token consumption went up. Cursor didn't need this level of imaginative prompting.

I'm giving this round to Cursor on points.

Round 3: editing the visual layout

Replit let me edit the visual layout of the app directly, while Cursor did not. I know Cursor has a visual editor, but I just couldn't get it to work. This is of course an ease of use thing, and overall, Replit is easier. For this app, I didn't need to tweak the layout but it's an important consideration. 

Round 3 is a decisive victory for Replit.

Round 4: what is the app doing?

I wanted to know what the apps were doing "under the hood" so I wanted to see the code. Cursor is unashamedly a code editor, so it was simple. By contrast, Replit hides the code away and it requires a bit of digging. On a related theme, Cursor is much better at debugging, so it's easier to track down errors.

Round 4 is a victory for Cursor.

Round 5: changing the app under the hood

I wanted to change the app "under the hood", which meant changing some of the code. Cursor generates code that's very well commented, so it's easy to see what's going on. By contrast, Replit's code is sparsely commented and I found it difficult to understand what each file did. Bear in mind though, Replit is trying to be an app creation tool not a code editor.

Round 5 is a victory for Cursor.

Round 6: running the app locally

Both Replit and Cursor did well here. This round is a draw.

Round 7: deploying the app to the web

Replit makes this really easy, There's a simple process to go through and your app is deployed. Cursor doesn't do deployment and the deployment services like Render have a learning curve.

Round 7 is a victory for Replit.

A disturbing thought

I was looking at how both apps turned out and something struck me when I was looking at the code for the Cursor app: what services did these apps use? I didn't specify what APIs I wanted to use, the AIs chose for me.

Both of these apps converted an address to a latitude/longitude, showed a map, got local news, got a climate chart for the year, and so on. But what APIs (services) did they use underneath? What were the terms and conditions of the services? What are the limitations of the services? The answer is: you have to find out for yourself. Which means either asking the AI or digging into the code.

If I sign up for an API key, I have to go to a website, read what the service offers, and accept the terms and conditions. For example, some APIs forbid commercial use, some are very rate limited, and others require an acknowledgment in the app or web page. If you build an app using an AI, how do you know what you've agreed to? Will your app get rate limited? Will you get banned for using the API service inappropriately? What are the risks? It seems like a feeble defense to say "my AI made me do it".

It looks like the onus is on you to figure this out, which is definitely a problem.

Who won?

Looking at the results of the contest, my answer is: it depends on your end goal.

If you want a tool to let you build a "simplish" app and you don't have much, if any, coding experience, then Replit is the clear winner. On the downside, it will be very difficult to add more complex features later.

If you want to build a more complex app and you have coding experience, then Cursor wins. Cursor also wins if you think that you'll need to edit the app code in the future. 

What would I chose for internal reporting or BI-type development? On balance, Cursor, but it's not a clear victory. Here's my logic.

  • I love the idea of democratizing analysis. I like giving users the power to answer their own questions. This would appear to favor Replit, but...
  • I worry about maintainability and extendability. I've seen too many cases where a one-off app has become business critical and no-one knows how to maintain it. This favors Cursor because in my view, it produces more maintainable code.

Future directions

The ultimate goal is a tool that lets a non-coder quickly and simply build an app, even a complex one, that's maintainable in the future. This could be building an app for internal use (within an organization) or external use. The app development process will be a combination of natural language prompting and visual editing. Right now, we're really, really close to that goal and it's probably arriving later in 2026.

I'm sure some readers will feel I'm being harsh when I say Replit isn't quite there yet; for me, it needs less prompting and better code layout and documentation. Cursor has a way to go and I'm not convinced they're going in this direction (they may well stay focused on code development). 

In my view, the bigger problem is not app development but data availability. To build internal apps, the internal data has to be available, which means it has to be well-described and in a place where the app development program (and the app itself) can access it. In many organizations, their data isn't as well organized as it should be (to put it politely). It's like having a car but not being able to find gas (or only finding the wrong gas), it makes the car useless. To make internal app development really fly, internal data has to be organized "good enough". We may well see more focus on data organization within companies as a result.

Both Cursor hand Replit have the advantage that they both ultimately use common languages and packages. This means that the skills to maintain apps created using them are common in any company with programmers or analysts on staff. Contrast that with BI tools where the skills and knowledge of how to use the BI tools are only in the BI group. I can see tools like Cursor and Replit encroaching more and more into BI territory, especially as app development becomes democratized.

Friday, January 9, 2026

The Siren Song

A happy siren accident

I was searching the web for something, and by a happy accident of mistyping, I found a completely unrelated and wonderful event. What I saw inspired this blog post. 

I'm going to write about sirens, those loud things that scare you into taking your safety seriously.

(World War II British siren, Robert Jarvis, via Wikimedia Commons.  Creative Commons Attribution 3.0 Unported license.)

Siren etmology

The word siren comes from ancient Greek mythology. Sirens were female, human-like beings who used their voices to lure young men to their deaths. In the Jason and the argonauts story, the crew had to sail passed an island of sirens who sang to lure the ship onto the rocks. The crew had Orpheus play his lyre to drown them out so they could pass safely. Unfortunately, one man, Butes, succumbed to the sirens' song and went overboard to reach them.

(The Siren by John Willam Waterhouse, via Wikimedia Commons. Note the siren's fishy feet.)

From this legend, we get the use of the word siren to describe a beautiful woman who's dangerous, and also its use to describe a device for making loud tones. I'm going to skip the sexist use and focus on noisy devices. 

Of course, I need to mention the reversal here: sirens in ancient Greece used beautiful sounds to lure you to your death, moderns sirens use ugly sounds to save your life. 

What's a siren?

A siren is a device that makes loud and piercing noises to alert people of a danger. You can use pretty much any mechanism you like to produce a noise, but in modern times, it tends to be rotating disks pushing air through holes, or electronics. Modern sirens produce relatively 'simple' sounds compared to musical instruments, adding to their impact.

How they work

I'm going to focus on mechanical slotted disk sirens because they're what most people associate with the word siren. You can make any sound you like with electronics, but that's boring. 

Sound is a pressure wave moving through the air (or other medium). It consists of a wave of compression and rarefaction, meaning the air is compressed (higher pressure) and decompressed (lower pressure).  Sound is movement within the air, wind is the movement of the air itself. This is an important distinction for a siren as we'll see.

To make a noise, we have to set up a sound wave. Moving air alone won't work. For instance, blowing air through a straw won't make a noise. If we want to turn blowing air through a straw into a noise (and so create a simple siren), we have to create a compression wave. We can do it using an electric drill.

This article in Scientific American (https://www.scientificamerican.com/article/building-a-disk-siren/) describes the process. To simplify, create a disk with holes around the edge. Mount it on a electric drill and spin it up. Have a child blow through a straw above the holes in the rotating disk. You should hear a siren like sound.

Obviously, operating an electric drill close to a child's face could be an interesting experience, so buyer beware.

Blowing through the straw doesn't make a noise, but the holes in the rotating disk stop and start the airflow, so creating a compression wave and hence a sound. Because the holes are equally spaced and the drill is rotating at a constant angular velocity, you hear what's approximately a single frequency. The faster the drill goes, the higher the frequency.

To make this much louder, we need to push a lot more air through the holes. Instead of a child blowing through a straw, we need an electrical fan pushing air through holes. That's what electro-mechnical sirens do.

In most sirens, it's the fan that rotates and the holes remain stationary, The holes are placed at the edge of a stationary disk called a stator. It looks something like this. 

The holes are often called ports. How many there are and how fast the rotor spins determines the frequency.

The rotor both blows air through the holes and blocks the holes, creating a pressure wave. The rotor looks something like this.

Note the design. The 'fins' push air out of the holes when the holes in the stator and rotator line up. The fins also block the holes as the rotor rotates. So the rotor alternately blocks the holes and pushes air through them. This is what creates the pressure wave and hence the sound.

The design I've shown here creates a single tone. Most sirens create two tones, so they consist of either two rotors and stators each producing a separate tone, or a single rotor and stator in a 'sandwich'. I've shown both designs below. The 'sandwich' terminology is mine, so don't go searching for it!

(Siren that produces different tones at different ends. Srikantasarangi, CC0, via Wikimedia Commons)

('Sandwich' design for two-tone sirens, from airraidsirens.com. The tones are created at the same end of the siren.)

Siren sounds

The tone a siren creates depends on the speed of the motor, the number of holes, and the diameter of the stator/rotor. As the motor starts up, its angular velocity increases from zero, which means the frequency the siren produces increases. Conversely, as the motor slows down to a stop, the frequency drops. By turning the power off and on, or by varying the power to the siren, we can create a moaning or wailing effect.

Sirens don't create a pure sound sine wave, but it's fairly close. They produce a roughly triangular sound wave that has lots of harmonics (see https://www.airraidsirens.net/tech_howtheywork.html). Because of this distinct sound wave shape, a siren is clearly an artificial sound and that's what the authorities want.

A single tone is OK, but you can achieve a stronger psychological effect on the population with two tones or more. Sound waves interfere with one another to create new frequencies; with a two-tone siren, you can create what's called a minor third, a new tone. Because a minor third is musically a sad or downbeat sound, siren designers often deliberately design for it.

Lower frequencies travel further than higher frequencies, which is why sirens tend to use them. On the flip side, it's harder for humans to locate the source of lower frequency sounds, but that doesn't really matter for a warning. You don't need people to know where the siren is, you just need them to hear it and run. These lower frequencies are typically in the range 400-500 Hz, with the mid-range 450 Hz generally considered the most annoying.

World War II - wailing Winne and moaning Minnie

The most famous sirens of World War II are the air raid sirens used in the UK. They're mostly associated with the London Blitz, but they were used in other British cities. They used two different signals: one to alert for an air-raid and the other the all-clear.

Here's a recording of the air-raid alert sound (first minute). Note the wailing sound caused by varying the power to the siren. These sirens used lower frequencies, designed to be penetrating, and used a minor third for a spooky downbeat sound. 

Imagine sirens like this going off all at once all over a city to warn you that planes are coming to drop bombs on you.

The wailing sounds led to the sirens being called wailing Winnne or moaning Minnie. The same names were also used for Nazi weaponry too, so be careful of your internet searches.

Here's the all clear signal (same video, but towards the end). It's a continuous tone. 


In 2012, the British band Public Service Broadcasting released a track called "London Can Take It", based on a 1940 British propaganda film that was narrated by the American Quentin Reynolds. It starts with an air-raid siren. Is this the only pop-song that uses an air-raid siren?

Post WWII - civil defense in different forms

During the Cold War, sirens were deployed in many cities to warn of an attack, though I'm not sure how useful hiding from a nuclear weapon would be.

Over the same time period, siren usage was extended to include warning of danger from natural disasters like tornadoes or flooding. As you might expect, the technology became more sophisticated and more compact using electronics to generate sound, meaning smaller sirens were possible as were different sounds. Smaller sirens were deployed on emergency vehicles and you've certainly heard them. 

(Siren mounted on a fire truck. FiremanKurt, CC BY-SA 3.0, via Wikimedia Commons)

Despite all this change, the fundamental acoustics stay the same, which means that sirens that warn the population (and so cover a wide area) must have large horn-type 'speakers' to broadcast their signals. In other words, warning sirens are big.

Build your own siren

There are loads of sites on the web that show you how you can build your own air-raid type siren. Most of them assume you've got access to a beefy electrical motor (like the ones used to power grinders), though a few have designs you can use with an electric drill. 

Several sites will tell you how to build an air-raid siren from wood, but the skill level is quite high. I'm a little put off by designs that require me to cut a perfect circle with a jigsaw and balance it carefully. I'm not sure my woodworking skills are up to it.

Other sites have instructions for 3D-printing the components. This seems more doable, but the designs are mostly for sirens that can fit on an electric drill. Even though this seems easier than woodworking, there are some tricky engineering stages.

The other problem is of course the noise. If you get it right, your home-built siren is going to be loud. I'm sure my neighbors would be pleased to hear my siren on a quiet Sunday afternoon.

SirenCon

My happy internet accident was searching for a conference but coming across the similarly named SirenCon, a conference for people who like sirens (https://www.sirencon.com/home). I spent more time than I should clicking around their site and finding out more.

Think for a minute about how this works. SirenCon attendees will want to set off sirens which is not good news for the neighbors. Where in New York City could you hold it, whereabouts in any big city could you hold it? The same logic applies to small towns and the suburbs. Where would be a good place to hold a loud conference?

The answer unsurprisingly is in the countryside. For SirenCon, they meet once a year in the woods in rural Wisconsin, in Rhindelander. Their location seems to be away from any population centers. 

Each year, people come and show off their sirens. The 2025 siren list is here: https://www.sirencon.com/the-2025-line-up. Rather wonderfully, there's live streaming and you can watch and listen too seven and a half hours of siren fun here: https://www.youtube.com/live/ZV24Ioriar4

I think it's great that people with a niche interest like this can get together and share their passion. Good luck to them and I hope they have a wonderful 2026 SirenCon.

I've got the power: what statistical power means

Important, but overlooked

Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.

In this blog post, I'm going to teach you all about power.

Hypothesis testing

All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.

With any hypothesis test, there are four possible outcomes (using simplified language):

  • The null hypothesis is actually true (there is no effect)
    • We say there is no effect (true negative)
    • We say there is an effect (false positive)
  • The null hypothesis is actually false (there is an effect)
    • We say there is no effect (false negative)
    • We say there is an effect (true positive)

I've summarized the possibilities in the table below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

A lot of attention goes on \(\alpha\), called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.

By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be \( \alpha \) and \( \beta \) because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than \( \beta \). 

In pictures

To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.

Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.

To be confident there is an effect, we want the power to be as high as possible.

Calculating power - before and after

Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice. 

Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.

Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.

To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website. 

  • A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test. 
  • We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided. 
  • Now we need to define the thresholds for our test, which are \( \alpha \)  and power. Common values are 0.05 and 0.8.  
  • Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch. 
We can put all this into G*Power and here's what we get.

Test type Tail(s) \( \alpha \) Power Proportion 1 Proportion 2 Sample size
z-test Two-tailed 0.05 0.8 0.02 0.022 161364
z-test Two-tailed 0.05 0.95 0.02 0.022 267154

The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).

Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.

Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results. 

After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.

A nightmare scenario

Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set \(\alpha\) to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.

Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.

When the test is over, you calculate a p-value of 0.01, which is less than your \(\alpha\), so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?

It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.

What do you do?

You have four choices, each with positives and negatives.

  1. Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
  2. Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
  3. Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
  4. Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.

None of these options are great. You need strong statistics to decide on the right way forward for your business.

(A/B testing was painted as an easy-to-use wonder technique. The reality is, it just isn't.)

What's a good value?

The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?

Let's go back to the definitions of \( \alpha \) and \( \beta \) (remember, \( \beta \) is 1 - power).  \( \alpha \) corresponds to the probability of a false positive, \( \beta \) corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for \( \alpha \) and \( \beta \) are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?

In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same \( \alpha \) and \( \beta \) value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.

Why is power so misunderstood?

Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, \( \alpha \) is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.

Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.

Wednesday, December 31, 2025

Whiskey prices!

Whiskey prices and age of the single malt

I was in a large alcohol supermarket the other day and I was looking at Scotch whiskey prices. I could see the same single malt at 18, 21, and 25 years. What struck me was how non-linear the price was. Like any good data scientist, I collected some data and took a closer look. I ended up taking a deeper dive into the whiskey market as you'll read.

(Gemini. Whiskey that's old enough to drink.)

The data and charts

From an online alcohol seller, I collected data on the retail prices of several single malt Scotch whiskies with different ages, being careful to make a like-for-like comparison and obviously comparing the same bottle size (750 ml). This is more difficult than it sounds as there are many varieties, even within the same single malt brand. 

Here are the results. You can interact with this chart through the menu on the right. Yes, 50 year old whiskies do sell for $40,000.

First impressions are that the relationship between price and age is highly non-linear. To see this in more detail, I've redrawn the chart using a log y-axis. 

This presentation suggests an exponential relationship between price and age. To confirm it, I did a simple curve fit and got an exponential fit that's very good.

What's going on with the price curve?

The exponential age-price curve is well-known and has been discussed in the literature [1, 2]. What might make the curve exponential? I find the literature a bit confusing here, so I'll offer some descriptions of the whiskey market and whiskey itself.

First off, by definition, whiskey takes a long time to come to market; by definition a 21 year old Scotch has been in a barrel for 21 years. This means distillers are making predictions about the demand for their product far into the future. A 50 year old whiskey on sale today was put into a barrel when Jaws was a new movie and when Microsoft was formed; do you think they could have made an accurate forecast for 2025 demand back then? Of course, the production process means the the supply is finite and relatively inelastic; you can't quickly make more 50 year old whiskey. 

How whiskey ages adds to the difficulty distillers have with production. Unlike wine, whiskey ages in the barrel  but not in the bottle; an 18 year old single malt bottled in 2019 is the same as an 18 year old single malt bottled in 2025. So once whiskey is bottled, it should be sold as soon as possible to avoid bottle storage costs. This punishes premature bottling; if you over-bottle, you either sell at a reduced price or bear storage costs.

There is a possible exception to whiskey not aging in the bottle known as the Old Bottle Effect (OBE). Expert tasters can taste novel flavors in whiskeys that have spent a long time in the bottle. These tastes are thought to come from oxidation, with oxygen permeating very slowly through the pores in the cork [3]. Generally speaking, oxidation is considered a bad thing for alcoholic drinks, but it seems in the case of whiskey, a little is OK. Viewing the online images of 50 year old whiskey bottles, it looks like they've been bottled recently, so I'm not convinced OBE has any bearing on whiskey prices,

Whiskey is distilled and gets its taste from the barrels, which means that unlike wine, there are no vintage years. Whiskey is unaffected by terroir or the weather; a 21 year old Scotch should taste the same regardless of the year it was bottled, which has a couple of consequences. 

  • If you bottle too much whiskey and have to store it instead of selling it, you won't be able to charge a price premium for the bottles you store (over bottling = higher costs). 
  • On the analysis side, it's possible to compare the prices of the same whiskey over several years; a 25 year old whiskey in 2019 is the same product as a 25 year old whiskey in 2025.

One notable production price driver is evaporation. Each year, 2-5% of whiskey in barrels is lost due to it, which is the so-called "angel's share".  Let's assume a 4% annual loss from a 200 liter barrel and see what it does to the amount of whiskey we can sell (I've rounded the numbers to the nearest liter).

Year Whiskey volume
0 200
3 177
10 133
15 108
18 96
21 85
25 72
30 59
40 39
50 26

By law, whiskey has to be matured for 3 years and in reality, the youngest single malts are 10 years old. To get the same revenue as selling the barrel at 10 years, a 50 year old barrel has to be sold for (133/26) or about 5 times the price. That helps explain the increase with age, but not the extent of the increase.

Storage costs obviously vary linearly with age and we can add in the time value of money (which follows the same type of equation as the angel's share). These costs obviously drive up the cost of older whiskey more, but all the production and supply-side factors still don't get us to an exponential price curve.

Before moving to the demand side, I should talk a bit about the phenomena of independent bottlers, also known as cask brokers or cask investment companies. These are companies that buy whiskey in barrels from the distillers and store the barrels. They either bottle the whiskey themselves or sell the barrels, sometime selling barrels back to the original distiller. As far as I can see, they're operating like a kind of futures market. There are several of these companies, the biggest being Gordon & MacPhail who were founded in 1895.  It's not clear to me what effect these companies might have on the supply of single malts.

On the demand side, whiskey has been a boom-and-bust industry.

Up until the late 1970s, there had been a whiskey boom and distilleries had upped production in response. Unfortunately, that led to over-production and the creation of a whiskey 'loch' (by comparison with the wine lake and the butter mountain created by over-production).  By the early 1980s, distilleries were closing and the industry was in a significant downturn. This led to a sharp reduction in production. For us in 2025, it means the supply of older whiskey is very much less than demand. 

More recently, there was a whiskey boom from the early 2000s to the early 2020s. Demand increased substantially but with a fixed supply.  Increased demand + fixed supply = increased price, and as older whiskies are rarer, this suggests that older whiskies appreciate in price more.

It's an anecdotal point, but I seem to remember it was uncommon to see "young" whiskeys less than 18 years old. It's only recently that I've seen lots of 10 year old whiskeys on sale. If this is true, it would be a distillers response to the boom; bottle and sell as much as you can now while demand is high. Bottling whiskies younger will have the side-effect of reducing the supply of older whiskeys.

Of course, the whiskey boom has seen older whiskies become luxury goods. The Veblan effect might be relevant here, this is an observation that when the price of some luxury goods increases, demand increases (the opposite dynamic from "normal" goods). Small additions to a product might drive up the price disproportionately (handbags being a good example), in this case, the small additions would be an increase in the age of the whiskey (say from 40 years to 45 years).

As rare and old whiskies have become more expensive, investors have moved in and bought whiskey not as something to drink, but as something to buy and sell. This has brought more money into the high-end of the market, adding to the price rise.

Let's pull all these strands together. Whiskey seems to be a boom-and-bust industry coupled with long-term production and a fixed supply. Over recent years, there's been a boom in whiskey consumption. Market dynamics suggest that distillers sell now while the market is good, which means bottling earlier, which in turn means fewer older whiskies for the future. Really old whiskies are quite rare because of the industry downturn that occurred decades ago and because of maturation costs. Rareness coupled with wealth and desirability pushes the price up to stratospheric levels for older whiskies. The price-age curve is then a function of supply, distillers bottling decisions, and market demand. That still doesn't get us to the exponential curve, but you can see how we could produce a model to get there.

What about blends, other countries, and science?

If single malt whiskey is becoming unaffordable, what about blends? Like wine, the theory goes that the blender can buy whiskey from different distillers and combine them to produce a superior product. However, like wine, the practice is somewhat different. Blends have been associated with the lower end of the market and I've had some really nasty cheap blended whiskey. At the upper end of the blend market, a 750ml bottle of Johnnie Walker Blue Label retails for about $178, and I've heard it's very good. For comparison, the $178 price tag puts it in the price range of some 18-21 year old whiskies. There are rumors that some lesser-known blends are single malts in all but name, so they might be worth investigating, but at over $150 a bottle, this feels a bit like gambling.

What about whiskey or whisky from other countries? I'm not sure I count bourbon as a Scotch-type whiskey, it kind of feels like its own thing - perhaps it's a separate branch of the whiskey family. Irish whiskey is very good and the market isn't as developed as Scotch, but prices are still high. I've tried Japanese whiskey and I didn't like it, maybe the more expensive stuff is better, but it's an expensive risk. I've seen Indian whiskey, but again the price was too high for me to want to try my luck.

What about engineered whiskey? Whiskey gets its flavor from wooden barrels and if you know the chemistry, you can in principle make an equivalent product much faster. There are several companies trying to do this and they've been trying for several years. The big publicity about these so-called molecular spirits was around 2019, but they've not dented the Scotch market at all and their products aren't widely available. The whiskey "equivalents" I've seen retail for about $40, making them much cheaper than single malts, however, the reviews are mixed. The price point does mean I'm inclined to take a risk; if I can find a bottle, I'll buy one.

Whiskey or whisky?

Both spellings are correct. Usage depends on where you are and the product you're talking about. Whiskey is the Irish spelling and it's the spelling used in the US for this category of spirits. Whisky is the Scottish spelling and it's the spelling they use on their bottles. Because I'm writing in the US, I've used whiskey in this blog post even though I'm writing about the Scottish product. I decided I couldn't win on a spelling choice, so I chose to be consistent.

The future

During the 2000s whiskey boom, investors created new distilleries and re-opened old ones, which suggests production is likely to increase over the coming years. At the same time, the whiskey boom is slowing down and sales are flattening. Are we headed to another whiskey crash? I kind of doubt it, but I think prices will stabilize or even come down slightly for younger whiskies  (21 years or younger). Older whiskies will still be rare because of the industry slump in the 1980s and they're likely to remain eye-wateringly expensive. 

Of course, I'll be having a glass of single malt in the near future, but I'll try not to bore everyone with whiskey facts!

References

  1. Moroz D, Pecchioli B. Should You Invest in an Old Bottle of Whisky or in a Bottle of Old Whisky? A Hedonic Analysis of Vintage Single Malt Scotch Whisky Prices. Journal of Wine Economics. 2019;14(2):145-163. doi:10.1017/jwe.2019.13
  2. Page, Ian B. "Why do distilleries produce multiple ages of whisky?." Journal of Wine Economics 14.1 (2019): 26-47.
  3. https://hedonism.co.uk/what-obe-old-bottle-effect