Friday, January 9, 2026

The Siren Song

A happy siren accident

I was searching the web for something, and by a happy accident of mistyping, I found a completely unrelated and wonderful event. What I saw inspired this blog post. 

I'm going to write about sirens, those loud things that scare you into taking your safety seriously.

(World War II British siren, Robert Jarvis, via Wikimedia Commons.  Creative Commons Attribution 3.0 Unported license.)

Siren etmology

The word siren comes from ancient Greek mythology. Sirens were female, human-like beings who used their voices to lure young men to their deaths. In the Jason and the argonauts story, the crew had to sail passed an island of sirens who sang to lure the ship onto the rocks. The crew had Orpheus play his lyre to drown them out so they could pass safely. Unfortunately, one man, Butes, succumbed to the sirens' song and went overboard to reach them.

(The Siren by John Willam Waterhouse, via Wikimedia Commons. Note the siren's fishy feet.)

From this legend, we get the use of the word siren to describe a beautiful woman who's dangerous, and also its use to describe a device for making loud tones. I'm going to skip the sexist use and focus on devices that make loud tones. Of course, I need to mention the reversal here: sirens in ancient Greece used beautiful sounds to lure you to your death, moderns sirens use ugly sounds to save your life. 

What's a siren?

A siren a device that makes loud and piercing noises to alert people of a danger. You can use pretty much any mechanism you like to produce a noise, but in modern times, it tends to be rotating disks pushing air through holes, or electronics. Modern sirens produce relatively 'simple' sounds compared to musical instruments, adding to their impact.

How they work

I'm going to focus on mechanical slotted disk sirens because they're what most people associate with the word siren. You can make any sound you like with electronics, but that's boring. 

Sound is a pressure wave moving through the air (or other medium). It consists of a wave of compression and rarefaction, meaning the air is compressed (higher pressure) and decompressed (lower pressure).  Wind is the movement of the air itself, sound is movement within the air. This is an important distinction for a siren as we'll see.

To make a noise, we have to set up a sound wave. Moving air alone won't work. For instance, blowing air through a straw won't make a noise. If we want to turn blowing air through a straw into a noise (and so create a simple siren), we have to introduce a compression wave. We can do this using an electric drill.

This article in Scientific American (https://www.scientificamerican.com/article/building-a-disk-siren/) describes the process. To simplify, create a disk with holes around the edge. Mount it on a electric drill and spin it up. Have a child blow through a straw above a hole in the disk. You should hear a siren like sound.

Obviously, operating an electric drill close to a child's face could be an interesting experience, so buyer beware.

Blowing through the straw doesn't make a noise, but the holes in the rotating disk stop and start the flow, so creating a compression wave and hence a sound. Because the holes are equally spaced and the drill is rotating at a constant angular velocity, you hear what's approximately a single frequency. The faster the drill goes, the higher the frequency.

To make this much louder, we need to push a lot more air through the holes. Instead of a child blowing through a straw, we need an electrical fan pushing air through holes. That's what electro-mechnical sirens do.

In most sirens, it's the fan that rotates and the holes remain stationary, The holes are placed at the edge of a stationary disk called a stator. It looks something like this. 

The holes are often called ports. How many there are and how fast the rotor spins determines the frequency.

The rotor both blows air through the holes and blocks the holes, creating a pressure wave. The rotor looks something like this.

Note the design. The 'fins' push the air out the holes when the holes in the stator and rotator line up. The fins also block the holes as the rotor rotates. So the rotor alternately blocks the holes and pushes air through them. This is what creates the pressure wave and hence the sound.

The design I've shown here creates a single tone. Most sirens create two tones, so they consist of either two rotors and stators each producing a separate tone, or a single rotor and stator in a 'sandwich'. I've shown both designs below. The 'sandwich' terminology is mine, so don't go searching for it!

(Siren that produces different tones at different ends. Srikantasarangi, CC0, via Wikimedia Commons)

('Sandwich' design for two-tone sirens, from airraidsirens.com.)

Siren sounds

The tone a siren creates depends on the speed of the motor, the number of holes, and the diameter of the stator/rotor. As the motor starts up, its angular velocity increases from zero, which means the frequency the siren produces increases. Conversely, as the motor slows down to a stop, the frequency drops. By turning the power off and on, or by varying the power to the siren, we can create a moaning or wailing effect.

Sirens don't create a pure sound sine wave, but it's fairly close. They produce a roughly triangular sound wave that has lots of harmonics (see https://www.airraidsirens.net/tech_howtheywork.html). Because of this distinct sound wave shape, a siren is clearly an artificial sound and that's what the authorities want.

A single tone is OK, but you can achieve a stronger psychological effect on the population with two tones or more. Sound waves interfere with one another to create new frequencies; with a two-tone siren, you can create what's called a minor third, a new tone. Often, siren designers chose to create what's called a minor third, which musically is a sad or downbeat sound. 

Lower frequencies travel further than higher frequencies, which is why sirens tend to use them. On the flip side, it's harder for humans to locate the source of lower frequency sounds, but that doesn't really matter for a warning. You don't need people to know where the siren is, you just need them to hear it and run. These lower frequencies are typically in the range 400-500 Hz, with the mid-range 450 Hz generally considered the most annoying.

World War II - wailing Winne and moaning Minnie

The most famous sirens of World War II are the air raid sirens used in the UK. They're mostly associated with the London Blitz, but they were used in other British cities. They used two different signals: one to alert for an air-raid and the other the all-clear.

Here's a recording of the air-raid alert sound (first minute). Note the wailing sound caused by varying the power to the siren. These sirens used lower frequencies, designed to be penetrating, and used a minor third for a spooky downbeat sound. Imagine sirens like this going off all at once all over a city to warn you that planes are coming to drop bombs on you.

The wailing sounds led to the sirens being called wailing Winnne or moaning Minnie. The same names were also used for Nazi weaponry too.

Here's the all clear signal (same video, but towards the end). It's a continuous tone. 


In 2012, the British band Public Service Broadcasting released a track called "London Can Take It", based on a 1940 British propaganda film that was narrated by the American Quentin Reynolds. It starts with an air-raid siren.

Post WWII - civil defense in different forms

During the Cold War, sirens were deployed in many cities to warn of an attack, though I'm not sure how useful hiding from a nuclear weapon would be.

Over the same time period, siren usage was extended to include warning of danger from natural disasters like tornadoes or flooding. As you might expect, the technology became more sophisticated and more compact using electronics to generate sound, meaning smaller sirens were possible as were different sounds. Smaller sirens were deployed on emergency vehicles and you've certainly heard them. Despite all this change, the fundamental acoustics stay the same, which means that sirens that warn the population (and so cover a wide area) must have large horn-type 'speakers' to broadcast their signals. In other words, warning sirens are big.

(Siren mounted on a fire truck. FiremanKurt, CC BY-SA 3.0, via Wikimedia Commons)

Build your own siren

There are loads of sites on the web that show you how you can build your own air-raid type siren. Most of them assume you've got access to a beefy electrical motor, though a few have designs you can use with an electric drill. 

Several sites will tell you how to build an air-raid siren from wood, but the skill level is quite high. I'm a little put off by designs that require me to cut a perfect circle with a jigsaw and balance it carefully. I'm not sure my woodworking skills are up to it.

Other sites have instructions for 3D-printing the components. This seems more doable, but the designs are mostly for sirens that can fit on an electric drill. Even though this seems easier, there are some tricky engineering stages.

The other problem is of course the noise. If you get it right, your home-built siren is going to be loud. I'm sure my neighbors would be pleased to hear my siren on a quiet Sunday afternoon.

SirenCon

My happy internet accident was searching for a conference but coming across SirenCon, a conference for people who like sirens (https://www.sirencon.com/home). I spent more time than I should clicking around their site and finding out more.

Think for a minute about how this works. SirenCon attendees will want to set off sirens which is not good news for the neighbors. Where in New York City could you hold it, where abouts in any big city could you hold it? The same logic applies to small towns and the suburbs. Where would be a good place to hold a loud conference?

The answer unsurprisingly is in the countryside. For SirenCon, they meet once a year in the woods in rural Wisconsin, in Rhindelander. Their location seems to be away from any population centers. 

Each year, people come and show off their sirens. The 2025 siren list is here: https://www.sirencon.com/the-2025-line-up Rather wonderfully, there's live streaming and you can watch seven and a half hours of siren fun: https://www.youtube.com/live/ZV24Ioriar4

I think it's great that people with a niche interest like this can get together and share their passion. Good luck to them and I hope they have a wonderful 2026 SirenCon.

I've got the power: what statistical power means

Important, but overlooked

Power is a crucial number to understand for hypothesis tests, but sadly, many courses omit it and it's often poorly understood if it's understood at all. To be clear, if you're doing any kind of A/B testing, you have to understand power.

In this blog post, I'm going to teach you all about power.

Hypothesis testing

All A/B tests, all randomized control trials (RCTs), and many other forms of testing are ultimately hypothesis tests; I've blogged about what this means before. To briefly summarize and simplify, we make a statement and measure the evidence in favor or against the statement using thresholds to make our decision.

With any hypothesis test, there are four possible outcomes (using simplified language):

  • The null hypothesis is actually true (there is no effect)
    • We say there is no effect (true negative)
    • We say there is an effect (false positive)
  • The null hypothesis is actually false (there is an effect)
    • We say there is no effect (false negative)
    • We say there is an effect (true positive)

I've summarized the possibilities in the table below.

    Null Hypothesis is
    True False
Decision about null hypothesis  Fail to reject True negative
Correct inference
Probability threshold= 1 - \( \alpha \)
False negative
Type II error
Probability threshold= \( \beta \)
Reject False positive
Type I error
Probability threshold = \( \alpha \)
True positive
Correct inference
Probability threshold = Power = 1 - \( \beta \)

A lot of attention goes on \(\alpha\), called the significance or significance level, which tells us the probability of a false positive. By contrast, power is the probability of detecting an effect if it's really there (true positive), sadly it doesn't get nearly the same level of focus.

By the way, there's some needless complexity here. It would seem more sensible for the two threshold numbers to be \( \alpha \) and \( \beta \) because they're defined very similarly (false positive and false negative). Unfortunately, statisticians tend to use power rather than \( \beta \). 

In pictures

To get a visual sense of what power is, let's look at how a null hypothesis test works in pictures. Firstly, we assume the null is true and we draw out acceptance and rejection regions on the probability distribution (first chart). To reject the null, our test results have to land in the red rejection regions in the top chart.

Now we assume the alternate hypothesis is true (second chart). We want to land in the blue region in the second chart, and we want a certain probability (power), or more, of landing in the blue region.

To be confident there is an effect, we want the power to be as high as possible.

Calculating power - before and after

Before we run a test, we calculate the sample size we need based on a couple of factors, including the power we want the test to have. For reasons I'll explain later, 80% or 0.8 is a common choice. 

Once we've run the test and we have the rest results, we then calculate the actual power based on the data we've recorded in our test. It's very common for the actual power to be different from what we specified in our test design. If the actual power is too low, that may mean we have to continue the test or redesign it.

Unfortunately, power is hard to calculate; there are no convenient closed-form formula and to make matters worse, some of the websites that offer power and sample size calculations give incorrect results. The G*Power package is probably the easiest tool for most people to use, though there are convenient libraries in R and Python that will calculate power for you. If you're going to understand power, you really do need to understand statistics.

To make all this understandable, let me walk you through a sample size calculation for a conversion rate A/B test for a website. 

  • A/B tests are typically large with thousands of samples, which means we're in z-test territory rather than t-test. 
  • We also need to decide what we're testing for. A one-sided test is testing for a difference in one direction only, either greater than or less than, a two-sided test tests for a difference (in either direction). Two-sided tests are more common because they're more informative. Some authors use the term one-tailed and two-tailed instead of one-sided or two-sided. 
  • Now we need to define the thresholds for our test, which are \( \alpha \)  and power. Common values are 0.05 and 0.8.  
  • Next up we need to look at the effect, in the conversion test example, we might have a conversion rate of 2% on one branch and expected conversion rate of 2.2% on the other branch. 
We can put all this into G*Power and here's what we get.

Test type Tail(s) \( \alpha \) Power Proportion 1 Proportion 2 Sample size
z-test Two-tailed 0.05 0.8 0.02 0.022 161364
z-test Two-tailed 0.05 0.95 0.02 0.022 267154

The first row of the table shows a power of 80% which leads to a sample size of 161,364. Increasing the power to 95% gives a sample size 267,154, a big increase and that's a problem. Power varies non-linearly with sample size as I've shown in the screen shot below for this data (from G*Power).

Conversion rates of 2% are typical for many retail sites. It's very rare that any technology will increase the conversion rate greatly. A 10% increase from 2% to 2.2% would be wonderful for a retailer and they'd be celebrating. Because of these numbers, you need a lot of traffic to make A/B tests work in retail, which means A/B tests can really only be used by large retailers.

Why not just reduce power and reduce the sample size? Because that's making the results of the test less reliable; at some point, you might as well just flip a coin instead of running a result. A lot of A/B tests are run when a retailer is testing new ideas or new paid-for technologies. An A/B test is there to provide a data-oriented view of whether the new thing works or not. The thresholds are there to give you a known confidence in the test results. 

After a test is done, or even partway through the test, we can can calculate the observed power. Let's use G*Power and the numbers from the first row of the table above, but assume a sample size of 120,000. This gives a power of 0.67, way below what's useful and too close to a 50-50 split. Of course, it's possible that we observe a a smaller effect than expected, and you can experiment with G*Power to vary the effect size and see the affect on power.

A nightmare scenario

Let's imagine you're an analyst at a large retail company. There's a new technology which costs $500,000 a year to implement. You've been asked to evaluate the technology using an A/B test. Your conversion rate is 2% and the new technology promises a conversion rate of 2.2%. You set \(\alpha\) to 0.05, and power to 0.8 and calculate a sample size (which also gives you a test duration). The null hypothesis is that there is no effect (conversion rate of 2%) and the alternate hypothesis is that the conversion rate is 2.2%.

Your boss will ask you "how sure are you of these results?". If you say there's no effect, they will ask you "how sure are you there's no effect?", if you say there is an effect, they will ask you "how sure are you there is an effect"? Think for a moment how you'd ideally like to answer these questions (100% sure is off the cards). The level of surety you can offer depends on your website traffic and the test.

When the test is over, you calculate a p-value of 0.01, which is less than your \(\alpha\), so you reject the null hypothesis. In other words, you think there's an effect. Next you calculate power. Let's say you get a 0.75. Your threshold for accepting a conversion rate of 2.2% is 0.8. What's next?

It's quite possible that the technology works, but just not increasing the conversion rate to 2.2%. It might increase conversion to 2.05% or 2.1% for example. These kinds of conversion rate lifts might not justify the cost of the technology.

What do you do?

You have four choices, each with positives and negatives.

  1. Reject the new technology because it didn't pass the test. This is a fast decision, but you run the risk of foregoing technology that would have helped the business.
  2. Carry on with the test until it reaches your desired power. Technically, the best, but it may take more time than you have available.
  3. Accept the technology with the lower power. This is a risky bet and very dangerous to do it regularly (lower thresholds mean you make more mistakes).
  4. Try a test with a lower lift, say an alternate hypothesis that the conversion rate is 2.1%.

None of these options are great. You need strong statistics to decide on the right way forward for your business.

(A/B testing was painted as an easy-to-use wonder technique. The reality is, it just isn't.)

What's a good value?

The "industry standard" power is 80%, but where does this come from? It's actually a quote from Michael Cohen in his 1988 book "Statistical Power Analysis for the Behavioral Sciences", he said if you're stuck and can't figure out what the power should be, use 80% as a last result. Somehow the value of last resort has become an unthinking industry standard. But what value should you chose?

Let's go back to the definitions of \( \alpha \) and \( \beta \) (remember, \( \beta \) is 1 - power).  \( \alpha \) corresponds to the probability of a false positive, \( \beta \) corresponds to the probability of a false negative. How do you balance these two false results? Do you think a false positive is equally as bad as false negative or do you think it's better or worse? The industry standard choices for \( \alpha \) and \( \beta \) are 0.05 and 0.20 (1 - 0.8), which means we think a false positive is four times worse than a false negative. Is that what you intended? Is that ratio appropriate for your business?

In retail, including new technologies on a website comes with a cost, but there's also the risk of forgoing revenue if you get a false negative. I'm tempted to advise you to choose the same \( \alpha \) and \( \beta \) value of 0.05 (which gives a power of 95%). This does increase the sample size and may take it beyond the reach of some websites. If you're bumping up against the limits of your traffic when designing tests, it's probably better to use something other than an A/B test.

Why is power so misunderstood?

Conceptually it's quite simple (probability of making a true positive observation), but it's wrapped up with the procedure for defining and using a null hypothesis test. Frankly, the whole null hypothesis setup is highly complex and unsatisfactory (Bayesian statistics may offer a better approach). My gut feeling is, \( \alpha \) is easy to understand, but once you get into the full language of a null hypothesis testing, people get left behind, which means they don't understand power.

Not understanding power leaves you prone to making bad mistakes, like under-powering tests. An underpowered test might mean you reject technologies that could increase conversion rate. Conversely, under-powered tests can lead you to claim a bigger effect than is really there. Overall, it leaves you vulnerable to making the wrong decision.