Showing posts with label data analysis. Show all posts
Showing posts with label data analysis. Show all posts

Friday, December 12, 2025

Data sonification: a curious oddity that may have some uses

What is sonification?

The concept is simple: you turn data into sound. Obviously, you can play with frequency and volume, but there are more subtle sonic things you can play with to represent data. Let's imagine you had sales data for different countries and the data went up and down with time, you could assign a different instrument for each country (e.g. drum for the US, piano for Germany, violin for France), and different sales volumes could be represented as different notes. The hope is of course, that the notes get higher as sales increase. 

If you have more musical experience, you could turn data sets into more interesting music, for example, mapping ups and downs in the data to shifts in tone and speed. 

(Gemini)

Examples

Perhaps the simplest sonfication example is the one you've probably seen in movies: using a Geiger counter to measure radiation. The more it clicks, the more radiation there is. Because it's noise rather than a dial, the user can focus their eyes on where they point the detector and use their ears to detect radiation. It's so simple, even James Bond has used a Geiger counter. In a similar vein, metal detectors use sound to alert the user to the presence of metal.

Perhaps the best example I've heard of sonification is Brian Foo mapping income inequality along the New York Subway's 2 line. You can watch video and music here: https://vimeo.com/118358642?fl=pl&fe=sh. He's turned a set of data into a story and you can see how this could be taken further into a full-on multi-media presentation.

Sometimes, our ears can tell us things our eyes can't. Steve Mould's video on "The planets are weirdly in sync" has a great sonification example starting here: https://youtu.be/Qyn64b4LNJ0?t=1110, the sonficiation shows up data relationships that charts or animations can't. The whole video is also worth a watch too (https://www.youtube.com/watch?v=Qyn64b4LNJ0).

There are two other related examples of sonification I want to share. 

In a nuclear facility, you sometimes hear a background white noise sound. That signifies that all is well. If the sound goes away, that signifies something very bad has happened and you need to get out fast. Why not sound an alarm if something bad happens? Because if something really bad happens, there might not be power for the alarm. Silence is a fail-safe.

In a similar vein, years ago I worked on an audio processing system. We needed to know the system was reliable, so we played a CD of music over and over through the system. If we ever heard a break or glitch in the music, we knew the audio system had failed and we needed to intervene to catch the bug. This was a kind of ongoing sonic quality assurance system.

What use is it?

Frankly, sonification isn't something I would see people use every day. It's a special purpose thing, but it's handy to know about. Here are two use cases.

  • The obvious one is presenting company data. This could be sales, or clicks, or conversion etc. With a bit of effort and musical ability, you could do the kind of thing that Brian Foo did. Imagine an investor presentation (or even an all-hands meeting) with a full-on multi-media presentation with charts, video, and sound.
  • The other use is safety and alerting. Imagine a company selling items on a website. It could pipe in music into common areas (e.g. restrooms and lunch areas). If sales are going well, it plays fast music, if they're slow, it plays slow music. If there are no sales at all, you get silence. This is a way of alerting everyone to the rhythm of sales and if something goes wrong. Obviously, this could go too far, but you get the idea.

Finding out more

Sonification: the music of data - https://www.youtube.com/watch?v=br_8wXKgtkg

The planets are weirdly in sync - https://www.youtube.com/watch?v=Qyn64b4LNJ0

Brian Foo's sonifications - https://datadrivendj.com/

NASA's astronomical data sonifications - https://science.nasa.gov/mission/hubble/multimedia/sonifications/

The sound of science - https://pmc.ncbi.nlm.nih.gov/articles/PMC11387736/

Thursday, November 6, 2025

How to get data analysis very wrong: sample size effects

We're not reading the data right

In the real world, we’re under pressure to get results from data analysis. Sometimes, the pressure to deliver certainty means we forget some of the basics of analysis. In this blog post, I’m going to talk about one pitfall you can make which can cause you to give wildly wrong answers. I’ll start with an example.

School size - smaller schools are better?

You’ve probably heard the statement that “small schools produce better results than large schools”. Small-school advocates point out that small schools disproportionately appear in the top performing groups in an area. It sounds like small schools are the way to go, or are they? It’s also true that small schools disproportionately appear among the worse schools in an area. So, which is it, are small schools better or worse?

The answer is: both. Small schools have a higher variation in results because they have fewer students. The results are largely due to “statistical noise” [1].

We can easily see the effects of sample size “statistical noise”, more properly called variance, in a very simple example. Imagine tossing a coin and scoring heads a 1 and tails a 0. You would expect the mean over many tosses to be close to 0.5, but how many tosses do you have to do? I wrote a simple program to simulate tossing a coin and I summed up the results as a I went along. The charts below show four simulations. The x axis of each chart is the number of tosses, the y axis is the running mean, the blue line is the simulation, and the red dotted line is 0.5.

The charts clearly show higher variance at low numbers of simulations. It takes a surprisingly large number of tosses for the mean to get close top 0.5. If we want more certainty, and less variance, we need bigger samples sizes.

We can repeat the experiment, but this time with a six-sided dice and record the running mean. We’d see the same result, more variance for shorter simulations. Let’s try a more interesting example (you’ll see why in a minute). Let’s imagine a 100-sided dice and run the experiment multiple times , recording the mean results after each simulation (I’ve shown a few runs here). 

Let’s change the terminology a bit here. The 100-sided dice is a percentage test result. Each student rolls the dice. If there are 100 students in a school, there are 100 die rolls, if there are 1,500 students in the school, we roll the die 1,500 times. We now have a simulation of school test results and the effect of school size.

I simulated 500 schools with 500 to 1,500 students. Here are the results.

As you can see, there’s more variance for shorter smaller schools than larger schools. This neatly explains why smaller schools are both the best in an area and the worst.

You might object to the simplicity of my analysis, surely real school results don't look like this.What does real-world data show? Wainer [1] did the work and got the real results (read his paper for more detials). Here's a screen shot from his paper showing real-world school results. It looks a lot like my simple-minded simulation.

Sample size variation is not the full explanation for school results, but it is a factor. Any analysis has to take it into account. Problems occur because of simple (wrong) analysis and overly-simple conclusions.

The law of large numbers

The effect that variance goes down with increasing sample size is known as the law of large numbers. It’s widely taught and there’s a lot written about it online. Unfortunately, most of the discussions get lost in the weeds very quickly. These two references do a very good job of explaining what’s going on: [1] [2].

The law of large numbers has a substantial body of mathematical theory behind it. It has an informal counter-part, that's a bit easier to understand, called the law of small numbers that says that there’s more variance in smaller samples than large ones. Problems occur because people assume that small samples behave in the same way as larger samples (small school results have the same variance as large school results for example).

So far, this sounds simple and obvious, but in reality, most data analysts aren’t fully aware of the effect of sample size. It doesn’t help that the language used in the real-world doesn’t conform to the language used in the classroom.

Small sales territories are the best?

Let’s imagine you were given some sales data on rep performance for an American company and you were asked to find factors that led to better performance.

Most territories have about 15-20 reps, with a handful having 5 or less reps. The top 10 leader board for the end of the year shows you that the reps from the smaller territories are doing disproportionally well.The sales VP is considering changing her sales organization to create smaller territories and she wants you to confirm what she’s seen in the data. Should she re-organize to smaller territories to get better results?

Obviously, I’ve prepped you with answer, but if I hadn’t, would you have concluded smaller territories are the way to go?

Rural lives are healthier

Now imagine you’re an analyst in a health insurance company in the US. You’ve come across data on the prevalence on kidney cancer by US county. You’ve found that the lowest prevalence is in rural counties. Should you set company policy based on this data? It seems obvious that the rural lifestyle is healthier. Should health insurance premiums include a rural/urban cost difference?

I’ve taken this example from the paper by Wainer [1]. As you might have guessed, rural counties have both the lowest and the highest rates of kidney cancer because their populations are small, so the law of small numbers kicks in. I’ve reproduced Wainer’s chart here: the x axis is county population and the y-axis is cancer rate, see his paper for more about the chart. It’s a really great example of the effect of sample size on variance.

A/B test hell

Let’s take a more subtle example. You’re running an A/B test that’s inconclusive. The results are really important to the company. The CMO is telling everyone that all the company needs to do is run the test for a bit longer. You are the analyst and you’ve been asked if running more tests is the solution. What do you say?

The only time it's worth running the test a bit longer is if the test is on the verge of significance. Other than that, it's probably not worth it. Belle's book [3] has a nice chapter on sample size calculations you can access for free online [4]. The bottom line is, the smaller the effect, the larger the sample size you need for significance. The relationship isn't linear. I've seen A/B tests that would have to run for over a year to reach significance. 

Surprisingly, I've seen analysts who don't know how to do a sample size/duration estimate for an A/B test. That really isn't a good place to be when the business is relying on you for answers,

The missing math

Because I’m aiming for a more general audience, I’ve been careful here not to include equations. If you’re an analyst, you need to know:

  • What variance is and how to calculate it.
  • How sample size can affect results - you need to look for it everywhere.
  • How to estimate how much of what you're seeing is due to sample size effects and how much due to something "real".
Unfortunately, references for the law of large numbers get overly technical overly quickly. A good place to start is references that cover variance and standard deviation calculations. I like reference [5], but be aware it is technical.

The bottom line

The law of large numbers can be hidden in data; the language used and the data presentation can all confuse what’s going on. You need to be acutely aware of sample size effects: you need to know how to calculate them and how they can manifest themselves in data in surprising ways.

References

[1] Howard Wainer, “The Most Dangerous Equation”, https://www.americanscientist.org/article/the-most-dangerous-equation

[2] Jeremy Orloff, Jonathan Bloom, “Central Limit Theorem and the Law of Large Numbers”, https://math.mit.edu/~dav/05.dir/class6-prep.pdf 

[3] Gerald van Belle, "Statistical rules of thumb", http://www.vanbelle.org/struts.htm 

[4] Gerald van Belle, "Statistical rules of thumb chapter 2 - sample size", http://www.vanbelle.org/chapters/webchapter2.pdf

[5] Steven Miller, "The probability lifesaver"




Saturday, September 6, 2025

Old & experienced vs. young and energetic: mean age in English football

Which is better, youth or experience?

Professional sports are pretty much a young person's game and English football is no exception; it's rare to see players over 30. One notable example is Mark Howard, a goalkeeper for Wrexham up to 2025, who was 38 at the end of his contract. His advanced age earned him the nickname "Jurassic Mark". He carried on playing as long as he did because his experience gave him an edge.

Given all teams are youthful, is it better to have an older team (guided by experience) or a younger team (the energy of youth)? Which type of team might score more goals? I'm going to explore this issue in this blog post.

(Canva)

The data

I've taken the data for this blog post from TransferMarket.com (https://www.transfermarkt.com/) that has data on the mean age of English football clubs at the start of each season. Obviously, transfers etc. change the mean age, but it's a reasonable place to start.

The charts

Here's a chart showing total goals for, against, and goal difference per season per club per league against mean team age at the start of the season. I've added a linear fit to the data so you can see the trends and I've included a 95% confidence band around the fit. The r2 value is in the chart title, as is the p-value.

The charts are interactive, you can:

  • Zoom in and out of the data using the menu on the left.
  • Save the charts to disk using the menu on the left.
  • See the data points values by hovering your mouse over the data points.
  • Select the league tier using the buttons.
  • Select the season using the slider.


What the charts show

There's some correlation between goals and mean team age, but it isn't very strong. 

For the Premier League, there is a consistent pattern over the years that younger teams do better, but it's a small effect, really something that's second-order at best.

For the lower leagues, again, there's an effect, but it's smaller and less consistent.

One thing that did surprise me was the consistency of the mean age ranges across leagues and across time. I would have thought that lower leagues might have more players towards the end of their careers (slower and cheaper) or possibly more younger players (inexperienced and cheaper) and that might skew the club mean age older or younger. That doesn't seem to be the case. It's possible lower leagues have a different club age makeup from the Premier League, but I can't get at that from this data set.

What does it mean?

A player might have ten years (ages 20-30) in the top flight if they're lucky, which suggests 25 is mid-career for most of them. At some point, they'll have an optimal balance between experience and youth, but that's unlikely to be at the beginning or end. A similar argument might apply to teams as a whole. If there's any truth to this argument, then some form of triangular fit would be better than a straight-linear fit. Even with the linear fit, we can see there is some relationship between goals and mean age, albeit a very weak one.

I'm looking for features that help predict team success. Club mean age seems like it would be a good second-order one.