Saturday, September 6, 2025

Old & experienced vs. young and energetic: mean age in English football

Which is better, youth or experience?

Professional sports are pretty much a young person's game and English football is no exception; it's rare to see players over 30. One notable example is Mark Howard, a goalkeeper for Wrexham up to 2025, who was 38 at the end of his contract. His advanced age earned him the nickname "Jurassic Mark". He carried on playing as long as he did because his experience gave him an edge.

Given all teams are youthful, is it better to have an older team (guided by experience) or a younger team (the energy of youth)? Which type of team might score more goals? I'm going to explore this issue in this blog post.

(Canva)

The data

I've taken the data for this blog post from TransferMarket.com (https://www.transfermarkt.com/) that has data on the mean age of English football clubs at the start of each season. Obviously, transfers etc. change the mean age, but it's a reasonable place to start.

The charts

Here's a chart showing total goals for, against, and goal difference per season per club per league against mean team age at the start of the season. I've added a linear fit to the data so you can see the trends and I've included a 95% confidence band around the fit. The r² value is in the chart title, as is the p-value.

The charts are interactive, you can:

Zoom in and out of the data using the menu on the left.
Save the charts to disk using the menu on the left.
See the data points values by hovering your mouse over the data points.
Select the league tier using the buttons.
Select the season using the slider.

What the charts show

There's some correlation between goals and mean team age, but it isn't very strong.

For the Premier League, there is a consistent pattern over the years that younger teams do better, but it's a small effect, really something that's second-order at best.

For the lower leagues, again, there's an effect, but it's smaller and less consistent.

One thing that did surprise me was the consistency of the mean age ranges across leagues and across time. I would have thought that lower leagues might have more players towards the end of their careers (slower and cheaper) or possibly more younger players (inexperienced and cheaper) and that might skew the club mean age older or younger. That doesn't seem to be the case. It's possible lower leagues have a different club age makeup from the Premier League, but I can't get at that from this data set.

What does it mean?

A player might have ten years (ages 20-30) in the top flight if they're lucky, which suggests 25 is mid-career for most of them. At some point, they'll have an optimal balance between experience and youth, but that's unlikely to be at the beginning or end. A similar argument might apply to teams as a whole. If there's any truth to this argument, then some form of triangular fit would be better than a straight-linear fit. Even with the linear fit, we can see there is some relationship between goals and mean age, albeit a very weak one.

I'm looking for features that help predict team success. Club mean age seems like it would be a good second-order one.

Tuesday, September 2, 2025

Em dash = AI slop?

Punctuation as a giveaway

Recently, I've seen a lot of comments on the web that the use of em dashes is a dead giveaway that an article has been written by AI. This immediately made me think of my own use of dashes and semicolons. I don't use AI for text generation, but I wondered if my writing might be mistaken for AI because of my use of punctuation. I decided to take a deeper look at the whole area.

(Gemini, with some assistance.)

Punctuation symbols

Let's start by looking at the symbols themselves.

Symbol	Name	Commentary
—	Em dash	Not easily available from my keyboard. Named because it's the width of a capital M. HTML: — Markdown: ---
–	En dash	Not easily available from my keyboard. Named because it's the width of a capital N. HTML: – Markdown: --
-	Hyphen	Available easily on my keyboard (minus sign)
;	Semicolon	Easily available from my keyboard.
,	Comma	Easily available from my keyboard.

Grammatical use

I'm not going to go into grammar too much here because I'm the wrong person to do that, but I will very briefly summarize the situation (my favorite grammar book is "Rules for writers" by Hacker and Sommers, check it out if you want a good grammar reference). Commas and semicolons have different grammatical purposes and their use goes back a long time. Hyphens are a more modern invention and seem to have some of the same usage of both commas and semicolons; a sort of generic punctuation mark.

As far as I can tell, em dashes, en dashes, and hyphens are used for more or less the same grammatical purpose; they're interchangeable. Some websites suggest that there is a difference in grammatical use between — and – and these are reputable websites, so there may be some fine distinction. For most people and most uses, em dashes, en dashes, and hyphens serve the same purpose.

Usage in the real world by people

Recent writing seems to favor the use of - rather than ;, especially in short form communications like text messages or even emails. I've noticed some modern authors are using hyphens instead of semicolons, in fact, I've met a professional writer who always used hyphens and never semicolons. Overall, semicolon usage seems to be in decline.

If I'm typing in text, normally I only use characters easily available from my keyboard, unless I'm using a special character like a currency symbol (e.g. €). In other words, it's unlikely I'll use em dashes or en dashes. Given that it's hard to tell the different dashes apart, it's hard to understand why anyone (any human) other than a professional typesetter would use a dash other than a - (hyphen). In the sentence below, have I used an Em dash or an En dash or even a hyphen?

"David lived in Paris 2005–2010."

It's hard to tell isn't it? Which means for humans, em dashes, en dashes, and hyphens can't easily be distinguished.

Is it a reliable AI detector?

Recent English usage seems to favor - over ;, so you can see why an AI might learn to use - rather than ;. As I said earlier, there are some websites that distinguish different uses between —, –, and -, so it's possible an AI will apply these rules too. You can sometimes detect non-native English speakers because their English is too good, they don't make the mistakes native speakers do, and something similar may be happening here. An AI may be applying a "dashes" rule that a native writer wouldn't.

Is it a smoking gun proof? Probably not. I'm sure there are writers who love different dashes, and of course, the software they're using may convert hyphens into different types of dashes for them. But it is a strong indicator.

I find distinguishing between dashes hard, but peeking at the underlying HTML or Markdown gives way the use of em-dashes and en-dashes immediately. So if you have access to the text, you can check.

By contrast, the use of a ; may indicate a human writer, until of course, AIs learn how to use it (im)properly.

Thursday, August 28, 2025

The sisters "paradox" - counter-intuitive probability

It seems simple, but it isn't

There are a couple of famous counter-intuitive problems in probability theory and the sisters "paradox" is one of them. I'll tell you the problem, let you guess the solution, and then give you some of the background.

Here's the problem: a family has two children. You're told that at least one of them is a girl. What's the probability both are girls?

(International Film Service / American Releasing Co., Public domain, via Wikimedia Commons)

Assume that the probability of having a girl or boy is 50% and that the birth order has no effect on the probability. Assume the family is selected at random because they have at least one girl.

What do you think the probability is that both children are girls?

A simpler question

Let's image you're asked a simpler question.

A family has two children. What's the probability both are girls?

We can work this out using a simple probability tree:

Boy (0.5) Girl (0.5)

/ \ / \

Boy-Boy (0.25) Boy-Girl (0.25) Girl-Boy (0.25) Girl-Girl (0.25)

So the probability of two girls is 0.25.

Note there are two ways of having a boy and a girl, so the total probability of having a boy and a girl (in any order) is 0.5.

The wrong answer

Let's go back to the original problem and see the logic behind the most-often given wrong answer.

The birth chance is 0.5 boy and 0.5 girl. We don't know the gender of one of the children, but it must be a 0.5 probability it's a girl. Given the fact we already know one of the children is a girl, the probability of their being two girls must therefore be 0.5.

It sounds right because it sounds logical, but it isn't right for reasons as I'll explain next.

The correct answer

The correct answer is 1/3. Let's see why.

In the probability tree above, we can see four equally likely combinations: {Boy-Boy} (0.25), {Boy-Girl} (0.25), {Girl-Boy} (0.25), and {Girl-Girl} (0.25). We're told in the problem that the {Boy-Boy} combination is ruled out, which leaves us with three remaining combinations. Each of these three remaining combinations is equally likely and it has to be one of them, which means the probability of two girls is 1/3.

There are two ways of having a boy and a girl, {Boy-Girl} and {Girl-Boy}, which means there's a 2/3 probability of having a boy and a girl (in any order). The mistake is to consider that a 0.5 probability.

Sample space

The underlying method to solve this problem is to use something called the 'sample space' which is the set of all possible outcomes of a trial. In our case, the set of all outcomes is {{Boy-Girl}, {Girl-Boy}, {Girl-Girl}}. We can associate probabilities with each of the elements of our sample space. In our case, they're all 1/3.

The sample space idea helps us solve various versions of the problem, here's an example. If we're told the eldest child is a girl, does this change anything? Actually, it does. The sample space becomes {Boy-Girl}, {Girl-Girl}, so the probability is now a 1/2 (eldest child is last on list). Why? Because the {Girl-Boy} combination isn't possible.

How might you test this?

With problems like this that seem counter-intuitive, a good way forward it to actually test the theory. Plainly, it would be expensive to ask people for real, but we can do a computer simulation. Here are the steps.

Randomly create a large number of two-children families with the sample space {Boy-Boy}, {Boy-Girl}, {Girl-Boy}, {Girl-Girl} and probabilities 1/4, 1/4, 1/4, and 1/4.
Select only the families that have at least one girl.
Now figure out the fraction of all the selected families that are {Girl-Girl}.

Interestingly, if you think about ways of testing a solution, it often helps you define the problem a bit better. I found just writing the test process down helped me confirm the correct answer.

Controversy, complexity, and meaning

I've presented a simple analysis here, but you should be aware that things can get a lot, lot more complex. The Wikipedia article on the Boy or girl paradox goes into some painful detail about the problem and the controversy around it. Without going into too much detail, the detailed text of the problem is important.

This might seem abstract, but I've seen variations of this problem pop up in business and I've had difficult conversations with non-technical people as a result. It's especially hard when the "common sense" error gives a more optimistic answer than the correct answer. Realistically, the only way forward is prior eduction and the use of sample space arguments.

Probability theory, and conditional probability in particular, can give some very counter-intuitive results. Here's my advice if you're working with probabilities:

Be as precise as you can be and list all your assumptions.
Figure out how you might run a computer simulation to test your theory. Go back and look at the problem definition once you've defined your simulation.
Don't rely on "common sense".