Are the books right about coin tossing?
Almost every probability book and course starts with simple coin-tossing examples, but how do we know that the books are right? Has anyone tossed coins several thousand times to see what happens? Does coin-tossing actually have any relevance to business? (Spoiler alert: yes it does.) Coin tossing is boring, time-consuming, and badly paid, so there are two groups of people ideally suited to do it: prisoners and students.
Prisoner of war
John Kerrich was an English/South African mathematician who went to visit in-laws in Copenhagen, Denmark. Unfortunately, he was there in April 1940 when the Nazis invaded. He was promptly rounded up as an enemy national and spent the next five years in an internment camp in Jutland. Being a mathematician, he used the time well and conducted a series of probability experiments that he published after the War [Kerrich]. One of these experiments was tossing a coin 10,000 times. The results of the first 2,000 coin tosses are easily available on Stack Overflow and elsewhere, but I've not been able to find all 10,000, except in outline form.
We’re going to look at the cumulative mean of Kerrich’s data. To get this, we’ll score a head as 1 and a tail as 0. The cumulative mean is the cumulative mean of all scores we’ve seen so far; if after 100 tosses there are 55 heads then it’s 0.55 and so on. Of course, we expect to go to 0.5 ‘in the long run’, but how long is the long run? Here’s a plot of Kerrich’s data for the first 2,000 tosses
I don’t have all of Kerrich’s tossing data for individual tosses, but I do have his cumulative mean results at different numbers of tosses, which I’ve reproduced below.
|Number of tosses||Mean||Confidence interval (±)|
Do you find something surprising in these results? There are at least two things I constantly need to remind myself when I’m analyzing A/B test results and simple coin-tossing serves as a good wake-up call.
The first piece is how many tosses you need to do to get reliable results. I won’t go into probability theory too much here, but suffice to say, we usually quote a range, called the confidence interval, to describe our level of certainty in a result. So a statistician won’t say 0.5, they’d say 0.5 +/- 0.04. You can unpack this to mean “I don’t know the number exactly, but I’m 95% sure it lies in the range 0.46 to 0.54”. It’s quite easy to calculate a confidence interval for an unbiased coin for different numbers of tosses. I've put the confidence interval in the table above.
The second piece is the structure of the results. Naively, you might have thought the cumulative mean would smoothly approach 0.5, but it doesn’t. The chart above shows a ‘blip’ around 100 where the results seem to change, and this kind of ‘blip’ happens very often in simulation results.
There’s a huge implication for both of these pieces. A/B tests are similar in some ways to coin tosses. The ‘blip’ reminds us we could call a result too soon and the number of tosses needed reminds us that we need to carefully calculate the expected duration of a test. In other words, we need to know what we're doing and we need to interpret results correctly.
In 2009, two Berkeley undergraduates, Priscilla Ku and Janet Larwood, tossed a coin 20,000 times each and recorded the results. It took them about one hour a day for a semester. You can read about their experiment here. I've plotted their results on the chart below.
The results show a similar pattern to Kerrich’s. There’s a ‘blip’ in Priscilla's results, but the cumulative mean does tend to 0.5 in the ‘long run’ for both Janet and Priscilla.
These two are the most quoted coin-tossing results you see on the internet, but in textbooks, Kerrich’s story gets told more because it’s so colorful. However, others have spent serious time tossing coins and recording the results; they’re less famous because they only quoted the final number and didn’t give the entire dataset. In 1900, Karl Pearson reported the results of tossing a coin 24,000 times (12,012 heads), which followed on from the results of Count Buffon who tossed a coin 4,040 times (2,048 heads).
I can’t leave the subject of coin tossing without mentioning Derren Brown, the English mentalist. Have a look at this YouTube video where he flips an unbiased coin heads ten times in a row. It’s all one take and there’s no trickery. Have a think about how he might have done it.
Got your ideas? Here’s how he did it; the old-fashioned way. He recorded himself flipping coins until he got ten heads in a row. It took hours.
But what if?
So far, all the experimental results match theory exactly and I expect they always will. I had a flight of fancy one day that there’s something new waiting for us out past 100,000 or 1,000,000 tosses - perhaps theory breaks down as we toss more and more. To find out if there is something there, all I need is a coin and some students or prisoners.
More technical details
I’ve put some coin tossing resources on my Github page under the coin-tossing section.
- Kerrich is the Kerrich data set out to 2,000 tosses in detail and out to 10,000 tosses in summary. The Python code kerrich.py displays the data in a friendly form.
- Berkeley is the Berkeley dataset. The Python code berkeley.py reads in the data and displays it in a friendly form. The file 40000tosses.xlsx is the Excel file containing the Berkeley data.
- coin-simulator is some Python code that shows multiple coin-tossing simulations. It's built as a Bokeh app, so you'll need to install the Bokeh module to use it.