# Are the books right about coin tossing?

Almost every probability book and course starts with simple coin-tossing examples, but how do we know that the books are right? Has anyone tossed coins several thousand times to see what happens? Does coin-tossing actually have any relevance to business? (Spoiler alert: yes it does.) Coin tossing is boring, time-consuming, and badly paid, so there are two groups of people ideally suited to do it: prisoners and students.

# Prisoner of war

John Kerrich was an English/South African mathematician who went to visit in-laws in Copenhagen, Denmark. Unfortunately, he was there in April 1940 when the Nazis invaded. He was promptly rounded up as an enemy national and spent the next five years in an internment camp in Jutland. Being a mathematician, he used the time well and conducted a series of probability experiments that he published after the War [Kerrich]. One of these experiments was tossing a coin 10,000 times. The results of the first 2,000 coin tosses are easily available on Stack Overflow and elsewhere, but I've not been able to find all 10,000, except in outline form.

We’re going to look at the cumulative mean of Kerrich’s data. To get this, we’ll score a head as 1 and a tail as 0. The cumulative mean is the cumulative mean of all scores we’ve seen so far; if after 100 tosses there are 55 heads then it’s 0.55 and so on. Of course, we expect to go to 0.5 ‘in the long run’, but how long is the long run? Here’s a plot of Kerrich’s data for the first 2,000 tosses

I don’t have all of Kerrich’s tossing data for individual tosses, but I do have his cumulative mean results at different numbers of tosses, which I’ve reproduced below.

Number of tosses | Mean | Confidence interval (±) |
---|---|---|

10 | 0.4 | 0.303 |

20 | 0.5 | 0.219 |

30 | 0.566 | 0.177 |

40 | 0.525 | 0.155 |

50 | 0.5 | 0.139 |

60 | 0.483 | 0.126 |

70 | 0.457 | 0.117 |

80 | 0.437 | 0.109 |

90 | 0.444 | 0.103 |

100 | 0.44 | 0.097 |

200 | 0.49 | 0.069 |

300 | 0.486 | 0.057 |

400 | 0.498 | 0.049 |

500 | 0.51 | 0.044 |

600 | 0.52 | 0.040 |

700 | 0.526 | 0.037 |

800 | 0.516 | 0.035 |

900 | 0.509 | 0.033 |

1090 | 0.461 | 0.030 |

2000 | 0.507 | 0.022 |

3000 | 0.503 | 0.018 |

4000 | 0.507 | 0.015 |

5000 | 0.507 | 0.014 |

6000 | 0.502 | 0.013 |

7000 | 0.502 | 0.012 |

8000 | 0.504 | 0.011 |

9000 | 0.504 | 0.010 |

10000 | 0.5067 | 0.009 |

Do you find something surprising in these results? There are at least two things I constantly need to remind myself when I’m analyzing A/B test results and simple coin-tossing serves as a good wake-up call.

The first piece is how many tosses you need to do to get reliable results. I won’t go into probability theory too much here, but suffice to say, we usually quote a range, called the confidence interval, to describe our level of certainty in a result. So a statistician won’t say 0.5, they’d say 0.5 +/- 0.04. You can unpack this to mean “I don’t know the number exactly, but I’m 95% sure it lies in the range 0.46 to 0.54”. It’s quite easy to calculate a confidence interval for an unbiased coin for different numbers of tosses. I've put the confidence interval in the table above.

The second piece is the structure of the results. Naively, you might have thought the cumulative mean would smoothly approach 0.5, but it doesn’t. The chart above shows a ‘blip’ around 100 where the results seem to change, and this kind of ‘blip’ happens very often in simulation results.

There’s a huge implication for both of these pieces. A/B tests are similar in some ways to coin tosses. The ‘blip’ reminds us we could call a result too soon and the number of tosses needed reminds us that we need to carefully calculate the expected duration of a test. In other words, we need to know what we're doing and we need to interpret results correctly.

# Students

In 2009, two Berkeley undergraduates, Priscilla Ku and Janet Larwood, tossed a coin 20,000 times each and recorded the results. It took them about one hour a day for a semester. You can read about their experiment here. I've plotted their results on the chart below.

The results show a similar pattern to Kerrich’s. There’s a ‘blip’ in Priscilla's results, but the cumulative mean does tend to 0.5 in the ‘long run’ for both Janet and Priscilla.

These two are the most quoted coin-tossing results you see on the internet, but in textbooks, Kerrich’s story gets told more because it’s so colorful. However, others have spent serious time tossing coins and recording the results; they’re less famous because they only quoted the final number and didn’t give the entire dataset. In 1900, Karl Pearson reported the results of tossing a coin 24,000 times (12,012 heads), which followed on from the results of Count Buffon who tossed a coin 4,040 times (2,048 heads).

# Derren Brown

I can’t leave the subject of coin tossing without mentioning Derren Brown, the English mentalist. Have a look at this YouTube video where he flips an unbiased coin heads ten times in a row. It’s all one take and there’s no trickery. Have a think about how he might have done it.

Got your ideas? Here’s how he did it; the old-fashioned way. He recorded himself flipping coins until he got ten heads in a row. It took hours.

# But what if?

So far, all the experimental results match theory exactly and I expect they always will. I had a flight of fancy one day that there’s something new waiting for us out past 100,000 or 1,000,000 tosses - perhaps theory breaks down as we toss more and more. To find out if there is something there, all I need is a coin and some students or prisoners.

**More technical details**

I’ve put some coin tossing resources on my Github page under the coin-tossing section.

*Kerrich*is the Kerrich data set out to 2,000 tosses in detail and out to 10,000 tosses in summary. The Python code kerrich.py displays the data in a friendly form.*Berkeley*is the Berkeley dataset. The Python code berkeley.py reads in the data and displays it in a friendly form. The file 40000tosses.xlsx is the Excel file containing the Berkeley data.*coin-simulator*is some Python code that shows multiple coin-tossing simulations. It's built as a Bokeh app, so you'll need to install the Bokeh module to use it.

Thanks, Mike. Here’s a thought... what if you had a nearly fair coin? Maybe the heads side sticks out more and the center of gravity is a bit off or something. Anyway, it tosses heads 0.5 + delta. How many tosses does it take to determine the rate is not 0.5 to a certain degree of confidence?

ReplyDeleteHi Colin, thank you for your comment. This is the basis of a statistical test called the z-test. I know some people are probably howling out 't-test' right now, but if the coin is only slightly biased, the number of samples needed to differentiate it from an unbiased coin is so great we're into z-test territory. The z-test is used in things like A/B testing to tell the difference between two normal distributions. (As an aside, if you are looking for a difference between two non-normal distributions, then you use the Kolmogorov–Smirnov test.) I might do a future blog post about these kinds of tests if anyone is interested.

Delete