Thursday, August 28, 2025

The sisters "paradox" - counter-intuitive probability

It seems simple, but it isn't

There are a couple of famous counter-intuitive problems in probability theory and the sisters "paradox" is one of them. I'll tell you the problem, let you guess the solution, and then give you some of the background.

Here's the problem: a family has two children. You're told that at least one of them is a girl. What's the probability both are girls?

(International Film Service / American Releasing Co., Public domain, via Wikimedia Commons)

Assume that the probability of having a girl or boy is 50% and that the birth order has no effect on the probability. Assume the family is selected at random because they have at least one girl.

What do you think the probability is that both children are girls?

A simpler question

Let's image you're asked a simpler question.

A family has two children. What's the probability both are girls?

We can work this out using a simple probability tree:

Boy (0.5)                                                Girl (0.5)

/              \                                               /                \

Boy-Boy (0.25)       Boy-Girl (0.25)            Girl-Boy (0.25)      Girl-Girl (0.25) 

So the probability of two girls is 0.25.

Note there are two ways of having a boy and a girl, so the total probability of having a boy and a girl (in any order) is 0.5.

The wrong answer

Let's go back to the original problem and see the logic behind the most-often given wrong answer.

The birth chance is 0.5 boy and 0.5 girl. We don't know the gender of one of the children, but it must be a 0.5 probability it's a girl. Given the fact we already know one of the children is a girl, the probability of their being two girls must therefore be 0.5.

It sounds right because it sounds logical, but it isn't right for reasons as I'll explain next.

The correct answer

The correct answer is 1/3. Let's see why.

In the probability tree above, we can see four equally likely combinations: {Boy-Boy} (0.25), {Boy-Girl} (0.25), {Girl-Boy} (0.25), and {Girl-Girl} (0.25). We're told in the problem that the {Boy-Boy} combination is ruled out, which leaves us with three remaining combinations. Each of these three remaining combinations is equally likely and it has to be one of them, which means the probability of two girls is 1/3.

There are two ways of having a boy and a girl, {Boy-Girl} and {Girl-Boy}, which means there's a 2/3 probability of having a boy and a girl (in any order). The mistake is to consider that a 0.5 probability.

Sample space

The underlying method to solve this problem is to use something called the 'sample space' which is the set of all possible outcomes of a trial. In our case, the set of all outcomes is {{Boy-Girl}, {Girl-Boy}, {Girl-Girl}}. We can associate probabilities with each of the elements of our sample space. In our case, they're all 1/3.

The sample space idea helps us solve various versions of the problem, here's an example. If we're told the eldest child is a girl, does this change anything? Actually, it does. The sample space becomes {Boy-Girl}, {Girl-Girl}, so the probability is now a 1/2 (eldest child is last on list). Why? Because the {Girl-Boy} combination isn't possible.

How might you test this?

With problems like this that seem counter-intuitive, a good way forward it to actually test the theory. Plainly, it would be expensive to ask people for real, but we can do a computer simulation. Here are the steps.

  1. Randomly create a large number of two-children families with the sample space {Boy-Boy}, {Boy-Girl}, {Girl-Boy}, {Girl-Girl} and probabilities 1/4, 1/4, 1/4, and 1/4.
  2. Select only the families that have at least one girl.
  3. Now figure out the fraction of all the selected families that are {Girl-Girl}.  
Interestingly, if you think about ways of testing a solution, it often helps you define the problem a bit better. I found just writing the test process down helped me confirm the correct answer.

Controversy, complexity, and meaning

I've presented a simple analysis here, but you should be aware that things can get a lot, lot more complex. The Wikipedia article on the Boy or girl paradox goes into some painful detail about the problem and the controversy around it. Without going into too much detail, the detailed text of the problem is important.

This might seem abstract, but I've seen variations of this problem pop up in business and I've had difficult conversations with non-technical people as a result. It's especially hard when the "common sense" error gives a more optimistic answer than the correct answer. Realistically, the only way forward is prior eduction and the use of sample space arguments.

Probability theory, and conditional probability in particular, can give some very counter-intuitive results. Here's my advice if you're working with probabilities:

  • Be as precise as you can be and list all your assumptions.
  • Figure out how you might run a computer simulation to test your theory. Go back and look at the problem definition once you've defined your simulation.
  • Don't rely on "common sense".

9 comments:

  1. It's just the Monty Hall paradox framed differently.

    ReplyDelete
    Replies
    1. Excellent point! I can see where you're coming from. The solution follows the same logic as the Monte Hall problem. In my view, these are different problems that use the same solution logic, but they are different problems.
      Perhaps we should pose another problem: what's the probability these are different problems?

      Delete
    2. This is not like the Monty Hall Problem. The Monty Hall Problem is about having a statistical probability and then learning more information, when Monty eliminates one of the two wrong answers, so now you are left with your original choice at 33% and another door at 50%, so you switch doors. It's new information.

      If you do the Monty Hall Problem but assume the contestant is swapped after Monty opens the door with a new contestant that doesn't know which door the first contestant picked, then both doors have a 50% chance.

      So the Monty Hall Problem is about getting new information. This question doesn't have that factor. You just know that a family has a female child and an unknown child and your job is to figure out the probability that the unknown child is a female.

      Delete
  2. It's a neat riddle, here's some R code if anybody wants to prove it to themselves
    ```
    library(tidyverse)

    samples <- 10^6
    children <- sample(c("Boy", "Girl"), size = samples , replace = TRUE)
    child_number <- rep(c(1, 2), samples/2)
    families <- rep(1:(samples/2), each = 2)

    tibble(gender = children, child_number = child_number, family = families) |>
    pivot_wider(values_from = gender, names_from = child_number, names_prefix = "child_", id_cols = family) |>
    filter(child_1 == "Girl" | child_2 == "Girl") |> # has at least one girl
    summarise(both_girls = mean(child_1 == "Girl" & child_2 == "Girl")) # proportion where both are girls
    ```

    ReplyDelete
  3. Why do we treat boy-girl and girl-boy as different outcomes since they are, per the question, equal outcomes thus representing one possible outcome.

    We don't care about which came first or second, only what gender each child is.

    Thus the answer, to the question given the information you have, is 50%. The only possible outcomes are girl-girl or girl-boy (where order is irrelevant.)

    And this is absolutely NOT the Monty Hall Problem. The Monty Hall Problem contains three possible choices and one is eliminated by the host, this is what makes the statistical math interesting in that problem. No choice is eliminated here.

    Lets look a the exact wording of the question:

    > a family has two children. You're told that at least one of them is a girl. What's the > probability both are girls?

    We have a family with two children. Assume we don't know their gender. We'll represent them as XX.

    We are told one of them is a female. So now they are represented as GX (remember GX = XG, since order doesn't matter.)

    You are left with the question what is the probability that X is female? Well there are only two choices, F and M, and we are told elsewhere that the probability of having a girl is 50/50.

    > Assume that the probability of having a girl or boy is 50% and that the birth order has > no effect on the probability.

    So the chance of X being female is 50%. Thus the answer is 50%.

    You can't say birth order doesn't matter and then use birth order to say the FM and MF are different results. The only possible results are FM and FF (since birth order is irrelevant.)

    ReplyDelete
    Replies
    1. The question posed was:
      > A family has two children. You're told that at least one of them is a girl. What's the probability both are girls?

      The only correct answer is 1/3 or 33.3%. You can verify this analytically or empirically, I added code to verify it emprically in a comment above. And the author covered it analytically in the "Sample Space" section.

      This part: "If we're told the eldest child is a girl, does this change anything? Actually, it does." is particularly important.

      The question you are answering is "Given that a family have one child, and that child is a girl, what is the probability that their next child will also be a girl?", that is a different question to what is posed, and it has a different answer!



      Delete
    2. To get to 1/3rd you need a sample space of four options and you eliminate one option. To do this you have to assume the children are distinguishable by birth order (that is assuming something never stated.)

      If birth order matters than the sample space is: [MM, MF, FF, FM]. Since we know MM is not a possibility we have only [MF, FF, FM] for a valid sample space and the probability is 1/3rd.

      But that, again, has assumed a fact. That MF and FM are distinguishable. But, since birth order doesn't matter, we can't assume that. We can't even assume we know the birth order - what if they were adopted and we don't know who was born first?

      Thus the sample space to start with is: [FM, FF]. We don't include MF because it's the same as FM. They are only different if you assume the subjects are distinguishable by order, which we can't do.

      And when the sample space is [FM, FF] then the probability that the family has two girls is 50%.

      Your code treats the the [F,M] and [M,F] outcomes as distinguishable. They aren't. They both contain one female and one male so they are equal outcomes where birth order is irrelevant.

      Delete
  4. There's ambiguity here. Imagine the following.

    "The question writer took all sets of two child families and ruled out the bb case. Then they asked the exact question above" This is indeed 1/3 chance - select gg from [gg,bg,gb]'

    vs

    "The question writer came across a girl from a two child family, then they asked the exact question above". This is 1/2 chance - select gg from [gg, gg, bg, gb] with gg listed twice since there's two ways to select a girl from that set; ie. coming across a girl is twice as likely to occur from the gg case than it is either gb or bg.

    The question as stated doesn't resolve this ambiguity. We don't know how the question writer sampled the data. It's a bit silly that you've linked to the wikipedia article https://en.wikipedia.org/wiki/Boy_or_girl_paradox which talks about this and why 1/3 is NOT correct. Instead you've stated it's absolutely 1/3 and that the 1/2 answer is incorrect.

    The correct answer is there's no way to answer this without knowing how the data was sampled. A classic problem in data science.

    ReplyDelete
  5. It is not that the question, posed as a math problem, is ambiguous; it is actually underdetermined. There is not enough information given in the problem at hand to provide an answer about which everyone is going to agree. From a mathematical standpoint few things are missing to fully characterise the problem: 1) what is the event space to be truly considered? E.g., should we lump (g, b) and (b, g) events as a unique single event (m) =(g,b)U(b,g) that does not discriminate order in the event space? That is what some comments have referred to when mentioning that it depends on how the data was sampled. If indeed we ask every girl belonging to a pair of siblings whether they had a sister or not, the sample space of such a question would be {yes=(g,g), no=(m)} where (m) is defined above.
    2) The probability measure (on the unspecified event space) is also missing from the question statement. Consequently, it is impossible to provide an irrefutable mathematical answer to it. Whether one should assume uniform measure on the event space of preference {(g,g), (g,b), (b,g)} or {(g,g), (m)} is also unclear.

    But we should be honest in admitting that this underdetermination is the case for 99% of problems posed in probability and statistics classes. There is nothing special about this problem or this post specifically. The same would arise with the Bday problem, the Monty Hall problem, or even evaluating the probability of getting at least one Heads by tossing a coin twice etc...

    One comment I would make in favour of the proposed answer by the OP is that almost all commenters appear to be able to conceive of the birth of a child as a single random experiment, the outcome of which would be a girl or a boy with probability 50%. Taking this as a given, what one can ask is what answer to the proposed problem is consistent with such an assumption: in my opinion there is only one, which is the one provided by the OP. Indeed, from the sample space 𝝎 = {g,b} for a single birth, the natural sample space for two births is going to be 𝛺 = 𝝎x𝝎 = {(g,g), (g,b), (b,g), (b,b)}. The naturally occurring event space is going to be the power set of this joint sample space which will contain the order-dependent events (g,b) and (b,g), but also the event (m) = (g,b) U (b,g).

    Next, it appears that most people would be fine with assuming the two births as being independent random experiments. This means that we can obtain the probability measure on the sample space 𝛺 of the joint experiment by taking the product of the probability measure on the sample space 𝝎 of the single random experiment corresponding to a single birth.

    Under these circumstances, the answer to proposed problem should be 1/3 in the sense that it is formally indistinguishable from asking the same question about 2 independent tosses of a fair coin and asking the probability of obtaining two Heads...which is 1/3.

    ReplyDelete