The Birthday Problem: Advanced

Posted on 05 June, 2021

Exact probabilities for counting the chance of collisions with distinct objects, using the birthday problem as a working example.

Introduction

This follows from my previous blog post on Intuitive explanations for non-intuitive problems: the Birthday Problem. Please see that post for a heuristic understanding of the problem. It assumed independence of sharing birthdays. This post makes no such assumption, so the mathematics is more advanced. A background in counting methods with combinations and permutations is required.

I have divided this section in to the following parts:

  1. Probability of k collisions
  2. Probability of groups of collisions
  3. Probability of at least 1 collision
  4. Probability of collisions when drawing from a distribution
  5. Integer partitions

Probability of $k$ collisions

What is the chance of $i$ people in a group of $k$ people sharing a birthday? In the previous post I assumed independence between sharing birthdays. But sharing is dependent. If person A and person B share a birthday, then sharing a birthday with person B depends on sharing a birthday with person A. Counting methods can account for this dependence.

In a group of 23 people there are ${23 \choose 2}=253$ ways that two people can be chosen. Separately we can imagine filling 22 slots with distinct birthdays (not 23 because one is shared). There are $^{365}P_{22}$ ways of filling these slots. The total probability is then:

\[\begin{align} P(x=2 | k=23) &= {23 \choose 2} \frac{^{365}P_{22}}{365^{23}} \\ &= 0.3634 \end{align}\]

Similarly for three people sharing a birthday:

\[\begin{align} P(x=3 | k=23) &= {23 \choose 3} \frac{^{365}P_{21}}{365^{23}} \\ &= 0.0074 \end{align}\]

This is much more unlikely.

Probability of groups of collisions

There can also be two pairs of two people sharing birthdays. Or three pairs. Or two pairs and a triplet and a quintuplet.

First let’s take the case of two pairs of two people amongst 23 people in total. We first choose 2 to be part of the first pair ($\binom{23}{2}$), then 2 of the remaining 21 to be part of the second pair ($\binom{21}{2}$). We can shuffle the birthdays between 21 unique slots ($^{365}P_{21}$). The first two slots can be shuffled without changing the arrangement because they both have identical looking pairs ($2!$). The probability is then:

partitions 2x2x19
\[\begin{align} P(x=(2,2) | k=23) &= \frac{\binom{23}{2}\binom{21}{2}}{2!}\frac{^{365}P_{21}}{365^{23} } \\ &= \frac{23!}{(2!)^2 2!19!} \frac{^{365}P_{21}}{365^{23}} \\ &= 0.1109 \end{align}\]

Now let’s add a set of three:

partitions 2x2x3x16
\[\begin{align} P(x=(2,2,3) | k=23) &= \frac{\binom{23}{2}\binom{21}{2}}{2!}\binom{19}{3} \frac{^{365}P_{19}}{365^{23} } \\ &= \frac{23!}{(2!)^2 2!3!16!} \frac{^{365}P_{19}}{365^{23}} \\ &= 0.0009 \end{align}\]

This is much more unlikely.

To get the probability of at least 2 people sharing, we need to account for all these scenarios. We can go through all ways of partitioning $k$ people. This is known as the integer partition problem. In practice, I find it is accurate enough to only consider up to 3 or 4 people sharing a birthday, because the other scenarios are so unlikely. However the method in the next section is much simpler.

Probability of at least 1 collision

As in my previous post, it’s easiest to work backwards. First find the probability that no people share a birthday and then subtract this from 1. For $k$ people, we can imagine shuffling $k$ unique birthdays in $k$ slots:

\[\begin{align} P(x \geq 1 | k) &= 1 - \frac{^{365}P_{k}}{365^k} \end{align}\]

For 23 people this gives a probability of 0.5073. The previous sections show that of the collisions that happen, 73% will be just two people, and 22% two pairs of two people.

Probability of collisions when drawing from a distribution

A rather subtle but important assumption made in the previous sections is that we are drawing people from a large population. This assumption breaks down when drawing from a small sample. For example, Facebook friends. We can only draw as many pairs, triples and sets in general as exist. In my group of friends there is a day where seven people share a birthday but none where six do. So I cannot include this probability in my calculation, which the calculation in the previous section does.

The probability of drawing from a distribution is a tricky problem. One notable application is calculating probabilities of hands in poker.

It is simpler and quicker to run Monte Carlo simulations, as I did in my previous post. Here is a comparison of running 1,000 and 10,000 trials per point compared to the true curve:

probability graphs

The Monte Carlo simulations hold up well. The MC 10,000 curve is almost indistinguishable from the theoretical curve.

But how does one calculate the theoretical curve? Let’s start with a simpler, very specific problem. What if I want to select 30 friends of which none share birthdays, but 12 share birthdays with one other friend (not in the 30), and 2 share with two other friends each, and 1 is one of the 7 that share a birthday on the same day? (This is the simpler problem.)

  • There are $\binom{374}{30}$ ways in total to choose 30 people.
  • From the $n_2=73$ pairs, I want to select 10 pairs $\binom{73}{10}$, 1 of 2 from each pair $\binom{2}{1}^{10}$.
  • From the $n_3=13$ triples, I want to select 2 triples $\binom{13}{2}$, 1 of 3 from each triple $\binom{3}{1}^{2}$.
  • From the $n_7=1$ sevens, I want to select 1 person of 7 $\binom{7}{1}$.
  • The remaining $17$ all come from the set of $n_1=132$ singeltons $\binom{132}{17}$.

Hence the total probability is: \(\begin{align} P&(x =(1\times 17, 2\times 10, 3\times 2, 7\times 1)|k=30) \\ &= \left[ \binom{73}{10} \binom{2}{1}^{10} \right] \left[ \binom{13}{2} \binom{3}{1}^2 \right] \left[ \binom{1}{1} \binom{7}{1} \right] \binom{132}{17} \div \binom{374}{30} \\ &= 0.00001981 \end{align}\)

Now repeat this calculation for every possible way of making up 30 friends from the known distribution. This can be done with an integer partition algorithm with a maximum on the values that each partition can take. This will give the theoretical value of selecting zero friends who share birthdays. Subtract from 1 to get the probability of least 2 sharing a birthday.

For Julia code on my laptop, this takes 22.5 seconds. Monte Carlo simulations with 10,000 trials per each point from 1 to 60 takes 2.2 seconds for all 600,000 trials. So the Monte Carlo simulations are faster.

If there is an easier way please let me know 🙂.

Integer partitions

For completeness, here is code for the integer partition problem implemented in Julia.

If $p(n)$ is an integer partition of $n$, then $p(n) = k + p(n -k)$. The following recursive algorithm is based off this formula:

function integer_partitions(n::Integer, max_value::Int)
    if n < 0
        throw(DomainError(n, "n must be nonnegative"))
    elseif n == 0
        return Vector{Int}[[]]
    end

    partitions = Vector{Int}[]
    for k in max_value:-1:1
        for p in integer_partitions(n-k, min(k, n-k))
            push!(partitions, vcat(k, p))
        end
    end        
    return partitions
end

This function gives all the unique partitions of a number. For the above problems we have the additional constraints:

  • There is a maximum value at each partition index.
  • Permutations of partitions are unique e.g. 3+0+0, 0+3+0 and 0+0+3 are all unique.

The function is therefore modified to keep an index count:

function integer_partitions(n::Integer, max_values::Vector{Int}, idx=1)
    if n < 0
        throw(DomainError(n, "n must be nonnegative"))
    elseif n == 0 
        return Vector{Int}[[]]
    end

    partitions = Vector{Int}[]
    max_value = (idx <= length(max_values)) ? min(n, max_values[idx]) : 0
    for k in max_value:-1:1
        for p in integer_partitions(n-k, max_values, idx + 1)
            push!(partitions, vcat(k, p))
        end
    end    
    # include 0 case 
    if (idx + 1 <= length(max_values))
        for p in integer_partitions(n, max_values, idx + 1)
            push!(partitions, vcat(0, p))
        end
    end
    return partitions
end

Caching can be added to improve the performance.