1. Introduction
Consider the coupon collector problem where each box of a brand of cereal contains a coupon and there are n different types of coupons. Suppose that the probability of a box containing a coupon of a specific type is ${1}/{n}$ , and that we keep buying boxes until we collect at least one coupon of each type. It is well known (see, for example, [Reference Newman and Shepp9]) that the expected number of boxes we need to buy is $nH_n$ , where $H_n$ is the nth harmonic number:
where $\gamma$ is the Euler–Mascheroni constant. The expected value of the total number of boxes collected has been extensively studied. For instance, it was proved in [Reference Newman and Shepp9] that, if you continue to collect boxes until you have at least m coupons of each type, then the expected number of boxes collected is $n\log n+(m-1)n\log\log n+n\cdot C_m+o(n)$ , where $C_m$ is a constant depending on m.
Definition 1.1. The Gumbel distribution with parameters $\mu$ and $\beta$ , which is denoted as $\mathrm{Gumbel}(\mu,\beta)$ , is defined as the probability distribution with cumulative density function $\exp\![-\mathrm{e}^{-(x-\mu)/\beta}]$ for $-\infty<x<\infty$ .
The above result was improved in [Reference Erdös and Rényi5], which proved that, if $v_m(n)$ is the total number of boxes collected, then $({v_m(n)}/{n})-\log n-(m-1)\log\log n\sim\mathrm{Gumbel}(-\log((m-1)!),1)$ as $n\rightarrow\infty$ . The probability distribution of the number of boxes that need to be collected to collect $a_n+1$ coupons, where $0\leq a_n<n$ , was studied in [Reference Baum and Billingsley1]. The expected number of boxes needing to be collected if the types of coupons have different probabilities of being in a box was examined in [Reference Berenbrink and Sauerwald2], and [Reference Doumas and Papanicolaou3, Reference Neal8] looked at the actual probability distribution of the number of boxes collected if the types of coupons have different probabilities. The case of collecting at least m coupons of each type has also been examined [Reference Doumas and Papanicolaou4]. Further variations of the problem, where you have multiple collectors, were studied in [Reference Foata and Han6, Reference Myers and Wilf7].
Here we examine the situation where we continue to collect boxes until we have at least m copies of each type of coupon for any fixed $m\in\mathbb{N}$ .
Definition 1.2. For all $k\geq m$ , call a certain coupon a k-ton if we see it k times by the time we have seen m copies of all of the coupons. Let the number of k-tons be denoted as $S_k$ .
In the case of $m=1$ , [Reference Myers and Wilf7] determined the expected number of 1-tons, which they called singletons, to be $H_n$ . Singletons were also studied in [Reference Penrose10], but in the case of unequal coupon probabilities and where the number of coupons collected in total is proportional to the number of types of coupons; many applications to their study, such as in databases, biological particles, and communication channels, were noted here. We extend this singleton result by determining the asymptotic joint probability distribution over values of k in a restricted range and the total number of coupons collected assuming, as in [Reference Myers and Wilf7], equal coupon probabilities. We also determine the asymptotic distribution of the number of k-tons after we have collected m copies of each coupon for any such values of k, given any fixed $m\in\mathbb{N}$ . While making an analogous extension to [Reference Penrose10] is certainly interesting, we leave it for a future project.
Theorem 1.1. Fix $m\in\mathbb{N}$ , let $T_m(n)$ be the total number of coupons collected to see at least m copies of each coupon, and let $n\rightarrow\infty$ . Let $k=o(\!\log n)$ if $m=1$ and $k=o(({\log n})/({\log\log n}))$ if $m\geq 2$ , with $k\geq m$ in either case. Then the joint probability distribution of
converges to $(X,\mathrm{e}^{-X})$ , where $X\sim\mathrm{Gumbel}(-\log((m-1)!),1)$ .
A simple heuristic argument for Theorem 1.1 may be given as follows. First, from [Reference Erdös and Rényi5], we know that
Later on, we also see, unsurprisingly, that the number of times a specific type of coupon is seen is essentially independent of whether or not all types of coupons have been collected at least m times. Also, unsurprisingly, it turns out that the number of times two different types of coupons are seen is asymptotically independent. Since the probability of a type of coupon being seen k times is
where $x=({T_m(n)-n\log n-(m-1)n\log\log n})/{n}$ , the expected number of k-tons is $\sim{(\!\log n)^{k-m+1}}/{\mathrm{e}^xk!}$ , so that ${k!S_k}/{(\!\log n)^{k-m+1}}\sim \mathrm{e}^{-x}$ .
While Theorem 1.1 provides us with the joint distribution between the total number of coupons collected and the number of k-tons for minimal k, we can also ask what happens for maximum values of k. The answer to this is provided in the next theorem.
Theorem 1.2. Fix $m\in\mathbb{N}$ and let $T_m(n)$ be the total number of coupons collected to see at least m copies of each coupon.
-
(i) Fix $d\in\mathbb{R}$ . Pick an increasing sequence $(n_j)_j\subset\mathbb{N}$ and a sequence $(d_j)_j\subset\mathbb{R}$ such that $\lim_{j\rightarrow\infty}d_j=d$ and $k_j=\mathrm{e}\log n_j+\big((\mathrm{e}-1)(m-1)-\frac{1}{2}\big)\log\log n_j+d_j\in\mathbb{N}$ for all $j\in\mathbb{N}$ . Then the joint probability distribution of
\begin{align*} \bigg(\frac{T_m(n_j)-n_j\log n_j-(m-1)n_j\log\log n_j}{n_j},S_{k_j}\bigg) \end{align*}converges to\begin{align*} \bigg(X,\mathrm{Pois}\bigg(\frac{\mathrm{e}^{(\mathrm{e}-1)X-d}}{\sqrt{2\pi \mathrm{e}}}\bigg)\bigg), \end{align*}where $X\sim\mathrm{Gumbel}(-\log((m-1)!),1)$ . -
(ii) Let $g(n)=o(\!\log\log n)$ with $\lim_{n\rightarrow\infty}g(n)=\infty$ , such that $k=\mathrm{e}\log n+\big((\mathrm{e}-1)(m-1)$ $-\frac{1}{2}\big)\log\log n+g(n)\in\mathbb{N}$ for all $n\in\mathbb{N}$ . Then $S_k$ converges to 0 with probability 1.
-
(iii) Let $g(n)=o(\!\log\log n)$ with $\lim_{n\rightarrow\infty}g(n)=\infty$ , such that $k=\mathrm{e}\log n+\big((\mathrm{e}-1)(m-1)$ $-\frac{1}{2}\big)\log\log n-g(n)\in\mathbb{N}$ for all $n\in\mathbb{N}$ . Then the value of $\mathbb{P}(S_k=0)$ converges to 0.
A simple heuristic argument for Theorem 1.2 may be given as follows. The probability of a type of coupon being seen $k_j$ times is
Assuming that $x=({T_m(n_j)-n_j\log n_j-(m-1)n_j\log\log n_j})/{n_j}$ is constant as $j\rightarrow\infty$ and $k_j$ is as defined in Theorem 1.2, we have that the above is asymptotic to
Therefore, by similar reasoning to the heuristic for Theorem 1.1, we can see that the probability that there are y k-tons is asymptotic to
where $\lambda = {\mathrm{e}^{(\mathrm{e}-1)x-d}}/({\sqrt{2\pi \mathrm{e}}})$ , which gives the Poisson distribution in Theorem 1.2(i). Parts (ii) and (iii) can be heuristically argued by letting $d_j$ approach $\infty$ or $-\infty$ .
From Theorem 1.1 we can derive the following corollaries.
Corollary 1.1. Fix $m\in\mathbb{N}$ , let $n\rightarrow\infty$ , and let $k=o(\!\log n)$ if $m=1$ and $k=o(({\log n})/({\log\log n}))$ if $m\geq 2$ , with $k\geq m$ in either case. Suppose we keep collecting coupons until we see at least m copies of each coupon. Then
Corollary 1.1 gives the asymptotic probability distribution of the number of k-tons. If we fix k, we can also determine the joint asymptotic distribution between the number of m-tons, $m+1$ -tons, …, and k-tons. We can see that asymptotically all of these variables are linearly dependent.
Corollary 1.2. Fix $k,m\in\mathbb{N}$ with $k\geq m$ , and let $n\rightarrow\infty$ . Suppose we keep collecting coupons until we see at least m copies of each coupon. Then the joint probability distribution
converges to $X\cdot(1,1,\ldots,1)$ , where
2. Proofs
To prove Theorem 1.1 we require the following notation and lemma.
Note 2.1. Suppose we collect x coupons in total, regardless of whether we have collected all n types of coupons or not. Let $S_{i,x}$ denote the number of types of coupons we collected exactly i times.
Note 2.2. Throughout this paper, let $N(n,f(n))\;:\!=\;n\log n+(m-1)n\log\log n+f(n)$ .
Lemma 2.1. Let $f\colon\mathbb{N}\rightarrow\mathbb{R}$ be such that $-\frac{1}{2}n\log\log n \leq f(n) \leq \frac{1}{2}n\log\log n$ , and $N(n,f(n))\in\mathbb{N}$ for all $n\in\mathbb{N}$ . Suppose we collect N(n,f(n)) coupons in total, regardless of whether we collect m copies of all n different kinds of coupons or not. Let $k=o(\!\log n)$ if $m=1$ and $k=o({\log n}/{\log\log n})$ if $m\geq 2$ . If $m\leq k$ , then as $n\rightarrow\infty$
is asymptotic to $\mathrm{e}^{-{f(n)}/{n}}$ with probability 1, and
Proof. First, assume that $m\leq k$ . We can see that, as $n\rightarrow\infty$ ,
Thus, we can deduce that
as $n\rightarrow\infty$ . Also,
Notice that
If k is bounded as $n\rightarrow\infty$ , then the right-hand side of (2.2) $O({1}/{\sqrt{\log n}})$ . On the other hand, for k sufficiently large we have $k!<k^{k-m}$ , so even if values of k tend toward $\infty$ , the right-hand side of (2.2) is $O({1}/{\sqrt{\log n}})$ . Thus, from (2.1), we obtain
as $n\rightarrow\infty$ . Thus,
as $n\rightarrow\infty$ , so that we obtain our result for $m\leq k$ .
Proof of Theorem 1.1. $X=({T_m(n)-n\log n-(m-1)n\log\log n})/{n}$ was proved to be asymptotically Gumbel distributed in [Reference Erdös and Rényi5]. To prove Theorem 1.1, it therefore suffices to show that$
for any fixed $x<y$ , and that$
for any fixed $y<x$ . To prove (2.3), we first calculate an upper bound for
for all sufficiently large $n\in\mathbb{N}$ . Let $M(n)=n\log n+\big(m-\frac{3}{2}\big)n\log\log n$ . We have
We have
Clearly,
Combining (2.5) and (2.6), we therefore have
From Chebyshev’s inequality and Lemma 2.1 we have, for sufficiently large n,
for some constant $C>0$ , depending only on x and y. From (2.7), we can thus deduce that$
for all sufficiently large n.
Now we let $n\rightarrow\infty$ . From the result quoted at the start of the proof, we can deduce that $\lim_{n\rightarrow\infty}\mathbb{P}(T< M(n))=0$ . From (2.8) we can also see that
These two limits give us (2.3).
We can prove (2.4) similarly, replacing $M(n)=n\log n+\big(m-\frac{3}{2}\big)n\log\log n$ with $M'(n)=n\log n+\big(m-\frac{1}{2}\big)n\log\log n$ and reversing the appropriate inequalities.
Proof of Corollaries 1.1 and 1.2. Corollaries 1.1 and 1.2 follow from Theorem 1.1 by showing that, if $X\sim\mathrm{Gumbel}(-\log((m-1)!),1)$ , then
Indeed, we have, for any $r>0$ ,
Hence, the result follows.
To prove Theorem 1.2, we require the following lemma.
Lemma 2.2. Fix $m\in\mathbb{N}$ . Let $f\colon\mathbb{N}\rightarrow\mathbb{R}$ be such that there exists $x\in\mathbb{R}$ such that $\lim_{n\rightarrow\infty}f(n)/n=x$ and $N(n,f(n))\in\mathbb{N}$ for all $n\in\mathbb{N}$ . Suppose we collect N(n, f(n)) coupons in total regardless of whether we collect m copies of all n different kinds of coupons or not. Pick an increasing sequence $(n_j)_j\subset\mathbb{N}$ and a sequence $(d_j)_j\subset\mathbb{R}$ such that $\lim_{j\rightarrow\infty}d_j=d$ for some $d\in\mathbb{R}$ and $k=\mathrm{e}\log n_j+\big((\mathrm{e}-1)(m-1)-\frac{1}{2}\big)\log\log n_j+d_j\in\mathbb{N}$ for all $j\in\mathbb{N}$ . Then, as $j\rightarrow\infty$ ,
where the implied constant in the error term only depends upon $r_1$ and $r_2$ .
Proof. Take some $j\in\mathbb{N}$ . Let the probability that $r_1$ prescribed types of coupons are $(m-1)$ -tons and $r_2$ prescribed types of coupons are k-tons be denoted as $W_{r_1,r_2}(n)$ . Then, as $j\rightarrow\infty$ ,
where the implied constant in the error term only depends on $r_1$ and $r_2$ . By the inclusion–exclusion formula, we have
For $t\leq n_j-r_1-r_2$ even:
For $t\leq n_j-r_1-r_2$ odd:
Combining (2.9)–(2.11), we have
as $j\rightarrow\infty$ .
Proof of Theorem 1.2. We first prove Theorem 1.2(i). Let $d\in\mathbb{R}$ and pick increasing sequences $(n_j)_j$ and $(d_j)_j$ such that $\lim_{j\rightarrow\infty}d_j=d$ and $k_j \;:\!=\; \mathrm{e}\log n_j + \big((\mathrm{e}-1)(m-1)-\frac{1}{2}\big)\log\log n_j + d_j\in\mathbb{N}$ for all $j\in\mathbb{N}$ . Let $x<y$ and $r\in\mathbb{N}$ be constants. Consider the probability $\mathbb{P}(N(n_j,n_jx)\leq T_m(n)\leq N(n_j,n_jy),S_k=r)$ . We have
Clearly,
Also, for $M\geq N(n_j,n_jx)$ , if we have $S_{i,M}>0$ for some $0\leq i\leq m-2$ , then we have $S_{l,\lceil N(n_j,n_jx)\rceil}>0$ for some $0\leq l\leq i$ . Thus, we can also deduce that
For any $0\leq l\leq m-2$ , notice that, as $j\rightarrow\infty$ ,
Since $l\leq m-2$ , we therefore have $\lim_{j\rightarrow\infty}\mathbb{E}(S_{l,\lceil N(n_j,n_jx)\rceil})=0$ , and so$
For every $N(n_j,n_jx) \leq M \leq N(n_j,n_jy)$ , let $c_{M,j} \;:\!=\; ({M-n_j\log n_j-(m-1)n_j\log\log n_j})/{n_j}$ . By Lemma 2.2, we have$
Note that
Combining (2.12)–(2.16), we obtain
By similar reasoning, we can also obtain
Since the choices of $x<y$ were arbitrary, we can see that the joint limiting distribution holds.
For Theorem 1.2(ii), again let $x<y$ be constants. We will show that
which implies the desired result since $x<y$ are arbitrary. We have
As $n\rightarrow\infty$ , we have$
We have
giving us our result.
For Theorem 1.2(iii), again let $x<y$ be constants. We will show that
which implies the desired result since $x<y$ are arbitrary. We have$
Note that
We have
Since $k<n$ for sufficiently large n, we have that there exists $C_1>0$ such that, for sufficiently large n,
Similarly to the derivation of (2.17), we can deduce that there exists $C_2>0$ depending only on x and y such that, for sufficiently large n, $\mathbb{E}(S_{k,M})<C_2\mathrm{e}^{g(n)}<C_2\log n$ . Therefore, for sufficiently large n, we have $\mathrm{Var}(S_{k,M}) \leq \mathbb{E}(S_{k,M}) + 1 < 2\mathbb{E}(S_{k,M})$ . Thus, for sufficiently large n, $\sigma(S_{k,M}) < \sqrt{2\mathbb{E}(S_{k,M})}$ . By Chebyshev’s inequality, we therefore have$
for sufficiently large n. Also, similarly to the derivation of (2.17), we have that there exists $C_3>0$ depending on only x and y such that $C_3\mathrm{e}^{g(n)}<\mathbb{E}(S_{k,M})$ . Thus, from (2.18) and (2.19), we have
for sufficiently large n. Since $\lim_{n\rightarrow\infty}g(n)=\infty$ , we have our result.
3. Future work
There are a few open questions that are worth exploring with respect to the coupon collector’s problem and the number of k-tons. For instance, can we extend our ranges for k? We have results for $k=o(\!\log n)$ if $m=1$ and $k=o({\log n}/{\log\log n})$ if $m\geq 2$ , as well as $k=\mathrm{e}\log n+\big((\mathrm{e}-1)(m-1)-\frac{1}{2}\big)\log\log n+d$ , but we can also ask similar questions for values of k between these minimum and maximum values. For instance, we can consider $k=\theta(\!\log n)$ or $k=\theta({\log n}/{\log\log n})$ . We can also ask what happens if we have m increasing with n. Also, [Reference Doumas and Papanicolaou3] studied the limiting distribution of $T_m(n)$ in the case of unequal coupon probabilities. We could also study the limiting distribution of k-tons in the case of unequal coupon probabilities.
Acknowledgement
The author would like to thank Dr. Daniel Berend for helpful comments on the paper.
Funding information
The research of J. C. Saunders is supported by an Azrieli International Postdoctoral Fellowship, as well as a Postdoctoral Fellowship at the University of Calgary.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.