No CrossRef data available.
Published online by Cambridge University Press: 27 July 2009
We analyze a standard algorithm for sampling m items without replacement from a computer file of n records. The algorithm repeatedly selects a record at random from the file, rejecting records that have previously been selected, until m records are obtained. The running time of the algorithm has two components: a rejection component and a search component. We show that the probability distribution of the rejection component undergoes an infinite series of phase transitions, depending on the order of magnitude of m relative to n. We identify an infinite number of ranges of m, each with a different behavior. The rejection component is distributed as a linear combination of Poisson-like random variables. The search component is customarily done using a hash table with separate chaining. The analysis of the hashing scheme in this problem differs from previous hashing analyses, as the number of lookups in the hash table for each insertion has a geometric distribution. We show that the average overall cost of searching is asymptotically linear with only two phase transitions in the coefficient of linearity.