Optimal Sampling for Identifying a Cluster: An Application in Sorting

Shmuel Gal; Dafna Sheinwald

doi:10.1017/S0269964800002990

Optimal Sampling for Identifying a Cluster: An Application in Sorting

Published online by Cambridge University Press: 27 July 2009

Shmuel Gal and

Dafna Sheinwald

Show author details

Shmuel Gal: Affiliation:
IBM Israel, Science and Technology LTD.Haifa, Israel
Dafna Sheinwald: Affiliation:
IBM Israel, Science and Technology LTD.Haifa, Israel

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We consider the following problem. For a given population of m items, we have to make a decision whether or not the population includes a relatively large cluster of identical items. This decision affects the effectiveness of a subsequent computational process, depending on the actual existence of the cluster and its size. To make a good decision, we use a statistical sample which should indicate the existence of a cluster and find a representative thereof. This paper describes the optimal sampling technique to be used in such a case, given the cost of the sampling and the potential gain in speed of the subsequent process. The optimal fixed sample size is specified, as well as the optimal sequential sampling, along with characterizing the dependence of the cost function on the truncation point.

For the case that the a priori distribution of the cluster proportion is known, we present formulae by which the optimal sampling procedures can be easily calculated. For the common situation in which the a priori distribution is not known, we present, in the case of a fixed sample size, a tight upper bound for the sample size, which is independent of the a priori distribution, and for the case of the sequential sampling, we present an approximately optimal truncation point, which is also independent of the a priori distribution.

The situation described arose in connection with choosing the best sorting method, an application that will be described in full detail. The most interesting practical result is that for our application truncating the sequential procedure at 35 observations, out of a population of 25,000–30,000 items, guarantees that in our sorting application we are always within 2.1% of the optimal cost independently of the a priori distribution.

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 7 , Issue 3 , July 1993 , pp. 387 - 407

DOI: https://doi.org/10.1017/S0269964800002990 [Opens in a new window]

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

1.Berger, J.O. (1980). Statistical decision theory. New York: Springer-Verlag.CrossRef Google Scholar

2. IBM system/370 extended architecture principle of operation. Publication SA22-7085-1, IBM.Google Scholar

3.Knuth, D.E. (1973). The art of programming, sorting and searching. Addison-Wesley.Google Scholar

Article contents

Optimal Sampling for Identifying a Cluster: An Application in Sorting

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests