A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters

Abba M. Krieger; Paul E. Green

doi:10.1007/BF02294300

A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters

Published online by Cambridge University Press: 01 January 2025

Abba M. Krieger and

Paul E. Green

Show author details

Abba M. Krieger: Affiliation:
Department of Statistics, University of Pennsylvania
Paul E. Green*: Affiliation:
Department of Marketing, University of Pennsylvania
*: Requests for reprints should be sent to Paul Green, Marketing Department, The Wharton School, University of Pennsylvania, 1400 Steinberg Hall-Dietrich Hall, Philadelphia PA 19104-6371.

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.

Keywords

cluster analysis cross-validation stopping rules

Type: Original Paper
Information: Psychometrika , Volume 64 , Issue 3 , September 1999 , pp. 341 - 353

DOI: https://doi.org/10.1007/BF02294300 [Opens in a new window]
Copyright: Copyright © 1999 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project.

References

Arabie, P., & Hubert, L. W. (1994). Cluster analysis in marketing research. In Bagozzi, R. P. (Eds.), Advanced methods in marketing research (pp. 160–189). Oxford: Blackwell.Google Scholar

Atlas, R. S., & Overall, J. E. (1994). comparative evaluation of two superior stopping rules for hierarchical cluster analysis. Psychometrika, 59, 581–591.CrossRef Google Scholar

Bradley, L. A., Prokop, C. K., Margolis, R., & Gentry, W. D. (1978). Multivariate analysis of MMPI profiles of low back pain patients. Journal of Behavioral Medicine, 1, 253–272.CrossRef Google Scholar PubMed

Breckenridge, J. N. (1989). Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research, 24, 147–161.CrossRef Google Scholar

Calinski, R. B., & Harabasz, J. (1976). A dendrite method for cluster analysis. Communications in Statistics, 3, 1–27.Google Scholar

Carroll, J.D. (1973). Howard-Harris clustering. In Green, P., & Wind, Y. (Eds.), Multivariate decisions in marketing (pp. 369–371). Hinsdale, IL: Dryden Press.Google Scholar

Cyr, J. J., Atkinson, L., & Haley, G. A. (1986). A Replicated cluster solution in a heterogeneous psychiatric population. Journal of Clinical Psychology, 42, 92–94.3.0.CO;2-2>CrossRef Google Scholar

Green, P. E., & Krieger, A. M. (1991). Segmenting markets with conjoint analysis. Journal of Marketing, 55, 20–31.CrossRef Google Scholar

Helsen, K., & Green, P. E. (1991). A Computational study of replicated clustering with an application to market segmentation. Decision Science, 22, 1124–1141.CrossRef Google Scholar

Hubert, L. J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef Google Scholar

Johnson, R. M. (1988). Convergent cluster analysis system, Ketchum, ID.: Sawtooth Software.Google Scholar

McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivariate Behavioral Research, 2, 225–238.CrossRef Google Scholar

Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.CrossRef Google Scholar

Milligan, G. W. (1994). Issues in applied classification: replication analysis. CSNA Newsletter, 36, 5–6.Google Scholar

Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.CrossRef Google Scholar

Milligan, G. W., & Cooper, M. C. (1987). Methodology review: Clustering methods. Applied Psychological Measurement, 11, 329–354.CrossRef Google Scholar

Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis. Applied Psychological Measurement, 16, 119–128.CrossRef Google Scholar

Punj, G. N., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions. Journal of Marketing Research, 20, 134–148.CrossRef Google Scholar

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244.CrossRef Google Scholar

Article contents

A Cautionary Note on using Internal Cross Validation to Select the Number of Clusters

Abstract

Keywords

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests