Validating Clusters with the Lower Bound for Sum-of-Squares Error

Douglas Steinley

doi:10.1007/s11336-003-1272-1

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Published online by Cambridge University Press: 01 January 2025

Douglas Steinley

Show author details

Douglas Steinley*: Affiliation:
University of Missouri-Columbia
*: Requests for reprints should be sent to Douglas Steinley, Department of Psychological Sciences, University of Missouri-Columbia, 210 McAlester Hall, Columbia, MO 65211, USA. E-mail: [email protected].

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on the observed sum-of-squares error as compared to the minimum sum-of-squares error.

Keywords

K-means cluster analysis

Type: Original Paper
Information: Psychometrika , Volume 72 , Issue 1 , March 2007 , pp. 93 - 106

DOI: https://doi.org/10.1007/s11336-003-1272-1 [Opens in a new window]
Copyright: Copyright © 2006 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The author was partially supported by the Office of Naval Research Grant #N00014-06-0106.

References

Anderberg, M.R. (1973). Cluster analysis for applications, New York: Academic Press.Google Scholar

Brusco, M.J., Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.CrossRef Google Scholar

Cattell, R.B., Coulter, M.A. (1966). Principles of behavioral taxonomy and the mathematical basis of the taxonome computer program. British Journal of Mathematical and Statistical Psychology, 19, 237–269.CrossRef Google Scholar PubMed

Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society, Series A, 134, 321–367.CrossRef Google Scholar

Cox, D.R. (1957). Note on grouping. Journal of the American Statistical Association, 52, 543–547.CrossRef Google Scholar

Duda, R.O., Hart, P.E., Stork, D.G. (2001). Pattern recognition (2nd ed.),, New York: Wiley.Google Scholar

Engelman, L., Hartigan, J.A. (1969). Percentage points of a test of clusters. Journal of the American Statistical Association, 64, 1647–1648.CrossRef Google Scholar

Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 35, 652–655.CrossRef Google Scholar

Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.CrossRef Google Scholar

Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789–798.CrossRef Google Scholar

Gentle, J.E. (2002). Elements of computational statistics, New York: Springer-Verlag.Google Scholar

Gersho, A., Gray, R.M. (1992). Vector quantization and signal compression, Boston, MA: Kluwer Academic.CrossRef Google Scholar

Golub, G.H., Van Loan, C.F. (1996). Matrix computations (3rd ed.),, Baltimore: The Johns Hopkins University Press.Google Scholar

Gordon, A.D. (1999). Classification (2nd ed.),, New York: Chapman & Hall/CRC.CrossRef Google Scholar

Gordon, A.D., Henderson, J.J. (1977). An algorithm for Euclidean sum of squares classification. Biometrics, 33, 355–362.CrossRef Google Scholar

Hartigan, J.A. (1975). Clustering algorithms, New York: Wiley.Google Scholar

Hubert, L., Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef Google Scholar

Hubert, L.J., Arabie, P., Meulman, J. (2001). Combinatorial data analysis: Optimization by dynamic programming, Philadelphia: SIAM.CrossRef Google Scholar

Johnson, R.A., Wichern, D.W. (2002). Applied multivariate statistical analysis (5th ed.),, Upper Saddle River, NJ: Prentice Hall.Google Scholar

Kaufman, L., Rousseeuw, P.J. (1987). Clustering by means of medoids. In Dodge, Y. (Eds.), Statistical data analysis based on the L ₁-norm and related methods (pp. 405–416). Amsterdam: Elsevier Science.Google Scholar

Kaufman, L., Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis, New York: Wiley.CrossRef Google Scholar

Lattin, J., Carroll, J.D., Green, P.E. (2003). Analyzing multivariate data, Pacific Grove, CA: Brooks/Cole.Google Scholar

MacQueen, J. (1967). Some methods of classification and analysis of multivariate observations. In L.M. Le Cam, & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp.~281–297). Berkeley, CA: University of California Press.Google Scholar

Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 123–127.CrossRef Google Scholar

Milligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.CrossRef Google Scholar

Milligan, G.W., Cooper, M.C. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181–204.CrossRef Google Scholar

Sebestyen, G.S. (1962). Decision making process in pattern recognition, New York: Macmillan.Google Scholar

Späth, H. (1980). Cluster analysis algorithms for data reduction and classification of objects, New York: Wiley.Google Scholar

Steinley, D. (2003). K-means clustering: What you don't know may hurt you. Psychological Methods, 8, 294–304.CrossRef Google Scholar PubMed

Steinley, D. (2004a). Standardizing variables in K-means clustering. In Banks, D., House, L., McMorris, F.R., Arabie, P., Gaul, W. (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer-Verlag.CrossRef Google Scholar

Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.CrossRef Google Scholar PubMed

Steinley, D. (in press). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology.Google Scholar

Steinley, D., Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250.CrossRef Google Scholar

Thorndike, R.L. (1953). Who belongs in the family?. Psychometrika, 18, 267–276.CrossRef Google Scholar

Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423.CrossRef Google Scholar

Timm, N.H. (2002). Applied multivariate analysis, New York: Springer-Verlag.Google Scholar

van Os, B.J. (2000). Dynamic programming for partitioning in multivariate data analysis, Leiden: University Press.Google Scholar

Waller, N.G., Underhill, J.M., Kaiser, H.A. (1999). A method for generating simulated plasmodes and artificial test clusters with user-defined shape, size, and orientation. Multivariate Behavioral Research, 34, 123–142.CrossRef Google Scholar PubMed

Weisstein, E.W. (2003). CRC concise encyclopedia of mathematics, Boca Raton, FL: Champman & Hall.Google Scholar

Zha, H., Ding, C., Gu, M., He, X., Simon, H.D. (2001). Spectral relaxation for K-means clustering. Nueral Information Processsing Systems, 14, 1057–1064.Google Scholar

Article contents

Validating Clusters with the Lower Bound for Sum-of-Squares Error

Abstract

Keywords

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests