Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-01-08T11:54:31.408Z Has data issue: false hasContentIssue false

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Published online by Cambridge University Press:  01 January 2025

Douglas Steinley*
Affiliation:
University of Missouri-Columbia
Michael J. Brusco
Affiliation:
Florida State University
*
Requests for reprints should be sent to Douglas Steinley, Department of Psychological Sciences, University of Missouri-Columbia, 210 McAlester Hall, Columbia, MO 65211, USA. E-mail: [email protected]

Abstract

Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.

Type
Theory and Methods
Copyright
Copyright © 2007 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803821.CrossRefGoogle Scholar
Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis, London: Arnold.Google Scholar
Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249270.CrossRefGoogle Scholar
Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501509.CrossRefGoogle Scholar
Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321367.CrossRefGoogle Scholar
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 138.CrossRefGoogle Scholar
DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 5778.CrossRefGoogle Scholar
De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173192.CrossRefGoogle Scholar
Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385427.CrossRefGoogle Scholar
Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845889.Google Scholar
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553569.CrossRefGoogle Scholar
Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205228.CrossRefGoogle Scholar
Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249266.CrossRefGoogle Scholar
Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 125.CrossRefGoogle Scholar
Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881890.CrossRefGoogle Scholar
Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113136.CrossRefGoogle Scholar
Goffe, W.L., Ferrier, G.D., &Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 6599.CrossRefGoogle Scholar
Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271285.CrossRefGoogle Scholar
Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193218.CrossRefGoogle Scholar
Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In Milton, R.C., & Nelder, J.A. (Eds.), Statistical Computation (pp. 427440). New York: Academic Press.CrossRefGoogle Scholar
Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 11541166.CrossRefGoogle ScholarPubMed
Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB, Boca Raton: Chapman & Hall.CrossRefGoogle Scholar
Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB, Boca Raton: Chapman & Hall.Google Scholar
McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering, New York: Dekker.Google Scholar
McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions, New York: Wiley.Google Scholar
McLachlan, G.J., & Peel, D. (2000). Finite mixture models, New York: Wiley.CrossRefGoogle Scholar
Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325342.CrossRefGoogle Scholar
Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23127.CrossRefGoogle Scholar
Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 5371.CrossRefGoogle Scholar
Montanari, A., Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463473.CrossRefGoogle Scholar
Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168178.CrossRefGoogle Scholar
Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294304.CrossRefGoogle ScholarPubMed
Steinley, D. (2004). Standardizing variables in K-means clustering. In Banks, D., House, L., McMorris, F.R., Arabie, P. & Gaul, W. (Eds.), Classification, clustering, and data mining applications (pp. 5360). New York: Springer.CrossRefGoogle Scholar
Steinley, D. (2004). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386396.CrossRefGoogle ScholarPubMed
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 134.CrossRefGoogle ScholarPubMed
Steinley, D. (2006). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178192.CrossRefGoogle ScholarPubMed
Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika.CrossRefGoogle Scholar
Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221250.CrossRefGoogle Scholar
Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133156.CrossRefGoogle ScholarPubMed
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411423.CrossRefGoogle Scholar
van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699706.CrossRefGoogle Scholar