Hostname: page-component-586b7cd67f-l7hp2 Total loading time: 0 Render date: 2024-11-23T21:02:00.990Z Has data issue: false hasContentIssue false

A PROBABILISTIC 1 METHOD FOR CLUSTERING HIGH-DIMENSIONAL DATA

Published online by Cambridge University Press:  05 April 2021

Tsvetan Asamov
Affiliation:
Department of Management Science & Information Systems, Rutgers Business School, 100 Rockafeller Road, Piscataway, NJ08854, USA E-mails: [email protected]; [email protected]
Adi Ben-Israel
Affiliation:
Department of Management Science & Information Systems, Rutgers Business School, 100 Rockafeller Road, Piscataway, NJ08854, USA E-mails: [email protected]; [email protected]

Abstract

In general, the clustering problem is NP-hard, and global optimality cannot be established for non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a probabilistic, distance-based, iterative method for clustering data in very high-dimensional space, using the ℓ1-metric that is less sensitive to high dimensionality than the Euclidean distance. For K clusters in ℝn, the problem decomposes to K problems coupled by probabilities, and an iteration reduces to finding Kn weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data space, and its performance was observed to improve significantly as the dimension increases.

Type
Research Article
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C.C., Hinneburg, A., & Keim, D.A. (2001). On the surprising behavior of distance metrics in high dimensional space. In Van den Bussche, J., & Vianu, V. (eds.), Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-44503-X_27.Google Scholar
Andoni, A. & Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.Communications of the ACM 51 (1): 117122.CrossRefGoogle Scholar
Arav, M. (2008). Contour approximation of data and the harmonic mean. Journal of Mathematical Inequalities 2: 161167.Google Scholar
Babak, O. (2014). Inverse distance interpolation for facies modeling. Stochastic Environmental Research and Risk Assessment 28: 13731382.CrossRefGoogle Scholar
Basu, A. & Saxena, N.K. (2002). Bathymetry data correction using global optimization method. Marine Geodesy 25: 3760.CrossRefGoogle Scholar
Beck, A. & Sabach, S. (2015). Weiszfeld's method: old and new results. Journal of Optimization Theory and Applications 164: 140.CrossRefGoogle Scholar
Ben-Israel, A. & Iyigun, C. (2008). Probabilistic distance clustering. Journal of Classification 25: 526.CrossRefGoogle Scholar
Ben-Israel, A. & Iyigun, C. (2010). Clustering, classification and contour approximation of data. In Y. Censor, M. Jiang, & G. Wang (eds.), Biomedical mathematics: promising directions in imaging, therapy planning and inverse problems. Madison, Wisconsin: Medical Physics Publishing, pp. 75–100.Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In C. Beeri & P. Buneman (eds.) Database Theory — ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Berlin, Berlin, Heidelberg: Springer. https://doi.org/10.1007/3-540-49257-7_15.CrossRefGoogle Scholar
Bezdek, J.C. (1973). Fuzzy mathematics in pattern classification. Doctoral Dissertation, Cornell University, Ithaca.Google Scholar
Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.CrossRefGoogle Scholar
Bezdek, J.C. & Pal, S.K. (eds.) (1992). Fuzzy models for pattern recognition: methods that search for structure in data. New York: IEEE Press.Google Scholar
Chazelle, B. (2008). Finding a good neighbor, near and fast. Communications of the ACM 51: 115.CrossRefGoogle Scholar
Dixon, K.R. & Chapman, J.A. (1980). Harmonic mean measure of animal activity areas. Ecology 61: 10401044.CrossRefGoogle Scholar
Drezner, Z. (1984). The planar two-center and two-median problems. Transportation Science 18: 351361.CrossRefGoogle Scholar
Franke, R. & Nielson, G. (1980). Smooth interpolation of large sets of scattered data. International Journal for Numerical Methods in Engineering 15: 16911704.CrossRefGoogle Scholar
Hammersley, J.M. (1950). The distribution of distance in a hypersphere. The Annals of Mathematical Statistics 21: 447452.CrossRefGoogle Scholar
Iyigun, C. & Ben-Israel, A. (2008). Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences 22: 119.CrossRefGoogle Scholar
Iyigun, C. & Ben-Israel, A. (2010). A generalized Weiszfeld method for the multi-facility location problem. Operations Research Letters 38: 207214.CrossRefGoogle Scholar
Kailing, K., Kriegel, H. P, & Kröger, P. (2004). Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Orlando, FL, USA.CrossRefGoogle Scholar
Klawonn, F. (2013). What can fuzzy cluster analysis contribute to clustering of high-dimensional data? In F. Masulli, G. Pasi, & R. Yager (eds.), Fuzzy logic and applications.WILF 2013. Lecture Notes in Computer Science, vol 8256. Cham: Springer. https://doi.org/10.1007/978-3-319-03200-9_27.Google Scholar
Longley, P.A., Goodchild, M.F., Maguire, D.J., & Rhind, D.W. (2016). Geographic information science and systems. New York: Wiley.Google Scholar
Luce, R.D. (1959). Individual choice behavior: a theoretical analysis. New York: Wiley.Google Scholar
MATLAB version 7.14.0.739. Natick, Massachusetts: The MathWorks Inc., 2012.Google Scholar
Megiddo, N. & Supowit, K.J. (1984). On the complexity of some common geometric location problems. SIAM Journal on Computing 13: 182196.CrossRefGoogle Scholar
de Mesnard, L. (2013). Pollution models and inverse distance weighting: some critical remarks. Computers and Geosciences 52: 459469.CrossRefGoogle Scholar
Ruprecht, D. & Muller, H. (1995). Image warping with scattered data interpolation. IEEE Computer Graphics and Applications 15: 3743.CrossRefGoogle Scholar
Shepard, D.S. (1968). A two-dimensional interpolation function for irregularly spaced data. In Proceedings of 23rd National Conference Association for Computing Machinery. Princeton, NJ: Brandon/Systems Press, pp. 517–524.CrossRefGoogle Scholar
Shiode, S. & Shiode, N. (2009). Inverse distance-weighted interpolation on a street network. In Y. Asami, Y. Sadahiro, & T. Ishikawa (eds.), New frontiers in urban analysis: In Honor of Atsuyuki Okabe, Chapter 10. Boca Raton: CRC Press, pp. 179–196.Google Scholar
Stanforth, R.W., Kolossov, E., & Mirkin, B. (2007). A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR & Combinatorial Science 26, 837844.CrossRefGoogle Scholar
Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research 8: 65102.Google Scholar
Weiszfeld, E. (1937). Sur le point par lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal 43: 355386.Google Scholar
Ye, D.H., Pohl, K.M., Litt, H., & Davatzikos, C. (2010). Groupwise morphometric analysis based on high dimensional clustering. In Proc. MMBIA 2010: IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis at CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.Google Scholar
Yellott, J.I., Jr. (2001). Luce's choice axiom. In N.J. Smelser & P.B. Baltes (eds.), International encyclopedia of the social & behavioral sciences, Amsterdam, The Netherlands: Elsevier Science, pp. 9094–9097.Google Scholar
Zhang, B., Hsu, M., & Dayal, U. (2000). k-Harmonic means - a spatial clustering algorithm with boosting. In J.F. Roddick & K. Hornsby (eds.), Spatial, and SpatioTemporal Data Mining. Lecture Notes in Computer Science, Vol. 2007, Berlin, Heidelberg: Springer, pp. 31–45.Google Scholar
Zhang, B., Hsu, M., & Dayal, U. (2000). Harmonic average based clustering method and system. US Patent 6,584,433.Google Scholar