Data-driven penalty calibration: A case studyfor Gaussian mixture model selection

Cathy Maugis; Bertrand Michel

doi:10.1051/ps/2010002

Data-driven penalty calibration: A case studyfor Gaussian mixture model selection

Published online by Cambridge University Press: 05 January 2012

Cathy Maugis and

Bertrand Michel

Show author details

Cathy Maugis: Affiliation:
Institut de Mathématiques de Toulouse, INSA de Toulouse, Université de Toulouse, 135 avenue de Rangueil, 31077 Toulouse Cedex 4, France. [email protected]
Bertrand Michel: Affiliation:
Laboratoire de Statistique Théorique et Appliquée, Université Paris 6, 175 rue du Chevaleret, 75013 Paris, France. [email protected]

Article contents

Abstract
References

Get access

Abstract

In the companion paper [C. Maugis and B. Michel,A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S15 (2011) 41–68] , a penalized likelihoodcriterion is proposed to select a Gaussian mixture model among aspecific model collection. This criterion depends on unknownconstants which have to be calibrated in practical situations. A“slope heuristics” method is described and experimented to dealwith this practical problem. In a model-based clustering context,the specific form of the considered Gaussian mixtures allows us todetect the noisy variables in order to improve the data clusteringand its interpretation. The behavior of our data-driven criterionis highlighted on simulated datasets, a curve clustering exampleand a genomics application.

Keywords

Slope heuristics Penalized likelihood criterion Model-based clustering noisy variable detection

Type: Research Article
Information: ESAIM: Probability and Statistics , Volume 15: Supplement: In honor of Marc Yor , 2011 , pp. 320 - 339

DOI: https://doi.org/10.1051/ps/2010002 [Opens in a new window]
Copyright: © EDP Sciences, SMAI, 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abraham, C., Cornillon, P.A., Matzner-Løber, E. and Molinari, N.. Unsupervised curve clustering using B-splines. Scand. J. Stat. Th. Appl. 30 (2003) 581–595. CrossRef

H. Akaike, Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory (Tsahkadsor, 1971). Akadémiai Kiadó, Budapest (1973) 267–281.

H. Akaike, A new look at the statistical model identification. IEEE Trans. Automatic Control AC-19 (1974) 716–723. System identification and time-series analysis CrossRef

S. Arlot, Réechantillonnage et sélection de modèles, Ph.D. thesis, Université Paris-Sud XI (2007).

S. Arlot and P. Massart, Slope heuristics for heteroscedastic regression on a random design. Submitted to the Annals of Statistics (2008).

D. Babusiaux, S. Barreau and P.-R. Bauquis, Oil and gas exploration and production, reserves, costs, contracts. Technip, Paris (2007).

Banfield, J.D. and Raftery, A.E., Model-based gaussian and non-gaussian clustering. Biometrics 49 (1993) 803–821. CrossRef

Barron, A., Birgé, L. and Massart, P., Risk bounds for model selection via penalization. Prob. Th. Rel. Fields 113 (1999) 301–413. CrossRef

J.-P. Baudry, Clustering through model selection criteria. Poster session at One Day Statistical Workshop in Lisieux. http://www.math.u-psud.fr/ baudry, June (2007).

A. Berlinet, G. Biau and L. Rouvière, Functional classification with wavelets, Technical report To appear (2008), in Annales de l'ISUP.

Biernacki, C., Celeux, G. and Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 719–725. CrossRef

Biernacki, C., Celeux, G., Govaert, G. and Langrognet, F., Model-based cluster and discriminant analysis with the MIXMOD software. Comp. Stat. Data Anal. 51 (2006) 587–600. CrossRef

Birgé, L. and Massart, P., Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 (2001) 203–268.

Birgé, L. and Massart, P., Minimal penalties for Gaussian model selection. Prob. Th. Rel. Fields 138 (2006) 33–73. CrossRef

K.-E. Blake and C. Merz, Uci repository of machine learning databases (1999). http://mlearn.ics.uci.edu/MLSummary.html.

L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and regression trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA (1984).

Celeux, G. and Govaert, G., Gaussian parsimonious clustering models. Patt. Recog. 28 (1995) 781–793. CrossRef

Dempster, A.P., Laird, N.M. and Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Methodol. 39 (1977) 1–38, With discussion.

Gagnot, S., Tamby, J.-P., Martin-Magniette, M.-L., Bitton, F., Taconnat, L., Balzergue, S., Aubourg, S., Renou, J.-P., Lecharny, A. and Brunaud, V., CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform. Nucleic Acids Res. 36 (2008) 986–990. CrossRef

García-Escudero, L.A. and Gordaliza, A., A proposal for robust curve clustering. J. Class. 22 (2005) 185–201. CrossRef

P.J. Huber, Robust Statistics. Wiley (1981).

James, G.M. and Sugar, C.A., Clustering for sparsely sampled functional data. J. Am. Stat. Assoc. 98 (2003) 397–408. CrossRef

Jiang, D., Tang, C. and Zhang, A., Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng. 16 (2004) 1370–1386. CrossRef

Keribin, C., Consistent estimation of the order of mixture models. Sankhyā Ser. A 62 (2000) 49–66.

Lebarbier, E., Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Proc. 85 (2005) 717–736. CrossRef

V. Lepez, Potentiel de réserves d'un bassin pétrolier: modélisation et estimation, Ph.D. thesis, Université Paris Sud (2002).

Lurin, C., Andréas, C., Aubourg, S., Bellaoui, M., Bitton, F., Bruyère, C., Caboche, M., Debast, J., Gualberto, C., Hoffmann, B., Lecharny, M., Le Ret, A., Martin-Magniette, M.-L., Mireau, H., Peeters, N., Renou, J.-P., Szurek, B., Taconnat, L. and Small, I., Genome-wide analysis of arabidopsis pentatricopeptide repeat proteins reveals their essential role in organelle biogenesis. Plant Cell 16 (2004) 2089–103. CrossRef

Ma, P., Castillo-Davis, W., Zhong, C. and Liu, J.S., A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 (2006) 1261–1269. CrossRef

Mallows, C.L., Some comments on Cp. Technometrics 37 (1973) 362–372.

P. Massart, Concentration inequalities and model selection, Lecture Notes in Mathematics Vol. 1896. Springer, Berlin (2007). Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003).

Maugis, C., Celeux, G. and Martin-Magniette, M.-L., Variable selection for clustering with Gaussian mixture models. Biometrics 65 (2009) 701–709. CrossRef

C. Maugis, G. Celeux and M.-L. Martin-Magniette, Variable selection in model-based clustering: A general variable role modeling. Comput. Stat. Data Anal. 53 (2009) 3872–3882. CrossRef

Maugis, C. and Michel, B., A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 (2011) 41–68. CrossRef

B. Michel, Modélisation de la production d'hydrocarbures dans un bassin pétrolier, Ph.D. thesis, Université Paris-Sud 11 (2008).

B.P. Percival and A.T. Walden, Wavelet methods for time series analysis. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge university press, New York (2000).

Raftery, A.E. and Dean, N., Variable selection for model-based clustering. J. Am. Stat. Assoc. 101 (2006) 168–178. CrossRef

Schwarz, G., Estimating the dimension of a model. Ann. Stat. 6 (1978) 461–464. CrossRef

R. Sharan, R. Elkon and R. Shamir, Cluster analysis and its applications to gene expression data. In Ernst Schering Workshop on Bioinformatics and Genome Analysis. Springer Verlag (2002).

Tarpey, T. and Kinateder, K.K.J., Clustering functional data. J. Class. 20 (2003) 93–114. CrossRef

F. Villers, Tests et sélection de modèles pour l'analyse de données protéomiques et transcriptomiques, Ph.D. thesis, Université Paris-Sud 11 (2007).

Article contents

Data-driven penalty calibration: A case studyfor Gaussian mixture model selection

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests