Hostname: page-component-78c5997874-8bhkd Total loading time: 0 Render date: 2024-11-08T10:32:33.478Z Has data issue: false hasContentIssue false

Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

Published online by Cambridge University Press:  11 February 2013

ROSA DEL GAUDIO
Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: [email protected], [email protected]
GUSTAVO BATISTA
Affiliation:
Department of Computer Science, University of São Paulo, PO Box 668, 13560-970 São Carlos, SP, Brazil e-mail: [email protected]
ANTÓNIO BRANCO
Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: [email protected], [email protected]

Abstract

This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Acedański, S., Slaski, A., and Przepiórkowski, A. 2012. Machine learning of syntactic attachment from morphosyntactic and semantic co-occurrence statistics. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pp. 42–7. Jeju, Republic of Korea: Association for Computational Linguistics.Google Scholar
Aha, D. W., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning, 6 (1): 3766.Google Scholar
Alarcón, R., Sierra, G., and Bach, C. 2009. ECODE: a definition extraction system. In Vetulani, Z. and Uszkoreit, H. (eds.), Human Language Technology. Challenges of the Information Society, pp. 382–91. Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
Alshawi, H. 1987. Processing dictionary definitions with phrasal pattern hierarchies. American Journal of Computational Linguistics 13 (3–4): 195202.Google Scholar
Androutsopoulos, I., and Galanis, D. 2005. A practically unsupervised learning method to identify single-snippet answers to definition questions on the Web. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), pp. 323–30. Vancouver, Canada: Association for Computational Linguistics.Google Scholar
Baneyx, A., Malaisé, V., Charlet, J., Zweigenbaum, P., and Bachimont, B. 2005. Synergie entre analyse distributionnelle et patrons lexico-syntaxiques pour la construction d'ontologies différentielles. In Actes des 6 Émes Rencontres Terminologie et Intelligence Artificielle (TIA 2005), Rouen, France, pp. 3142Google Scholar
Barnbrook, G. 2002. Defining Language: A Local Grammar of Definition Sentences. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Batista, G. E. A. P. A., Bazzan, A. L. C., and Monard, M. C. 2003. Balancing training data for automated annotation of keywords: a case study. In Lifschitz, S., Almeida, N. F. Jr., Pappas, G. J. Jr., and Linden, R., (eds.), Proceedings of the Second Brazilian Workshop on Bioinformatics, Rio de Janeiro, pp. 3543.Google Scholar
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. 2004. A study of the behavior of several methods for balancing machine learning training data. Special Interest Group on Knowledge Discovery and Data Mining Explorations Newsletter – Special Issue on Learning from Imbalanced Datasets 6 (1): 20–9. New York: ACM.Google Scholar
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. 2005. Balancing strategies and class overlapping. In Famili, A. F., Kok, J. N., Peña, J. M., Siebes, A., and Feelders, A. J. (eds.), Advances in Intelligent Data Analysis VI, Sixth International Symposium on Intelligent Data Analysis, IDA 2005, Lecture Notes in Computer Science, vol. 3646, pp. 2435. Berlin: Springer.Google Scholar
Bay, S., Kumaraswamy, K., Anderle, M. G., Kumar, R., and Steier, D. M. 2006. Large-scale detection of irregularities in accounting data. In Proceeding of the Sixth International Conference on Data Mining, pp. 7586. IEEE Computer Society.CrossRefGoogle Scholar
Biau, G. 2012. Analysis of a random forests model. Journal of Machine Learning Research 13 (Jun), 1063–95.Google Scholar
Borg, C., Rosner, M., and Pace, G. 2009. Evolutionary algorithms for definition extraction. In Proceedings of the First Workshop on Definition Extraction (WDE’09), pp. 2632. Association for Computational Linguistics.Google Scholar
Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30: 1145–59.Google Scholar
Branco, A., and Silva, J. R. 2006. LX-Suite: shallow processing tools for Portuguese. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pp. 179–83.Google Scholar
Breiman, L. 2001. Random forests. Machine Learning 45: 532.Google Scholar
Chang, C.-C., and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
Chang, X., and Zheng, Q. 2007. Offline definition extraction using machine learning for knowledge-oriented question answering. In Proceeding of International Conference on Intelligent Computing ICIC (3), pp. 1286–94.Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–57.Google Scholar
Chawla, N. V., Japkowicz, N., and Kotcz, A. 2004. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6 (1): 16.Google Scholar
Chen, C., Liaw, A., and Breiman, L. 2004. Using random forest to learn imbalanced data. Technical Report, Department of Statistics, University of Berkeley.Google Scholar
de Freitas, M. C. 2007. Elaboração automática de ontologias de Domínio: Discussão e Resultados. PhD thesis, Pontifícia Universidade Católica de Rio de Janeiro.Google Scholar
Degórski, Ł., Kobyliński, Ł., and Przepiórkowski, A. 2008a. Definition extraction: improving balanced random forests. In Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2008): Computational Linguistics – Applications (CLA’08), PTI, Wisła, Poland, pp. 353–7.Google Scholar
Degórski, Ł., Marcińczuk, M. M., and Przepiórkowski, A. 2008b (May). Definition extraction using a sequential combination of baseline grammars and machine learning classifiers. In ELRA: European Language Resources Association (ed.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), pp. 837–41. Marrakech, Morocco: ELRA.Google Scholar
Demiröz, G., and Güvenir, H. A. 1997. Classification by voting feature intervals. In Proceedings of the 9th European Conference on Machine Learning, pp. 8592. London, UK: Springer.Google Scholar
Elkan, C. 2001. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pp. 973–8. Seattle, WA: Morgan Kaufmann.Google Scholar
Fahmi, I., and Bouma, G. 2006. Learning to identify definitions using syntactic feature. In Basili, R. and Moschitti, A. (eds.), Proceedings of the EACL workshop on Learning Structured Information in Natural Language Applications, Trento, Italy, pp. 6471.Google Scholar
Fawcett, T. 2004. ROC graphs: notes and practical considerations for researchers. Technical Report, HP Laboratories.Google Scholar
Hart, P. E. 1968. The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14 (3): 515–6.CrossRefGoogle Scholar
Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics, pp. 539–45. Morristown, NJ: Association for Computational Linguistics.Google Scholar
Ide, N., and Suderman, K. 2002. XML, corpus encoding standard, document XCES 0.2. Technical Report, Department of Computer Science, Vassar College and Equipe Langue et Dialogue, New York, USA and LORIA/CNRS, Vandouvre-les-Nancy, France.Google Scholar
John, G. H., and Langley, P. 1995. Estimating continuous distributions in Bayesian classifiers. In Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–45. San Mateo, CA: Morgan Kaufmann.Google Scholar
Joho, H., and Sanderson, M. 2000. Retrieving descriptive phrases from large amounts of free text. In Proceeding of the Ninth International Conference on Information and Knowledge Management, pp. 180–6. McLean, VA, USA: ACM.Google Scholar
Klavans, J., and Muresan, S. 2001. Evaluation of the DEFINDER system for fully automatic glossary construction. In Proceedings of the American Medical Informatics Association Symposium (AMIA 2001), pp. 324–8. New York: ACM Press.Google Scholar
Kobyliński, Ł., and Przepiórkowski, A. 2008. Definition extraction with balanced random forests. In Ranta, A. (ed.), International Conference on Natural Language Processing (GoTAL 2008), pp. 237–47. Berlin, Gothenburg: Springer.Google Scholar
Laurikkala, J. 2001. Improving identification of difficult small classes by balancing class distribution. In AIME ‘01: Proceedings of the Eighth Conference on AI in Medicine in Europe, pp. 63–6. London, UK: Springer.Google Scholar
Ling, C. X., and Sheng, V. S. 2008. Cost-sensitive learning and the class imbalance problem. In Sammut, C. (ed.), Encyclopedia of Machine Learning, pp. 231–5. New York: Springer.Google Scholar
Liu, Y., Chawla, N. V., Harper, M. P., Shriberg, E., and Stolcke, A. 2006. A study in machine learning from imbalanced data for sentence boundary detection in speech. Computer Speech and Language 20 (4): 468–94.Google Scholar
Malaise, V., Zweigenbaum, P., and Bachimont, B. 2004. Detecting semantic relations between terms in definitions. In The Third Edition of CompuTerm Workshop (CompuTerm 2004) at Coling, pp. 55–62.Google Scholar
Meyer, I. 2001. Extracting knowledge-rich contexts for terminography. Bourigault, D. (ed.), Recent Advances in Computational Terminology, pp. 279302. Amsterdam: John Benjamins.Google Scholar
Miliaraki, S., and Androutsopoulos, I. 2004. Learning to identify single-snippet answer to definition questions. In Proceeding of the 20th International Conference on Computational Linguistic (COLING 2004), Geneva, Switzerland, pp. 1360–6.Google Scholar
Muresan, S., and Klavans, J. 2002. A method for automatically building and evaluating dictionary resources. In Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 231–4.Google Scholar
Nakamura, J., and Nagao, M. 1988. Extraction of semantic information from an ordinary English dictionary and its evaluation. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, pp. 459–64.Google Scholar
Park, Y., Byrd, R., and Boguraev, B. K. 2002. Automatic Glossary Extraction: beyond terminology identification. In Proceeding of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, pp. 17.Google Scholar
Pearson, J. 1996. The expression of definitions in specialised text: a corpus-based analysis. In Gellerstam, M., Jaborg, J., Malgren, S. G., Noren, K., Rogstrom, L., and Papmehl, C. (eds.), Seventh International Congress on Lexicography (EURALEX 96), Goteborg, Sweden, pp. 817–24.Google Scholar
Prati, R. C., Batista, G. E. A. P. A., and Monard, M. C. 2011. A survey on graphical methods for classification predictive performance evaluation. IEEE Transactions on Knowledge and Data Engineering 23 (11): 1601–18.Google Scholar
Przepiórkowski, A., Marcińczuk, M. and Degórski, Ł. 2008. Noisy and imbalanced data: machine learning or manual grammars? In Text, Speech and Dialogue: 9th International Conference, TSD 2008, Lecture Notes in Artificial Intelligence, pp. 169–76. Berlin, Springer.Google Scholar
Quinlan, J. R. 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4: 7790.Google Scholar
Roth, D. 1999. Learning in natural language. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI’99), vol. 2, pp. 898904. San Francisco, CA: Morgan Kaufmann.Google Scholar
Saggion, H. 2004. Identifying definitions in text collections for question answering. In Proceedings of the International Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 1927–30.Google Scholar
Seppälä, S. 2009 (September). A Proposal for a framework to evaluate feature relevance for terminographic definitions. In Proceedings of the First Workshop on Definition Extraction at the Recent Advances in Natural Language Processing Conference (RANLP 2009), Borovest, Bulgaria, pp. 4753.Google Scholar
Sierra, G., Alarcón, R., Aguilar, C., and Barrón, A. 2006. Towards the building of a corpus of definitional contexts. In Proceeding of the 12th EURALEX International Congress, Torino, Italy, pp. 229–40.Google Scholar
Sierra, G., Alarcon, R., Aguilar, C., and Bach, C. 2008. Definitional verbal patterns for semantic relation extraction. Terminology 14 (1): 7498.Google Scholar
Taft, L. M., Evans, R. S., Shyu, C. R., Egger, M. J., Chawla, N., Mitchell, J. A., Thornton, S. N., Bray, B., and Varner, M. 2009. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. Journal of Biomedical Informatics 42 (April): 356–64.Google Scholar
Tjong, E., Sang, K., Bouma, G. and de Rijke, M. 2005. Developing offline strategies for answering medical questions. In Proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, pp. 41–5.Google Scholar
Tomanek, K., and Hahn, U. 2009. Reducing class imbalance during active learning for named entity annotation. In Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP ‘09, pp. 105–12. New York: ACM.Google Scholar
Tomek, I. 1976. Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6 (11): 769–72.Google Scholar
Toutanova, K., and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (EMNLP’00), vol. 13, pp. 6370. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Vatturi, P., and Wong, W.-K. 2009. Category detection using hierarchical mean shift. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), pp. 847–56. New York: ACM.Google Scholar
Walter, S., and Pinkal, M. 2006. Automatic extraction of definitions from German court decisions. In Proceedings of the Workshop on Information Extraction Beyond The Document, pp. 20–8. Sydney, Australia: Association for Computational Linguistics.Google Scholar
Weiss, G., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Stahlbock, R., Crone, S. F., and Lessmann, S. (eds.), Proceedings of the International Conference on Data Mining, pp. 3541. CSREA Press.Google Scholar
Westerhout, E. 2009. Extraction of definitions using grammar-enhanced machine learning. In Proceedings of the Student Research Workshop at EACL, pp. 8896. Athens, Greece: Association for Computational Linguistics.Google Scholar
Westerhout, E. 2010. Definition Extraction for Glossary Creation: A Study on Extracting Definitions for Semi-automatic Glossary Creation in Dutch. Utrecht, The Netherlands: LOT.Google Scholar
Westerhout, E., and Monachesi, P. 2007. Extraction of Dutch definitory contexts for eLearning purposes. In Proceedings of the Computational Linguistics in the Netherlands (CLIN 2007), Nijmegen, Netherlands, pp. 219–34.Google Scholar
Westerhout, E., and Monachesi, P. 2008. Creating glossaries using pattern-based and machine learning techniques. In Proceedings of the International Conference on Language Resources and Evaluation, pp. 3074–81.Google Scholar
Wilson, D. L. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2: 408–21.Google Scholar
Witten, I. H., and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.San Francisco, CA: Morgan Kaufmann.Google Scholar
Wu, G., and Chang, E. 2003. Class-boundary alignment for imbalanced dataset learning. In Proceedings of the Twentieth International Conference on Machine Learning – ICML 2003 Workshop on Learning from Imbalanced Data Sets, Washington, DC, pp. 786–95.Google Scholar
Zhang, H. 2005. Exploring conditions for the optimality of naïve Bayes. International Journal of Pattern Recognition and Artificial Intelligence 19 (2): 183–98.Google Scholar
Zhu, J. 2007. Active learning for word sense disambiguation with methods for addressing the class imbalance problem. In Proceeding Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 783–90. Prague, Czech Republic: ACL.Google Scholar