Learning verb complements for Modern Greek: balancing the noisy dataset

KATIA KERMANIDIS; MANOLIS MARAGOUDAKIS; NIKOS FAKOTAKIS; GEORGE KOKKINAKIS

doi:10.1017/S135132490600413X

Learning verb complements for Modern Greek: balancing the noisy dataset

Published online by Cambridge University Press: 01 January 2008

KATIA KERMANIDIS ,

MANOLIS MARAGOUDAKIS ,

NIKOS FAKOTAKIS and

GEORGE KOKKINAKIS

Show author details

KATIA KERMANIDIS: Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
MANOLIS MARAGOUDAKIS: Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
NIKOS FAKOTAKIS: Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
GEORGE KOKKINAKIS: Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.

Type: Papers
Information: Natural Language Engineering , Volume 14 , Issue 1 , January 2008 , pp. 71 - 100

DOI: https://doi.org/10.1017/S135132490600413X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aha, D., Kibler, D. and Albert, M. K. (1991) Instance based learning algorithms. Machine Learning 6 (1): 37–66.CrossRef Google Scholar

Aldezabal, I., Aranzabe, M., Atutxa, A., Gojenola, K. and Sarasola, K. (2002) Learning argument/adjunct distinction for Basque. SIGLEX Workshop of the ACL, Philadelphia.CrossRef Google Scholar

Batista, G., Bazan, A. and Monard, M. (2003) Balancing training data for automated annotation of keywords: a case study. Proceedings of the Second Brazilian Workshop on Bioinformatics (SBC), pp. 21–28.Google Scholar

Batista, G., Prati, R. and Monard, M. (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6 (1): 20–29.CrossRef Google Scholar

Brent, M. (1993) From grammar to lexicon: unsupervised learning of lexical syntax. Computational Linguistics 19 (3): 243–62.Google Scholar

Briscoe, T. and Carroll, J. (1997) Automatic extraction of subcategorization from corpora. Proceedings of the fifth Conference in Applied Natural Language Processing, ACL, pp. 356–363. Washington D.C.CrossRef Google Scholar

Buchholz, S. (1998) Distinguishing complements from adjuncts using memory-based learning. Proceedings of the Workshop on Automated Acquisition of Syntax and Parsing, ESSLLI-98, pp. 41–48. Saarbruecken, Germany.Google Scholar

Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–57. Morgan Kaufmann.CrossRef Google Scholar

Cheng, J. and Greiner, R. (1999) Comparing Bayesian network classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), Sweden.Google Scholar

Cheng, J. and Greiner, R. (2001) Learning Bayesian belief network classifiers: Algorithms and system. Proceedings of the Canadian Conference on Artificial Intelligence (CSCSI01) Ottawa.CrossRef Google Scholar

Domingos, P. (1999) Metacost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–64. San Diego, CA.CrossRef Google Scholar

Domingos, P. and Pazzani, M. (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29: 103–130.CrossRef Google Scholar

Ersan, M. and Charniak, E. (1995) A statistical syntactic disambiguation program and what it learns. Technical Report CS-95-29. Department of Computer Science, Brown University.CrossRef Google Scholar

Friedman, N. and Goldszmidt, M. (1996) Discretizing continuous attributes while learning Bayesian networks. In: L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference. Morgan Kaufmann.Google Scholar

Guo, H. and Viktor, H. (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations 6 (1): 30–39.CrossRef Google Scholar

Hart, P. E. (1968) The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, IT-14: 515–16.CrossRef Google Scholar

Hatzigeorgiu, N., Gavriilidou, M., Piperidis, S., Carayannis, G., Papakostopoulou, A., Spiliotopoulou, A., Vacalopoulou, A., Labropoulou, P., Mantzari, E., Papageorgiou, H. and Demiros, I. (2000) Design and Implementation of the online ILSP Greek Corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), pp. 1737–1742. Athens, Greece.Google Scholar

Japkowicz, N. (2000) The class imbalance problem: significance and strategies. Proceedings of the International Conference on Artificial Intelligence (IC-AI'2000): Special Track on Inductive Learning. Las Vegas, Nevada.Google Scholar

Karanikolas, A., Spyropoulos, I., Aggelidou, D., Betsopoulou, M., Grigoriadis, N., Kandros, P., Karpouza, D., Lanaris, E., Moumtzakis, A., Tombaidis, D., Tsolakis, X. (2004) Modern Greek Syntax. (in Greek). Publishing Organization of Educational Books. Athens.Google Scholar

Kermanidis, K., Fakotakis, N. and Kokkinakis, G. (2002) DELOS: An automatically tagged economic corpus for Modern Greek. In: M. Gonzalez Rodriguez and C. Paz Suarez Araujo, editors, Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 93–100. Las Palmas de Gran Canaria, Spain.Google Scholar

Kermanidis, K., Fakotakis, N. and Kokkinakis, G. (2004) Automatic Acquisition of Verb Subcategorization Information by Exploiting Minimal Linguistic Resources. International Journal of Corpus Linguistics 9 (1), pp. 1–28. John Benjamins Publishing Company.CrossRef Google Scholar

Klairis, C. and Babiniotis, G. (1999) Grammar of Modern Greek. II. The Verb. (in Greek). Athens: Greek Letters Publications.Google Scholar

Korhonen, A., Gorrell, G. and McCarthy, D. (2000) Statistical filtering and subcategorization frame acquisition. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 199–1205. Hong Kong.CrossRef Google Scholar

Kubat, M. and Matwin, S. (1997) Addressing the curse of imbalanced training sets. Proceedings of the International Conference on Machine Learning (ICML-97), pp. 179–186.Google Scholar

Kubat, M., Pfurtscheller, G. and Flotzinger, D. (1994) AI-based approach to automatic sleep classification. Biological Cybernetics 70: 443–48.CrossRef Google Scholar PubMed

Laurikkala, J. (2001) Improving identification of difficult small classes by balancing class distribution. Proceedings of the Eighth Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66. Cascais, Portugal.CrossRef Google Scholar

Lewis, D. and Gale, W. (1994) Training text classifiers by uncertainty sampling. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. Dublin, Ireland.Google Scholar

Ling, C. and Li, C. (1998) Data mining for direct marketing problems and solutions. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98). New York, NY.Google Scholar

Manning, C. (1993) Automatic acquisition of a large subcategorization dictionary from corpora. Proceedings of the 31st Meeting of the Association of Computational Linguistics, pp. 235–242. Columbus, Ohio.CrossRef Google Scholar

Maragoudakis, M., Fakotakis, N., Kokkinakis, G. (2004). Imposing Classification Bias in Bayesian Network Learning. Pattern Recognition and Image Analysis 14 (3).Google Scholar

Merlo, P. and Leybold, M. (2001) Automatic distinction of arguments and modifiers: the case of prepositional phrases. Proceedings of the Workshop on Computational Language Learning (CONNL 2001), Toulouse, France.CrossRef Google Scholar

Meyers, A., Macleod, C. and Grishman, R. (1994) Standardization of the complement/adjunct distinction. Proteus Project Memorandum 64, Computer Science Department, New York University.Google Scholar

Mitchell, T. (1997) Machine Learning. McGraw-Hill.Google Scholar

Partners of ESPRIT-291/860. (1986) Unification of the word classes of the ESPRIT Project 860. Internal Report BU-WKL-0376.Google Scholar

Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann.Google Scholar

Provost, F. and Fawcett, T. (2001) Robust classification for imprecise environments. Machine Learning 42 (3): 203–231.CrossRef Google Scholar

Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar

Sarkar, A., and Zeman, D. (2000) Automatic extraction of subcategorization frames for Czech. Proceedings of the 18th International Conference on Computational Linguistics, pp. 691–697. Saarbruecken, Germany.CrossRef Google Scholar

Schapire, R. (2002) The boosting approach to machine learning: An overview. MSRI Workshop on Nonlinear Estimation and Classification.CrossRef Google Scholar

Sgarbas, K., Fakotakis, N., and Kokkinakis, G. (2000) A straightforward approach to morphological analysis and synthesis. Proceedings of the Workshop on Computational Lexicography and Multimedia Dictionaries, pp. 31–34. Kato Achaia, Greece.Google Scholar

Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2000) A practical chunker for unrestricted text. Proceedings of the Second International Conference of Natural Language Processing (NLP2000), pp. 139–150. Patras, Greece.CrossRef Google Scholar

Stanfill, C. and Waltz, D. (1986) Toward memory-based reasoning. Communications of the ACM 29: 1213–1228.CrossRef Google Scholar

Sung, K. and Poggio, T. (1995) Learning human face detection in cluttered scenes. Proceedings of the International Conference on Computer Analysis of Images and Patterns. Prague, Czech Republic.CrossRef Google Scholar

Tomek, I. (1976) Two modifications of CNN. IEEE Transactions on Systems, Man and Communications SMC-6: 769–772.Google Scholar

Weiss, G. (2004) Mining with rarity. a unifying framework. SIGKDD Explorations 6 (1): 7–19.CrossRef Google Scholar

Article contents

Learning verb complements for Modern Greek: balancing the noisy dataset

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests