Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-27T18:37:17.608Z Has data issue: false hasContentIssue false

Learning verb complements for Modern Greek: balancing the noisy dataset

Published online by Cambridge University Press:  01 January 2008

KATIA KERMANIDIS
Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
MANOLIS MARAGOUDAKIS
Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
NIKOS FAKOTAKIS
Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]
GEORGE KOKKINAKIS
Affiliation:
Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rio 26500, Greece email: [email protected], [email protected], [email protected], [email protected]

Abstract

Attempting to automatically learn to identify verb complements from natural language corpora without the help of sophisticated linguistic resources like grammars, parsers or treebanks leads to a significant amount of noise in the data. In machine learning terms, where learning from examples is performed using class-labelled feature-value vectors, noise leads to an imbalanced set of vectors: assuming that the class label takes two values (in this work complement/non-complement), one class (complements) is heavily underrepresented in the data in comparison to the other. To overcome the drop in accuracy when predicting instances of the rare class due to this disproportion, we balance the learning data by applying one-sided sampling to the training corpus and thus by reducing the number of non-complement instances. This approach has been used in the past in several domains (image processing, medicine, etc) but not in natural language processing. For identifying the examples that are safe to remove, we use the value difference metric, which proves to be more suitable for nominal attributes like the ones this work deals with, unlike the Euclidean distance, which has been used traditionally in one-sided sampling. We experiment with different learning algorithms which have been widely used and their performance is well known to the machine learning community: Bayesian learners, instance-based learners and decision trees. Additionally we present and test a variation of Bayesian belief networks, the COr-BBN (Class-oriented Bayesian belief network). The performance improves up to 22% after balancing the dataset, reaching 73.7% f-measure for the complement class, having made use only a phrase chunker and basic morphological information for preprocessing.

Type
Papers
Copyright
Copyright © Cambridge University Press 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aha, D., Kibler, D. and Albert, M. K. (1991) Instance based learning algorithms. Machine Learning 6 (1): 3766.CrossRefGoogle Scholar
Aldezabal, I., Aranzabe, M., Atutxa, A., Gojenola, K. and Sarasola, K. (2002) Learning argument/adjunct distinction for Basque. SIGLEX Workshop of the ACL, Philadelphia.CrossRefGoogle Scholar
Batista, G., Bazan, A. and Monard, M. (2003) Balancing training data for automated annotation of keywords: a case study. Proceedings of the Second Brazilian Workshop on Bioinformatics (SBC), pp. 21–28.Google Scholar
Batista, G., Prati, R. and Monard, M. (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6 (1): 2029.CrossRefGoogle Scholar
Brent, M. (1993) From grammar to lexicon: unsupervised learning of lexical syntax. Computational Linguistics 19 (3): 243–62.Google Scholar
Briscoe, T. and Carroll, J. (1997) Automatic extraction of subcategorization from corpora. Proceedings of the fifth Conference in Applied Natural Language Processing, ACL, pp. 356–363. Washington D.C.CrossRefGoogle Scholar
Buchholz, S. (1998) Distinguishing complements from adjuncts using memory-based learning. Proceedings of the Workshop on Automated Acquisition of Syntax and Parsing, ESSLLI-98, pp. 41–48. Saarbruecken, Germany.Google Scholar
Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–57. Morgan Kaufmann.CrossRefGoogle Scholar
Cheng, J. and Greiner, R. (1999) Comparing Bayesian network classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), Sweden.Google Scholar
Cheng, J. and Greiner, R. (2001) Learning Bayesian belief network classifiers: Algorithms and system. Proceedings of the Canadian Conference on Artificial Intelligence (CSCSI01) Ottawa.CrossRefGoogle Scholar
Domingos, P. (1999) Metacost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–64. San Diego, CA.CrossRefGoogle Scholar
Domingos, P. and Pazzani, M. (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29: 103130.CrossRefGoogle Scholar
Ersan, M. and Charniak, E. (1995) A statistical syntactic disambiguation program and what it learns. Technical Report CS-95-29. Department of Computer Science, Brown University.CrossRefGoogle Scholar
Friedman, N. and Goldszmidt, M. (1996) Discretizing continuous attributes while learning Bayesian networks. In: L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference. Morgan Kaufmann.Google Scholar
Guo, H. and Viktor, H. (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations 6 (1): 3039.CrossRefGoogle Scholar
Hart, P. E. (1968) The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, IT-14: 515–16.CrossRefGoogle Scholar
Hatzigeorgiu, N., Gavriilidou, M., Piperidis, S., Carayannis, G., Papakostopoulou, A., Spiliotopoulou, A., Vacalopoulou, A., Labropoulou, P., Mantzari, E., Papageorgiou, H. and Demiros, I. (2000) Design and Implementation of the online ILSP Greek Corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), pp. 1737–1742. Athens, Greece.Google Scholar
Japkowicz, N. (2000) The class imbalance problem: significance and strategies. Proceedings of the International Conference on Artificial Intelligence (IC-AI'2000): Special Track on Inductive Learning. Las Vegas, Nevada.Google Scholar
Karanikolas, A., Spyropoulos, I., Aggelidou, D., Betsopoulou, M., Grigoriadis, N., Kandros, P., Karpouza, D., Lanaris, E., Moumtzakis, A., Tombaidis, D., Tsolakis, X. (2004) Modern Greek Syntax. (in Greek). Publishing Organization of Educational Books. Athens.Google Scholar
Kermanidis, K., Fakotakis, N. and Kokkinakis, G. (2002) DELOS: An automatically tagged economic corpus for Modern Greek. In: M. Gonzalez Rodriguez and C. Paz Suarez Araujo, editors, Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 93–100. Las Palmas de Gran Canaria, Spain.Google Scholar
Kermanidis, K., Fakotakis, N. and Kokkinakis, G. (2004) Automatic Acquisition of Verb Subcategorization Information by Exploiting Minimal Linguistic Resources. International Journal of Corpus Linguistics 9 (1), pp. 128. John Benjamins Publishing Company.CrossRefGoogle Scholar
Klairis, C. and Babiniotis, G. (1999) Grammar of Modern Greek. II. The Verb. (in Greek). Athens: Greek Letters Publications.Google Scholar
Korhonen, A., Gorrell, G. and McCarthy, D. (2000) Statistical filtering and subcategorization frame acquisition. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 199–1205. Hong Kong.CrossRefGoogle Scholar
Kubat, M. and Matwin, S. (1997) Addressing the curse of imbalanced training sets. Proceedings of the International Conference on Machine Learning (ICML-97), pp. 179–186.Google Scholar
Kubat, M., Pfurtscheller, G. and Flotzinger, D. (1994) AI-based approach to automatic sleep classification. Biological Cybernetics 70: 443–48.CrossRefGoogle ScholarPubMed
Laurikkala, J. (2001) Improving identification of difficult small classes by balancing class distribution. Proceedings of the Eighth Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66. Cascais, Portugal.CrossRefGoogle Scholar
Lewis, D. and Gale, W. (1994) Training text classifiers by uncertainty sampling. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. Dublin, Ireland.Google Scholar
Ling, C. and Li, C. (1998) Data mining for direct marketing problems and solutions. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98). New York, NY.Google Scholar
Manning, C. (1993) Automatic acquisition of a large subcategorization dictionary from corpora. Proceedings of the 31st Meeting of the Association of Computational Linguistics, pp. 235–242. Columbus, Ohio.CrossRefGoogle Scholar
Maragoudakis, M., Fakotakis, N., Kokkinakis, G. (2004). Imposing Classification Bias in Bayesian Network Learning. Pattern Recognition and Image Analysis 14 (3).Google Scholar
Merlo, P. and Leybold, M. (2001) Automatic distinction of arguments and modifiers: the case of prepositional phrases. Proceedings of the Workshop on Computational Language Learning (CONNL 2001), Toulouse, France.CrossRefGoogle Scholar
Meyers, A., Macleod, C. and Grishman, R. (1994) Standardization of the complement/adjunct distinction. Proteus Project Memorandum 64, Computer Science Department, New York University.Google Scholar
Mitchell, T. (1997) Machine Learning. McGraw-Hill.Google Scholar
Partners of ESPRIT-291/860. (1986) Unification of the word classes of the ESPRIT Project 860. Internal Report BU-WKL-0376.Google Scholar
Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann.Google Scholar
Provost, F. and Fawcett, T. (2001) Robust classification for imprecise environments. Machine Learning 42 (3): 203231.CrossRefGoogle Scholar
Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
Sarkar, A., and Zeman, D. (2000) Automatic extraction of subcategorization frames for Czech. Proceedings of the 18th International Conference on Computational Linguistics, pp. 691–697. Saarbruecken, Germany.CrossRefGoogle Scholar
Schapire, R. (2002) The boosting approach to machine learning: An overview. MSRI Workshop on Nonlinear Estimation and Classification.CrossRefGoogle Scholar
Sgarbas, K., Fakotakis, N., and Kokkinakis, G. (2000) A straightforward approach to morphological analysis and synthesis. Proceedings of the Workshop on Computational Lexicography and Multimedia Dictionaries, pp. 31–34. Kato Achaia, Greece.Google Scholar
Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2000) A practical chunker for unrestricted text. Proceedings of the Second International Conference of Natural Language Processing (NLP2000), pp. 139–150. Patras, Greece.CrossRefGoogle Scholar
Stanfill, C. and Waltz, D. (1986) Toward memory-based reasoning. Communications of the ACM 29: 12131228.CrossRefGoogle Scholar
Sung, K. and Poggio, T. (1995) Learning human face detection in cluttered scenes. Proceedings of the International Conference on Computer Analysis of Images and Patterns. Prague, Czech Republic.CrossRefGoogle Scholar
Tomek, I. (1976) Two modifications of CNN. IEEE Transactions on Systems, Man and Communications SMC-6: 769772.Google Scholar
Weiss, G. (2004) Mining with rarity. a unifying framework. SIGKDD Explorations 6 (1): 719.CrossRefGoogle Scholar