Article contents
Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting
Published online by Cambridge University Press: 11 February 2013
Abstract
This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.
- Type
- Articles
- Information
- Copyright
- Copyright © Cambridge University Press 2013
References
- 11
- Cited by