Data pre-processing to improve the mining of large feed databases

F. Maroto-Molina; A. Gómez-Cabrera; J. E. Guerrero-Ginel; A. Garrido-Varo; D. Sauvant; G. Tran; V. Heuzé; D. C. Pérez-Marín

doi:10.1017/S1751731113000293

Data pre-processing to improve the mining of large feed databases

Published online by Cambridge University Press: 08 March 2013

F. Maroto-Molina ,

A. Gómez-Cabrera ,

J. E. Guerrero-Ginel ,

A. Garrido-Varo ,

D. Sauvant ,

G. Tran ,

V. Heuzé and

D. C. Pérez-Marín

Show author details

F. Maroto-Molina*: Affiliation:
Servicio de Información sobre Alimentos, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
A. Gómez-Cabrera: Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
J. E. Guerrero-Ginel: Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
A. Garrido-Varo: Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
D. Sauvant: Affiliation:
UMR 791 Physiologie de la nutrition et de l'alimentation, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
G. Tran: Affiliation:
Association Française de Zootechnie, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
V. Heuzé: Affiliation:
Association Française de Zootechnie, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
D. C. Pérez-Marín: Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
*: †E-mail: [email protected]

Article contents

Abstract
References

Get access

Abstract

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

Keywords

chemical composition nutritive value data integration outlier mining

Type: Nutrition
Information: animal , Volume 7 , Issue 7 , July 2013 , pp. 1128 - 1136

DOI: https://doi.org/10.1017/S1751731113000293 [Opens in a new window]
Copyright: Copyright © The Animal Consortium 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abreu, JM, Bruno-Soares, AM, Calouro, F 2000. Intake and nutritive value of Mediterranean forages and diets: 20 years of experimental data. ISA Press, Lisbon, Portugal.Google Scholar

Anderson, TW, Darling, DA 1952. Asymptotic theory of certain ‘goodness-of-fit’ criteria based on stochastic processes. Annals of Mathematical Statistics 23, 193–212.Google Scholar

Breunig, MM, Kriegel, HP, Ng, RT, Sander, J 2000. LOF: identifying density-based local outliers. Retrieved May 15, 2012, from www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/LOF.pdf.Google Scholar

Chauvenet, W 1960. A manual of spherical and practical astronomy. Dover Publications, New York, USA.Google Scholar

Gizzi, G, Givens, DI 2004. Variability in feed composition and its impact on animal production. Retrieved February 25, 2011, from www.fao.org/docrep/Article/Agrippa/X9500E03.HTM.Google Scholar

Han, J, Kamber, M 2006. Data mining: concepts and techniques. Elsevier, San Francisco, USA.Google Scholar

Hatfield, R, Fukushima, RS 2005. Can lignin be accurately measured? Crop Science 45, 832–839.Google Scholar

Hawkins, DM 1980. Identification of outliers. Chapman and Hall, London, UK.Google Scholar

He, Z, Xu, X, Huang, JZ, Deng, S 2004. Mining class outliers: concepts, algorithms and applications in CRM. Expert System with Applications 27, 681–697.Google Scholar

Hernández, MA, Stolfo, SJ 1998. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37.CrossRef Google Scholar

Jiménez-Márquez, SA, Lacroix, C, Thibault, J 2002. Statistical data validation methods for large cheese plant database. Journal of Dairy Science 85, 2081–2097.Google Scholar

Kalu, BA, Fick, GW 1981. Quantifying morphological development of alfalfa for studies of herbage quality. Crop Science 21, 267–271.Google Scholar

Kotsiantis, SB, Kanellopoulos, D, Pintelas, P 2006. Data pre-processing for supervised learning. International Journal of Computer Science 1, 111–117.Google Scholar

Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A 2008. Propuesta para la homogenización de la información sobre alimentos: aplicación a la base de datos Pastos Españoles (SEEP). Pastos 38, 141–184.Google Scholar

Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A, Pérez-Marín, DC 2011. Building a metadata framework for sharing feed information in Spain. Journal of Animal Science 89, 882–888.Google Scholar

Mendenhall, W, Reinmuth, JE 1971. Statistics for management and economics. Duxury Press, Belmont, USA.Google Scholar

Molina, LC 2002. Data mining: torturando los datos hasta que confiesen. Retrieved March 12, 2012, from www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html.Google Scholar

Moore, KJ, Moser, LE, Vogel, KP, Waller, SS, Johnson, BE, Pedersen, JF 1991. Describing and quantifying growth stages of perennial forage grasses. Agronomy Journal 83, 1073–1077.Google Scholar

Mueller, SC, Teuber, LR 2007. Alfalfa growth and development. Retrieved July 4, 2011, from alfalfa.ucdavis.edu/IrrigatedAlfalfa/pdfs/UCAlfalfa8289GrowthDev_free.pdf.Google Scholar

Müller, H, Freytag, J 2003. Problems, methods and challenges in comprehensive data cleansing. Retrieved September 30, 2012, from http://www.dbis.informatik.hu-berlin.de/fileadmin/research/papers/techreports/2003-hub_ib_164-mueller.pdf.Google Scholar

NcDowell, LR, Conrad, JH, Thomas, JE, Harris, LE, Fick, KR 1977. Nutritional composition of Latin American forages. Tropical Animal Production 2, 273–279.Google Scholar

Palmquist, DL, Jenkins, TC 2003. Challenges with fats and fatty acid methods. Journal of Animal Science 81, 3250–3254.Google Scholar

Piramuthu, S 2006. On pre-processing data for financial credit risk evaluation. Expert System with Applications 30, 489–497.Google Scholar

Pyle, D 1999. Data preparation for data mining. Morgan Kaufmann, San Francisco, USA.Google Scholar

Sauvant, D, Pérez, JM, Tran, G 2002. Tables de composition et de valeur nutritive des matieres premieres destinees aux animaux d'elevage. INRA Editions, Paris, France.Google Scholar

Sauvant, D, Tran, G, Heuzé, V, Bastianelli, D, Archimède, H 2010. Data engineering for creating feed tables and animal models in the tropical context. Advances in Animal Biosciences 1, 438–439.Google Scholar

Tedeschi, LO, Fox, DG, Pell, AN, Duarte, DP, Boin, C 2002. Development and evaluation of a tropical feed library for the Cornell net carbohydrate and protein system model. Scientia Agricola 59, 1–18.Google Scholar

Tran, G, Lapierre, O 1997. The French feed database: a national network for collecting and disseminating data about feedstuff composition and nutritive value. In First European Conference for Information Technology in Agriculture (ed. H Kure, I Thysen and AR Kristensen), pp. 105–108. Copenhagen, Denmark.Google Scholar

Tran, G, Heuzé, V, Bastianelli, D, Archimède, H, Sauvant, D 2010. Tables of nutritive value for farm animals in tropical and Mediterranean regions: an important asset for improving the use of local feed resources. Advances in Animal Biosciences 1, 468–469.Google Scholar

Trujillo-Ortiz, A, Hernández-Walls, R, Castro-Pérez, A, Barba-Rojo, K 2006. MOUTLIER: Detection of outlier in multivariate sample test, a MATLAB file. Retrieved June 15, 2012, from www.mathworks.com/matlabcentral/fileexchange/122522.Google Scholar

Wang, RY, Reddy, MP, Kon, HB 1995. Towards quality data: an attribute based approach. Decision Support Systems 13, 349–372.Google Scholar

Wilks, SS 1963. Multivariate statistical outliers. Indian Journal of Statistics 25, 407–426.Google Scholar

Wu, Z 2009. A review of statistical methods for pre-processing oligonucleotide arrays. Statistical Methods in Medical Research 18, 533–541.Google Scholar

Yang, SS, Lee, Y 1987. Identification of a multivariate outlier. Paper presented at the Annual Meeting of the American Statistical Association, August 1987, San Francisco, USA.Google Scholar

Maroto Molina Supplementary Material

Appendix

Image 664 KB

Article contents

Data pre-processing to improve the mining of large feed databases

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Maroto Molina Supplementary Material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests