Hostname: page-component-78c5997874-94fs2 Total loading time: 0 Render date: 2024-11-08T09:20:31.421Z Has data issue: false hasContentIssue false

Data pre-processing to improve the mining of large feed databases

Published online by Cambridge University Press:  08 March 2013

F. Maroto-Molina*
Affiliation:
Servicio de Información sobre Alimentos, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
A. Gómez-Cabrera
Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
J. E. Guerrero-Ginel
Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
A. Garrido-Varo
Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
D. Sauvant
Affiliation:
UMR 791 Physiologie de la nutrition et de l'alimentation, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
G. Tran
Affiliation:
Association Française de Zootechnie, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
V. Heuzé
Affiliation:
Association Française de Zootechnie, AgroParisTech, 16 rue Claude Bernard, 75231, Paris, Cedex 05, France
D. C. Pérez-Marín
Affiliation:
Departamento de Producción Animal, ETS Ingeniería Agronómica y de Montes, Universidad de Córdoba, Ctra. Nacional IV km. 396, 14014, Córdoba, Spain
*
Get access

Abstract

The information stored in animal feed databases is highly variable, in terms of both provenance and quality; therefore, data pre-processing is essential to ensure reliable results. Yet, pre-processing at best tends to be unsystematic; at worst, it may even be wholly ignored. This paper sought to develop a systematic approach to the various stages involved in pre-processing to improve feed database outputs. The database used contained analytical and nutritional data on roughly 20 000 alfalfa samples. A range of techniques were examined for integrating data from different sources, for detecting duplicates and, particularly, for detecting outliers. Special attention was paid to the comparison of univariate and multivariate solutions. Major issues relating to the heterogeneous nature of data contained in this database were explored, the observed outliers were characterized and ad hoc routines were designed for error control. Finally, a heuristic diagram was designed to systematize the various aspects involved in the detection and management of outliers and errors.

Type
Nutrition
Copyright
Copyright © The Animal Consortium 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abreu, JM, Bruno-Soares, AM, Calouro, F 2000. Intake and nutritive value of Mediterranean forages and diets: 20 years of experimental data. ISA Press, Lisbon, Portugal.Google Scholar
Anderson, TW, Darling, DA 1952. Asymptotic theory of certain ‘goodness-of-fit’ criteria based on stochastic processes. Annals of Mathematical Statistics 23, 193212.Google Scholar
Breunig, MM, Kriegel, HP, Ng, RT, Sander, J 2000. LOF: identifying density-based local outliers. Retrieved May 15, 2012, from www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/LOF.pdf.Google Scholar
Chauvenet, W 1960. A manual of spherical and practical astronomy. Dover Publications, New York, USA.Google Scholar
Gizzi, G, Givens, DI 2004. Variability in feed composition and its impact on animal production. Retrieved February 25, 2011, from www.fao.org/docrep/Article/Agrippa/X9500E03.HTM.Google Scholar
Han, J, Kamber, M 2006. Data mining: concepts and techniques. Elsevier, San Francisco, USA.Google Scholar
Hatfield, R, Fukushima, RS 2005. Can lignin be accurately measured? Crop Science 45, 832839.Google Scholar
Hawkins, DM 1980. Identification of outliers. Chapman and Hall, London, UK.Google Scholar
He, Z, Xu, X, Huang, JZ, Deng, S 2004. Mining class outliers: concepts, algorithms and applications in CRM. Expert System with Applications 27, 681697.Google Scholar
Hernández, MA, Stolfo, SJ 1998. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 937.CrossRefGoogle Scholar
Jiménez-Márquez, SA, Lacroix, C, Thibault, J 2002. Statistical data validation methods for large cheese plant database. Journal of Dairy Science 85, 20812097.Google Scholar
Kalu, BA, Fick, GW 1981. Quantifying morphological development of alfalfa for studies of herbage quality. Crop Science 21, 267271.Google Scholar
Kotsiantis, SB, Kanellopoulos, D, Pintelas, P 2006. Data pre-processing for supervised learning. International Journal of Computer Science 1, 111117.Google Scholar
Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A 2008. Propuesta para la homogenización de la información sobre alimentos: aplicación a la base de datos Pastos Españoles (SEEP). Pastos 38, 141184.Google Scholar
Maroto-Molina, F, Gómez-Cabrera, A, Guerrero-Ginel, JE, Garrido-Varo, A, Pérez-Marín, DC 2011. Building a metadata framework for sharing feed information in Spain. Journal of Animal Science 89, 882888.Google Scholar
Mendenhall, W, Reinmuth, JE 1971. Statistics for management and economics. Duxury Press, Belmont, USA.Google Scholar
Molina, LC 2002. Data mining: torturando los datos hasta que confiesen. Retrieved March 12, 2012, from www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html.Google Scholar
Moore, KJ, Moser, LE, Vogel, KP, Waller, SS, Johnson, BE, Pedersen, JF 1991. Describing and quantifying growth stages of perennial forage grasses. Agronomy Journal 83, 10731077.Google Scholar
Mueller, SC, Teuber, LR 2007. Alfalfa growth and development. Retrieved July 4, 2011, from alfalfa.ucdavis.edu/IrrigatedAlfalfa/pdfs/UCAlfalfa8289GrowthDev_free.pdf.Google Scholar
Müller, H, Freytag, J 2003. Problems, methods and challenges in comprehensive data cleansing. Retrieved September 30, 2012, from http://www.dbis.informatik.hu-berlin.de/fileadmin/research/papers/techreports/2003-hub_ib_164-mueller.pdf.Google Scholar
NcDowell, LR, Conrad, JH, Thomas, JE, Harris, LE, Fick, KR 1977. Nutritional composition of Latin American forages. Tropical Animal Production 2, 273279.Google Scholar
Palmquist, DL, Jenkins, TC 2003. Challenges with fats and fatty acid methods. Journal of Animal Science 81, 32503254.Google Scholar
Piramuthu, S 2006. On pre-processing data for financial credit risk evaluation. Expert System with Applications 30, 489497.Google Scholar
Pyle, D 1999. Data preparation for data mining. Morgan Kaufmann, San Francisco, USA.Google Scholar
Sauvant, D, Pérez, JM, Tran, G 2002. Tables de composition et de valeur nutritive des matieres premieres destinees aux animaux d'elevage. INRA Editions, Paris, France.Google Scholar
Sauvant, D, Tran, G, Heuzé, V, Bastianelli, D, Archimède, H 2010. Data engineering for creating feed tables and animal models in the tropical context. Advances in Animal Biosciences 1, 438439.Google Scholar
Tedeschi, LO, Fox, DG, Pell, AN, Duarte, DP, Boin, C 2002. Development and evaluation of a tropical feed library for the Cornell net carbohydrate and protein system model. Scientia Agricola 59, 118.Google Scholar
Tran, G, Lapierre, O 1997. The French feed database: a national network for collecting and disseminating data about feedstuff composition and nutritive value. In First European Conference for Information Technology in Agriculture (ed. H Kure, I Thysen and AR Kristensen), pp. 105108. Copenhagen, Denmark.Google Scholar
Tran, G, Heuzé, V, Bastianelli, D, Archimède, H, Sauvant, D 2010. Tables of nutritive value for farm animals in tropical and Mediterranean regions: an important asset for improving the use of local feed resources. Advances in Animal Biosciences 1, 468469.Google Scholar
Trujillo-Ortiz, A, Hernández-Walls, R, Castro-Pérez, A, Barba-Rojo, K 2006. MOUTLIER: Detection of outlier in multivariate sample test, a MATLAB file. Retrieved June 15, 2012, from www.mathworks.com/matlabcentral/fileexchange/122522.Google Scholar
Wang, RY, Reddy, MP, Kon, HB 1995. Towards quality data: an attribute based approach. Decision Support Systems 13, 349372.Google Scholar
Wilks, SS 1963. Multivariate statistical outliers. Indian Journal of Statistics 25, 407426.Google Scholar
Wu, Z 2009. A review of statistical methods for pre-processing oligonucleotide arrays. Statistical Methods in Medical Research 18, 533541.Google Scholar
Yang, SS, Lee, Y 1987. Identification of a multivariate outlier. Paper presented at the Annual Meeting of the American Statistical Association, August 1987, San Francisco, USA.Google Scholar
Supplementary material: Image

Maroto Molina Supplementary Material

Appendix

Download Maroto Molina Supplementary Material(Image)
Image 664 KB