Inflammation of the bovine udder, known as mastitis, is often associated with bacterial intra-mammary infections (IMI), and is subdivided into clinical (visible signs of inflammation of the udder or milk) and subclinical (inflammation without visible signs) mastitis. Mastitis is the most widespread and economically challenging disease of dairy cattle (Petrovski et al. Reference Petrovski, Trajcev and Buneski2006), as it can directly affect milk production and quality, costing up to £277 per infected cow (Halasa et al. Reference Halasa, Huijps, Osteras and Hogeveen2007) and may affect up to 70% of a given herd (Bradley et al. Reference Bradley, Leach, Breen, Green and Green2007). In addition, it can be fatal for some infected cows (Addis et al. Reference Addis, Tedde, Puggioni, Pisanu, Casula, Locatelli, Rota, Bronzo, Moroni and Uzzau2016).
Clinical mastitis (CM) is classified as mild, moderate or severe and can be detected by simple observatory tests. Cows may show clinical signs such as fever, loss of appetite, dehydration, a swollen udder that is sensitive to touch, and reduced milk secretion with flakes, clots, and a watery appearance (Petrovski et al. Reference Petrovski, Trajcev and Buneski2006; Goncalves et al. Reference Goncalves, Tomazi, Barreiro, Beuron, Arcari, Lee, Martins, Araujo Junior and dos Santos2016).
Subclinical mastitis (SCM) is up to 40 times more common than CM and far more difficult to detect, thus having a greater economic impact. To control SCM, accurate surveillance strategies are required. Somatic cell count (SCC) is currently the key SCM detection method (Bochniarz et al. Reference Bochniarz, Adaszek, Dziegiel, Nowaczek, Wawron, Dabrowski, Szczubial and Winiarczyk2016).
SCC primarily comprises white blood cells (leucocytes), including macrophages, lymphocytes and polymorphonuclear leucocytes (essentially neutrophils), which are produced by the cow's immune system in response to an infection. Upon bacterial invasion of the bovine mammary gland, leucocytes are recruited into the gland from the blood stream, increasing the SCC, measured in number of cells per mL of milk (Harmon, Reference Harmon1994). In a normal, healthy cow, SCC is around 70 000 cells/ml but when a cow has an IMI, it increases sharply. Since SCC increases with the severity of mastitis, it is used to indicate the IMI status at the time of sampling. Once SCC exceeds the selected cut-off (Australian cut-off ≥250 000 cells/ml, European Union cut-off ≥200 000 cells/ml, and New Zealand cut-off ≥150 000 cells/ml), a cow is considered to have an IMI.
SCC (among other milk quality parameters) is measured during routine herd testing. The dairy industry worldwide has adopted this test-day SCC as its main mastitis surveillance tool (Schukken et al. Reference Schukken, Wilson, Welcome, Garrison-Tinofsky and Gonzales2003; Sharma et al. Reference Sharma, Singh and Bhadwal2011). However, simple test-day SCC may not be stable enough to accurately monitor SCM, as SCC widely fluctuates between test days (Berglund et al. Reference Berglund, Pettersson, Ostensson and Svennersten-Sjaunja2007).
Frequent (longitudinal) monitoring of SCC (Berglund et al. Reference Berglund, Pettersson, Ostensson and Svennersten-Sjaunja2007) and/or finding additional SCC-independent predictors (such as milk quality parameters), could increase the robustness and early predictive power of SCM. Some milk composition features have high potential for use as SCM predictors. Infected cows have a decreased milk yield (volume), and composition changes such as reduced fat, protein and lactose (Bartlett et al. Reference Bartlett, Miller, Anderson and Kirk1990; Leitner et al. Reference Leitner, Chaffer, Caraso, Ezra, Kababea, Winkler, Glickman and Saran2003; Petrovski et al. Reference Petrovski, Trajcev and Buneski2006). However, Gyr cows with SCM had reduced lactose and solids, only, while the protein and fat content of the milk appeared unaffected (Malek dos Reis et al. Reference Malek dos Reis, Barreiro, Mestieri, Porcionato and dos Santos2013). Thus, key predictors of SCM need to be reliably identified.
Key predictors of SCM can be identified through mining a large number of milking records, to develop a robust and reliable statistical model of feature selection. Data mining (machine learning) is the process of extracting implicit, previously unknown, and potentially useful information from data. It can also be defined as data analysis for finding regularities and patterns in a given data set. Data mining outperforms the common multivariate statistical methods in large-scale studies. Great attention has been paid to the application of data mining in finance, medicine, drug discovery, agriculture, and biology as it can create accurate risk models and discover patterns in data (Ebrahimi et al. Reference Ebrahimi, Ebrahimie and Shamabadi2010; Ebrahimi et al. Reference Ebrahimi, Lakizadeh, Agha-Golzadeh, Ebrahimie and Ebrahimi2011; Vinod Bharat et al. Reference Vinod Bharat, Khandelwal and Navsare2016). Data mining is extensively used in the medical field for disease detection, and assessing drug outcomes, due to its predictive abilities and pattern recognition (KayvanJoo et al. Reference KayvanJoo, Ebrahimi and Haqshenas2014; Hande Küçükönder, Reference Hande Küçükönder, Ceyhan and Cinar2015). Supervised feature selection has been largely used in data mining to analyse complex data.
For this study, the goal of feature selection was to find a subset of input features of milking parameters by removing the ones with little or no predictive value (Shekoofa et al. Reference Shekoofa, Emam, Ebrahimi and Ebrahimie2011). In a large-scale, longitudinal study, we analysed a range of milk composition variables by 10 attribute weighting models to find those linked to SCC and SCM.
Materials and methods
Data collection
Data was collected between July 2011 and June 2013 from a split-calving, commercial dairy herd with a year-round average of 2400 mixed-age, Holstein Friesian cows milked twice daily. The farm, located in Ongaonga, Hawkes Bay, New Zealand, supplements a 40% pasture-based diet with 50% mixed rations (palm kernel extract, tapioca, triticale, barley, oak, straw, grass silage, corn silage, hay), 10% food processing by-products (including apples, beetroot, turnips, brewer's grain) and 5 g yeast per cow per day, to promote high milk production in its cows. Cows were regularly sourced from various family lines and origins at a production-driven, high replacement rate. No experimental control group was used in this study. However, the farm was chosen for its strict enforcement of standard operating procedures to minimise confounding variables. Cows were exposed to the same environmental and management conditions, following standard operating procedures.
An automated, electronic monitoring system was used to analyse milk for volume, weight, fat, protein, lactose, electrical conductivity, milking time, peak flow and SCC. SCC was measured by CellSense®, an on-line SCC detector that collects milk samples from individual cows approximately one minute after the start of milking, and mixes them with a reagent, CellGelTM, using the same principle as the rapid mastitis test, to give SCC, as described by (Whyte et al. Reference Whyte, Orchard, Cross, Frietsch, Claycomb, Mein, Meijering, Hogeveen and de Koning)2004). CellSense® measures SCC with reasonably good precision, showing an overall correlation coefficient of 0·76 when compared with the Fossomatic technique (CombiFoss 5000, Foss Electric, Hillerød, Denmark), the most commonly used laboratory measure for herd testing globally (Reference Kawai, Hayashi, Kiku, Chiba, Nagahata, Higuchi, Obayashi, Itoh, Onda, Arai and SatoKawai et al. 2013).
Automatically collected data was first tabulated in Excel to enable statistical processing. Incomplete records or those with recording errors were identified and removed from the data set before analysis.
Description of features
The following variables were recorded at each milking: SCC, milk volume, fat, protein, lactose, Electrical Conductivity (EC), milking time, and peak flow. Fifteen bails of the farm's 60 bail rotary parlour (a one in four ratio) were fitted with CellSense®, allowing data capture of an average three and a half SCC measurements per cow, per week.
As repeated bacteriological cultures were not available for individual cows, the accepted SCC cut-offs of ≥150 000 cells/ml in New Zealand, ≥200 000 cells/ml in the European Union and ≥250 000 cells/ml in Australia, were used to classify a cow's mastitis status (0 – non-mastitis and 1 – mastitis). In other words, SCM occurrence was determined as binomial distribution (Yes/No) based on New Zealand, European, and Australian cut-offs.
Since infection status was not confirmed bacteriologically, cows classified as having mastitis (1) were described as having an assumed intramammary infection (IMI).
Preparation of data sets
Two types of data sets were created:
(1) Data set containing SCC variable (DCS): Herein, the sampling score (Australian, European Union or New Zealand threshold/cut-off) was set as the target (label) variable. In this data set, SCC was set as the predictive variable.
(2) Data set without SCC variable (DwSCC): In this data set, mastitis was set as the target (label) variable.
The data sets were imported into Rapid Miner software (RapidMiner 5.0.001, Rapid-I GmbH, Stochumer Str. 475, 44227 Dortmund, Germany) and either the sampling score (Australian, New Zealand, or European Union threshold/cut-off) was set as the output (label or target) variable, and other variables (with/without SCC) as dependent inputs.
Descriptive statistics, univariate, and multivariate analysis
Descriptive statistics, univariate, and multivariate analyses were performed using Minitab 17 package (http://www.minitab.com/). Mean comparison between milk composition variables was carried out with 2-Sample T-TEST. Comparison of two proportions was performed by Z-test and Fisher exact test. Spearman correlation was used to calculate the correlation between numeric variable and mastitis, as a binomial (Yes/No) variable. Two Proportion tests were performed using Z-test and Fisher's exact test.
Attribute weighting algorithms
To determine the most important variables, various attribute weighting algorithms were applied to each data set (Ebrahimi et al. Reference Ebrahimi, Lakizadeh, Agha-Golzadeh, Ebrahimie and Ebrahimi2011):
• Information gain attribute weighting: Herein, the relevance of a factor by computing the information gain in class distribution.
• Weighting by the information gain ratio: This operator calculates the relevance of the attributes based on information gain and assigns weights to them accordingly.
• Weighting by rule: This operator calculates the relevance of a factor by computing the error rate of a model on the sample data set without the factor.
• Weighting by deviation: The operator creates weights from the standard deviations of all attributes. The average normalised the values, minimum or maximum of the attribute.
• Weighting by the chi-squared statistic: This operator calculates the relevance of a factor by computing, for each attribute in the input sample data set, the value of the chi-squared statistic with respect to the class attribute.
• Weighting by the Gini Index: This operator calculates the relevance of a factor by computing the Gini Index of the class distribution, if the given sample data set would have been split per the factor in question.
• Weighting by uncertainty: This operator calculates the relevance of an attribute by measuring the symmetrical uncertainty with respect to the class.
• Weighting by Relief: This operator measures the relevance of a factor by sampling the examples and comparing the value of the current factor with the nearest example of the same, and of a different, class. The resulting weights are normalised into the interval between 0 and 1.
• Weighting by Support Vector Machine (SVM): This operator uses the coefficients of the normal vector of a linear SVM as feature weights.
• Weighting by Principal Component Analysis (PCA): This operator uses the factors of the first principal component as feature weights. Data is normalised before running the models, so it is reasonable to expect that all weights will be presented as a digit between 0 and 1; showing the importance of each attribute for the target attribute.
To enable comparison between results of different attribute weighting models, the calculated weights were normalised in range from 0 to 1. As previously described (Ebrahimi et al. Reference Ebrahimi, Aghagolzadeh, Shamabadi, Tahmasebi, Alsharifi, Adelson, Hemmatzadeh and Ebrahimie2014), we defined an index based on intersection (agreement) of different weighting models. The features that were identified as important by most weighting models (weights >0·5, >0·75, or >0·95) were assumed the key indicators of mastitis.
Results
Distribution of milk composition variables and SCM occurrence
The distribution of SCM was compared between the 3 different scoring methods. As expected, due to the higher level of acceptable SCC, scoring based on the Australian cut-off resulted in the lowest SCM prevalence (19%) and scoring based on New Zealand's low cut-off had the highest prevalence (35%) with the European cut-off intermediate (24%: online Supplementary File Fig. S1).
Figure 1 shows the difference between SCC and milk composition variables between records with and without mastitis (using the Australian mastitis scoring system). As expected, cows with mastitis had significantly higher SCC (83·9 ± 60·6 vs. 433·2 ± 369·2, P < 0·01) (Fig. 1). Mastitis significantly decreased lactose content, milking time (by 11·6%), peak flow (by 9·9%) and volume of milk by 26% (all P < 0·01). In contrast, fat and EC were increased (P < 0·01) (Fig. 1). While the mean comparison shows that mastitis increases the protein content of milk, this increase was not supported by median comparison (data not shown). Median of protein contents in cows with and without mastitis was 4·00 g. Milk EC was highly variable within cows without mastitis (Fig. 1).
Attribute weighting analysis of mastitis data set containing SCC variable (DCS)
Table 1 presents the normalised weights for different measured features in this study with respect to mastitis (based on Australian cut-off) with SCC included as a feature. Attribute weighting results applied to data scored by the Australian cut-off, showed that the SCC variable gained the highest possible weights of 1·0 by 9 out of 10 weighting algorithms (PCA, SVM, Uncertainty, Gini Index, Chi Squared, Deviation, Rule, Info Gain Ratio, and Info Gain) and 0·65 by Relief method (Table 1). Other variables did not attain weights in the presence of SCC. When data with Australian and New Zealand scoring systems were analysed, the same results were derived, and the SCC variable was the sole important feature (Online Supplementary FileFigure S1).
SCC, Somatic Cell Count; EC, Electric Conductivity
Attribute weighting analysis of mastitis data set without SCC variable (DwSCC)
When the SCC variable was omitted from the variable list and the Australian mastitis scoring was appointed as the target or label variable, lactose was promoted to first place, followed by EC (Table 2). When analysing data with Australian mastitis scoring, three attribute weighting algorithms including Uncertainty, Chi Squared, and Info Gain Ratio gave the highest possible value of 1·0 to lactose (Table 2). EC received the weight of 1 by Gini Index and Info Gain algorithms. Milking time received two 1·0 weights by Gini Index and Info Gain (Table 2). In the data labelled with EU and New Zealand scoring, the same results were concluded, and again, lactose, EC, and milking time were the most important features (data not shown). Altogether, lactose content of milk is the key distinguishing attribute for mastitis prediction (Table 2), as identified by most of the attribute weighting models (intersection of different models).
EC, Electric Conductivity
Lactose content of milk: negative indicator of mastitis occurrence
Attribute weighting models selected lactose as the most important negative indicator of mastitis (when SCC variable was omitted from the variable list). We determined the frequency of mastitis at different concentrations of milk lactose (online Supplementary File Table S1). When the lactose content of milk decreased, the frequency of mastitis occurrence sharply increased. Spearman correlation showed a significantly (P-value < 0·01) negative correlation (−26·8%) between lactose and mastitis (Australian mastitis scoring). The prevalence of mastitis differed greatly between lactose <4·2 mg/100 ml (average of 73·5% mastitis prevalence) and lactose >4·4 mg/100 ml (average of 17·3% mastitis prevalence: online Supplementary File Fig. S2). Fisher's exact test and Z tests showed this difference to be significant at P < 0·01.
Discussion
In this study, we analysed a range of milk composition variables by 10 attribute weighting models, to find the ones linked to SCM. SCM was detected by on-line SCC measurements and a variety of threshold/cut-offs (i.e. Australian, New Zealand and European Union). Lactose, EC and their combination were the most accurate variables to detect mastitis on dairy farms equipped with in-line sensors. Both forms of mastitis (CM and SCM) influence the quantity and quality of milk and therefore are of major economic concern for the dairy industry. Early detection of mastitis (especially SCM) can improve animal welfare and the quality of milk production, and increase economic gains. Data mining, also known as knowledge discovery, is a relatively new technique to extract useful knowledge from data (compared to traditional multivariate analysis) (Shekoofa et al. Reference Shekoofa, Emam, Shekoufa, Ebrahimi and Ebrahimie2014). The main strength of data mining resides in its ability to handle large data sets and to extract meaningful and easy-to-understand results, which is usually impossible by human calculations (Torkzaban et al. Reference Torkzaban, Kayvanjoo, Ardalan, Mousavi, Mariotti, Baldoni, Ebrahimie, Ebrahimi and Hosseini-Mazinani2015). A few studies have looked into this field to improve early detection techniques by novel data mining tools (Sharifi et al. Reference Sharifi, Pakdel, Ebrahimi, Reecy, Farsani and Ebrahimie2018).
In recent years, the application of sensors to collect data during automatic milking, has generated a considerable amount of data which needs efficient pattern recognition models to mine the data (Kamphuis et al. Reference Kamphuis, Pietersma, van der Tol, Wiedemann and Hogeveen2008). Decision tree algorithms have been employed to predict CM and SCM from sensor data with promising outcomes (Kamphuis et al. Reference Kamphuis, Mollenhorst, Heesterbeek and Hogeveen2010b; Ebrahimie et al. Reference Ebrahimie, Ebrahimi, Ebrahimi, Tomlinson and Petrovski2018). Support vector machine (SVM) methods have been used to predict SCM in dairy cattle by constructing and examining a prediction model with 89% sensitivity, 92% specificity, and 50% error in mastitis detection (Mammadova & Keskin, Reference Mammadova and Keskin2013). In a recent study, neural network algorithms have been used to classify healthy and mastitic buffaloes, based on yield and milk quality parameters (Panchal et al. Reference Panchal, Sawhney, Sharma and Dang2016).
This is the first large-scale and longitudinal study employing 10 different attribute weighting models to distinguish healthy from mastitic dairy cattle based on quality and quantity of milking parameters. Ten different attribute weighting models (PCA, Chi Squared, Rule, Relief, Info Gain, Info Gain Ratio, SVM, Uncertainty, Deviation and Gini Index) were used to find the most important features associated with SCM. In line with previous findings SCC is one of the best indicators of SCM (Mitchell et al. Reference Mitchell, Rogers, Houlihan, Tucker and Kitchen1986).
Lactose was the most important predictor of mastitis after SCC. Prevalence of mastitis was statistically higher when milk lactose was low. EC was another important predictor of SCM. EC of udder tissue increases during mastitis because of changes in ionic composition of milk: increased Na + and Cl- levels and decreases in other mineral substances. Computerised herd management systems have been implemented by modern enterprises, allowing variables, such as milk yield, flow rate and EC, to be automatically recorded during milking. Then, based on EC, each individual cow's mastitis status is evaluated and an alarm raised to indicate mastitis. However, these alarms may be heard at the wrong time, and thus, the importance of this feature for CM and SCM detection has been debated previously (Kamphuis et al. Reference Kamphuis, Mollenhorst, Heesterbeek, Hogeveen and Eric Hillerton2010a). Interestingly, there was no report found showing the relationship between SCC or mastitis with lactose or EC. Our results showed that lactose and EC are the second and third most important feature, and were given high values by some attribute weighting models. The results of this study should be further evaluated on more farms.
Conclusion
The high variability of SCC, particularly in test-day SCC, reinforces the need to identify additional predictive indicators for SCM, derived from longitudinal, large-scale studies, mined by highly efficient pattern recognition models. This study employed ten different attribute weighting models (data mining) with different statistical backgrounds and selected key features based on the agreement of these algorithms. Results showed that after SCC, lactose concentration and EC gained the highest weight by all attribute weighting models. We suggest that a combination of lactose concentration and EC can provide a predictive tool for mastitis.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0022029918000249