Introduction
Machine learning has gained substantial interest in nutritional sciences over the last decade(Reference Kirk, Kok, Tufano, Tekinerdogan, Feskens and Camps1). A PubMed search using the terms ‘nutrition’ and ‘machine learning’ shows the number of articles with title and abstract matches increasing exponentially from 2013 onwards (Fig. 1). Examples of applications of machine learning can be seen in various areas of nutrition research, including precision nutrition(Reference Kirk, Catal and Tekinerdogan2), malnutrition(Reference Janssen, Bouzembrak and Tekinerdogan3), obesity(Reference DeGregory, Kuiper, DeSilvio, Pleuss, Miller, Roginski, Fisher, Harness, Viswanath and Heymsfield4), food intake assessment(Reference Oliveira Chaves, Gomes Domingos, Louzada Fernandes, Ribeiro Cerqueira, Siqueira-Batista and Bressan5), diet recommendation(Reference Shah, Degadwala and Vyas6) and chatbots for nutritional support(Reference Oh, Zhang, Fang and Fukuoka7).
The growing interest in machine learning can be attributed to its appealing properties. Machine learning has the capability to automate tasks that would otherwise be performed manually, thereby freeing up human resources for other activities. Additionally, the different approaches and focuses involved in machine learning compared to traditional statistical methods bring the possibility to analyse data in new ways, which could lead to new scientific discoveries and, ultimately, improve individual and population health(Reference Rajpurkar, Chen, Banerjee and Topol8,Reference Badillo, Banfai, Birzele, Davydov, Hutchinson, Kam-Thong, Siebourg-Polster, Steiert and Zhang9) .
As with any research tool, proper use is essential to ensure the validity of the findings. Unfortunately, the enthusiasm around machine learning has led to adoption preceding a proper understanding of its workings by those applying it(Reference Lipton and Steinhardt10,Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11) . This has become apparent in various ways, including the application of machine learning on datasets to which it is not suited, inappropriate methodological choices in data processing steps, non-robust validation schemes, misinterpretation of results, and inadequately described methodology(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11). The consequences of these issues and similar ones include false findings, models that do not generalise to unseen data (i.e. overfitting), and ultimately a reduction in the quality of the literature in the nutrition field.
Claims about the properties of machine learning are used to justify its use, with the considerations behind these claims sometimes neglected. For example, it is often claimed that machine learning techniques are more flexible and make fewer assumptions about the data than traditional statistical methods(Reference Bzdok, Altman and Krzywinski12). However, this does not mean that careful methodological planning and data processing are no longer necessary(Reference Badillo, Banfai, Birzele, Davydov, Hutchinson, Kam-Thong, Siebourg-Polster, Steiert and Zhang9,Reference Libbrecht and Noble13) . Even if fewer statistical assumptions are made by some of the algorithms, improper data processing can still lead to suboptimal results.
Machine learning approaches are also praised for their ability to handle high-dimensional datasets(Reference Bzdok, Altman and Krzywinski12,Reference Hastie, Tibshirani and Friedman14) . For example, ordinary least squares regression cannot be fit when the number of predictors exceeds that of the number of observations because a unique solution to the problem cannot be found(Reference Montgomery, Peck and Vining15). In contrast, machine learning regression algorithms generally allow this without any apparent problem, even though this can lead to overfitting and unstable feature importance estimates, which may go unnoticed unless checked by the analyst(Reference Hastie, Tibshirani and Friedman14). Indeed, the ease with which machine learning experiments can be performed by certain programmes or software libraries and the way in which the outputs are presented can give a false sense of certainty about the results that are generated. Without a better understanding of machine learning and its capabilities and limits, issues will persist in the literature.
To shed light on some of these issues, this review briefly discusses the concept of machine learning before going through steps in the machine learning process and describing good practices and common pitfalls or misconceptions in each. There is a particular a focus on concepts of machine learning as it tends to be applied to modern datasets observed in the nutrition sciences, such as large cohorts and omics datasets. The goal is to increase awareness on important details of the machine learning process, promoting robust methodologies in research and enabling a better understanding when interpreting the work of others using machine learning.
Machine learning overview and advantages
Machine learning is a subdivision of artificial intelligence (AI) that learns patterns in a dataset to perform a given task without being explicitly programmed to do so(Reference Sarker16). Different types of machine learning exist, within which tasks are performed to achieve an objective.
The two most common types of machine learning in nutrition research are supervised and unsupervised machine learning. In supervised learning, data come with labels and thus the target is known. Either regression or classification are the tasks that are completed to predict the output labels, with algorithms including logistic regression, decision trees, random forest, and support vector machine being used to do this. In unsupervised learning, labels are not available and instead patterns or similarities within the data structure are sought. Tasks include clustering and dimensionality reduction, with example algorithms including k-means and principal component analysis (PCA).
In semi-supervised learning, some of the data (usually a small portion) have labels whereas others (usually a large portion) do not. A combination of both supervised and unsupervised tasks and algorithms may be applied. Reinforcement learning is the final type of learning in which the algorithm updates its behaviour based on feedback from a dynamic environment. Reinforcement learning is currently less often seen in nutrition research but is involved in chatbots and recommendation systems(Reference Theodore Armand, Nfor, Kim and Kim17–Reference Yau, Chong, Fan, Wu, Saleem and Lim19) and will likely become more common as personalised nutrition grows and chatbots improve. Detailed descriptions of machine learning types and the algorithms used to complete the tasks within them can be seen in Kirk et al.(Reference Kirk, Kok, Tufano, Tekinerdogan, Feskens and Camps1).
Machine learning approaches have certain attractive properties which have motivated their inclusion in scientific research methodologies. Being able to learn for themselves how to complete tasks without explicit programming brings the possibility of continuous improvement with the addition of new data(Reference Jordan and Mitchell20). Additionally, the principles of machine learning are not limited to single domains, which means that machine learning can be applied to many different problem areas (provided the data are suitable).
Machine learning can automate the jobs that have historically been undertaken by humans, particularly repetitive ones and those with elements of pattern recognition. One example of this is the research area involved in tracking food intake to improve the accuracy of food intake assessment while simultaneously reducing the burden for those doing so(Reference Oliveira Chaves, Gomes Domingos, Louzada Fernandes, Ribeiro Cerqueira, Siqueira-Batista and Bressan5). Many studies in this area make use of machine learning to model unstructured data such as videos of subjects eating(Reference Tufano, Lasschuijt, Chauhan, Feskens and Camps21), pictures of food(Reference Liu, Cao, Luo, Chen, Vokkarane and Ma22) or audio-based approaches(Reference Kalantarian and Sarrafzadeh23). Machine learning solutions are usually also much faster and can be permanently available, unlike human counterparts that may perform similar duties. For example, ChatGPT not only provided answers to common nutrition questions that scored higher than those from dietitians but could also do so instantly and at any moment of the day(Reference Kirk, Van Eijnatten and Camps24).
In comparison to traditional descriptive statistical methods, there is usually a focus on prediction on unseen data in machine learning(Reference Bzdok, Altman and Krzywinski12). Hence, in problem areas where prediction is more important than a detailed understanding of the contribution of a set of variables to an outcome, machine learning may be preferred. Unsupervised machine learning can be used for uncovering relationships within complex data structures even in the absence of predefined hypotheses. For example, clustering is often used to group individuals with similar characteristics who might have similar physiological responses to foods or nutritional interventions, such as in metabotyping studies(Reference Palmnäs, Brunius, Shi, Rostgaard-Hansen, Torres, González-Domínguez, Zamora-Ros, Ye, Halkjær and Tjønneland25). However, key characteristics of the groups such as how many there might be, which features define them, and if they even exist at all is not known beforehand.
Finally, machine learning techniques and traditional statistical methods are sometimes pitted against one another to compare which performs in a certain problem area(Reference Chowdhury, Leung, Walker, Sikdar, O’Beirne, Quan and Turin26–Reference Suzuki, Yamashita, Sakama, Arita, Yagi, Otsuka, Semba, Kano, Matsuno and Kato32). Whilst such studies might be well-intentioned and simply wish to inform on the effectiveness of a given method, machine learning and traditional statistics should not be thought of rivals competing for the same space. Instead, they should be seen as complementary tools with significant overlap, though also with distinct properties and use cases(Reference Bzdok, Altman and Krzywinski12,Reference Bzdok33,Reference Breiman34) . This is perhaps exemplified by techniques which could belong to either category depending on how they are used, such as Lasso(Reference Tibshirani35). When the goal is inference and the focus is on drawing conclusions about a population sample and describing underlying relationships within the data, this is a case for traditional statistical methods. When the goal is predictive performance on unseen data, machine learning techniques would be used(Reference Bzdok, Altman and Krzywinski12).
Steps in machine learning
The machine learning process is composed of a series of steps which start with collecting the data and end with deployment. These steps are described below, although in the context of research, interpreting (Interpretation) and describing the methodology and results (Reporting) are discussed in place of deployment. At each step, key points, good practices and common misconceptions or pitfalls within them are addressed, as summarised in Table 1.
Step 1. Data collection
Data collection is arguably the most important step in the machine learning process because of its influence in determining the quantity and quality of the data available for modelling(Reference Sarker16). Broadly, data collection can be done either with a research question in mind, for which data are required to investigate, or a dataset in mind, which can be used to investigate a variety of questions.
In the case of the former, it should be ensured that data collected are relevant to the problem and that the collection methods are capable of generating data of a sufficient quality(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11). Whilst machine learning algorithms may be able to learn structures in datasets that are not apparent to humans, they cannot produce meaningful results from poor-quality data (akin to ‘garbage in, garbage out’). Similarly, machine learning algorithms require an adequate number of instances from which to learn, making sample size an important factor. The number of data points (i.e. individual observations or instances containing unique information) required to achieve adequate performance and reliable results varies depending on data quality, signal to noise ratio, and the machine learning task being performed. In general, however, sample sizes below around 100 are considered small for many supervised and some unsupervised machine learning approaches in nutrition research using biological data and may not provide enough instances from which the algorithms can learn. It is also important that the sample is representative of the population for which the final model is intended. When this is not the case, models may fail to generalise or struggle upon encountering observations with data that were absent in their training (e.g. Naïve Bayes)(Reference Wickramasinghe and Kalutarage36).
Alternatively, data may also be collected without a specific research question in mind and where the goal is to create a dataset of sufficient size and depth to answer a broad array of questions and remain relevant over a long period(Reference Setia37). It remains important that the sample is representative of the population of interest, meaning inclusion criteria must be carefully defined(Reference Wang and Kattan38). For example, the inclusion and exclusion criteria for a large cohort study must ensure that participants are representative of the target population and recruitment techniques must be selected in a way that minimises biases(Reference Song and Chung39). Various types of data should be collected to permit investigating questions on a broad range of topics and the methods used in their collection should be documented in detail(Reference Budach, Feuerpfeil, Moritz, Ihde, Nathansen, Noack, Patzlaff, Naumann and Harmouch40). Important considerations include the longevity of the data collected, questionnaire wording and response options, which data (variables) will be collected, data storage (both physically and digitally), privacy and ethical considerations, and documentation and metadata, amongst others.
There has been much focus in recent decades on improving machine learning algorithms or developing new ones(Reference Budach, Feuerpfeil, Moritz, Ihde, Nathansen, Noack, Patzlaff, Naumann and Harmouch40). Great strides have been made in this regard, and there are now many algorithms available for various machine learning tasks and problem areas(Reference Woodman and Mangoni41). Despite this, data quality remains the most important limiting factor on the performance of most machine learning applications. It is unlikely that future breakthroughs will occur solely through the development of new and improved algorithms; rather, there must be a focus on improving data quality through rigorous methods of data collection and processing (discussed below)(Reference Budach, Feuerpfeil, Moritz, Ihde, Nathansen, Noack, Patzlaff, Naumann and Harmouch40).
Step 2. Data processing
Once collected, data usually require processing. The methodological decisions made during data processing influence the data that is eventually used for modelling. This section describes common data processing steps that should be considered in a machine learning experiment. Importantly, some data processing steps should be performed within validation steps and not applied to the whole dataset in order to avoid information leakage (see below: Internal validation schemes in Step 3. Modelling). Attention is brought to these situations below.
Selecting observations
It is possible that not all of the observations in the dataset are suitable to be included in the analysis. Reasons for this could include outliers, repeated measures (from which only one is required) or subgroups with few observations, such as males in a predominantly female sample. Decisions regarding which observations should be included and excluded should justified (e.g. based on good statistical grounds or findings from previous studies) and well documented when the methodology is described.
Pre-selecting features & dimensionality reduction
Whilst classic feature selection makes use of the outcome and uses statistical techniques to determine which features to include (see below: Step 3. Modelling), the analyst may also have to decide which features (or groups of features) are potentially relevant to the problem and therefore should be included for data processing. Where possible, domain knowledge should be used to justify the elimination of variables that are not expected to be relevant to the problem(Reference Badillo, Banfai, Birzele, Davydov, Hutchinson, Kam-Thong, Siebourg-Polster, Steiert and Zhang9,Reference Nguyen and Holmes42) . For example, when investigating cardiovascular disease risk using a large cohort, specific biochemical measures may be included, whereas others deemed insufficiently relevant can be excluded. Importantly, any feature selection steps that make use of the outcome variable must be performed within validation loops (see below: Internal validation schemes in Step 3. Modelling).
Dimensionality reduction refers to reducing the number of features irrespective of the outcome variable (i.e. unsupervised approaches), such as through identifying redundancies in the data(Reference Harrell43,Reference Sorzano, Vargas and Pascual-Montano44) . Reducing the dimensionality of a dataset can be desirable by reducing model complexity, computation time, problems related to collinearity, and overfitting(Reference Nguyen and Holmes42). Low-variance features may not contain enough information to be useful to the problem and are sometimes removed or combined with other features, if appropriate(Reference Sarker16). However, this should be done with caution since loss of information is possible. For example, in microbiome studies it is common to see bacteria present in fewer than a given proportion of the sample removed (e.g. <5%). However, in some cases the presence of a given bacterium in a small group of individuals may be informative for health outcomes or the problem being investigated. Such findings may be lost due to low-variance filtering.
Techniques for identifying similarities or redundancies within the data can also be employed, such as correlation-based approaches, PCA and variable clustering(Reference Zebari, Abdulazeez, Zeebaree, Zebari and Saeed45,Reference Xu and Lee46) . It is commonly believed that it is not necessary to perform such unsupervised data reduction techniques within validation steps and, instead, that they can be safely applied to the entire dataset. This belief is based on the understanding that these techniques do not make use of the outcome variable, thus minimising the risk of information leakage(Reference Hastie, Tibshirani and Friedman14,Reference Sarker16) . However, recent findings suggest that unsupervised dimensionality reduction steps on the whole dataset can still introduce bias(Reference Moscovich and Rosset47). Whilst more work is needed on the topic, analysts may consider repeating analyses where dimensionality reduction is confined within training splits of validation steps to assess the sensitivity of results to the timing of these steps.
Processing missing data
Missing data is common in many datasets and comes in different forms, including missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)(Reference Bennett48). Each reflects different underlying mechanisms: MCAR implies that the missingness is unrelated to any data, MAR suggests that the missingness is related to observed data but not the missing data themselves, and NMAR indicates that the missingness is related to the unobserved data(Reference Bennett48). These distinctions are crucial because they influence the methods used to handle missing data and the conclusions derived from the results.
One approach to missing data is to simply restrict the analyses to complete cases. This may be justified in certain circumstances, such as when data are MCAR and the number of observations with missing data is relatively small, however, when data are NMAR, removing observations with missing data can bias the results, and verifying the type of missing data in question is not always possible(Reference Sterne, White, Carlin, Spratt, Royston, Kenward, Wood and Carpenter49).
An alternative is to impute the missing data in various ways. Single imputation approaches with the mean, mode or other summary statistics are simple to implement, although they are almost never optimal and can distort the distribution of the features of the dataset and the relationships between them(Reference Zhang50,Reference Donders, van der Heijden, Stijnen and Moons51) . In contrast, multiple imputation uses distributions of the observed data to provide multiple plausible complete datasets, thus accounting for uncertainty in the missing values(Reference Sterne, White, Carlin, Spratt, Royston, Kenward, Wood and Carpenter49). Analyses are repeated on each dataset and the results are pooled to account for the variability due to imputation.
Finally, model-based approaches, such as regression and machine learning techniques (e.g. neural networks, k-nearest neighbours and random forest) use the observed data to predict missing values(Reference Jerez, Molina, García-Laencina, Alba, Ribelles, Martín and Franco52,Reference Emmanuel, Maupong, Mpoeleng, Semong, Mphago and Tabona53) . Various packages exist which facilitate the implementation of imputation techniques(Reference Yadav and Roychoudhury54). Imputation of missing data should be done within validation steps.
Processing outliers
Outliers can arise due to a variety of reasons, including equipment malfunction, human error during data entry, or extreme (but valid) data measurements(Reference Dash, Behera, Dehuri and Ghosh55). Regardless of their origin, the presence of outliers should be investigated during exploratory data analysis and, if necessary, appropriate action should be taken to account for them(Reference Badillo, Banfai, Birzele, Davydov, Hutchinson, Kam-Thong, Siebourg-Polster, Steiert and Zhang9). Outliers can sometimes be detected through manual observation of features through descriptive characteristics or plots(Reference Kuhn and Johnson56). For example, if the median LDL cholesterol levels in a sample were 2.5 mmol/L, with an interquartile range of 1mmol/L but a maximum value of 25 mmol/L, it would be likely that one or some values were unrealistically high, possibly reflecting a mistake during manual entry or measuring equipment malfunction. Other times, however, what defines an outlier is less clear, and even in cases where extreme values are found, it is not always the case that some type of treatment is required.
Whilst general approaches based on statistical properties can be simple and easy to implement (e.g. values greater than the third quartile plus 1.5 *interquartile range)(Reference Dash, Behera, Dehuri and Ghosh55), they often do not consider the characteristics of the data or the problem at hand and are unjustified in many situations. For example, a small number of people may report a much higher income than the rest in a sample; however, these may be perfectly valid observations that do not require corrective action, despite their appearance in descriptive statistics or plots. Whenever possible, domain knowledge and additional related information should be used to guide decision-making for identifying and processing outliers, and this should be documented in the methodology.
If observations are determined to be outliers, if and how they should be dealt with should be based on logical grounds and on a case-by-case basis, whenever possible. When this is not possible, such as when features cannot be easily interpreted or in high-dimensional settings, identifying and dealing with outliers should be done carefully to avoid loss of information or risk of introducing bias.
Finally, the processing of outliers may be done on either the whole dataset or within validation steps, depending on the purpose. For instance, approaches to correcting genuine errors in the dataset, such as mistaken data entries, can be performed on the whole dataset as there is little risk of introducing bias. However, if characteristics of the data are used to identify and correct potential outliers, this should be done within validation steps.
Transforming features
Feature transformation, such as normalisation or standardisation, is another common data processing step and may be required by some algorithms when the features are on different scales. Examples include the regularised regression techniques Lasso and Ridge regression, where the scale of the features is relevant to the penalty term(Reference Tibshirani35), and k-nearest neighbours, where the scale of the features is important for distance calculations(Reference Peterson57). Whether transformation is required for the algorithms used for data analysis should be known by the analyst beforehand. However, it should be noted that the choice of transformation technique used can have a significant impact on the results and should not be an arbitrary decision(Reference Singh and Singh58–Reference van den Berg, Hoefsloot, Westerhuis, Smilde and van der Werf60). Hence, if transformation techniques are required, it may be warranted to investigate how the results change with different transformation techniques in order to assess the sensitivity of the findings. Transformations that depend on the data distribution (e.g., Z-score normalisation) should be performed within validation steps to prevent data leakage.
Discretisation of continuous features
This refers to converting features on a continuous scale to categorical ones. A common example in nutrition is the conversion of BMI in kg/m2 to underweight, healthy weight, overweight and obese. Despite the dangers of this practice being well-documented and long-known(Reference Harrell43,Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61–Reference Sauerbrei and Royston65) , it remains widespread in the literature.
Discretisation almost inevitably leads to a loss of information, which can hinder the predictive capacity of the model(Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61,Reference Royston, Altman and Sauerbrei64) . The fewer the number of levels (or bins) in the newly formed category, the greater the loss of information. Another consequence of discretisation is the introduction of step functions where the response ‘jumps’ when moving from one level to the next within a category(Reference Harrell43). Unless this is justified (e.g. the pH at which a biological reaction occurs), such situations are usually undesirable and, in biological settings, may be unrealistic(Reference Sauerbrei and Royston65).
The decision regarding the number of bins to use (‘binning’) and the numerical limits which define them is also problematic(Reference Royston, Altman and Sauerbrei64). This is sometimes done in terms of quantiles (e.g. quartiles) or groups that make sense to humans (e.g. age groups of 40–49 years, 50–59 years, etc.), despite that such groups may not make sense to the problem at hand(Reference Altman62,Reference Bennette and Vickers63) . An additional problem related to binning concerns the sensitivity of the results to the limits defining the bins. In the absence of well-defined cut-points based on prior work(Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61), this decision is left in the hands of the analyst. Unintentionally or otherwise, this can lead to cut-points being selected that support hypotheses or iteratively trying enough combinations of bins and cut-points until favourable results are found. The consequences of such practices include an increased chance of spurious findings, false positives and biased results(Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61,Reference Sauerbrei and Royston65,Reference Heavner, Phillips, Burstyn and Hare66) .
Finally, discretisation of the outcome variable can influence the information that can be obtained from the results. Vastly different observations may be grouped together, whereas observations with only minimal differences may end up in different levels(Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61,Reference Bennette and Vickers63) . For example, discretising blood pressure measurements into levels of hypertension (e.g. normotensive, pre-hypertension and hypertension) could lead to individuals with dangerously high blood pressure and those who are barely hypertensive being classified into the same group, whereas those only 1 mmHg apart could be assumed to have different risks(Reference Naggara, Raymond, Guilbert, Roy, Weill and Altman61). This can have important consequences for the conclusions derived from the results, resource allocation, and treatment or intervention options.
Step 3. Modelling
The algorithms suitable for a specific problem depend on the machine learning task it involves. Various options exist for regression, classification, clustering, dimensionality reduction, and reinforcement learning, and some algorithms can be used for multiple tasks(Reference Woodman and Mangoni41).
The myriad options available can be overwhelming, and it is usually difficult to deduce on purely theoretical grounds which methods will perform best on a given problem beforehand(Reference Sarker16). For this reason, it is common to implement different ones and compare their performances(Reference Kirk, Catal and Tekinerdogan67). One important consideration in doing this is making sure that hyperparameter tuning (discussed below) takes place on a level playing field. Failure to ensure this risks providing an unfair representation of the results since some algorithms may perform comparatively well with little or no tuning (e.g. random forest(Reference Probst, Wright and Boulesteix68)), whereas others can be more sensitive to this. Similarly, the data used for training and testing should be the same for each candidate algorithm.
Sometimes the performance between different models may differ only slightly and be of little practical relevance. Whilst a given model may achieve the best performance in a given experiment, a certain amount of variability should always be expected and the possibility that the performance of the other models used for comparison could have been different had a different data sample been used cannot be ruled out. Hence, analysts should avoid overemphasising small performance differences and instead set an acceptable limit of tolerance (determined in the context of the problem and possibly set in advance and described in the methodology) within which various models could be considered. If formal testing is preferred, statistical tests can also be used to compare the performance of different machine learning models(Reference Rainio, Teuho and Klén69).
Prediction quality on scoring metrics, such as accuracy or mean squared error (see below: Step 4. Evaluation), is an important factor in selecting a machine learning model, but it is not the only one. Interpretability and calculation speed can also motivate the selection of machine learning algorithms. For example, ensemble methods are a group of algorithms that combine the results of many individual learners as their final output(Reference Mienye and Sun70), with notable examples including random forest(Reference Breiman71) and XGBoost(Reference Chen and Guestrin72). By aggregating multiple individual learners, ensemble methods are less sensitive to the errors that individual learners make and thus tend to have lower variance and, in general, perform well on unseen data compared to methods which only rely on single learners(Reference Dong, Yu, Cao, Shi and Ma73). However, this comes at the cost of longer fitting times and lower interpretability, which may motivate the selection of simpler methods (even if predictive accuracy is lower).
Hyperparameter tuning
Hyperparameters are modifiable parameters that affect the learning process of an algorithm. Examples include maximum depth on decision trees and the learning rate in neural networks. How these hyperparameters are tuned can have a significant impact on model performance. The tuning process involves fitting many different models, each with different hyperparameter configurations, to see which set of configurations leads to the best performance. Grid search and random search(Reference Bergstra, Ca and Ca74) are common and relatively basic optimisation techniques, though more advanced techniques also include Bayesian Optimisation, Hyperband, and evolution strategies(Reference Bischl, Binder, Lang, Pielok, Richter, Coors, Thomas, Ullmann, Becker and Boulesteix75).
Feature selection
Feature selection aims to reduce the initial feature space by eliminating features which are less important to the model output, ideally while preserving predictive performance(Reference Chandrashekar and Sahin76). This is increasingly sought-after as data in the modern world are being collected on a wide range of variables, sometimes from only a small number of samples. For example, thousands of microbes can be collected from each individual in gut microbiome studies, yet such studies rarely have so many participants.
Many feature selection approaches exist(Reference Theng and Bhoyar77,Reference Pudjihartono, Fadason, Kempa-Liehr and O’Sullivan78) , although the difficulty of the task of feature selection in the high-dimensional setting is often underappreciated(Reference Fan and Lv79). Especially when the number of features is much higher than the number of observations, identifying features which generalise to unseen data and distinguishing those that are relevant to the problem from those that are not can be challenging(Reference Fan and Lv79–Reference Saeys, Inza and Larrañaga81).
An additional challenge is feature selection stability, which refers to how sensitive the selected features are to perturbations of the data(Reference Khaire and Dhanalakshmi82). Ideally, the feature subset would contain features that are selected across multiple repeats of feature selection, each using different splits of the data for training and testing. Feature selection stability can be estimated by repeating feature selection across multiple different subsamples of the data, as part of robust internal validation schemes(Reference Saeys, Inza and Larrañaga81,Reference Khaire and Dhanalakshmi82) (as described below).
Internal validation schemes
Many techniques for hyperparameter tuning, feature selection and data processing steps make use of the outcome variable, meaning there is risk of information leakage(Reference Boehmke and Greenwell83,Reference Whalen, Schreiber, Noble and Pollard84) . Information leakage describes the situation in which the same observations in a dataset are used to both construct and evaluate the model, resulting in an optimistic evaluation of model performance on unseen data(Reference Boehmke and Greenwell83,Reference Whalen, Schreiber, Noble and Pollard84) .
To minimise information leakage, robust internal validation schemes preserve an entirely unseen portion of the data that was not involved in data processing or model-building for validation. Various approaches exist for this and are discussed below (see below: Step 4. Evaluation), though in general they involve splitting the data for training, where data processing, hyperparameter optimisation and feature selection are performed, and testing, where evaluation is performed. This procedure is then repeated multiple times to estimate uncertainty and stability in the results(Reference Lones85). This can protect against spurious findings and make the results more robust. A visual representation of how a robust internal validation scheme might look can be seen in Fig. 2. Examples of machine learning experiments with good internal validation schemes can be seen in(Reference Acharjee, Finkers, Gf Visser and Maliepaard86,Reference Acharjee, Larkman, Xu, Cardoso and Gkoutos87) .
Step 4. Evaluation
The purpose of model evaluation is to assess how well the model has performed its task, usually with a focus on performance on unseen data. Internal validation approaches divide the dataset into different portions for model-building and evaluation and would, ideally, be followed by external validation, where the models developed on the original dataset are validated on an unrelated dataset(Reference Kirk, Kok, Tufano, Tekinerdogan, Feskens and Camps1). Below, some common validation techniques are introduced, along with metrics by which optimisation occurs and model performance is evaluated.
Metrics
The metrics by which machine learning modes are evaluated are important since they not only reveal how a model performs from different perspectives but also determine how they are optimised(Reference Dinga, Penninx, Veltman, Schmaal and Marquand88). Hence, it is important that metrics are chosen that reflect the desired performance of the models with respect to the problem(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11). In some cases, performance with respect to some metrics may be more important than others. For example, in a classification problem, it may be more important that all true cases are correctly identified even at the expense of incorrectly labelling true negatives as positives (i.e. high sensitivity, low specificity), such as in identifying children at risk of obesity for targeted advertisement for events at a local sports centre. In other circumstances, false positives can be costly, such as when cases predicted as positive are selected to undergo surgery. In this case, a low specificity could lead to unnecessary surgery and related complications. Alternatively, if there is no particular preference for one metric over others, composite metrics which consider multiple aspects of performance (e.g. F1 score(Reference Dinga, Penninx, Veltman, Schmaal and Marquand88)) may be preferred.
Similar differences exist for metrics in regression. For example, mean-squared error (MSE) and root MSE (RMSE) punish larger errors more than mean absolute error (MAE), which can be desirable in some cases(Reference Naser and Alavi89). The coefficient of determination, R-squared, on the other hand, measures the proportion of the variation in the outcome that is explained by the features(Reference Chicco, Warrens and Jurman90). Because R-squared is limited between 0 and 1, it is easier to compare performance between different datasets that may have different variables on different scales(Reference Chicco, Warrens and Jurman90). However, it should be noted that it is not always the case that optimal performance with respect to one single metric is desired; instead, models may be evaluated across different metrics with each evaluating different aspects of performance(Reference Chicco, Warrens and Jurman90).
Whereas in supervised machine learning labels are available for the data, providing a ground truth on which to score predictions, this is not the case with unsupervised machine learning, which uses different metrics for assessing model performance. The exact metrics depend on the specific unsupervised task and often involve measures of homogeneity or dissimilarity. For example, for clustering, one of the most common unsupervised machine learning tasks, metrics include silhouette score, Calinski-Harabasz coefficient, and Dunn index(Reference Palacio-Niño and Berzal91). The different ways in which these metrics determine the quality of clustering can lead them to arrive at different optimal solutions. Hence, unless there is rationale to prefer one over another, it may be interesting in unsupervised approaches to explore how results differ across different metrics and how this influences the eventual conclusions drawn from the machine learning experiment.
Validation procedures
Train-test split
Splitting the dataset into a portion for training (model-building) and a portion for testing (evaluating model performance) is the most basic way in which a model can be evaluated(Reference Kirk, Kok, Tufano, Tekinerdogan, Feskens and Camps1). The data may be split entirely randomly or in a way that maintains certain aspects of the data characteristics in each portion, such as ensuring the same proportion of cases and controls in the training and test data. The cost of this simplicity is that the results obtained from a train-test split can be highly dependent on how the dataset was split, which can lead to significant variability in performance metrics, especially with smaller datasets. While train-test splits are more reliable on large datasets or those with good signal-to-noise ratios, alternative methods such as cross-validation (discussed below) offer a more robust evaluation of model performance. If train-test splits are to be used, they should be repeated in order to increase the stability of the results(Reference Raschka and Mirjalili92).
The importance of repeating validation techniques with different subsamples of the data is shown in Fig. 3, which uses a decision tree classifier with the Pima Indians Diabetes dataset(Reference Smith, Everhart, Dickson, Knowler and Johannes93) to demonstrate the effect of validation technique, the number of times it is repeated, and sample size on the stability and uncertainty of the results. The code for the analysis and location of the dataset can be seen in the supplementary code. Fig. 3 makes clear the importance of repeating validation techniques on smaller datasets in order to increase the certainty of the results. However, this is not always observed in the literature, despite the fact that relatively small samples are often used in machine learning experiments.
Cross-validation
Cross-validation approaches are a group of resampling procedures that can be used for model selection and to estimate the generalisability of the model(Reference Hastie, Tibshirani and Friedman14). In k-fold cross-validation, the dataset is split into k folds so that each observation is used once for testing and k-1 times for training(Reference Hastie, Tibshirani and Friedman14). The number of folds k is often chosen based on computational efficiency and a suitable bias-variance trade-off, with values generally between 5 and 10 being used in practice(Reference Raschka and Mirjalili92). Different variations of cross-validation exist, such as leave-one-out cross-validation and stratified and grouped cross-validation (see Kirk et al.(Reference Kirk, Kok, Tufano, Tekinerdogan, Feskens and Camps1)). Unlike train-test split, there are k number of test score results, which may be presented individually (as in Fig. 3) or aggregated into summary statistics (e.g. mean across all test folds). Different varieties of cross-validation exist to allow stratified cross-validation or account for dependent observations.
Cross-validation is an improvement over train-test split because all of the data are used for training and testing, making it less sensitive than a single split in the data. Even so, cross-validation can still be sensitive to how the data is split within each fold and test scores between each fold can differ greatly, especially with smaller datasets. To account for this, cross-validation can be repeated multiple times over, ensuring that on each repeat different splits of the data are used within each fold (Fig. 2)(Reference Lones85,Reference Raschka and Mirjalili92) .
Nested cross-validation
One concern with cross-validation is that data processing steps, hyperparameter tuning and feature selection are performed on the same data used to evaluate the model performance. Nested cross-validation deals with this by adding another cross-validation loop (known as the inner loop) within the training data of each fold of the outer loop (see Fig. 2). The inner loop is used for tuning hyperparameters and feature selection, and this is then validated on the test data of the outer loop (for details see(Reference Wainer and Cawley94)). This ensures a portion of the data that was not involved in any part of the model-building process is available to estimate performance on unseen data.
Nested-cross validation reduces the chance of information leakage and allows for an unbiased estimation of model performance(Reference Cawley and Talbot95). However, in doing so it greatly increases computational time, and whether this justifies the reduction in bias has been called into question(Reference Wainer and Cawley94). Analysts may prefer to first perform traditional cross-validation and then, if the results appear promising, validate that the findings are not the result of optimism due to information leakage by using nested cross-validation. This can circumvent wasting time and computer resource on machine learning experiments that would not be fruitful anyway.
Calibration
An important yet often overlooked concept for supervised classifiers is calibration, which refers to the alignment between the predicted probabilities and the observed outcomes(Reference Steyerberg and Harrell96). For example, a classifier that is trained to predict diabetes would be well-calibrated if the observed occurrence of diabetes was close to 10% for those whose predicted probabilities were close to 10%. Model miscalibration may not always be apparent in internal validation, but low calibration can mislead expectations and be problematic when looking at high or low-risk groups (where it may be needed most) or external datasets(Reference Van Calster, McLernon, Van Smeden, Wynants, Steyerberg, Bossuyt, Collins, MacAskill, Moons and Vickers97). This can have important implications for the action taken in response to the predictions made by the model.
In contrast to predicting class labels directly, there are advantages to working with predicted probabilities(Reference Van Calster, McLernon, Van Smeden, Wynants, Steyerberg, Bossuyt, Collins, MacAskill, Moons and Vickers97). Firstly, it is often more relevant to know the chance of an event occurring rather than a class label without further context. For example, whilst two individuals may both be predicted to belong to the same class, the probability for one being 55% and the other being 95% shows a clear difference in the confidence of their predicted class membership. This can have important implications, such as how resources are allocated in response to model predictions (e.g. children with a higher predicted probability of malnutrition are prioritised for corrective nutritional intervention). Using probabilities also means that custom thresholds can be more easily set and adjusted, which can be desirable when the cost or benefit of correct classification is not the same as that for misclassification. Finally, probabilities are also inherently more interpretable than class labels, which is a desired property of machine learning procedures(Reference Doshi-Velez and Kim98).
Miscalibration may occur due to the data or the model used to fit the data (i.e. overfitting)(Reference Van Calster, McLernon, Van Smeden, Wynants, Steyerberg, Bossuyt, Collins, MacAskill, Moons and Vickers97). It is most often diagnosed by plotting the predicted probabilities against the observed frequencies (known as calibration curves or reliability plots), with a straight-line y=x representing perfect calibration(Reference Dormann99). Due to differences in how they operate, some models drive probabilities away from 0 and 1 (e.g. support vector machines), whereas others more readily predict probabilities at the extremes (e.g. Naïve Bayes) and others still are naturally well-calibrated (e.g. logistic regression)(Reference Kull, De Menezes, Filho and Flach100,Reference Niculescu-Mizil and Caruana101) . Following the identification of miscalibration, calibration correction techniques can be used. Two well-known approaches are Platt scaling(Reference Platt102), which is used for those with S-shaped calibration curves, such as those seen for support vector machines, and isotonic regression, which is capable of modelling more complex shapes but with an increased risk of overfitting(Reference Niculescu-Mizil and Caruana101).
External validation
A primary goal of machine learning applications is to make predictions using new data that were not available during training. While good internal validation schemes provide an estimate of how well prediction models do this, they can still be optimistically biased(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11,Reference Ho, Phua, Wong and Bin Goh103) . External validation involves validating the generalisability of the model on a dataset that reflects the target population but comes from another source(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11,Reference Steyerberg and Harrell96,Reference Ho, Phua, Wong and Bin Goh103–Reference Siontis, Tzoulaki, Castaldi and Ioannidis107) .
External datasets may differ in key characteristics such as the location, time and methods of data collection, as well as the individuals responsible for collecting the data. However, it is crucial that key features remain constant. For instance, when externally validating a model predicting disease outcomes, it is imperative that the disease is defined in the same way in both datasets to prevent differences in model performance reflecting disease definitions rather than generalisability. To further enhance robustness and reduce the chance of optimistic bias, external validation may be performed by an independent research group that was not involved in model development or data collection for the initial dataset(Reference Vollmer, Mateen, Bohner, Király, Ghani, Jonsson, Cumbers, Jonas, McAllister and Myles11).
It is worth noting that some studies sometimes erroneously claim that external validation was used, when in fact their ‘external validation’ set is simply a test set or an extension of the original dataset (e.g.(Reference Heshiki, Vazquez-Uribe, Li, Ni, Quainoo, Imamovic, Li, Sørensen, Chow and Weiss108)). This is sometimes a semantic issue rather than intentional misrepresentation(Reference Siontis, Tzoulaki, Castaldi and Ioannidis107), but care should be taken when interpreting such results because external validation is a stronger indication of generalisability than internal validation with a larger dataset. Even still, it should be kept in mind that good external validation performance does not prove generalisability(Reference Cabitza, Campagner, Soares, García de Guadiana-Romualdo, Challa, Sulejmani, Seghezzi and Carobene105,Reference Van Calster, Steyerberg, Wynants and van Smeden109,Reference la Roi-Teeuw, van Royen, de Hond, Zahra, de Vries, Bartels, Carriero, van Doorn, Dunias and Kant110) . External validation performance is still dependent on the external dataset used, and conclusions made must bear the characteristics of this dataset (e.g. location, time, sample characteristics, etc.) in mind. As external validation is repeated on a greater number of external datasets that are more different from the dataset used to develop the model, confidence in the generalisability of the model increases(Reference Van Calster, Steyerberg, Wynants and van Smeden109,Reference la Roi-Teeuw, van Royen, de Hond, Zahra, de Vries, Bartels, Carriero, van Doorn, Dunias and Kant110) .
Step 5. Interpretation
Metrics by which machine learning models are optimised were introduced in Step 4. Evaluation, though it is also important that they are correctly interpreted and presented. For example, the simplest and most common metric for evaluating classifier performance is accuracy; however, accuracy can be deceptive. This is particularly evident in imbalanced datasets (i.e. where the proportion of one class is much higher than the other), where a classifier which always predicts the majority class (irrespective of the data it receives) can score highly on accuracy, despite having no predictive capacity(Reference Dinga, Penninx, Veltman, Schmaal and Marquand88). Similar considerations exist for other metrics, such as MSE for regression, which is dependent on the scale of the outcome variable, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC), for which scores of around 0.65 are sometimes viewed positively, despite 0.5 being what could be expected with random guessing.
The interpretation of machine learning results can be made complicated when using complex validation schemes. For example, in comparison to a single train-test split, in which metrics from one portion of test data can be easily understood and reported, machine learning experiments which involve multiple repeats of train-test split or cross-validation can have scores from many test sets which may need to be summarised concisely to be understood. Summary statistics on the results can be useful, such as reporting the mean, median, range and interquartile range, provided there are enough test scores for such statistics to be meaningful. Plots of results, such as those seen in Fig. 3, can also be useful to present many results at once without loss of information.
Feature importance
It is often desirable to know which features were important for machine learning models in generating their output, despite that there is usually no ground truth for feature importance and the concept itself is poorly defined(Reference Linardatos, Papastefanopoulos and Kotsiantis111). Different approaches estimate feature importance in different ways, often arriving at different conclusions(Reference Molnar, König, Herbinger, Freiesleben, Dandl, Scholbeck, Casalicchio, Grosse-Wentrup and Bischl112), though popular approaches with good statistical properties include SHAP(Reference Lundberg, Allen and Lee113), Lime(Reference Ribeiro, Singh and Guestrin114), and permutation-based feature importance(Reference Altmann, Toloşi, Sander and Lengauer115), amongst others(Reference Linardatos, Papastefanopoulos and Kotsiantis111).
Some algorithms can provide feature importance estimates as part of their architecture (so-called ‘built-in’ feature importance). Such built-in feature importance estimates, whilst convenient, can come with significant drawbacks. This is particularly well-described for random forest(Reference Breiman71), which can provide biased results based on the scale of continuous features and number of categories in categorical ones, as well as when multicollinearity is present(Reference Strobl, Boulesteix, Zeileis and Hothorn116–Reference Toloşi and Lengauer118). While no feature importance method is perfect(Reference Linardatos, Papastefanopoulos and Kotsiantis111), it is concerning how often random forest-derived feature importances are reported in scientific literature without consideration for their potential limitations or how the results may be different had other feature importance techniques been used(Reference Kirk, Catal and Tekinerdogan2), creating a false sense of certainty for feature importance estimates calculated with this approach.
Inappropriately calculated feature importance estimates can lead to both false positives (i.e. unimportant features falsely identified as important) and false negatives (i.e. important features falsely identified as unimportant), both of which reduce the quality of the literature. In response to this, analysts should first think carefully about how their choice of feature importance technique relates to their data(Reference Molnar, König, Herbinger, Freiesleben, Dandl, Scholbeck, Casalicchio, Grosse-Wentrup and Bischl112). It should be known if there are feature interactions or collinearity present, and how the chosen feature importance techniques may be affected by this(Reference Linardatos, Papastefanopoulos and Kotsiantis111). Analysts should also be open to comparing results across various suitable techniques and place their findings in the context of the model and explainable AI technique used(Reference Saarela and Jauhiainen119), rather than making conclusions about feature importance in a general sense.
Additionally, an important but sometimes underappreciated fact is that feature importance estimates on models with poor predictive performance cannot be trusted, and analysts should resist reporting feature importance estimates in such cases(Reference Molnar, König, Herbinger, Freiesleben, Dandl, Scholbeck, Casalicchio, Grosse-Wentrup and Bischl112,Reference Murdoch, Singh, Kumbier, Abbasi-Asl and Yu120) . Moreover, similarly to the results from predictions generated by machine learning models, feature importance estimates may also be dependent on how the dataset was split for training and testing(Reference Wang, Yang and Luo121,Reference Kalousis, Prados and Hilario122) . It can be informative to estimate the stability of feature importances by seeing how they change across many repeats of model fitting on different splits of the data(Reference Linardatos, Papastefanopoulos and Kotsiantis111,Reference Wang, Yang and Luo121,Reference Kalousis, Prados and Hilario122) . Finally, the subfield of explainable AI is not without issues and has been criticised for the consequences of unfaithful explanations, complex explanations, failing to consider that important features may change over time, and ambiguity regarding terminology(Reference Rudin123–Reference Yang, Wei, Wei, Chen, Huang, Li, Li, Yao, Wang and Gu125). Hence, researchers should be aware of the shortcomings of explainable AI techniques they decide to use and the potential costs of these in the context of the problem.
Step 6. Reporting
Reporting refers to describing both the methodology of the machine learning experiment and reporting the results obtained. There have been numerous concerns raised about the reproducibility of published work in the health field(Reference Ioannidis126–Reference Begley and Ioannidis129) and, unfortunately, the growing use of machine learning in research may further exacerbate this issue(Reference McDermott, Wang, Marinsek, Ranganath, Foschini and Ghassemi130). This is owed in part to difficulties involved in the machine learning process, such as minor details in data processing, modelling, and evaluation that may go undocumented, along with other factors such as randomness, software versions, data availability, biased methodologies, and selective reporting of optimistic results(Reference McDermott, Wang, Marinsek, Ranganath, Foschini and Ghassemi130,Reference Beam, Manrai and Ghassemi131) . Because of this, it is crucial that studies using machine learning in their research describe their methodology and report their results in appropriate detail.
The methodology in studies using machine learning should be described as thoroughly as possible, in a way that another analyst would be able to obtain the same results if they had access to the same dataset(Reference Beam, Manrai and Ghassemi131,Reference Heil, Hoffman, Markowetz, Lee, Greene and Hicks132) . In this regard, publishing code can be helpful because most, if not all, steps of the analysis can be automatically documented within this. Another advantage of publishing code is that it may still be interpreted and understood even without access to the data, and if the data is made available at some point in the future, investigating reproducibility is made much easier. When code is not available it becomes more important that each step, in their correct order, is described in adequate detail.
It is not uncommon to see data processing steps described in insufficient detail(Reference Heil, Hoffman, Markowetz, Lee, Greene and Hicks132). For example, if there were any missing data or outliers present in the initial dataset, their processing should be described in sufficient detail to allow an external party to perform exactly the same steps on the relevant data points. If there were no missing data outliers, then this should be mentioned, otherwise, it can be inferred that some level of data processing occurred that was not documented, which brings into question the validity of the whole experiment.
The same applies to many other steps in the machine learning process, such as optimising hyperparameters, model validation schemes, and feature selection. To ensure reproducible work, these and other steps should be described in adequate detail with respect to the complexity of the procedure, with accompanying random states or random seeds provided(Reference Beam, Manrai and Ghassemi131). Statements such as ‘hyperparameters were optimised’ or ‘cross-validation was used’ should not be acceptable without further specification of how this was done.
When reporting the results of machine learning experiments, it is important to be transparent about the findings(Reference Lipton and Steinhardt10,Reference Lones85) . It can be tempting to report only those results that make the experiment seem successful, such as reporting only favourable results and ignoring those which make the findings seem less convincing(Reference Lones85). However, all findings can be interesting and may provide different information, which can be useful for informing future work based on the results. For example, the consequence of omitting findings which expose instability in the results could be that money and time are wasted on validation studies. It is particularly important that any changes to the data processing or modelling steps that were informed by the outcome variable are documented, otherwise, results can be become biased, similar to p-hacking with traditional statistical methods(Reference Stefan and Schönbrodt133).
Conclusion
Machine learning has the potential to be a valuable tool in nutrition research. For this potential to be realised, it is imperative that researchers have sufficient understanding of machine learning concepts to be able to interpret the results of others and apply well-designed machine learning methodologies themselves. Failure to achieve this will lead to a reduction in the quality of the literature, missed opportunities, and wasted resources in unproductive efforts to validate or extend upon prior work.
By going through each of the key steps in the machine learning process, this review aimed to provide an overview on good practices and highlight common misconceptions and pitfalls of using machine learning in nutrition research. Nutrition researchers using machine learning in the coming years should focus on the generation of high-quality data, robust validation techniques, quantifying the stability or uncertainty of results, proper interpretation of machine learning outputs, adequately described methodologies, and transparency when reporting results.
Supplementary material
For supplementary material/s referred to in this article, please visit https://doi.org/10.1017/S0029665124007638
Acknowledgements
I would like to express my gratitude to the Nutrition Society, particularly Dr Anne Nugent and Professor Jayne Woodside, for inviting me to speak at the inaugural Nutrition Society Congress, which was an excellent event. I would like to thank also Andrea Onipede for her organisational support before and during the congress.
Financial support
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Competing interests
The author declares none.