For almost three decades, anxiety, mood and psychotic disorders formed over 14% of the age-standardised years lived with disabilities (YLDs) globally.1 The high burden of these disorders is, beyond the high prevalence, because of the high number of non-responding patients to initial or subsequent treatments.Reference Howes, Thase and Pillinger2 Most patients do not remit after their first treatment, and typically multiple therapies have failed before finding one that works.Reference Howes, Thase and Pillinger2 For patients, this means enduring a longer period of ineffectively treated symptoms and discomfort. For society, prolonged treatment duration puts a strain on the limited resources of mental healthcare services, causing long waiting lists for those who seek care.
Precision psychiatry
Prolonged treatment duration is related to the low clinical utility of available diagnostic systems to guide treatment choices.Reference Howes, Thase and Pillinger2 Classification systems such as the ICD-10 and DSM are descriptive in nature, based on experienced symptoms.Reference Maj3,Reference Feczko, Miranda-Dominguez, Marr, Graham, Nigg and Fair4 As such, individuals with the same diagnosis receiving similar treatments may not necessarily share biopsychosocial psychopathological mechanisms. Consequently, treatment responses vary widely.Reference Howes, Thase and Pillinger2,Reference Feczko, Miranda-Dominguez, Marr, Graham, Nigg and Fair4 To surpass this heterogeneity, the promise of precision medicine into psychiatry, defined as ‘to integrate clinical data with patient characteristics to uncover disease subtypes and improve the accuracy with which patients are categorized and treated’, is very appealing.Reference Bourke, White, Kumar and M5 By integrating clinical data with relevant biological, psychological or social patient information, more specific biopsychosocial markers can be established for better treatment response prediction and, consequently, improvement of outcomes.Reference Gillan and Whelan6,Reference Insel and Cuthbert7
Clinical prediction models
Over the past two decades, precision psychiatry has gained momentum and, as such, a wealth of clinical prediction models for personalised treatment outcomes have been developed.Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8 However, there is a large translational gap between model development and implementation in clinical practice, preventing successful implementation.
Requirements for successful prediction model implementation
For a prediction model to have added value in clinical practice, the model first needs to have adequate discriminative abilityFootnote a and calibrationFootnote b.Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9 Second, a model should be generalisable to different clinical contexts by external validation (e.g. location, treatment setting or period).Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9 Third, successful implementation depends on applicability factors, such as the feasibility to obtain relevant predictors within a clinical context in which time and financial constraints are common.Reference Wolff, Moons, Riley, Whiting, Westwood and Collins10 Previous systematic reviews into clinical prediction models in psychiatry shed light on the above-mentioned factors; either for a broad class of prediction models in psychiatry, including diagnostic models and models predicting onset, or for treatment outcome prediction in depression only.Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 In addition, all focused on a broad range of studies (e.g. from predictor-finding or development to external validation or implementation), and only one review assessed heterogeneity between studies in a quantitative way for a broad range of studies.
Aims
This important knowledge gap regarding the prediction accuracy, methodological quality, applicability and potential sources of heterogeneity for externally validated models for treatment outcomes of common mental disorders is addressed in the current systematic review and meta-regression. We will focus on three aims. First, we systematically appraise the current literature on externally validated clinical prediction models that estimate treatment outcomes for mood, anxiety and psychotic disorders, to qualitatively describe variability across studies and their models. This includes, among other study properties, a systematic categorisation of included predictor types, as precision psychiatry aims to develop prediction models containing relevant predictors, which, according to the biopsychosocial model of mental disorders,Reference Bashmi, Cohn, Chan, Tobia, Gohar, Herrera and IsHak12 should be biological and psychosocial, next to clinical predictors. Second, we critically review the methodological quality and applicability of included studies using the Prediction Model Risk of Bias Assessment Tool (PROBAST), to ensure proper interpretation of study results and to determine the validity and applicability of prediction models. Third, given the broad inclusion criteria of the included studies, data are expected to be heterogeneous on characteristics such as patients’ disorders, predictors and study settings.Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13 Therefore, a meta-analysis to estimate the average discriminative ability is not deemed useful.Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13 However, the observed heterogeneity itself and its sources are informative for investigating how clinical and methodological aspects of the studies relate to their results.Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14 For this reason, we conduct a meta-analysis on the omnibus discrimination performance to estimate how well the currently available models perform overall, concurrently focusing on the observed heterogeneity. This is informative for investigating how clinical and methodological aspects of the studies relate to their results.Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14 Hence, if heterogeneity is confirmed, the fourth aim is to examine whether heterogeneity in discrimination performance can be explained by a meta-regression on various study characteristics.
Method
The systematic review and meta-analysis were conducted conform the e-Cochrane Handbook for Systematic Review of Interventions and the preferred reporting items for systematic reviewsReference Bourton, Page, Altman, Lundh and Hróbjartsson15 and meta-analysis (PRISMA) statementReference Page, McKenzie, Bossuyt, Boutron, Hoffmann and Mulrow16 (see the research checklist). The protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) CRD42022307987 (307987). See Supplementary file 2 available at https://doi.org/10.1192/bjo.2024.789, which describes the scope of this review in terms of patient population, intervention (treatment), comparison, outcome and type of study (PICOT).
Definitions
The following key concepts are described for a clear and precise understanding of the terminology central to our study.
• Clinical prediction model: a model that is developed to facilitate the prognostic ability estimations in daily medical practice, by (statistically) associating multiple predictors with outcome data from a sample.Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9
• Multivariate model: a model estimating that a specific event (e.g. remission) will occur, based on multiple characteristics or pieces of information for a specific individual.Reference Heus, Damen, Pajouheshnia, Scholten, Reitsma and Collins17
• Treatment outcome: any type of outcome, whether it be of clinical, psychosocial or biological nature, that could be associated with an effect (or absence thereof) induced by pharmacological or/and psychoeducational treatment.Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9
• External validation: the prediction model is applied to a similar (but not necessarily the same) target population as in the development data-set, by one or more of the following validation methods to ensure the test data-set is independent:Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9
• temporal: model tested in the same data collection, but using different individuals at a later timepoint;
• geographical: model is tested in a sample from a different location;
• different settings: model is tested in a different setting (e.g. from secondary to primary care setting).
• Statistical learning: statistical learning includes the Cox hazard model, logistic regression model, linear regression model, negative binomial model, generalised linear model, Weibull regression model, regularised regression model and other regression methods, either standard, penalised, boosted or bagged.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11
• Machine learning: machine learning includes classification trees, random forests, artificial neural networks, support vector machines, boosted tree methods, Bayes machine learning algorithms, K-nearest neighbours algorithms, multivariate adaptive regression splines and genetic algorithms.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11
In- and exclusion criteria
The following inclusion criteria were adopted: (1) studies including adults with a mood, anxiety or psychotic diagnosis as confirmed by diagnostic interview or patient file; (2) studies externally validating clinical prediction models for any type of treatment outcome; and (3) studies testing multivariable models following the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement.Reference Heus, Damen, Pajouheshnia, Scholten, Reitsma and Collins17
The following exclusion criteria were adopted: (1) publication types not based on original data (abstracts, conference proceeding, reviews, meta-analyses); (2) predictor-finding studies that did not report prediction models; (3) studies with predictive models that did not evaluate a model's performance for personalised risk prediction in any recommended way;Footnote c (4) studies predicting the first onset of disorders; (5) studies including children (under the age of 12) or exclusively focused on adolescents (aged 12–18); (6) studies solely including participants recruited from the general population (i.e. not in mental healthcare settings); (7) studies exclusively focusing on postnatal mental disorders; and (8) studies not written in the English, German or Dutch language. No restrictions were placed based on the model's performance in the development data-set.
Development of the search string
To ensure conceptual validity, a priori 13 must-find PubMed reference number articles (PMID) were identified. The search string was developed in consultation with an independent information expert (Dr Paul Braun (P.B.)) from the Central Medical Library (CMB) of the University Medical Center Groningen (UMCG). Once the algorithm for PubMed was finalised, the string was translated to PsychINFO and Excerpta Medica (EMBASE) in collaboration with P.B. All three databases were searched on 8 November 2021. Final search strings are given in Supplementary file 1.
Screening
Results from the three databases were imported into EndNote 20.1 (Clarivate, Philadelphia, PA, USA; see https://support.clarivate.com/Endnote/s/article/Download-EndNote?language=en_US). Duplicate results were first detected with the ‘find duplicates’ feature in EndNote, and thereafter manually removed after review. The remaining results were imported into Rayyan (Rayyan, Cambridge, MA, USA; see www.rayyan.ai), and the duplicate removal was repeated. Next, an order of exclusion criteria was tested blindly on ten randomly selected papers to ensure synchronisation of the eligibility determination process between the two reviewers (see Supplementary file 7). After pilot testing the order, it did not require adjustments. The order was used for the double-blinded title/abstract and full-text screening. The outcomes of the double-blinded screening were discussed during weekly meetings. A priori, a third reviewer was identified (H.R.) in case any discrepancies could not be solved during those meetings. However, consultation of the third reviewer was not needed. Some disagreements between D.G.B. and S.H.B. arose from the different understanding of a clinical sample, resulting in a further specification of the definition of a clinical sample in the PROSPERO protocol (307987), as follows: ‘Studies with a small part of the sample being recruited from the general population (with diagnoses being established via interviews) and without information about mental healthcare, is accepted when at least a majority of the total sample is reported to receive mental health care’.
Data collection
All data were extracted by means of an in-house developed data extraction form following the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS) checklist.Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18 The data extraction form was piloted by taking a sample of five included studies. On 13 January 2022, data extraction was started. The first reviewer (D.G.B.) extracted all relevant data, and the second reviewer (S.H.B.) independently assessed data extraction validity using a random subsample of full texts (n = 6 out of 21), and S.H.B. reviewed the correctness of all reported outcomes of main interest, as shown in Table 3.
Risk of bias and applicability
For each study, the risk of bias was assessed double-blinded by D.G.B. and S.H.B. using PROBAST. With PROBAST, the risk of bias was assessed in four domains: participants, predictors, outcomes and analyses.Reference Wolff, Moons, Riley, Whiting, Westwood and Collins10 Each domain contained questions that should be rated with low, high or unclear concern. If one domain was judged to have a high risk of bias, the overall risk of bias was considered high. Equivalently, if one domain was assessed as unclear, the overall risk of bias was unclear. The evaluation of model applicability was judged using the same method of assessment. Potential publication bias of studies included in the meta-analysis was explored with a funnel plot using the R package metafor.Reference Viechtbauer19
Data synthesis
Discrimination and calibration measures are of main interest, including the lowest and highest reported range if multiple models are externally validated on the same data-set and the type of measure reported. Other performance measures and descriptive data were also extracted and summarised from the papers: author(s), year of publication, source validation data-set (i.e. cohort, clinical trial), participant characteristics (i.e. type of disorder, age range), treatment characteristics (type of treatment, in- or out-patient setting, primary or secondary facility), outcome measures (definition, timing, method of assessment), predictors (psychosocial/clinical/biological nature), modelling methods (type of statistical analysis, method selection predictors), power (sample size, number of events per variable) and the handling of missing data.
Predictor type categorisation
Predictor type categorisation is not straightforward, and transparent protocols for this are lacking in previous reviews in the field of precision psychiatry.Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 As such, the qualitative method of grounded coding was applied. This qualitative method is directed towards finding common categories among codes and, if possible, higher-order categorisation may yield an overarching theme.Reference Punch20 To ensure transparency in reporting and possible reproduction, the resulting codebook generated by grounded coding is given in Supplementary file 6.
Meta-analysis, sensitivity analyses and meta-regression
Formal meta-analysis to estimate an average discriminative ability is only useful if the criteria of the analysis are met.Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13,Reference Eysenck21 The data included in the meta-analysis are homogeneous on characteristics such as patients’ disorders, outcomes, predictors and study settings. However, the heterogeneity itself and its sources are informative for investigating how clinical and methodological aspects of the studies relate to their results.Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14
As such, a meta-analysis of omnibus discrimination performance measures was executed, as a prerequisite for the meta-regression. Any study that presented an omnibus measure of discrimination, such as the concordance statistic (c-statistic,Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9,Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18 ), or if not available, reported accuracy at single cut-off (ASC), was used in the meta-analysis. Given that ASC is less informative than the area under the curve (AUC),Reference Bradley22,Reference Carrington, Manuel, Fieguth, Ramsay, Osmani and Wernley23 a sensitivity analysis was performed, excluding ASCs.
Currently, CochraneReference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14 does not provide recommendations for selecting models for meta-analysis in the case of multiple models reported within the same study. Therefore, we decided to follow the strategy of Lee et al,Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24 about using machine learning-based prediction models in psychiatry, to increase comparability. The best performing model per outcome was selected based on the highest reported discrimination. Multiple discrimination estimates from the same study were included under a few conditions as specified in the a priori established decision document (see Supplementary file 6).
When measures of uncertainty were missing for the c-statistic but the total sample size and the number of events were reported, these were estimated according to a formula described by Debray et al.Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9 Before the meta-analysis, discrimination measures were logit transformed to meet the normality assumptions. These transformations were done with the R package metamisc.Reference Debray and Jong25
For the assessment of the amount of heterogeneity, common parameters, I 2Footnote d, tau2Footnote e and the prediction interval between studies were used in the meta-analysis, which was implemented with the R package metafor.Reference Viechtbauer19 As heterogeneity between studies was anticipated, a random-effects model was used with restricted maximum likelihood. To ease interpretation, estimates were back transformed before being presented in tables and plots with the inverse logit transformation function (itrans.logit) of the metafor package, representing c-statistics (or ASCs).
To investigate which clinical and methodological aspects of the studies relate to the discriminative ability, a random-effects meta-regression was conducted. The number of investigated potential effect modifiers on model performance should be limited, as the likelihood of a false-positive result among subgroup analysis and meta-regression increases with the number of characteristics investigated.Reference Moons, de Groot, Bouwmeester, Vergouwe, Mallett and Altman18 Following the Cochrane Handbook, potential effect modifiers may include study-level characteristics, such as population type, type of intervention or treatment, length of follow-up or methodology (i.e. design and quality). For testing interactions with categorical variables, the handbook advises that the number of studies should be minimally five per category.Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14 For testing interactions with continuous variables, the total number of studies should be minimally 10.Reference Altman, Ashby, Birks, Borenstein, Campbell, Deeks, Deeks, Higgings and Altman14 The following study characteristics were eligible for meta-regression: type and number of predictors and type of disorder. The other above-mentioned characteristics could not be tested because of insufficient sample sizes. Regression coefficients, the associated z-scores, s.e. values, significance values and prediction intervals were reported as primary outcomes per moderator. The R script for these analyses is listed in Supplementary file 7.
Results
Study selection
The systematic search in PubMed, EMBASE and PsycINFO yielded a total of 2389 hits. Fifty-seven studies were screened full text (see Fig. 1 for a flowchart of the screening process). In total, 28 studiesReference Ashar, Clark, Gunning, Goldin, Gross and Wager26–Reference Wang, Patten, Sareen, Bolton, Schmitz and MacQueen53 were part of the systematic review (see Supplementary file 3 for the list of included papers).
The meta-analysis was conducted on the 21 studiesReference Athreya, Neavin, Carrillo-Roa, Skime, Biernacka and Frye27,Reference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29,Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson and Trivedi31–Reference Hayes, Osborn, Francis, Ambler, Tomlinson and Boman37,Reference Jha, South, Trivedi, Minhajuddin, Rush and Trivedi39,Reference Kambeitz-Ilankovic, Vinogradov, Wenzel, Fisher, Haas and Betz40,Reference Leighton, Krishnadas, Chung, Blair, Brown and Clark43–Reference Nie, Vairavan, Narayan, Ye and Li45 that either included the needed parameters or those in which these parameters could be calculated to conform with the description in the analysis section. Notably, Ashar et alReference Ashar, Clark, Gunning, Goldin, Gross and Wager26 used a continuous metric for predictive accuracy and could not be included in the meta-analysis. Multiple models from the same study were included following the previously described decision document. Therefore, from the 21 studies, 28 models were included in the meta-analysis.
Study characteristics
Source validation data-set
Five studies made use of data derived from registries, 14 studies included data from clinical trials (either open-label or randomised; nine and five, respectively), eight studies used observational cohort data and one study used survey data.
Participants and setting
Most studies focused on patients diagnosed with a mood disorder (unipolar n = 14, bipolar n = 2), followed by a mixed sample with severe mental disorders (SMIs) (n = 6) and psychotic disorders (n = 5), while there was one study that validated prediction models for anxiety and mood disorder separately. Most studies included adult patients only (n = 14), some included adolescents as well (n = 9) and in some papers, inclusion of adolescents remained unclear (n = 5). All study characteristics, including sample characteristics and care setting details, are given in Table 1. More details regarding the in- and exclusion criteria per validation data-set are listed in Supplementary file 7.
ACT, auditory-based computed tomography intervention; CBT, cognitive behavioural therapy; DEP, depressive disorders; Der, derivation sample; EET, employment, education or training status; EOT, end of treatment; GENDEP, genome-based therapeutic drugs for depression project; GP, general practitioner; INT, interview; ISPC, International SSRI Pharmacogenomics Consortium; MARS, Munich Antidepressant Response Signature project; MHC, medical hospital centre; NMB, no meaningful benefit; OBS, observational status; OUT, out-patient status; SMI, severe mental illness; STAR*D, Sequenced Treatment Alternatives to Relieve Depression study; TAU, treatment as usual; TRS, treatment resistance; UNK, unknown.
a. Calculated from data on training set and training and validation set.
Treatment outcome
Most studies reported clinical outcomes (n = 25), such as symptom change, remission status, adverse events and care consumption. One studyReference Leighton, Krishnadas, Chung, Blair, Brown and Clark43 reported a clinical (remission) and psychosocial outcome (employment, education and training status). One study described the psychosocial outcome of likelihood of crime committance.Reference Fazel, Wolf, Larsson, Lichtenstein, Mallett and Fanshawe34 To assess the outcome, most studies used a clinician-rated instrument (n = 16) followed by self-report instruments (n = 7) and looking at event occurrence in registry data (n = 5). The timing of the end-points varied greatly. The earliest outcome assessment was at 1 week follow-up,Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36 while the latest was at 6 years follow-up.Reference Perry, Osimo, Upthegrove, Mallikarjun, Yorke and Stochl49 In some studies (n = 4) the outcome assessment depended on the end of treatment.Reference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29,Reference Kautzky, Dold, Bartova, Spies, Kranz and Souery41,Reference Perlis, Ostacher, Miklowitz, Hay, Nierenberg and Thase48,Reference Taliaz, Spinrad, Barzilay, Barnett-Itzhaki, Averbuch and Teltsh52 A detailed description of the end-point assessment per study is given in Table 1.
Predictors
With the exception of three studies,Reference Cattaneo, Ferrari, Uher, Bocchio-Chiavetto, Riva and Pariante30,Reference Fabbri, Kasper, Kautzky, Zohar, Souery and Montgomery32,Reference Kambeitz-Ilankovic, Vinogradov, Wenzel, Fisher, Haas and Betz40 all studies included clinical predictors in their model, such as symptom severity, past psychiatric history (PHX) or general medical history (GHX). Notably, studies that included psychosocial variables included biological variables as well (n = 16). Only a few studies (n = 7) included clinical predictors in their model(s). A detailed overview of the predictors used per study is given in Table 2. The codebook for coding which variable is considered which type is given in Supplementary file 7.
ASC, accuracy at single cut-off; AUC, area under the curve; BMI, body mass index; Childhood AEs, Childhood adverse events; GMH, general medical history; HRQOL, health-related quality of life; LSOA, lower super output area (measurement of neighbourhood deprivation); MHX, psychiatric medication history; NHS, National Health Service; PANSS Gx, positive and negative symptoms scale subscale general psychopathology; PANSS Nx, positive and negative symptoms scale subscale negative symptoms; PANSS Px, positive and negative symptoms scale subscale positive symptoms; PHX, psychiatric history and past psychiatric care use.
Performance measures
Out of the 28 included studies, 11 reported on both calibration and discrimination measures (see Table 3). The majority of studies only reported discrimination measures (n = 16). One studyReference Cattaneo, Ferrari, Uher, Bocchio-Chiavetto, Riva and Pariante30 did not report omnibus calibration and discrimination measures, despite constructing and validating a clinical prediction model. The study reported accuracy measures instead (sensitivity, specificity). No studies included calibration measures only.
AdaBoost, adaptive boosting; Anx, anxiety symptoms; Anypol, any polarity in bipolar; ASC, accuracy at single cutoff; AUC, area under the curve; BAC, balanced accuracy curve; Bay, Bayesian updating algorithm; BCT, boosted classification trees; Brier, Brier score; BNPV, Bayes negative predictive value; BPPV, Bayes positive predictive value; CCA, complete case analysis, C-stat, c-statistic; Dep, depressive symptoms; EPV, event per variable (lowest incidence category/number of predictors); FNR, false negative rate; GBDT, gradient boosting decision trees; GBM, gradient boosting; HLT, Hosmer–Lemeshow test; Imp, imputation; LDF, linear discriminant function; Learning, model learning type; LOCV, last observation carried forward; Log, Logistic regression; Mania, manic episode in bipolar; MDB, major depressive episode in bipolar; ModR, model-based R square; NND, number needed to diagnose; NR, not recorded; NRI, net reclassification index; NPV, net positive value; PLR, penalised linear regression; PPV, positive predictive value; Rem, removal; Sens, sensitivity; Spec, specificity; SVM, support vector machine; Stat, statistical learning; UNK, unknown, XGBoost, eXtreme gradient boosting.
All reported measures are rounded on two decimals.
a. Highest performance was only filled in if the study reported multiple models.
b. In case of multiple models, ‘other measures’ are given for the highest performing model.
c. Calculated confidence interval.
Out of 11 studies reporting calibration studies, most reported calibration slopes (n = 4), while some reported calibration plots (n = 3) or reported results from the Hosmer–Lemeshow test (n = 3). One study reported the calibration intercept. Studies that included discrimination measures (n = 27) reported AUC statistics (n = 14), c-statistics (n = 7), ASCs (n = 5) and a continuous outcome with an R 2 measure to represent predictive accuracy (n = 1).
Modelling methods
Almost all studies constructed classification models (n = 27). Many studies (see Table 3) constructed statistical learning models (n = 16), with logistic models being the most prevalent (n = 14). A total of 11 models were constructed using machine learning only, with some form of boosting (either gradient, extreme or adaptive) (n = 9) and elastic net (n = 4) the most utilised techniques. One studyReference Bone, Simmonds-Buckley, Thwaites, Sandford, Merzhvynska and Rubel29 used both statistical and machine learning methods. For further detail, see Supplementary file 7, in which a detailed description of the utilised technique per study is listed.
Handling of missing data
Several studies (n = 9) accounted for missing data by imputation only. A few studies (n = 4) excluded variables from analysis if a given share of a variable was missing and, subsequently, imputed variables. Notably, some studies (n = 7) relied on complete case analysis. One studyReference Ashar, Clark, Gunning, Goldin, Gross and Wager26 applied last observation carried forward for handling missing data. In various studies (n = 7), handling of missing data remained unclear from the text and supplement.
Risk of bias
Following PROBAST, the overall quality of reporting in the studies was low (see Supplementary file 6 for study-specific (sub)domain evaluations). Main areas of high concern with respect to bias introduction were the poor reporting on the handling of missing data, analyses and outcomes.
Applicability
Applicability was also assessed with PROBAST. Main areas of concern were participant in- and exclusion criteria, included predictors and outcome assessment. Many studies were rated as high concern on the participant domain, mostly because of stringent in- and exclusion criteria, which harmed the generalisability of the model (n = 12). Seven studies were rated as of high concern on predictor feasibility as operation of the chosen predictors in clinical practice was challenging. For example, one studyReference Ashar, Clark, Gunning, Goldin, Gross and Wager26 included the predictor amygdala connectivity, which was calculated from a functional magnetic resonance imaging (fMRI) scan. Following the PROBAST guidelines, this study is rated of high concern with regards to applicability. Not only are neuroimaging data-sets prone to variability in analysis,Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber and Johannesson54 but the limited sample size of these studies also limits the applicability of the results.Reference Rashid and Calhoun55,Reference Schnack and Kahn56 The specialised techniques required for imaging techniques and their associated complexity and cost may make the applicability lower for widespread clinical deployment.
A few studies (n = 2) were rated as of high concern on the outcome feasibility for various reasons. Notably, only two studiesReference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36,Reference Puntis, Whiting, Pappa and Lennox50 were of low concern in both risk of bias and applicability.
Meta-analysis
In total, 28 models (derived from 21 studies) were eligible for meta-analysis according to the pre-specified criteria. The funnel in Supplementary file 5 indicates heterogeneity in the discrimination performance, even between studies with a small standard error.
As expected, the heterogeneity among studies was substantial (I 2 = 97.83%, tau2 = 0.28). Overall, evidence for the performance accuracy was mixed. The overall summary discrimination performance was fair (>0.70; point estimate: 0.72; 95% CI [0.46; 0.89]). However, the wide prediction interval (see Fig. 2) indicates that some models performed no better than chance (confidence interval surpasses the value 0.50). Excluding the ASC-reporting studies resulted in a slightly higher summary estimate and narrower confidence interval: 0.76 [0.71–0.80]. The prediction interval surpassed the threshold of 0.50, meaning that all included models performed better than chance [0.51; 0.90]. The forest plot of this sensitivity analysis is presented in Supplementary file 4.
As can be viewed in Fig. 2, the two studies that reported a psychosocial outcome performed among the best. However, because of the low sample size in this category, no meta-regression on the type of outcome could be conducted.
Potential sources of heterogeneity
When applying the criteria given in the Method section, the following sources of heterogeneity could be investigated: type of disorder and number and type of included predictors.
Type of disorder
Diagnosis of a depressive disorder was a significant moderator (B = −0.47, z = −2.31, s.e. = 0.20 P = 0.02, prediction interval = −0.86; −0.07) and it explained 3.41% of the residual heterogeneity. Models that predict outcomes for individuals diagnosed with depressive disorders (n = 18) showed lower discrimination than models that predict outcomes for other disorders (n = 10). The remaining models included SMIs, such as psychosis (n = 6) and bipolar disorder (n = 2), or mixed samples (n = 2).
Number of included predictors
The number of included predictors did not significantly moderate discrimination performance (B = 0.00, z = 0.25, s.e. = 0.01 P = 0.80, prediction interval = −0.01; 0.01).
Type of predictors
Out of 28 eligible models, 18 models used biological, psychosocial and clinical predictors. The usage of a combination of these three categorical variables did not significantly moderate the discrimination performance (B = 0.11, z = 0.46, s.e. = 0.24, P = 0.65, prediction interval = −0.35; 0.57). The use of clinical predictors (n = 7) only did not significantly moderate discrimination performance (B = 0.03, z = 0.13, s.e. = 0.26, P = 0.89, prediction interval = −0.48; 0.55). As there were few models using biological predictors exclusively (n = 2), a meta-regression for biological models was not performed. There were no models using psychosocial predictors only.
Discussion
This is the first systematic review specifically examining the performance and applicability of externally validated prediction models that estimate treatment outcomes for individuals diagnosed with a mood, anxiety or psychotic disorder. Twenty-eight studies were identified. Studies included in this review differed by source of validation data-set, participant in- and exclusion criteria, in definition and in timing of the assessment method, treatment outcome, type and number of included predictors, reporting of discrimination and calibration measures, handling of missing data and the modelling methods. The majority of the studies were labelled as high risk of bias by the PROBAST, because of methodological concerns, (poor reporting of) analyses or inappropriate handling of missing data. The applicability of the included studies was overall poor because of strict in- and exclusion criteria, included predictors (feasibility to obtain predictors within a clinical context) and outcome assessment. The two studies scoring of low concern with regards to risk of bias and applicabilityReference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36,Reference Puntis, Whiting, Pappa and Lennox50 shared the following characteristics: they were both pragmatic in nature (either trial close to clinical practice or routinely collected data), predicted a clinical outcome, utilised accessible clinical, biological and psychosocial variables, employed logistic regression and imputed missing data using chained equations (see Supplementary Table 7.6).
In the meta-analysis, the overall summary discrimination measure was fair. Nevertheless, some models had such wide prediction intervals that they should be interpreted as performing on chance level, while other models were adequate. The two models reporting psychosocial outcomes performed among the best. Out of the meta-regressions performed, only the type of disorder (depression versus other disorders) explained a limited share of the heterogeneity; models for participants with depression performed worse than others.
Despite the effort to conduct a comprehensive search and include all relevant literature, there may have been papers omitted that should have been included by the way we conducted our title and abstract screening. As an example, during the screening of titles and abstracts, we excluded a paper on lithium response by Nunes et alReference Nunes, Ardau, Berghöfer, Bocchetta, Chillotti and Deiana57 (2020) because the methods in the abstract mentioned cross-validation as the validation method, while the full text also included a form of external validation (leave-one-site-out). However, we had to keep a balance between rigorousness and effort.
Because of resource constraints, there was a partial duplication in data extraction. For full texts, only a subsample was taken, and while every effort was made to ensure accuracy, it is possible that minor discrepancies may have occurred. It is important to note that all outcomes of primary interest, particularly those outlined in Table 3, were meticulously double-checked. Despite our best efforts, we acknowledge the possibility of minor discrepancies in other areas.
Literature comparison and interpretation
A recent meta-reviewReference Gillett, Tomlinson, Efthimiou and Cipriani58 identified four externally validated models, of which one was a predictor-finding study according to the review of Lee et al.Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24 A systematic review and meta-analysis for clinical prediction models above and beyond psychiatric treatment outcomes in psychiatryReference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 identified 25 externally validated predictive models. The number of studies identified in the current review illustrates a rapidly growing interest in the field of psychiatry with regards to clinical prediction model research. Still, the number of externally validated models is very low compared to the number of developed models. In the meta-review of Gillet et al,Reference Gillett, Tomlinson, Efthimiou and Cipriani58 the lack of external validation in the literature was noted. In the review of Lee et alReference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24 and Salazar de Pablo et al,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 only 20.0% and 29.9% of the included studies externally validated their models, respectively. Synthesising the finding of relatively few externally validated models compared to the number of developed models, literature comparison supports our notion that the field could greatly benefit from a shift in focus from developing new models to external validation of developed models.
The finding that there was large methodological heterogeneity among studies constructing and validating clinical prediction models is in line with the conclusions of the studies of Gillet et alReference Gillett, Tomlinson, Efthimiou and Cipriani58 and Salazar de Pablo et al.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 More specifically, Gillet et alReference Gillett, Tomlinson, Efthimiou and Cipriani58 concluded that within systematic reviews, definition of treatment response, predictor variables and assessment thereof varied greatly among studies, and Salazar de Pablo et alReference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 concluded that the main limitation of their study was the heterogeneity of characteristics of prediction models. Chekroud et alReference Chekroud, Hawrilenko, Loho, Bondar, Gueorguieva and Hasan59 highlight the challenge of external validation across clinical trials, thereby advocating for the importance of identifying trial-level characteristics in relation to patient outcomes to improve model generalisability. As the growth of mental healthcare delivery systems enables extensive data collection, longitudinal validation methods are the most pragmatic form of examining a model's generalisability.
Similar to previous work by Salazar de Pablo et al,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 Gillet et alReference Gillett, Tomlinson, Efthimiou and Cipriani58 and Meehan et al,Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8 the majority of the externally validated prediction models were classification models. One could speculate why this is the case. Perhaps, for clinical decision-making, the classification of treatment outcomes into binary categories may be considered more clinically useful than a change in symptom score. In our review, several studies employed a change in or a threshold on symptom scores to determine treatment outcome status such as remission versus non-remission or response versus non-response. Another possible reason for only finding classification models could be inherent to the search string used. If calibration and discrimination measures are underreported in non-classification models, they are not included in the review.
We extend previous work by Salazar de Pablo et al,Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 as the frequent use of clinical variables and the less frequent use of biomarkers (such as genetic information) is identified in this review as well. Salazar de Pablo et alReference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 found that most models were based on clinical predictors, and that there is no evidence that models that include biomarkers outperform other types of models. In existing biomarker-based models, the great variability in raw data processing and feature type selection potentially hampers successful external validation.Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber and Johannesson54,Reference Rashid and Calhoun55 Some even question if biomarkers are a necessity to include when estimating treatment outcomes. In Meehan et al,Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8 studies reporting models with a majority of biomarkers were excluded because of pragmatic concerns. In this review, no such restrictions were applied. Notably, still few externally validated models were identified that predominantly used biomarkers, and were of poor performance. This supports the notion that biomarkers are unable or not yet sufficiently able to reflect psychopathological mechanisms.
Notably, we could not identify externally validated clinical prediction models that contained tests of known psychological mechanisms. For example, the role of negative bias (i.e. the tendency to pay more attention to negative information) is a well-established characteristic in the psychopathology of patients suffering from depressive disorders, anxiety disorders, bipolar disorders and schizophrenia,Reference Fazel, Wolf, Larsson, Lichtenstein, Mallett and Fanshawe33–Reference Fiedorowicz, Merranko, Iyengar, Hower, Gill and Yen35 of which severity can be assessed by the dot-probe computer task.Reference Furukawa, Kato, Shinagawa, Miki, Fujita and Tsujino36 In the review of Lee et al,Reference Lee, Ragguett, Mansur, Boutilier, Rosenblat and Trevizol24 a study constructed a prediction model utilising a cognitive-emotional biomarker, but unfortunately, this model was not externally validated. Based on this research, it is tempting to speculate that models using outcomes of cognitive tests fail to replicate in external validation, or that this observation identifies a promising gap in the literature that invites further exploration.
Comparing accuracy performance among reviews must be performed with great caution, as the reported accuracy interval depends on the type of validation the included models were tested with.Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13,Reference Eysenck21 In this review, only externally validated models were included, regardless of the model's performance in the development data-set (see Supplementary Table 7.7). Based on the lower point-estimates and wider prediction intervals of this study, one could conclude that the accuracy performance in the current study is lower compared to the accuracy of the reported prediction models of Salazar de Pablo et al.Reference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 This is also observed in the review of Meehan et al.Reference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8 However, in both reviews, the sample in the meta-analyses consisted mostly of internally validated models, making the higher reported discrimination measures a more likely finding.Reference Debray, Vergouwe, Koffijberg, Nieboer, Steyerberg and Moons9,Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13 Moreover, in both reviews, a protocol for the selection of models for meta-analysis was lacking.
Comparing the findings of systematic reviews and meta-analysis enhances the risk of ecological fallacy, as aggregated data can mask important differences between subgroups or individuals, which is contradictory to the mission of precision psychiatry. One may even argue that a narrative review may be more appropriate, as this publication type allows for a broader scope, broader audience and more flexibility.Reference Pae60 However, narrative reviews are, because of their flexibility, more prone to bias in the gathering and evaluation of evidence.Reference Bourton, Page, Altman, Lundh and Hróbjartsson15
This review illustrates the symbiotic relationship between introduced bias and applicability concern because of methodological design. In our review, only two out of 28 models were both of low concern when considering risk of bias and applicability. The reviews of Meehan et alReference Meehan, Lewis, Fazel, Fusar-Poli, Steyerberg and Stahl8 and Salazar de Pablo et alReference Salazar de Pablo, Studerus, Vaquerizo-Serrano, Irving, Catalan and Oliver11 identified key weaknesses in model construction and performance: underpowered studies, failure to report key aspects of model handling (missing data handling, predictive performance (mostly calibration) and bias-prone variable selection strategies. Not surprisingly, most models in both reviews were rated of high concern in both the risk of bias and applicability domains.
The observation, informed by the meta-regression, that models predicting depressive treatment outcomes performed worse compared to the other models in our sample is novel and intriguing. A possible explanation to consider is the particular case mix of the ‘rest’ category in the meta-regression, which included mainly SMI samples. Studies using samples with more severe disorders are more often conducted in patients treated in secondary care, and may have different distributions of outcomes and predictors, which may lead to different accuracy performance.Reference Eysenck21 Relatedly, the presentation of patients with moderate severity, which is often the case for depression, is typically more heterogeneous, thereby making it more difficult for a model to perform well as compared to in a sample with more homogeneous severely mentally ill patients.Reference Greco, Zangrillo, Biondi-Zoccai and Landoni13 Notably, the models predicted psychosocial outcomes for a SMI population, raising the question whether the good discrimination of these models is due the sample, the type of outcome predicted or both.
The encountered diversity of studies and settings in this review reflects the challenge of clinical model implementation. The finding that many models were labelled of ‘high concern’ with regard to the risk of bias and/or applicability indicates that few models seem ready for further implementation in clinical practice to aid treatment allocation. For the implementation of clinical prediction models into practice, the need for close examination of the clinical setting before model implementation remains, regardless of whether they have been externally validated. The observation in the meta-regression that depression models reported lower accuracy, and visually apparent stronger accuracy for psychosocial outcomes in the meta-analysis, are new and highly remarkable. More research is needed to understand these differences and their implications.
In conclusion (large-scale) implementation of individualised externally validated prediction models in clinical practice does not seem to be feasible in the near future. Few models seem ready for further implementation in clinical practice to aid treatment allocation. Besides the need for more external validation studies, we recommend close examination of the clinical setting before model implementation.
Supplementary material
Supplementary material is available online at https://doi.org/10.1192/bjo.2024.789
Data availability
Data availability is not applicable to this article as no new data were created or analysed in this study.
Acknowledgements
We would like to express our appreciation to Dr Peter Braun for his contributions to the development of the search string. We would like to thank Dr Ans Hovenkamp for her methodological advice during the inception of this review.
Author contributions
The PROSPERO protocol was designed and registered by S.H.B., D.G.B. and H.R. The search was performed by D.G.B. and S.H.B. Data extraction was performed by D.G.B. and S.H.B. The manuscript was written by D.G.B. and reviewed by H.R., S.H.B. and R.A.S. All authors gave final approval for the work to be published.
Funding
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Declaration of interest
None.
eLetters
No eLetters have been published for this article.