Person-specific and pooled prediction models for binge eating, alcohol use and binge drinking in bulimia nervosa and alcohol use disorder

N. Leenaerts; P. Soyster; J. Ceccarini; S. Sunaert; A. Fisher; E. Vrieze

doi:10.1017/S0033291724000862

Person-specific and pooled prediction models for binge eating, alcohol use and binge drinking in bulimia nervosa and alcohol use disorder

Published online by Cambridge University Press: 22 May 2024

N. Leenaerts

P. Soyster ,

J. Ceccarini ,

S. Sunaert ,

A. Fisher and

E. Vrieze

Show author details

N. Leenaerts*: Affiliation:
Department of Neurosciences, KU Leuven, Leuven Brain Institute, Research Group Psychiatry, Leuven, Belgium Department of Neurosciences, Mind-Body Research, Research Group Psychiatry, KU Leuven, Belgium
P. Soyster: Affiliation:
Department of Psychology, Idiographic Dynamics Lab, University of California, Berkeley, USA
J. Ceccarini: Affiliation:
Department of Nuclear Medicine and Molecular Imaging, KU Leuven, Leuven Brain Institute, Research Nuclear Medicine & Molecular Imaging, Leuven, Belgium
S. Sunaert: Affiliation:
Department of Imaging and Pathology, Translational MRI, Biomedical Sciences Group, KU Leuven, Belgium
A. Fisher: Affiliation:
Department of Psychology, Idiographic Dynamics Lab, University of California, Berkeley, USA
E. Vrieze: Affiliation:
Department of Neurosciences, KU Leuven, Leuven Brain Institute, Research Group Psychiatry, Leuven, Belgium Department of Neurosciences, Mind-Body Research, Research Group Psychiatry, KU Leuven, Belgium
*: Corresponding author: N. Leenaerts; Email: [email protected]

Article contents

Abstract
Background
Methods
Results
Conclusions
Introduction
Methods
Results
Discussion
Conclusion
Author's Note
Funding statement
Competing interests
References

Rights & Permissions

Abstract

Background

Machine learning could predict binge behavior and help develop treatments for bulimia nervosa (BN) and alcohol use disorder (AUD). Therefore, this study evaluates person-specific and pooled prediction models for binge eating (BE), alcohol use, and binge drinking (BD) in daily life, and identifies the most important predictors.

Methods

A total of 120 patients (BN: 50; AUD: 51; BN/AUD: 19) participated in an experience sampling study, where over a period of 12 months they reported on their eating and drinking behaviors as well as on several other emotional, behavioral, and contextual factors in daily life. The study had a burst-measurement design, where assessments occurred eight times a day on Thursdays, Fridays, and Saturdays in seven bursts of three weeks. Afterwards, person-specific and pooled models were fit with elastic net regularized regression and evaluated with cross-validation. From these models, the variables with the 10% highest estimates were identified.

Results

The person-specific models had a median AUC of 0.61, 0.80, and 0.85 for BE, alcohol use, and BD respectively, while the pooled models had a median AUC of 0.70, 0.90, and 0.93. The most important predictors across the behaviors were craving and time of day. However, predictors concerning social context and affect differed among BE, alcohol use, and BD.

Conclusions

Pooled models outperformed person-specific models and the models for alcohol use and BD outperformed those for BE. Future studies should explore how the performance of these models can be improved and how they can be used to deliver interventions in daily life.

Keywords

alcohol use alcohol use disorder binge drinking binge eating bulimia nervosa ecological momentary assessment experience sampling method machine learning

Type: Original Article
Information: Psychological Medicine , Volume 54 , Issue 10 , July 2024 , pp. 2758 - 2773

DOI: https://doi.org/10.1017/S0033291724000862 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press

Introduction

Bulimia nervosa (BN) and alcohol use disorder (AUD) are two psychiatric disorders that share a number of similarities (American Psychiatric Association, 2013). First, they can both be characterized by binge behavior where patients consume a large amount of food (i.e. binge eating [BE]) or alcohol (i.e. binge drinking [BD]) within a short period of time, while not being able to stop eating or drinking or not being able to control the amount they eat or drink. (American Psychiatric Association, 2013). Second, they can both have a significant impact on health with BN having a high mortality of 1.7 per 1000 person-years and with AUD being the largest risk factor for disease and disability among 15- to 49-year-olds (Arcelus, Mitchell, Wales, & Nielsen, Reference Arcelus, Mitchell, Wales and Nielsen2011; Griswold et al., Reference Griswold, Fullman, Hawley, Arian, Zimsen, Tymeson and Gakidou2018). Third, both disorders can be difficult to treat, with up to 60% of patients who receive treatment not achieving remission (Fleury et al., Reference Fleury, Djouini, Huỳnh, Tremblay, Ferland, Ménard and Belleville2016; Linardon & Wade, Reference Linardon and Wade2018). Taken together, the high impact and poor treatment outcomes highlight the need for more effective therapies for BN and AUD.

One promising new form of therapy is the just-in-time adaptive intervention (JITAI) (Nahum-Shani et al., Reference Nahum-Shani, Smith, Spring, Collins, Witkiewitz, Tewari and Murphy2018). In a JITAI, support is given ‘just-in-time’ or when a patient needs it the most. For example, a patient could report their emotions, behaviors, and context with a smartphone application, and an algorithm could evaluate the risk of BE or BD based on this information, after which an intervention could be sent out when this momentary risk is elevated. The support can also be adaptive, meaning that it can be tailored to a patient's in-the-moment needs. For instance, a patient could receive a text message alert when the estimated risk for BE or BD is moderate, but a phone call when the estimated risk is high. Because of its potential benefits, several researchers have developed and implemented JITAIs in recent years, but their results have been mixed (Carpenter, Menictas, Nahum-Shani, Wetter, & Murphy, Reference Carpenter, Menictas, Nahum-Shani, Wetter and Murphy2020; Hardeman, Houghton, Lane, Jones, & Naughton, Reference Hardeman, Houghton, Lane, Jones and Naughton2019; Wang & Miller, Reference Wang and Miller2020). One reason for this could be the limited adaptive nature of these JITAIs. Namely, the research designs of these JITAIs were primarily based on previous literature (e.g. which information should be gathered from participants and how it should be evaluated), which means that the decision to send out an intervention was static and based on findings from previous studies. However, if such decisions were to be based on the actual data provided by participants, a JITAI would be more adaptive and perhaps more effective.

This goal could be realized with the help of machine learning (ML) where statistical models and algorithms learn from data without explicit instruction (Shatte, Hutchinson, & Teague, Reference Shatte, Hutchinson and Teague2019). A ML model could learn when individuals are at risk of BE or BD in daily life and then subsequently predict this risk when presented with new data in a JITAI. A ML model could also determine which of the many possibly assessed variables (e.g. momentary mood, location, social context, and time) are predictive of BE or BD and which ones are not. Several ML algorithms can examine a large number of variables and select only those that are most predictive of an outcome (Cai, Luo, Wang, & Yang, Reference Cai, Luo, Wang and Yang2018). This kind of information could then provide targets for the interventions in a JITAI. However, researchers are confronted with specific challenges when using ML to predict daily life behavior. Namely, they need to decide whether they want to build person-specific or group-level (pooled) prediction models. On the one hand, person-specific models are trained with the data of an individual patient (Soyster, Ashlock, & Fisher, Reference Soyster, Ashlock and Fisher2021). This type of model can be built more easily and can result in more person-specific information. On the other hand, pooled prediction models are trained with the data of multiple patients (Soyster et al., Reference Soyster, Ashlock and Fisher2021). This model type is more difficult to build as more patients need to be included, but can result in a better model performance, particularly if the factors driving a momentary behavior are similar across the study participants.

In recent years, several studies have built person-specific and pooled models to predict BE, alcohol use, or BD in daily life (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023; Bae, Chung, Ferreira, Dey, & Suffoletto, Reference Bae, Chung, Ferreira, Dey and Suffoletto2018; Bae et al., Reference Bae, Ferreira, Suffoletto, Puyana, Kurtz, Chung and Dey2017; Levinson, Trombley, Brosof, Williams, & Hunt, Reference Levinson, Trombley, Brosof, Williams and Hunt2022; Soyster et al., Reference Soyster, Ashlock and Fisher2021; Walters et al., Reference Walters, Businelle, Suchting, Li, Hébert and Mun2021). When it comes to disordered eating behavior, one study demonstrates that pooled models can predict future occurrences of BE, restriction, and purging in patients with an eating disorder (Levinson et al., Reference Levinson, Trombley, Brosof, Williams and Hunt2022). By utilizing predictors related to disordered eating cognitions and behaviors, along with affect, the study shows that dietary restriction, weighing, and anxiousness are important predictors of subsequent BE. Additionally, another study indicates that person-specific models can perform well in predicting BE in patients with a binge-type eating disorder (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023). This study, using a set of variables selected from feedback from clinicians and patients, reports that hunger and craving are the most common predictors of BE. When it comes to alcohol use, studies have demonstrated that pooled models can successfully use phone sensor data to predict non-heavy alcohol use as well as BD in individuals without AUD (Bae et al., Reference Bae, Ferreira, Suffoletto, Puyana, Kurtz, Chung and Dey2017, Reference Bae, Chung, Ferreira, Dey and Suffoletto2018). In these studies, time of day, number of activities, and phone usage emerge as the most informative predictors. Furthermore, another study finds that both person-specific and pooled models can predict alcohol use in individuals without AUD utilizing predictors related to affect, craving, and recent alcohol use (Soyster et al., Reference Soyster, Ashlock and Fisher2021). More specifically, it finds that craving and feeling pressured to drink are the most common predictors across individuals. Employing a similar set of variables, with additional predictors related to social setting and location, a different study also concludes that pooled models can predict alcohol use in individuals without AUD (Walters et al., Reference Walters, Businelle, Suchting, Li, Hébert and Mun2021). Here, craving, the availability of alcohol, and feeling that alcohol will improve mood were the most important predictors.

However, these studies have significant limitations. First, their generalizability to a broader clinical population is often limited. This is because only a few studies include a clinical sample and those that do, include a small number of participants for which they only have a limited number of observations. This is problematic as a small sample size can have serious methodological implications in ML (Way, Sahiner, Hadjiiski, & Chan, Reference Way, Sahiner, Hadjiiski and Chan2010). Indeed, most studies did not hold out data when training or tuning their ML models and therefore were not able to evaluate model performance on unseen data. This implies that their models run a serious risk of overfitting and might not generalize. Second, the majority of variables in these studies assess emotions or behaviors and do not look at the social or situational context of a patient. However, previous studies show the importance of context in BE, alcohol use, and BD (Allison & Timmerman, Reference Allison and Timmerman2007; Clapp, Shillington, & Segars, Reference Clapp, Shillington and Segars2009). Third, to our knowledge only one study evaluated both person-specific and pooled prediction model and did so only for alcohol use, leaving the question unanswered which model type performs best for BE and BD (Soyster et al., Reference Soyster, Ashlock and Fisher2021).

Because of these limitations, it is still unclear to what extent BE, alcohol use, and BD can be predicted in the daily lives of patients with BN and/or AUD and which variables are important predictors. This study aims to fill that gap by collecting a large amount of data from a clinical sample on a variety of variables and make methodologically correct prediction models. We followed patients with BN and/or AUD over a period of 12 months during which we used the experience sampling method (ESM) to repeatedly assess the patient's emotions, behaviors, and contexts in daily life. We then used this data to fulfill the following two objectives. First, to build and evaluate person-specific and pooled prediction models for BE, alcohol use, and BD in daily life. Second, to identify the most important predictors of these behaviors.

Methods

Study sample

The participants were drawn from a larger ESM study that followed patients with BN and/or AUD as well as control volunteers without these diagnoses in daily life. In the current study, all the data of the patients with BN (n = 50), with AUD (n = 51) or with BN and AUD (n = 19) were used, after the elimination of one patient with BN and two patients with AUD due to insufficient data (i.e. not answering two consecutive assessments, causing the lagging procedure described below to fail). Recruitment happened in Flanders, Belgium through residential and ambulatory care centers, patient groups, universities, social media, and by handing out flyers on the street. Inclusion ran from September 2019 to February 2022. The inclusion criteria were: (1) being assigned female at birth; (2) understanding Dutch language; (3) being of age ⩾18 years; and (4) being of BMI ⩾18.5 kg/m2. It was decided to not include individuals assigned male at birth as the prevalence of BN in significantly lower in this population (Galmiche, Déchelotte, Lambert, & Tavolacci, Reference Galmiche, Déchelotte, Lambert and Tavolacci2019). Additional inclusion criteria for patients were: (5) meeting the criteria for BN or AUD of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 2013); (6) meeting those diagnostic criteria for a duration of ⩽5 years. This maximum was set as the importance of certain predictors of BE, alcohol use, and BD are thought to change over the course of AUD and BN (Boness, Watts, Moeller, & Sher, Reference Boness, Watts, Moeller and Sher2021; Pearson, Wonderlich, & Smith, Reference Pearson, Wonderlich and Smith2015). For example, it is thought that BE episodes start as a rash action during moments of high negative affect, and are reinforced by a subsequent decrease in negative affect (Pearson et al., Reference Pearson, Wonderlich and Smith2015). However, it is then thought that after repeated cycles of emotional distress, urge, and BE, the behavior becomes habitual, where it persists, even when it is no longer reinforced by a decrease in negative affect (Pearson et al., Reference Pearson, Wonderlich and Smith2015). Furthermore, it is thought that the role of negative affect and positive affect change of the course of AUD, whereby changes in positive affect are more predictive of alcohol use in the beginning of AUD, while the role of negative affect increases over the course of AUD (Koob & Le Moal, Reference Koob and Le Moal1997). Participants with AUD also needed to display a pattern of repetitive BD according to the criteria of the National Institute on Alcohol Abuse and Alcoholism (i.e. drinking 4 units of alcohol within 2 h for women) (National Institute on Alcohol Abuse & Alcoholism [NIAAA], 2022). Participants were excluded for the following reasons: (1) major medical pathology (e.g. severe liver or kidney disease, uncontrolled diabetes, cancer or untreated hyper- or hypothyroidism); (2) chronic use of sedatives (i.e. more than three times in the past three month); (3) pregnancy; (4) presence of major psychiatric pathology (i.e. schizophrenia, autism spectrum disorder, bipolar disorder, substance use disorder other than AUD). All participants gave their written consent, and the study was approved by the ethical committee of the UZ/KU Leuven.

Study design

Potential participants were initially screened via telephone or email, after which they attended an in-person assessment. Here, a psychiatry resident confirmed an individual's eligibility to participate. The participants had their weight and height measured with a calibrated scale and stadiometer and completed clinical interviews and questionnaires. All participants underwent a briefing on the ESM questions and practiced the use of the mobile application. Then, the participants entered the ESM protocol on the first Thursday after the in-person assessment. An overview of the protocol can be seen in Fig. 1. It consisted of a repeated measurement design where seven bursts of data collection were spread out over a 12-month period. The bursts had a duration of three weeks and were separated by intervals of five weeks. During the bursts, data were only collected on Thursday, Friday, and Saturday to limit the protocol's impact on the participants. These specific days were selected to consecutively gather data on both week and weekend days. Then, participants were required to respond to eight signals on each day of data collection which were sent out on a signal-contingent (i.e. semi-random) basis. The participants received 20 eurocent per answered assessment. The ESM data were initially collected with the app MobileQ (Meers, Dejonckheere, Kalokerinos, Rummens, & Kuppens, Reference Meers, Dejonckheere, Kalokerinos, Rummens and Kuppens2020). When the development of the app was discontinued in October 2020, data collection continued using m-Path (Mestdagh et al., Reference Mestdagh, Verdonck, Piot, Niemeijer, Tuerlinckx, Kuppens and Dejonckheere2022). More information about the apps can be found in online Supplementary eMethods 1 and eTable 1 in the supplement.

Figure 1. Experience sampling method protocol. The protocol consisted of seven bursts of data collection which were spread out over a 12-month period. The bursts had a duration of three weeks and were separated by intervals of five weeks. During the bursts, data were only collected on Thursday, Friday, and Saturday. On a given day of data collection, participants received eight signals which were sent on a signal-contingent (i.e. semi-random) basis.

Measures

Baseline measures

The Structured Clinical Interview for DSM-5 (SCID-5-S) was used to confirm the diagnosis of BN or AUD and to screen for other psychiatric disorders (American Psychiatric Association [APA], 2017). BN and AUD severity were assessed using the Eating Disorder Examination Questionnaire (EDE-Q) and the Alcohol Use Disorders Identification Test (AUDIT) (Fairburn & Beglin, Reference Fairburn and Beglin1994; Saunders, Aasland, Babor, De La Fuente, & Grant, Reference Saunders, Aasland, Babor, De La Fuente and Grant1993).

ESM measures

At each assessment, the participants received questions evaluating different emotions, behaviors, and contexts. The exact number of items varied at each assessment as the presentation of some questions was conditional on a participant's response to a previous question. The full list of questions can be seen in Table 1. More information on the reliability and/or validity of the items can be found in the supplement (online Supplementary eMethods 2). Importantly, participants needed to indicate if they had eaten since the previous assessment. If so, they had to identify the eating event as undereating, normal eating, or overeating. Then, participants were asked if they experienced a loss of control over their eating behavior. As in previous studies, BE was defined as an episode of overeating with loss of control (Ambwani, Roche, Minnick & Pincus, Reference Ambwani, Roche and Minnick2015). The participants were trained to interpret overeating as eating an amount of food that is definitely larger than what most people would eat under similar circumstances. Furthermore, they were instructed to interpret loss of control as wanting to stop eating, but not being able to. Similarly, participants needed to indicate whether they drank alcohol since the previous assessment and if so, how many units they drank and if they experienced a loss of control over their drinking behavior. The participants were instructed on the definition of an alcohol unit. Here, BD was defined as having consumed at least four units of alcohol since the previous assessment while alcohol use was conceptualized as having consumed at least one unit since the previous assessment.

Table 1. Experience sampling method questions

Data analysis

Data preparation

A figure providing an overview on the data preparation procedure can be found in the supplement (online Supplementary eFigure 1). Only assessments answered within 240 min of the prompt were used in the analyses. This window was chosen to include assessments which were answered later at night, where the likelihood that patients binge eat, drink alcohol, or binge drink could be higher. First, the scoring of the conditional ESM variables was corrected. A conditional ESM variable depended on a previous ESM answer (e.g. how stressful an event was, was only asked on the condition that a participant answered ‘yes’ on experiencing a stressful event). The conditional ESM variables therefore included missing values, when the condition was not met, which could be filled in with zeroes (i.e. indicating that past events were not stressful at all). Second, temporal variables were created that represented assessment number (i.e. 1–8), weekday (i.e. Thursday, Friday, and Saturday), time since starting participation in the study (linear, quadratic, and cubic) and cycles of 12 h, and 24 h frequency (Flury & Levri, Reference Flury and Levri1999). These temporal variables have been used in previous studies predicting BE and alcohol use (Arend et al., Reference Arend, Kaiser, Pannicke, Reichenberger, Naab, Voderholzer and Blechert2023; Soyster et al., Reference Soyster, Ashlock and Fisher2021). Third, to account for the varying levels of COVID-19 prevention measures throughout the study, a COVID-19 stringency variable was created based on the Oxford COVID-19 Government Response Tracker (Kira et al., Reference Kira, Saptarshi, Thayslene, Oliveira, Nagesh, Phillips and Hallas2022). More information on the temporal and COVID-19 stringency variables can be found in the supplement (online Supplementary eMethods 3). This brought the total possible number of predictors to 110. However, a predictor could not be entered in the prediction models when it had a variance of zero (i.e. meaning it always had the same response value). More specifically, for the person-specific models, the median number of predictors for BE was 97 (Q1–Q3: 90–100), while the median number of predictors for alcohol use was 96 (Q1–Q3: 90–99), and the median number of predictors for BD was 98 (Q1–Q3: 91–99). The pooled models used all predictors. Fourth, all ESM variables except for the outcome variables were lagged by one assessment, with time between assessments measuring 102 min on average. The variables could be lagged across days, but not across weeks. The temporal variables and the COVID-19 stringency variable were not lagged and therefore remained aligned in time with the outcome variables. Fifth, observations with missing values were removed from the data.

This resulted in a dataset which could be used to predict BE, BD, and alcohol use at a certain point in time in the future, based on the temporal variables and the COVID-19 stringency variable at that timepoint as well as the ESM variables at a previous timepoint. This was done to emulate how a machine learning-based JITAI would be used to treat a patient in daily life. For example, a model could predict whether a patient who reported to experience more stress is more likely to binge eat after two hours. As lagging across days was permitted, BE, BD and alcohol use episodes which happened at night but were reported in the morning could also be predicted.

Model training and evaluation

Person-specific as well as pooled prediction models were built for BE, alcohol use, and BD. Based on the definitions outlined under ESM measures, the moments of BD were also considered moments of alcohol use. This approach was taken because a JITAI would most likely focus on either alcohol use (i.e. drinking any alcohol) or BD (i.e. drinking more than 4 units in 2 h for women), rather than non-heavy alcohol use (i.e. alcohol use that is not BD). For BE, the data of the patients with BN and the patients with BN and AUD were used (n = 69). Similarly, for alcohol use and BD, the data of the patients with AUD and the patients with BN and AUD were used (n = 70). This meant that only the data of patients who displayed the behavior were included in the analyses for a specific outcome. The models were trained and evaluated with the ensr, glmnet, pROC and caret packages in R, version 4.1.1 (DeWitt, Reference DeWitt2019; Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010; Kuhn, Reference Kuhn2021; Robin et al., Reference Robin, Turck, Hainard, Tiberti, Lisacek, Sanchez and Müller2011). The scripts and data for the analyses can be found at https://rdr.kuleuven.be/dataset.xhtml?persistentId=doi:10.48804/OBLDWE. A brief description of the elastic net wrappers which were developed for this paper can be found in the supplement (online Supplementary eMethods 4). More detailed information can be found at https://github.com/nicolasleenaerts/NLML. Person-specific The models were fitted and evaluated on the data of each participant with nested k-fold cross-validation. A visual representation of this method can be seen in Fig. 2. More information on nested cross-validation can be found in the supplement (online Supplementary eMethods 5). For the outer loop, a stratified 5-fold cross-validation was used. Due to the stratification, the distribution of positive events was similar across folds. The allocation of observations to specific folds was random, meaning that the observations within each fold were not temporally contiguous. However, due to the lagging procedure described above, each instance of the dependent variables was only ever predicted by the independent variables for the immediately previous observation. The continuous variables of the training folds were standardized. This can increase performance of regression-based models and simplify comparisons between model estimates (Shahriyari, Reference Shahriyari2019). Additionally, the continuous variables from the test fold were also standardized, but with the mean and standard deviation from the training folds. This separated procedure transforms the testing data to the same scale as the training data but prevents any information from leaking.

Figure 2. Nested cross-validation. In the outer loop, the total dataset was divided into five folds which were processed in five rounds. During each round, one fold was used as a test set while the other four folds were used a training set. In the inner loop, the training folds were used to select the most optimal alpha and lambda and to fit the elastic net model. A grid search of 10 alphas and 100 lambdas was performed with a 10-fold cross-validation. The combination with the lowest cross-validation error was used to fit the definitive elastic net model. This model was then evaluated on the test fold of the outer loop.

For the inner loop, an elastic net regularized regression model was fitted to the training folds of the outer loop (Zou & Hastie, Reference Zou and Hastie2005). Though other machine learning techniques exist, elastic net was specifically chosen as the machine learning algorithm as it is especially suited for the objectives of the current manuscript. This is because it is a performant machine learning technique that reduces overfitting, can work with high-dimensional data (i.e. more predictors than observations), handles correlated variables, but also provides information on the strength and nature of the relation between a predictor and an outcome. It combines two regularization methods, ridge regression which shrinks model estimates and LASSO regression which removes variables that do not contribute to the model. The amount of ridge and lasso regression is expressed by a variable alpha which varies from 0 (exclusively ridge regression) to 1 (exclusively LASSO). The strength of the regularization is defined by a variable lambda with higher values leading to more shrinkage of the coefficients. The most optimal alpha and lambda were selected with a grid search of 10 alphas and 100 lambdas (i.e. the default settings of ensr). For each possible combination, a cross-validation error was calculated with 10-fold cross-validation. The combination with the lowest cross-validation error was then used to fit the definitive elastic net model on the training folds of the outer loop. This elastic net model was then used to predict BE, BD or alcohol use in the data of the test fold of the outer loop. The predictions were then compared with the actual BE, BD, and alcohol use events in the test fold to calculate the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV). Due to the nested cross-validation, a participant needed to have a sufficient number of BE, BD or alcohol use events (n > 4) to be included in the analysis. Pooled The pooled models were also fitted and evaluated with nested k-fold cross-validation. However, they were trained on the pooled training data of all the participants and tested on the individual test data of each participant. More specifically, in the outer loop, the training folds were a combination of the standardized training folds of the participants. Due to the standardization at a participant-level, the values of the continuous variables represented a deviation from the within-person means. As no multilevel variant of elastic net regularized regression exists, this accounted in part for the within-person nesting of the data (Soyster et al., Reference Soyster, Ashlock and Fisher2021). In the inner loop, the most optimal alpha and lambda were again determined with 10-fold cross-validation and used to fit the final elastic net model. This model was then applied to the test fold of every individual participant to evaluate the AUC, sensitivity, specificity, accuracy, PPV, and NPV of the pooled model for each participant. As the training folds were pooled, a participant only needed to have one BE, BD, or alcohol use event in the test fold to be included in the analyses.

Model comparison

For each participant and each outcome, the AUC of the pooled models was subtracted from the AUC of the person-specific models. Then, to explore why some participants had a better performance with the person-specific model than with the pooled model, this difference in AUC was compared between the different analysis groups with Mann–Whitney U tests and correlated to age, BMI, EDE-Q scores, AUDIT scores, BE frequency, and BD frequency with Spearman correlations. Non-parametric tests were performed due to the non-normal distribution of the AUCs.

Validation analyses

First, to assess whether patients were more likely to binge eat, drink alcohol, or binge drink on certain days of the week (e.g. Thursday, Friday, and Saturday), generalized linear mixed models were constructed for each outcome of interest (i.e. binge eating, alcohol use, and binge drinking) with the aggregated data of all participants, with day of the week as a main effect and with a random intercept for the participants. Second, as there was an imbalance in the outcomes whereby patients typically did not display BE, alcohol use or BD, several analyses were performed to assess the validity of the results. To begin, the results of the pooled and person-specific models were compared to those of models that always predict the majority class (i.e. not BE, not drinking alcohol, not BD). Additionally, the results of the person-specific and pooled models were compared to those of models where the imbalance in the outcomes was corrected with ROSE or SMOTE. Third, as some participants changed apps over the course of the study, the impact of app type on model performance was explored. More specifically, person-specific and pooled models were constructed with an additional variable indicating whether MobileQ of m-Path was used, after which their performance was compared to that of the original models. Additionally, the AUC of the original person-specific and pooled models was correlated to the percentage of observations that a participant reported through m-Path. Spearman correlations was performed due to the non-normal distribution of the AUCs. More information on these validation analyses can be found in the supplement (online Supplementary eResults 3–5).

Model predictors

For each outcome and each model type, the 10% best predictors were identified. This was based on the raw estimates for the pooled model (as only one estimate per variable existed) and the mean estimates over all participants for the person-specific models.

Results

Sample characteristics

The characteristics of the different patient groups can be found in Table 2. Additionally, the characteristics of the different analysis groups (i.e. for BE or BD/alcohol use) can be found in the supplement (online Supplementary eTable 2). Notably, the age of the patients with BN/AUD (mean = 20.4, s.d. = 1.7, CI 19.6–21.2) was lower than that of the patients with BN (mean = 22.4, s.d. = 4.1, CI 21.3–23.6). Also, the BMI of the patients with BN (mean = 25.6, s.d. = 5.9, CI 23.9–273) was higher than that of the patients with AUD (mean = 21.5, s.d. = 3.5, CI 20.5–22.4).

Table 2. Sample characteristics

Abbreviations: ADHD, attention deficit hyperactivity disorder; AP, agoraphobia; AUD, alcohol use disorder; BMI, body mass index; BN, bulimia nervosa; CI, confidence interval; EDE-Q, Eating Disorder Examination Questionnaire; MDD, major depressive disorder; N, number; PD, panic disorder; PTSD, post-traumatic stress disorder; SAD, social anxiety disorder; SD, standard deviation.

Data characteristics

In total, 41 (34.2%) participants (16 (32.0%) BN, 19 (37.3%) AUD, 6 (31.6%) BN/AUD) dropped out of the study before the end of the ESM protocol. For every participant group (AUD, BN, AUD/BN), there was no significant difference between patients who dropped out and those who did not when it came to age, BMI, illness duration, AUDIT scores, or EDE-Q scores. The mean compliance (percentage of signals answered) per participant during the first burst was 80.4% for the patients with BN, 75.2% for the patients with AUD and 73.6% for the patients with BN/AUD. This is similar to the compliance rates of previous cross-sectional ESM studies in patients with an eating disorder or AUD (Fischer, Wonderlich, Breithaupt, Byrne, & Engel, Reference Fischer, Wonderlich, Breithaupt, Byrne and Engel2018; Jones et al., Reference Jones, Remmerswaal, Verveer, Robinson, Franken, Wen and Field2019; Schaefer et al., Reference Schaefer, Smith, Anderson, Cao, Crosby, Engel and Wonderlich2020). In total, the patients with BN answered 12 932 (61.5%) of their scheduled beeps, while the patients with AUD answered 12 328 (62.9%) and the patients with BN/AUD answered 3947 (51.2%). The overall compliance of this study fell in the range of the lengthier ESM studies on substance use (Jones et al., Reference Jones, Remmerswaal, Verveer, Robinson, Franken, Wen and Field2019). More information on the reasons for dropout and the compliance per burst can be found in the supplement (online Supplementary eResults 1 and eTable 3). There was an imbalance in the outcomes whereby the median percentage of BE episodes that patients with BN and BN/AUD experienced was 14% (Q1–Q3: 6–20%), with a median percentage of alcohol use episodes that patients with AUD and BN/AUD experienced of 14% (Q1–Q3: 8–18%), while the median percentage of BD episodes that patients with AUD and BN/AUD experienced was 4% (Q1–Q3: 2–7%).

Model performance

The performance metrics had a skewed distribution within and between participants. Therefore, the median across folds and across participants was used to describe them. An extended overview can be found in Table 3. A visual summary can be seen in Fig. 3. The confusion matrices of the predictions can be found the supplement (online Supplementary eTables 4–9).

Table 3. Model performance

Abbreviations: AUC, area under the curve; CV, cross-validation; N, number of participants with a successful model; NPV, negative predictive value; PPV, positive predictive value.

The performance metrics had a skewed distribution within and between participants. Therefore, they are best described by the median across folds and participants. To compare, the results after taking the mean across folds is also presented.

Figure 3. Model performance. Performance of the person-specific and pooled prediction models for binge eating, alcohol use, and binge drinking. Due to a skewed distribution of the performance metrics within participants, the median across folds was taken for the area under the curve.

Binge eating

A person-specific model could be fitted and evaluated for 48 (69.6%) participants. The performance of the person-specific models was poor with a median AUC of 0.61 (Q1:0.53; Q3:0.73), sensitivity of 0.83 (Q1:0.67; Q3:1.00), specificity of 0.71 (Q1:0.56; Q3:0.78), PPV of 0.31 (Q1:0.22; Q3:0.43), and NPV of 0.97 (Q1:0.91;Q3:1.00). The pooled model could be evaluated on 66 (95.7%) participants. Its performance was adequate with a median AUC of 0.71 (Q1:0.60; Q3:0.78), sensitivity of 01.00 (Q1:0.75; Q3: 1.00), specificity of 0.75 (Q1:0.60; Q3:0.86), PPV of 0.33 (Q1:0.23;Q3:0.50), and NPV of 1.00 (Q1:0.94; Q3:1.00).

Alcohol use

There were 43 (61.4%) participants with a person-specific model. The performance of these models was good with an AUC of 0.80 (Q1:0.72; Q3:0.89), sensitivity of 1.00 (Q1:0.79; Q3:1.00), specificity of 0.80 (Q1:0.75; Q3:0.88), PPV of 0.38 (Q1:0.28; Q3:0.57), and NPV of 1.00 (Q1:0.96; Q3:1.00). The pooled model could be evaluated on 63 (90.0%) participants. It had an outstanding performance with an AUC of 0.90 (Q1:0.83; Q3:0.96), sensitivity of 1.00 (Q1: 1.00; Q3: 1.00), specificity of 0.88 (Q1:0.80; Q3:0.96), PPV of 0.50 (Q1:0.37; Q3:0.75), and NPV of 1.00 (Q1:1.00; Q3:1.00).

Binge drinking

A person-specific model could be fitted and evaluated for 13 (18.6%) participants. The performance of the person-specific models was good with a median AUC of 0.85 (Q1: 0.71; Q3: 0.93), sensitivity of 1.00 (Q1:0.75; Q3:1.00), specificity of 0.90 (Q1:0.78; Q3: 0.96), PPV of 0.28 (Q1:0.18;Q3:0.50), and NPV of 1.00 (Q1:0.98;Q3:1.00). The performance of the pooled model could be evaluated on 49 (70.0%) participants. Its performance was outstanding with an AUC of 0.93 (Q1: 0.87; Q3: 0.98), sensitivity 1.00 (Q1: 1.00; Q3: 1.00), a specificity of 0.93 (Q1: 0.84; Q3: 0.99), PPV of 0.50 (Q1:0.20; Q3:0.83), and NPV of 1.00 (Q1:1.00;Q3:1.00).