Introduction
Approximately, 50% of UK adults live with at least one long-term health condition (LTC) (Office for National Statistics, 2022). Adjusting to an LTC can be challenging given the burden of symptoms and treatments, physical disability, loss of independence, and reduced quality of life. Unsurprisingly, people living with LTCs are 2–3 times more likely to have anxiety and depression than those without (McDaid, Knapp, Fossey, & Galea, Reference McDaid, Knapp, Fossey and Galea2012). Yet many of these patients do not have access to psychological support for their illness (Diabetes UK, 2019; Ellison, Gask, Bakerly, & Roberts, Reference Ellison, Gask, Bakerly and Roberts2012; IBD UK, 2021; Ponzio, Tacchino, Zaratin, Vaccaro, & Battaglia, Reference Ponzio, Tacchino, Zaratin, Vaccaro and Battaglia2015; Schwarz, Schmidt, Bobek, & Ladurner, Reference Schwarz, Schmidt, Bobek and Ladurner2022). When offered treatment, this is often targeted to depression and anxiety as a primary mental health condition rather than the unique LTC stressors that can lead to illness-related distress.
The transdiagnostic model of adjustment in LTCs (TMA-LTC) (Carroll, Moon, Hudson, Hulme, & Moss-Morris, Reference Carroll, Moon, Hudson, Hulme and Moss-Morris2022) suggests that poor psychological adjustment to an LTC or LTC-related distress results in part from unique illness-specific stressors (e.g. stigma, symptom and treatment management, uncertainty about the future) which are distinct from primary mental health risk factors such as low self-esteem or global hopelessness. Though different LTCs have a specific set of stressors and self-management demands, there are core transdiagnostic mechanisms underlying psychological adjustment and LTC distress. Helping people manage these illness stressors should be central to psychological therapy for people with LTC-related distress. Being able to distinguish LTC distress from a primary mental health disorder is an important first step in ensuring LTC patients get the correct psychological support (Carroll, Moss-Morris, Hulme, & Hudson, Reference Carroll, Moss-Morris, Hulme and Hudson2021).
The Patient Health Questionnaire (PHQ-9) (Kroenke, Spitzer, & Williams, Reference Kroenke, Spitzer and Williams2001) and Generalized Anxiety Disorder Questionnaire (GAD-7) (Spitzer, Kroenke, & Williams, Reference Spitzer, Kroenke and Williams2006) are commonly used to screen patients for possible mental health disorders. Whilst these measures have excellent psychometric properties and well-validated cut points for clinical caseness, they have limitations in screening for LTC distress. First, negative emotions associated with poor LTC adjustment extend beyond anxiety and depression, including feelings of anger, guilt, embarrassment, and shame (Ayers & Steptoe, Reference Ayers, Steptoe, Ayers, Baum, McManus, Newman, Wallston, Weinman and West2007; Browne, Ventura, Mosely, & Speight, Reference Browne, Ventura, Mosely and Speight2013; Kreider, Reference Kreider2017). Second, patients distressed by their illness may score subthreshold on traditional measures of anxiety and depression (Geraghty & Esmail, Reference Geraghty and Esmail2016; Katon & Roy-Byrne, Reference Katon and Roy-Byrne1991), due to these inadequately capturing LTC distress. Third, relating adjustment to diagnostic levels of anxiety and depression may unnecessarily pathologize the negative emotions resulting from objectively challenging illness-related stressors (Hudson & Moss-Morris, Reference Hudson and Moss-Morris2019). Finally, some anxiety/depression symptoms are common symptoms of LTCs (e.g. fatigue, sleep disturbances), obscuring the unique distress experienced due to poor adjustment.
Therefore, there is a need to measure LTC-related distress to aid clinical decision-making. In LTC care, Distress Thermometers alongside Problem Lists are sometimes used; however, these have some important psychometric limitations. Distress Thermometers and other single-item measures inadequately capture complex psychological constructs (Allen, Iliescu, & Greiff, Reference Allen, Iliescu and Greiff2022; Cuvillier, Léger, & Sénécal, Reference Cuvillier, Léger and Sénécal2021; Stewart-Knight, Parry, Abey, & Seymour, Reference Stewart-Knight, Parry, Abey and Seymour2012). Problem Lists have clinical utility in identifying sources of distress but they do not measure the severity of distress or allow comparisons across conditions. Generic psychological distress measures such as the Kessler K-10 scale (Kessler et al., Reference Kessler, Andrews, Colpe, Hiripi, Mroczek, Normand and Zaslavsky2002) and Patient Health Questionnaire Anxiety and Depression Scale (PHQ-ADS) (Kroenke et al., Reference Kroenke, Wu, Yu, Bair, Kean, Stump and Monahan2016) effectively assess the severity of distress; however, they do not differentiate whether the distress is related to an individual’s LTC or an unrelated mental health disorder or other non-LTC life stressors. Conversely, illness-specific distress measures exist for some LTCs (e.g. inflammatory bowel disease [IBD]; Dibley et al., Reference Dibley, Czuber-Dochan, Woodward, Wade, Bassett, Sturt and Norton2018 and diabetes; Fisher, Glasgow, Mullan, Skaff, & Polonsky, Reference Fisher, Glasgow, Mullan, Skaff and Polonsky2008). However, there is no transdiagnostic measure of illness-specific distress that can be used across various LTC populations. A transdiagnostic measure has greater utility in primary care or mental health services that are not specialized to particular LTCs while minimizing administrative burden. For instance, in the UK Talking Therapy services, healthcare professionals report having low confidence in determining whether an LTC treatment is appropriate and wanting additional tools and skills to assess and treat these patients (Carroll et al., Reference Carroll, Moss-Morris, Hulme and Hudson2021). Therefore, a transdiagnostic measure of IRD could be used alongside more traditional measures of distress, anxiety, and depression to signal whether a primary mental health or LTC adjustment protocol should be used (Carroll et al., Reference Carroll, Moon, Hudson, Hulme and Moss-Morris2022; Reference Jenkinson, Hudson, Moss-Morris and HackettJenkinson, Hudson, Moss-Morris, & Hackett, in prep.). Moreover, as multimorbidity is increasingly common, estimated to affect over 50% of UK and US populations (Fleetwood et al., Reference Fleetwood, Guthrie, Jackson, Kelly, Mercer, Morales and Prigge2025; Head et al., Reference Head, Fleming, Kypridemos, Schofield, Pearson-Stuttard and O’Flaherty2021; Knies & Kumari, Reference Knies and Kumari2022; Mossadeghi et al., Reference Mossadeghi, Caixeta, Ondarsuhu, Luciani, Hambleton and Hennis2023), and appears to confer additional risk of distress (Fleetwood et al., Reference Fleetwood, Guthrie, Jackson, Kelly, Mercer, Morales and Prigge2025; Read, Sharpe, Modini, & Dear, Reference Read, Sharpe, Modini and Dear2017), a transdiagnostic measure would be better placed to capture the additive impact of multiple health concerns. Furthermore, it would cater for rarer conditions and would allow comparison of LTC distress across conditions in both clinical and research settings.
The primary aims of the current study were to develop a novel, concise transdiagnostic measure of illness-related distress (IRD) with good face validity and to assess the factor structure and the minimal number of best-fit items, convergent validity, internal consistency, and test–retest reliability of the scale. A secondary aim was to explore clinical cut points of the scale using Receiver Operating Characteristic (ROC) analyses to guide clinical decision-making and treatment assessment.
Methods
The study was registered on clinicaltrials.gov (NCT06072287). Ethics approval was obtained from the King’s College London Health Faculties Research Ethics Subcommittee on the July 13, 2023 (HR/DP-22/23-36320). All participants provided informed consent.
Procedures and recruitment
Eligibility criteria were: self-reporting a LTC; being UK-based; being aged ≥18 years; having an email address; and having English proficiency. Participants were excluded if they only reported psychological or mental disorders.
We conducted convenience sampling via social media and charity website advertisements (Supplement 1 of the Supplementary Material). Links directed participants to the information sheet, followed by eligibility screening, consent, and the baseline questionnaire. To assess test–retest reliability, 1 week later, respondents were emailed a link to complete a follow-up questionnaire (IRD scale only).
Measures
The Illness-Related Distress (IRD) Scale
Several pieces of formative research summarized in Table 1 shaped the initial selection of items for the IRD scale with a focus on ensuring good face validity of the items.
Table 1. Summary of methods used to develop the initial 28-item pool of the IRD scale

Note: HCPs, healthcare professionals; IBD, inflammatory bowel disease; LTC, long-term condition; MS, multiple sclerosis; PPI, patient and public involvement; TMA-LTC, Transdiagnostic Model of Adjustment in Long-Term Conditions.
A preliminary 28-item scale was tested in the current study (Supplement 3 of the Supplementary Material). Respondents reported the frequency with which they had experienced each item during the past 2 weeks. Items were scored on a five-point Likert-type response scale from 0 (‘Never’) to 4 (‘Always’); five items were reverse scored.
A slider item was included as a validity check, whereby participants rated the source of their distress, ranging from ‘entirely due to other life stressors’ (0%) to ‘entirely due to their LTC’ (100%). Respondents could select N/A if they did not feel distressed.
Demographics
At baseline, respondents provided their age, gender, ethnicity, level of education, employment status, and LTC diagnoses. LTC response options were determined via gold-standard studies in LTCs and the National Institute for Health and Care Excellence (NICE) guidelines (Coulter et al., Reference Coulter, Entwistle, Eccles, Ryan, Shepperd and Perera2015; NICE, 2024).
Assessment of validity
Self-report measures to assess the validity of the IRD scale were informed by the COSMIN Taxonomy of Measurement Properties (Mokkink et al., Reference Mokkink, Terwee, Patrick, Alonso, Stratford, Knol and de Vet2010). Measures were selected to maximize relevance to our transdiagnostic populations while minimizing participant burden.
To assess convergent validity, we measured:
-
- Psychological distress, depression, and anxiety: The PHQ Anxiety and Depression Scale (PHQ-ADS) (Kroenke et al., Reference Kroenke, Wu, Yu, Bair, Kean, Stump and Monahan2016) combines the Patient Health Qusetionniare-8 (PHQ-8) (Kroenke et al., Reference Kroenke, Strine, Spitzer, Williams, Berry and Mokdad2009) for depression and the Generalized Anxiety Disorder-7 (GAD-7) (Spitzer et al., Reference Spitzer, Kroenke and Williams2006) scale for anxiety to create an overall measure of psychological distress. All measures were responded to on a four-point Likert scale (0–3) and utilized sum scores. Higher scores indicate greater levels of distress/depression/anxiety. Here, Cronbach
$ {\alpha}_{PHQ- ADS} $ = 0.92;
$ {\alpha}_{PHQ} $ = 0.85;
$ {\alpha}_{GAD} $ = 0.91.
-
- Illness-specific distress: The Diabetes Distress Scale (DDS) is a 17-item diabetes-related distress questionnaire (Fisher et al., Reference Fisher, Glasgow, Mullan, Skaff and Polonsky2008),
$ {\alpha}_{DDS\; baseline} $ = 0.95. The 28-item IBD Distress Scale (IBDDS), measures distress in IBD (Dibley et al., Reference Dibley, Czuber-Dochan, Woodward, Wade, Bassett, Sturt and Norton2018)
$ {\alpha}_{IBDDS\ baseline} $ = 0.93). Higher total scores on each measure indicate increased distress. Only participants who identified as having diabetes or IBD completed the DDS or IBDDS, respectively.
-
- Functional impairment: The Work and Social Adjustment Scale (WSAS) (Mundt, Marks, Shear, & Greist, Reference Mundt, Marks, Shear and Greist2002) measures overall impairment in everyday life using five items. Higher total summed scores indicate greater impairments in functioning,
$ {\alpha}_{WSAS} $ = .88.
-
- Cognitive and Behavioral Responses to Symptoms: The Cognitive and Behavioral Responses to Symptoms Questionnaire (CBRQ) (Picariello, Chilcot, Chalder, Herdman, & Moss-Morris, Reference Picariello, Chilcot, Chalder, Herdman and Moss-Morris2023) has 40-items with seven subscales: five cognitive (Fear Avoidance, Catastrophizing, Damage Beliefs, Embarrassment Avoidance, and Symptom Focusing) and two behavioral (All-or-Nothing Behavior, Avoidance/Resting behavior) subscales. Higher summed scores indicate a stronger presence of the specific cognitive/behavioral response;
$ {\alpha}_{CBRQ\ subscale\ range} $ = 0.80–0.91.
Readability
Readability was assessed with The Flesch Reading Ease score, providing a score out of 100 and the reading age at which the material is appropriate.
Statistical analyses
Data were analyzed between March 8, 2024 and March 14, 2025 (available in online repository: https://osf.io/gnwe6/).
Step 1: Characteristics of samples
The sample was randomly split into two groups using a random number generator (Microsoft Excel v2402) to allow for an initial exploratory factor analysis (EFA) in one sample followed by a confirmatory factor analysis (CFA) in the other. The rationale for using both methods is described in the next sections. Descriptive statistics were performed in STATA v18.0 for the total sample and the subgroups.
Step 2: Factor analysis
Factor analysis was used to assess the factor structure and best-fit items of the IRD. Unless otherwise specified, steps 2.1 (EFA), 2.2 (CFA/ESEM), and 4 (Invariance testing) of the analysis were conducted using MPlus V 7.4. A maximum likelihood with robust standard errors (MLR) was used to treat missing data and account for non-normality. Given the five response categories, the ordinal data from each item was treated as continuous (Rhemtulla, Brosseau-Liard, & Savalei, Reference Rhemtulla, Brosseau-Liard and Savalei2012).
For best practice, exploratory and confirmatory models should be conducted in different samples to reduce bias and risk of overfitting. The EFA was conducted in Sample 1 (n = 698). All model fitting (step 2.2) was conducted with Sample 2 (n = 700). The sample-to-item ratio exceeded the recommended 10:1 (Costello & Osborne, Reference Costello and Osborne2019) (Sample 1 ratio: 24.9:1; Sample 2 ratio: 25:1).
Step 2.1: Exploratory factor analysis (EFA)
EFA was used to (1) reduce the item pool and (2) determine the general factor structure (Rhemtulla et al., Reference Rhemtulla, Brosseau-Liard and Savalei2012). Factors were examined for item loadings and eigenvalues. Factors with eigenvalues ≥1 and a minimum of three items per factor were retained (Costello & Osborne, Reference Costello and Osborne2019). This approach, although seen as excessively liberal (Cliff, Reference Cliff1988; Horn, Reference Horn1965) was chosen due to the exploratory nature of this scale development. Initially, items with primary factor loadings of ≤0.4 were eliminated, and ≤0.45 were investigated further. Items were removed if cross-loadings between primary and secondary factors were <0.15. Factors in the final EFA, selected based on eigenvalues, root mean square error of approximation estimation (RMSEA; values of 0.01, 0.05, and 0.08 indicate excellent, good, and mediocre fit, respectively) (Xia & Yang, Reference Xia and Yang2019), and factor loadings, in conjunction with theory and previous evidence, were used in subsequent CFA and ESEM analyses. Importantly, we did not let the λ ≥ 1 dictate our final model and ran a parallel analysis to further guide our decisions. Face and construct validity were prioritized rather than pre-emptively restricting models in our item/factor reduction steps.
Step 2.2: Model fitting
ESEM is an integrative approach that balances the strictness of CFA and the adaptability of EFA (Marsh et al., Reference Marsh, Morin, Parker and Kaur2014). CFAs often use overly restrictive models where cross-loadings between items and non-target factors are fixed at zero. Consequently, CFAs may not always yield a good model fit or assist in the theoretical interpretation of multidimensional constructs, particularly those with multiple factors (Brown, Barker, & Rahman, Reference Brown, Barker and Rahman2022; Dicke et al., Reference Dicke, Marsh, Riley, Parker, Guo and Horwood2018; Marsh, Hau, & Grayson, Reference Marsh, Hau and Grayson2005; Morin, Arens, Tran, & Caci, Reference Morin, Arens, Tran and Caci2016). This is common with psychological constructs; items rarely perfectly define a construct due to potential association with similar constructs or sub-dimensions. ESEM incorporates an EFA measurement model using target rotation, allowing for confirmatory use by specifying a priori cross-loadings (Asparouhov & Muthén, Reference Asparouhov and Muthén2009; Morin, Myers, & Lee, Reference Morin, Myers, Lee, Tenenbaum and Eklund2020; Morin et al., Reference Morin, Arens, Tran and Caci2016). Cross-loadings for non-targeted items are set close to zero, avoiding unnecessary restrictions as in CFA. We used an analytic framework to systemically compare CFA and ESEM hierarchical models (Morin et al., Reference Morin, Myers, Lee, Tenenbaum and Eklund2020).
Three CFA models were investigated: (1) unidimensional, (2) correlated factors (specified by the final EFA), and (3) bifactor model with one general factor and specific factors (orthogonal). Two ESEM models were assessed: (1) correlated lower order factors, and (2) bifactor model. Superiority of models would be decided by: (1) better model fit, (2) smaller factor correlations, (3) smaller cross-loadings, and (4) well-defined factors (Morin et al., Reference Morin, Myers, Lee, Tenenbaum and Eklund2020; Morin et al., Reference Morin, Arens, Tran and Caci2016). Bifactor model superiority would be confirmed if there was (1) an improved fit in comparison to lower-order correlated factor models and (2) well-defined general and specific factors.
Absolute model fit was assessed with the
$ {\chi}^2 $
goodness of fit statistic (non-significant (p > .05) values indicating good fit) and the standardized root mean square residual (SRMR, with values <0.05 indicating good and <0.08 indicating acceptable fit), respectively, for both indices. Both indices were deemed necessary as the
$ {\chi}^2 $
goodness of fit statistic typically rejects models with large sample sizes (Hooper, Reference Hooper, Coughlan and Mullen2008).
The relative fit was using Hu and Bentler (Reference Hu and Bentler1999) recommendations of the two-fit criterion, that the comparative fit index (CFI) and Tuker-Lewis index (TLI) should be >.95 and RMSEA and SRMR should be <.06 to minimize Type 1 and Type 2 error rates. The Akaike Information Criteria (Akaike, Reference Akaike1987) and the Bayesian Information Criteria (BIC; Schwarz, Reference Schwarz1978) were also used to assess relative fit, where lower values indicate improved fit.
The Omega (ω) program was used (Watkins, Reference Watkins2013) to estimate fit indices for the final bifactor models, as Cronbach’s α is limited in the assumption of equal factor loadings across all constructs for each indicator (Dunn, Baguley, & Brunsden, Reference Dunn, Baguley and Brunsden2014).
Several ω coefficient variants were used to assess if there was sufficient variance accounted for by both general and specific factors to justify the selection of a hierarchical bifactor model (Rodriguez, Reise, & Haviland, Reference Rodriguez, Reise and Haviland2016). We also assessed construct replicability (H), explained common variance (ECV), and the percentage of uncontaminated correlations (PUC).
Step 3: Creating a shortened clinical IRD scale
The study aimed to create a brief questionnaire for use in clinical settings. Items were considered for removal if they: were thematically similar; had primary factor loadings ≤0.55 (in CFA analysis); had a cross-loading ≤0.20 (in EFA analysis); or correlated highly (r ≥ 0.6 in CFA analysis). Step 2.2 was repeated for the final clinical version.
Step 4: Measurement invariance
Measurement invariance was assessed across gender in the final best-fitting model for the initial and final clinical versions. We assessed (1) configural (factor structure), (2) metric (factor loadings), and (3) scalar (item intercepts) invariance to see if these patterns were stable across groups (Morin et al., Reference Morin, Arens, Tran and Caci2016).
The models for each invariance sub-group were compared with the models that arose from Step 2 and 3. The same indices to assess the goodness of fit in Step 2 were used to assess invariance models. Since
$ {\chi}^2 $
is sensitive to sample size, we also considered changes in CFI (ΔCFI), RMSEA (ΔRMSEA), and SRMR (ΔSRMR) for invariance decisions. Significant levels of invariance were decided a priori, reflecting similar analyses (Brown et al., Reference Brown, Barker and Rahman2022). Cut-offs were: CFI decrement (ΔCFI) <0.010; RMSEA change (ΔRMSEA) <0.015 (Chen, Reference Chen2007); ΔSRMR <0.030 for configural and metric invariance and <0.010 for scalar invariance.
Step 5: Validity and reliability testing
Additional psychometric testing was conducted in STATA V18. Total subscale scores were calculated using weighted factor scores. Due to the expected clinical utility of the scale, unweighted totals (sum of items, irrespective of factor loading) were also calculated.
Criterion validity testing was conducted on both initial and final clinical questionnaire versions (weighted and unweighted scores). Pearson’s bivariate correlations assessed test–retest reliability and convergent validity of the latent factors from the final best-fit model (Step 2.2). Effect sizes were interpreted using standard cut-offs (Cohen, Reference Cohen2013). Convergent validity was supported through moderate correlations (usually, |r| = 0.4–0.6), whereas, strong correlations indicate construct overlap or multicollinearity. Internal consistency was assessed using both Cronbach’s αand McDonald’s ω estimates (target values > 0.70) (Bland & Altman, Reference Bland and Altman1997; Nunnally & Bernstein, Reference Nunnally and Bernstein1978).
Step 6: Receiver operating characteristic (ROC) analyses
ROC analyses with boot-strapped 95% Confidence Intervals (CI95, based on 10,000 iterations) were used to determine optimal cut-points, with equal emphasis on sensitivity and specificity (Youden, Reference Youden1950). In this context, sensitivity is important as illness-related distress can result in suicidal ideation and behavior. Specificity is important, as clinical support is expensive and the capacity of health services is limited. The slider item served as the true class/gold standard for caseness, with participants attributing at least 50% of their psychological distress to their LTC considered as a case. ROC analyses were conducted in R version 4.1.2 (2021-11-01), using the pROC package (Robin et al., Reference Robin, Turck, Hainard, Tiberti, Lisacek, Sanchez and Müller2011).
Results
Characteristics of study samples
We recruited participants from June 28, 2023 to January 22, 2024. There were 2,114 entries of the baseline questionnaire; however, 474 were removed as they were suspected to be automated ‘bot’ submissions. Responses were deemed at a high likelihood of automation if they were excessively similar (e.g. multiple submissions had identical response patterns), had suspicious responses (e.g. providing numerical postcodes when the study is UK-based), faster than expected response times, and/or non-UK IP addresses. This left 1,640 authentic entries of the baseline questionnaire. Of these, 242 entries were removed leaving a total of 1,398 in the baseline sample (participant flow in Supplement 4 of the Supplementary Material).
Follow-up was completed by 1,240 participants (88.7% response rate), with 1,171 completing their follow-up questionnaire between 6 and 48 days after baseline (M = 11.51, SD = 7.71). Table 2 shows the demographics of respondents and the most common illnesses reported. Supplement 5 of the Supplementary Material shows a full list of the 198 ‘Other’ LTCs reported.
Table 2. Demographic and clinical characteristics of the total sample as well as exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) sub-samples

Note: COPD, chronic obstructive pulmonary disease; GAD-7, Generalized Anxiety Disorder Scale; IBD, inflammatory bowel disease; LTC, long-term condition; MI, myocardial infarction; PCOS, polycystic ovary syndrome; PHQ-ADS, Patient Health Questionnaire Anxiety and Depression Scale; PHQ-9, Patient Health Questionnaire-9; TIA, transient ischemic attack.
Exploratory factor analysis
The initial EFA with 28 items yielded four factors with eigenvalues >1. Additionally, we performed a parallel analysis (O’Connor, Reference O’Connor2000) with 5,000 random samples, as this is considered more statistically robust. This similarly suggested four factors be retained (
$ \lambda 1 $
: 12.95 vs.
$ \lambda {1}_{random\ sample} $
= 1.39;
$ \lambda 2 $
: 5.52 vs.
$ \lambda {2}_{random\ sample} $
= 1.33;
$ \lambda 3 $
: 2.36 vs.
$ \lambda {3}_{random\ sample} $
$ =1.29;\lambda 4 $
: 1.49 vs.
$ \lambda {4}_{random\ sample} $
= 1.26). Variance explained by each factor was 46.2%, 19.70%, 8.30%, and 5.3%, respectively. One factor was removed as it only contained two items, and another was investigated as it only contained three items.
In subsequent EFAs, items were removed if primary loadings were <
$ \left|.40\right| $
or if cross-loadings were within 0.15 of primary loadings. Consequently, the final EFA utilized 23 of the original 28 items and had three factors with eigenvales >1 (
$ \lambda 1 $
: 11.48;
$ \lambda 2 $
: 1.34;
$ \lambda 3 $
: 1.04); parallel analysis suggested two factors
$ (\lambda {1}_{random\ sample} $
= 1.34;
$ \lambda {2}_{random\ sample} $
= 1.28;
$ \lambda {3}_{random\ sample} $
= 1.24). This, taken together with factor inspection, suggested the three-factor model was inappropriate as the third factor contained only reverse-coded items, a common phenomenon in scale development and psychometrics (Salazar, Reference Salazar2015). Thus, the two-factor EFA model was selected:
$ {\chi}^2(208) $
= 932.32, p < .001, RMSEA = .086, 90% CI [.081, .090], RMSR = .052. The factors were named (1) intrapersonal distress (14 items; 48.29%) and (2) interpersonal distress (9 items; 19.75%), (r = 0.811, p < 0.001) (Primary rotated loadings in Supplement 7 of the Supplementary Material).
Model fitting (CFA and ESEM)
For the
$ {IRD}_{initial} $
the lower order CFA model was most appropriate. Factor loadings and bifactor fit indices indicated that a ‘g’ factor was inappropriate. The lower order CFA was preferred over the lower order ESEM given that the former was more parsimonious with only minimal differences in fit and factor loadings compared with the latter (Supplement 8 of the Supplementary Material). We subsequently removed nine items as described in Step 3 to create the
$ {IRD}_{final} $
with seven items defining each factor. Factors remained strongly and positively correlated in this model (r = 0.774, p < 0.001). All testing (Steps 4–6) was performed on both
$ {IRD}_{initial} $
and
$ {IRD}_{final} $
. For brevity
$ {IRD}_{initial} $
is presented in Supplement 8 of the Supplementary Material; all results presented below pertain to
$ {IRD}_{final} $
.
Amongst CFA models, the bifactor model demonstrated the lowest value for
$ {\chi}^2 $
, RMSEA, SRMR, AIC, and BIC and the highest CFI and TLI (Table 2), but poorer model fit with respect to standardized item loadings on each factor (Supplement 9 of the Supplementary Material). We rejected a unidimensional model as, although factor loadings were adequate, the two-factor model had good-excellent fit and demonstrated face validity (Supplement 9 of the Supplementary Material).
In ESEM models, the bifactor model demonstrated superior fit indices (Table 3). Items loaded well onto a general factor in the bifactor model, however loadings onto both specific factors were inadequate/inconsistent, suggesting a non-hierarchical model was more appropriate (Supplement 10 of the Supplementary Material). In additional calculations, both bifactor models supported the rejection of hierarchical CFA and ESEM models (Supplement 11 of the Supplementary Material). Specific factors (labeled intrapersonal distress and interpersonal distress) demonstrated high internal reliability (acceptable cut-off
$ {\omega}_S $
>.70) supporting scale multidimensionality. In the hierarchical model, the general factor explained a large proportion of variance in score (89.0%), with low ω values for specific factors (
$ {\omega}_{HS} $
<0.50). H values were <.80 for specific factors, indicating poorly defined latent variables in the bifactor model. Although the ECV statistics for the general factor were >.8 indicating high proportions of common variance attributable to the general factor, the PUC statistics <.8 suggests the scale should not be interpreted as unidimensional. Therefore, hierarchical multidimensionality was not supported.
Table 3. Model fit information for estimated SFI models and invariance testing

Note: G, general factor; F1, factor one (intrapersonal distress); F2, factor two (interpersonal distress). Measurement invariance levels were said to be reached if (1) CFI did not deteriorate by >.10 in the more restrictive model, (2) ΔRMSEA <.015, and (3) ΔSRMR was <.030 in the configural model and <.010 in the scalar model.
The superiority of the CFA versus ESEM lower order model (correlated factors) was supported by generally larger factor correlations in the CFA and small or non-significant cross-loadings in the ESEM. Moreover, although the ESEM model demonstrated a marginally better fit, it was not justified, as the improvements were small, despite increasing model complexity substantially.
Supplement 12 of the Supplementary Material presents the final IRD scale.
Invariance testing
All levels of gender invariance were reached (Table 3), with no significant differences in factor structure (configural invariance), factor loadings (metric invariance), or item intercepts (scalar invariance).
Validity and reliability testing
Additional psychometric assessment was completed using weighted factor scores in the CFA sample (Table 4). Examination of skewness and kurtosis indicated normality (Hair, Anderson, Babin, & Black, Reference Hair, Anderson, Babin and Black2010). The number of LTC diagnoses had a small, positive, significant correlation with both factors (intrapersonal: r = 0.230, p < 0.001, interpersonal: r = 0.238, p < 0.001; inter-item correlations in Supplement 13 of the Supplementary Material.
Table 4. Descriptive statistics, validity testing, and reliability testing of the weighted IRD subscales

Note: *** p < 0.001, **p < 0.01.
LTC, long-term condition; PHQ, Patient Health Questionnaire; GAD, Generalized Anxiety Disorder Scale; PHQ-ADS, Patient Health Questionnaire Anxiety Disorder Scale; WSAS, Work and Social Adjustment Scale; CBRQ, cognitive and behavioural responses to symptoms; DDS, Diabetes Distress Scale; IBD-DS, Inflammatory Bowel Disease Distress Scale; IRD, illness related distress.
Assessments were repeated with summed scores (Supplement 14 of the Supplementary Material) to assess if similar properties were found compared with the weighted scores. Minimal differences were found, supporting the use of sum scoring while maintaining strong psychometric properties.
Mean scores by illness groups are represented in Supplement 15 of the Supplementary Material. Regression models of the subscales and most common LTCs in our sample (see Table 2) were run. LTCs positively associated with the intrapersonal score were chronic pain (β = 0.149, p < 0.001), rheumatoid arthritis (β = 0.118, p = 0.002), IBD (β = 0.107, p = 0.006), gynecological conditions (β = 0.080, p = 0.035) and diabetes (β = 0.079, p = 0.042); however, hypertension was negatively associated (β = −0.104, p = 0.016). LTCs positively assocaited with the interpersonal score were IBD (β = 0.144, p < 0.001), gynaecological conditions (β = 0.105, p = 0.005), chronic pain (β = 0.097, p = 0.020), psoriasis (β = 0.089, p = 0.018) and rheumethoid arthritis (β = 0.080, p = 0.039).
Internal consistency
Both the intrapersonal and interpersonal subscales demonstrated excellent internal consistency, as measured by both Cronbach α and McDonald’s Ω statistics (Table 4).
Convergent validity
Both the intrapersonal and interpersonal subscales correlated significantly, positively, and weakly with DDS scores. Both subscales demonstrated large, significant, positive correlations with the IBD-DS, PHQ-ADS, and WSAS scores. Both subscales had moderate to strong, significant, positive correlations with subscales of the CBRQ, with strongest correlations between the embarrassment avoidance subscale and the interpersonal subscale, and symptom focusing and catastrophizing and the intrapersonal subscale. The IRD scale therefore demonstrated good convergent validity with measures of illness-specific distress, generalized distress, functional impairment, and cognitive and behavioral responses to symptoms (Table 4).
Both subscale scores had positive, significant correlations with the slider scale item, indicating higher factor scores were associated with participants describing their LTC(s) as their primary source of distress.
Test–retest reliability
Both factors demonstrated excellent test–retest reliability (Table 4), with very strong, positive significant correlations between baseline and follow-up scores (intrapersonal: r = 0.811, p < 0.0001; interpersonal: r = 0.829, p < 0.0001).
ROC analyses
The results of the ROC analyses are reported in Supplement 16 of the Supplementary Material. The true/gold standard class (cases) used were participants who attributed at least 50% of their psychological distress to their LTC(s). This was indicated by 66.28% of our sample (916/1,382 participants, 16 of the 1,398 [1.14%] participants overall had missing IRD scores), indicating moderately high case prevalence. Figure 1 shows IRD factor scores for cases (n = 916, 66.28 %) and non-cases (n = 466, 33.72%) at sensitivity and specificity optimizing cut-points for the intrapersonal and interpersonal factors (14.5 and 11.5). Demographics and disease predictors of caseness are shown in Supplement 17 of the Supplementary Material. The number of LTCs increased odds of caseness; whereas, older age, Asian ethnicity, and diagnoses of asthma and hypertension reduced odds of caseness.

Figure 1. Receiver operating characteristic (ROC) curves and associated 95% confidence intervals (left panel) for the illness related distress (IRD) scale, final version (intrapersonal and interpersonal factors) as a predictor of attributing >= 50% of psychological distress to the primary long-term condition. Boxplots of underlying IRD scores by caseness are shown in the right panel. Note: IRD, ‘illness related distress’.
The area under the ROC curve (Figure 1) values for each IRD scale version and factor were acceptable (lower CI95) to excellent (point estimates, upper CI95), with 79–88% of participants correctly classified. Sensitivity and specificity values for optimal cut-points showed that 72–87% of cases, and 66–76% non-cases were correctly classified respectively. Based on Positive and Negative Predictive Values, IRD scale cut-points (intrapersonal factor: 14.5, interpersonal factor: 11.5) resulted in a correct classification of 82–87% of participants who attributed at least 50% of their psychological distress to their LTC(s) correctly, and of 58–73% of participants who attribute a lower percentage of their distress to their LTC(s). Positive and Negative Likelihood Ratios indicate that cases are 2.11–3.62 times more likely to score above optimal IRD cut-points than non-cases, and non-cases are 2.32–5.88 times more likely to score below optimal IRD cut-points than cases. For the intrapersonal factor, the optimal cut-point of 14.5 identified 97.82% cases (896/916 participants) correctly, and 27.90% non-cases incorrectly (130/466 participants). For the interpersonal factor, the optimal cut-point of 11.5 identified 92.03% cases (843/916 participants), and 29.97% non-cases incorrectly (135/466 participants).
Readability
The final IRD scale (Supplement 12 of the Supplementary Material) had a Flesch Reading Ease score of 74.9, equivalent to US grade 5 reading level of 10-11 years old.
Discussion
This study aimed to develop a measure of Illness-Related Distress that can be used across LTCs and to test the psychometric properties of this new scale. To our knowledge, this is the first transdiagnostic measure of IRD. The final IRD Scale was comprised of two 7-item factors demonstrated through EFA and confirmed by CFA. Model fitting demonstrated that a single factor or bifactor model was not supported. A CFA lower-order correlated factor model was favored over ESEM due to marginal differences in fit statistics, and greater simplicity with the CFA. Therefore, the two factors, although conceptually related, should be calculated separately and not combined into a total score. The model had excellent fit statistics and passed invariance testing.
We labeled the two factors the intrapersonal distress subscale, measuring a range of emotions directly related to the challenges of living with an LTC such as anger, frustration, and worry, and the interpersonal distress subscale, capturing feelings associated with social/self-perception issues, such as being embarrassed by the illness or feeling like a burden. The subscales demonstrated excellent internal reliability, good test–retest reliability, good readability, and promising clinical cut points against the reference category (asking the percentage of distress respondents attributed to their LTC(s)). There were significant, small to large positive correlations between the subscales and measures of conceptually related constructs, including psychological distress, depression, anxiety, impaired functioning, cognitive and behavioral responses to symptoms, and illness-specific measures of distress (in diabetes and IBD). While convergent validity was supported with moderately strong correlations between the IRD subscales and IBD-related distress, the relationships between IRD subscales and diabetes-related distress were smaller. This may be because the Diabetes Distress Scale (DDS) includes items concerning impact and management (e.g. ‘Feeling that my doctor doesn’t give me clear enough directions on how to manage by diabetes’), rather than purely distress. Overall, the results indicate that the IRD scale is a brief, valid, reliable, and potentially clinically informative instrument for measuring and classifying transdiagnostic IRD.
Extensive research underpinned the initial scale item pool, including qualitative interviews with people living with LTCs, expert consensus meetings, a systematic literature search, and feedback on items from people living with LTCs and HCPs who treat anxiety and depression in LTCs. This focus on the face validity of items may underly the robust psychometrics of the scale (Boateng, Neilands, Frongillo, Melgar-Quiñonez, & Young, Reference Boateng, Neilands, Frongillo, Melgar-Quiñonez and Young2018; Flake & Fried, Reference Flake and Fried2020; Morgado, Meireles, Neves, Amaral, & Ferreira, Reference Morgado, Meireles, Neves, Amaral and Ferreira2017). Given its excellent psychometrics and good readability, the IRD Scale may be an acceptable tool to introduce consistency in the measurement of distress related to adjusting to LTCs, irrespective of diagnosis and multimorbidity.
There are several potential clinical applications of this scale. First, it can improve understanding of how IRD is experienced, incorporating a wider range of emotions beyond depression and anxiety. Second, it may help clinicians determine the presenting problem, that is, whether current distress is related to the difficulties of adjusting to the challenges of an LTC (IRD), or if the presentation resembles primary anxiety or depression. The range for each IRD subscale is 0–28. Preliminary analysis suggests cut-offs of 14.5 and 11.5, respectively, for the intrapersonal and interpersonal factors of the scale, rounded to 15 and 12 for clinical use. Importantly, these cut-offs do not indicate whether distress is clinically significant or preclude a prior or primary diagnosis that may increase vulnerability to IRD. However, they can be used to decide if a significant proportion of distress is LTC-related, thus signaling whether a therapy tailored to IRD may be most appropriate. Third, it may help to identify specific treatment targets for interventions. For example, those who score high on interpersonal distress may benefit from interventions that focus on making social connections to reduce loneliness, challenging cognitions related to embarrassment, and exploring ways to reciprocate social support when feeling like a burden. Finally, the IRD Scale could be used as a primary outcome measure in trials assessing interventions designed to treat IRD and adjustment in LTCs. This could be utilized alongside more traditional measures of anxiety and depression such as the GAD-7 or PHQ-9, to explore relative sensitivity to change.
Strengths, limitations, and future directions
The IRD scale was developed rigorously, ensuring common pitfalls in scale development were avoided by (1) specifying a construct; (2) confirming the absence of existing scales through a literature search; (3) prioritizing lived experience by conducting exploratory interviews with and getting detailed feedback from LTC patients; and (4) consulting expert judges (clinicians) (Boateng et al., Reference Boateng, Neilands, Frongillo, Melgar-Quiñonez and Young2018). The study used a large sample with diverse LTCs and employed sophisticated analyses to reduce item pool and identify best model fit, allowing the assessment of varied complex structures and construct-relevant multidimensionality. Moreover, scale development relied upon factor loadings, factor correlations, and general factor structure, alongside health psychology theory.
Despite the large sample size, the self-selected community sample may limit generalizability. Overrepresented demographics included white ethnicity, female gender, and high educational level. Moreover, some LTCs appeared to be underrepresented based on epidemiological prevalence (e.g. hypertension, obesity, type 2 diabetes). However, the subscales passed invariance testing (based on gender), suggesting that this demographic does not impact factor structure. Although internet research allows anonymous participation, improved accessibility, minimized embarrassment, social stigma, and fear of judgment, it prevents confirmation of LTC diagnoses. The ROC analysis, used to define the clinical cut-points has limitations, as the definition used for a case was a single item non-validated ratio rather than a severity measure (Pepe, Janes, Longton, Leisenring, & Newcomb, Reference Pepe, Janes, Longton, Leisenring and Newcomb2004). However, as there is no gold standard measure of IRD, the ratio measure provides a starting point. Future research should systematically compare methods and explore machine learning approaches for potentially improved accuracy. Moreover, the model could be tested for invariance based on LTC diagnosis and/or with qualitatively different EFA and CFA/ESEM samples instead of different data portions. Though the IRD scale measures the severity of distress, a transdiagnostic illness-related stressor checklist may also complement research and clinical decision-making in this area. Although initial convergent validity was performed in this study, future research should include measures to test divergent validity. Finally, sensitivity to change and criterion validity should be assessed by utilizing the IRD subscales in intervention studies and comparing IRD cut-offs with diagnostic interviews.
Conclusions
The IRD scale is a 14-item valid and reliable measure, comprised of two factors of distress (intrapersonal and interpersonal IRD). It reliably captures IRD, with excellent evidence of internal consistency, convergent validity with thematically similar measures, and test–retest reliability. The IRD scale has significant clinical utility in clinical and research settings, particularly in treatment decision-making and the assessment of treatment efficacy. Further research is needed to assess sensitivity to change and criterion validity.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S003329172500090X.
Acknowledgements
We would like to thank the following charities who advertised our study with their network: Alopecia UK, Arthritis Action UK, The Arthritis and Musculoskeletal Alliance (ARMA), Asthma + Lung UK, Bowel Research UK, Brave Hearts Northern Ireland, Breast Cancer Now, Breathe Arts Health Research, British Polio Fellowship, British Porphyria Association, British Skin Foundation, Burning Nights, Crohn’s & Colitis UK, Complex Regional Pain Syndrome (CRPS) UK, Dewis Cymru, Diabetes UK, Fibromyalgia Action UK, Juvenile Diabetes Research Foundation, Kidney Care UK, Kidney Research UK, Lipoedema UK, Liver4lifeUK, Motor Neuron Disease Association, Multiple Sclerosis (MS) Trust, National Kidney Federation, National Rheumatoid Arthritis Society (NRAS) UK, Neuroblastoma UK, Neuroendocrine Cancer UK, Pain UK, The Psoriasis and Psoriatic Arthritis Alliance, Psoriasis UK, Royal Osteoporosis Society, Stroke Association. We would like to thank students Edward Le Marchant, Amber (Kai Xin) Siow, and Glenn (Chin) Kong for their support with study recruitment.
Funding statement
This study has been delivered through the National Institute for Health and Care Research (NIHR) Maudsley Biomedical Research Centre (BRC) (NIHR203318). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.
Competing interests
R.M.M. has receiveds personal fees from Mahana Therapeutics for scientific advisory work and from other universities and hospital trusts for cognitive behavioral therapy training in irritable bowel syndrome. She was a beneficiary of a license agreement between Mahana Therapeutics and King’s College London. The remaining authors declare no further competing interests.