Introduction
Randomized control trials (RCTs) are used widely across palliative care and behavioral medicine research to evaluate the efficacy of interventions. The defining features of an RCT are the presence of control groups and randomization, which remove allocation bias and minimize confounding effects. RCTs are thus considered the most rigorous way to determine that a cause–effect relationship exists between treatment and outcomes and are considered the “gold standard” as they allow for the minimization of biases introduced from confounding variables or covariates (Sibbald and Roland Reference Sibbald and Roland1998).
Despite the prevalence of RCTs, it can be difficult to select a statistical method to analyze the results due to the range of available options without clear guidance. The assessment of change using a pretest–posttest control group design is a deceivingly straightforward task, as there is often a lack of clarity around when and how to use different methods (Rudestam and Newton Reference Rudestam and Newton2012; Wilkinson Reference Wilkinson1999).
There have historically been 5 main ways to analyze continuous, individual-level RCT outcomes: analysis of variance (ANOVA) of change scores, ANOVA of follow-up scores, analysis of covariance (ANCOVA), multivariate analysis of variance (MANOVA), and longitudinal modeling, each with relative strengths and weaknesses. The first 2 options are the analysis of follow-up scores or the analysis of change scores, both of which are typically done using ANOVA or t-test of 2 groups. ANOVA is a method of separating variability on a dependent variable in order to test hypotheses regarding differences in means (Maxwell et al. Reference Maxwell, Delaney and Kelley2017). Although these methods are straightforward, they assume that randomization balances the groups on pretest scores and do not adjust for baseline differences (see Table 1). It has been argued that these 2 methods of analysis should be avoided as they are often utilized inappropriately (Rudestam and Newton Reference Rudestam and Newton2012). However, these methods are accepted among those who support the posttest-only method, which rationalizes that randomization should control for between-group baseline differences and that the inclusion of a pretest can reduce external validity by confounding change from an intervention (Vickers Reference Vickers2005b). Theoretically, if a study has a large enough sample, baseline characteristics including baseline measures should be balanced; however, determining how large a sample need be to achieve such balance may not be known, and for smaller samples, there may indeed be imbalance due to random chance that need be accounted for in the analysis.
Note. This table was adapted from Rudestam and Newton (Rudestam and Newton Reference Rudestam and Newton2012). The syntax above assumes that the variables y1 and y2 are the baseline and first follow-up outcome measures, respectively, and group is the factor indicating group assignment. For the longitudinal model, the data need to be structured as one line per timepoint, indexed via the time and id values for timepoint and per-participant identifier, where y is the outcome at a given timepoint.
ANCOVA (controlling for baseline score) is a method that adjusts for differences on the covariate by including the covariate (baseline score) as a continuous predictor variable in the analysis (Maxwell et al. Reference Maxwell, Delaney and Kelley2017). This method thus accounts for baseline differences in primary outcome. ANCOVA also has higher statistical power compared to posttest and change score ANOVAs (Vickers Reference Vickers2001). Thus, this advantage may be particularly useful for analyzing studies with smaller sample sizes. ANCOVA can also be extended to incorporate time effects when using repeated measures and randomization strata as covariates (Vickers Reference Vickers2005a).
A MANOVA is a method of ANOVA that has 2 or more dependent variables (Warne Reference Warne2014); the multiple dependent variables may be a single variable measured at multiple longitudinal timepoints. Similar to ANCOVA, this method accounts for baseline differences. However, it has been suggested that results using MANOVA are often misinterpreted, as the main effect is not the analysis of interest but rather the interaction effect (Vickers Reference Vickers2005b). There is also growing evidence to suggest that the field of psychology specifically is unfamiliar with the proper statistical procedures after rejecting a null hypothesis (Warne Reference Warne2014).
Longitudinal modeling, such as mixed effects models with a random per-person intercept or generalized estimating equations that adjust variance estimates based on within-subject correlation, has also grown in popularity. Mixed effects models are regression models that explicitly incorporate a random per-person intercept to account for the within-person variation, while generalized estimating equations treat the per-person correlation as a nuisance variable and simply estimate the common correlation among data from the same person. These methods are often highly regarded as they can utilize all available data for participants lost to follow-up and can analyze multiple dependent variables. This method is also able to analyze unbalanced time points, a clear advantage to ANOVA. However, interpretation and choices of covariance structures and parameters may become overly technical.
Thus, there are many analysis options and a lack of clear guidance for selecting one method over another. This is especially true of RCTs determining the efficacy of behavioral interventions, also known as behavioral clinical trials, as the outcome variable of such studies is often continuous (Vickers Reference Vickers2005b). There has also been a growing appreciation of the differences in optimal methodology between behavioral clinical trials and standard pharmacological trials (Bacon et al. Reference Bacon, Lavoie and Ninot2015; Penzien et al. Reference Penzien, Andrasik and Freidenberg2005). Accordingly, behavioral clinical trials have added complexity in terms of research design and guidelines as well as noted limitations in their execution and dissemination (Bacon et al. Reference Bacon, Lavoie and Ninot2015). While recommendations on the determination of sample sizes via statistical analyses have been outlined (Penzien et al. Reference Penzien, Andrasik and Freidenberg2005), outcome analysis guidelines have yet to be established, despite the growing call for preregistration, which requires researchers to make a thoughtful selection of their statistical analysis plan a priori.
In response to the recent acknowledgment of the field’s replicability crisis (Open Science Collaboration 2015), metascience has emerged as a scientific social movement that seeks to use quantification of science to diagnose issues in research practices with the goal of improving them (Peterson and Panofsky Reference Peterson and Panofsky2023). Metascience of statistical analyses may prove particularly useful, as Breznau et al. (Reference Breznau, Rinke and Wuttke2022) have noted idiosyncratic variation among researchers’ analytic choices, even when working with the same data, and suggested that this may be especially true for behavioral research. Given the rise of palliative care and behavioral medicine interventions (e.g., Cognitive Behavioral Therapy, Motivational Interviewing, Meaning Centered Psychotherapy, etc.), it is critical to characterize the analytic patterns employed (Breitbart et al. Reference Breitbart, Pessin and Rosenfeld2018; Funderburk et al. Reference Funderburk, Shepardson and Wray2018). Thus, the present study aimed to characterize and understand any patterns in the predominate methods utilized in recently published peer-reviewed RCTs in top palliative care and behavioral medicine journals with the goal of highlighting potential opportunities to reform future scientific practices.
Methods
Four journals with some of the most impactful research in palliative care and behavioral medicine and psycho-oncology were selected for analysis: Annals of Behavioral Medicine, Health Psychology, Psycho-Oncology, and Psychosomatic Medicine. These journals were selected based on study team consensus that they represent a sampling of (i.e., not intended to be exhaustive) some of the most widely respected journals in the field and the study team’s interest in psycho-oncology. Inclusion criteria were (1) peer-reviewed publication in one of the 4 target journals; (2) RCT design with randomization at the participant level; and (3) analysis of the intervention effect on a continuous primary outcome. Studies where the primary outcome was feasibility (e.g., recruitment, retention, etc.) were excluded as these outcomes do not require inferential statistics. Studies were also excluded if they analyzed more than 2 follow-up time points as this design likely addresses questions beyond the scope of a pre–post analysis or if they were secondary analyses or an analysis only of mechanisms (i.e., moderation or mediation) and not reporting the main effect of the intervention.
IRB approval was not necessary for this review. An electronic query using the PubMed search engine was conducted for all manuscripts published in the 4 target journals during the calendar years 2015–2021. Articles were then excluded if they were not RCTs. Next, each article was deemed as eligible or ineligible based on the study inclusion criteria, resulting in the final analyzable manuscripts.
Among the manuscripts deemed eligible, 2 raters independently classified each study based on its statistical methods for the primary outcome prior to a consensus meeting of at least 3 raters where classifications were finalized after discussing any inter-rater disagreement. Classification categories were determined in concordance with Rudestam and Newton (Reference Rudestam and Newton2012), who delineated 4 primary methods for analyzing pre–post effects: (1) ANOVA of posttest scores, (2) ANOVA of change scores, (3) ANCOVA, and (4) MANOVA. In the general modeling context, these 4 methods translate to (1) regression of posttest scores without adjustment for pretest, (2) regression models of change scores without adjustment for pretest, (3) regression of posttest scores with adjustment for pretest, and (4) repeated measures ANOVA models with multiple observations per person, respectively. A fifth option, (5) multilevel modeling such as generalized estimating equations and hierarchical level modeling, was also included. The data that support the findings of this study are available from the corresponding author upon reasonable request.
Descriptive statistics were calculated based on the classifications; frequencies were calculated for each method by journal, and overall classifications were compared by sample size using Kruskal–Wallis due to non-normality and by journal using Chi-square tests. For the Chi-square test of 5 methods among 4 journals, a sample of 183 manuscripts provides 80% power to detect a standardized effect size of at least Cohen’s w = 0.31, a medium effect. Adjusted analysis was conducted via multinomial regression of the method on journal and log-transformed sample size, with an overall (type 3) test of the journal and sample size effects.
Results
The 7-year electronic query netted 3,989 manuscripts, of which 380 (10%) were identified as RCTs. Among all RCT manuscripts, 197 (52%) were excluded based on initial eligibility criteria as described above, resulting in 183 (48%) analyzable manuscripts from Annals of Behavioral Medicine (49 manuscripts), Health Psychology (41 manuscripts), Psycho-Oncology (58 manuscripts), and Psychosomatic Medicine (35 manuscripts). The consensus team classified all 183 manuscripts into one of the 5 distinct categories for statistical methods.
The most prevalent analytic method for the included RCTs was longitudinal modeling (n = 58, 32%), followed by ANCOVA controlling for baseline (n = 42, 23%) and MANOVA (n = 40, 22%). While longitudinal modeling (method 5) was the most prevalent method overall and for 3 of the individual journals, manuscripts in Psychosomatic Medicine more frequently used a MANOVA (method 4; 37%); however, this differential result was not statistically significant.
Sample size for the included studies ranged widely from 19 to 2,005 participants. Distributions of sample sizes, by method, are depicted in Figure 1. Statistical methods varied significantly by sample size (p = 0.008), such that manuscripts with larger sample sizes were more likely to employ ANOVA methods (of either change scores or follow-up scores, methods 1 and 2) and those with smaller sample sizes were more like to use method 3 (ANCOVA). In a model including both log-transformed sample size and journal, only sample size was significantly associated with statistical method (p = 0.03).
Discussion
The great variability of analytic methods observed highlights the variety of options researchers have when selecting an analysis method, each with its own pros and cons (Table 1). Standardized guidelines outlining this decision-making process are of particular relevance given the growing utilization of registered reports, which require researchers to present their analysis plan a priori. As such, these guidelines would have the potential to aid in open science reform.
The prevailing method in these influential peer-reviewed palliative care and behavioral medicine journals was longitudinal modeling (method 5). As discussed in Table 1, an advantage of longitudinal (multilevel) modeling is the inclusion of baseline data for participants who were lost to follow-up. That is, multilevel model methods employ available-case analysis, whereas analyses of change scores or follow-up scores necessitate listwise deletion. Thus, multilevel models increase the effective sample size and statistical power for multilevel model methods compared to others and reduce bias related to participants lost to follow-up. If a researcher can assume attrition is random, listwise deletion is only a concern for power and not bias, but this assumption is rarely true in behavioral sciences.
Another advantage of longitudinal modeling is that one can extract multiple comparisons of interest (e.g., interval-specific time effects or pair-wise group comparisons) from the single model when appropriate contrasts are used. However, some recommend the use of multiple ANOVAs over longitudinal modeling as researchers may misuse this technique (Vickers Reference Vickers2005b). The biggest barrier to utilizing longitudinal modeling is a lack of familiarity with the techniques and computational logistics. For example, longitudinal data may need to be restructured to the less familiar “long” format where multiple observations per person are disaggregated into separate rows. This can be accomplished fairly succinctly using something like the CASESTOVARS function in SPSS but adds another layer of complexity if the researcher is not well versed in data management. Despite concerns that this technique may be misused, researchers may still select these more sophisticated analyses for perceived increase in publication potential.
The second most prevailing method in the 4 journals was ANCOVA controlling for baseline (method 3). As previously mentioned, one strength of this method is that it be extended to incorporate time effects (for repeated measures) and randomization strata as covariates, which has the benefit of potentially increasing power (Kalish and Begg Reference Kalish and Begg1985; Vickers Reference Vickers2005a). Another strength of this method is that it generally has greater statistical power to detect a treatment effect (Wasserstein and Lazar Reference Wasserstein and Lazar2016), making it advantageous when analyzing smaller sample sizes, which may be more common in behavioral clinical trials. Accordingly, studies in the current review with smaller sample sizes were more likely to employ ANCOVAs.
The third most prevailing method in these journals was MANOVA. One possible strength of MANOVA over the previously described ANOVAs is that it adjusts for baseline differences. According to those who support the use of analyses such as MANOVA, even though RCTs ought to be balanced on baseline measures for adequately sized studies, a random imbalance or smaller studies may be better advised to allow for a baseline adjustment. Analyses using t-tests or ANOVA on either follow-up or change scores are also likely more accessible to readers with limited statistical training than methods such as MANOVA, and it has been suggested that results using MANOVA are often misinterpreted (Vickers Reference Vickers2005b).
Statistical methods also varied significantly by sample size across all journals. The observation that manuscripts using ANOVA of follow-up scores (method 1) had the largest sample sizes is appropriate, given the statistical principle that if there is a large enough sample size, baseline differences will be negligible due to randomization. The observation that studies with smaller sample sizes were more likely to employ ANCOVAs also aligns with the claim that ANCOVAs have greater statistical power than ANOVAs (Vickers Reference Vickers2001). Given that research suggests psychological research is oftentimes underpowered and sample sizes have not increased over time (Marszalek et al. Reference Marszalek, Barber and Kohlhart2011), the sample size should be considered when the researcher is selecting the appropriate method of analysis.
In addition to the descriptive findings of the current study, an incidental observation was that none of the papers reviewed utilized an initial Bayesian framework. The use of Bayesian models for estimation falls outside of traditional testing of a null hypothesis, instead resulting in estimation of parameters with a “highest density region” or a Bayes factor for model comparison. One study by Yeung et al. (Reference Yeung, Sharpe and Geers2020) first utilized null hypothesis significance testing followed by post hoc Bayesian analyses to further analyze their data, perhaps representing an acknowledgment of the limitations of traditional null hypothesis significance testing. Recent statements on the obsolescence of traditional significance testing, made by such sources as the American Statistical Association (Wasserstein and Lazar Reference Wasserstein and Lazar2016) and Nature (2019), have pointed to Bayesian methods and the utilization of the Bayes factor as an indicator of credibility of results. For analyses that utilize ANOVA or standard regression models, Bayesian methods have recently been made accessible and relatively user-friendly by incorporation into software such as SPSS (IBM Corp 2020).
In sum, the assessment of change using a pretest–posttest control method is a potentially complex task, and recent work has documented great variation among researchers’ analytic choices (Breznau et al. 2021). While a statistical analysis plan should ultimately be driven by factors such as the research question, assumptions about the nature of change in the outcome, assumptions about attrition, and design factors including sample size, knowledge of the relative strengths and weaknesses of each common method used, as well as guidelines for their use may prove useful to researchers in palliative care and behavioral medicine. The information resulting from the current characterization of the literature and overview of the statistical methods available may help to inform this decision and aid in the development of future selection guidelines. Given the high levels of variability observed in this review, future discussion around best practices in RCT analyses is warranted to compare the relative impact of interventions in a more standardized way and aid in future scientific practice reform (e.g., preregistration, selection of an analysis plan a priori, open science reform, replicability, etc.).
Competing interests
The authors have no conflicts of interest to report.