We thank Dr Jenkins for his comment on our review on advances in the treatment of anorexia nervosa (AN) (Brockmeyer et al., Reference Brockmeyer, Friederich and Schmidt2018). We agree with Dr Jenkins' main statement, i.e. adequate sample size calculations are a necessary prerequisite in designing treatment studies, and that having limited statistical power can lead to erroneous conclusions, including so-called false negatives (i.e. failures to statistically identify real differences between the efficacy of treatments). Given the only moderate treatment response, particularly in adult patients with AN, and its severity, high mortality rate, long-term impairments in social functioning and employment, low quality of life, high burden on caregivers, and huge societal costs (Giel et al., Reference Giel, Schmidt, Fernandez-Aranda and Zipfel2016; Schmidt et al., Reference Schmidt, Adan, Böhm, Campbell, Dingemans, Ehrlich, Elzakkers, Favaro, Giel, Harrison, Himmerich, Hoek, Herpertz-Dahlmann, Kas, Seitz, Smeets, Sternheim, Tenconi, van Elburg, van Furth and Zipfel2016; Zipfel et al., Reference Zipfel, Giel, Bulik, Hay and Schmidt2016), we entirely agree that adequate clinical trials with large enough samples of patients with AN are urgently needed. Hence, we do not disagree with Dr Jenkins regarding the main message of his comment – which is also the basic principle of sample size calculation. However, we would like to comment on a few of his statements, which, in our view, might lead to erroneous conclusions themselves.
The dearth of large-scale randomised controlled trials in AN arises not only from underestimations of statistical power but can be explained by limited funding for AN research (Schmidt et al., Reference Schmidt, Adan, Böhm, Campbell, Dingemans, Ehrlich, Elzakkers, Favaro, Giel, Harrison, Himmerich, Hoek, Herpertz-Dahlmann, Kas, Seitz, Smeets, Sternheim, Tenconi, van Elburg, van Furth and Zipfel2016), low prevalence rates of AN, and high treatment ambivalence in this population (Abbate-Daga et al., Reference Abbate-Daga, Amianto, Delsedime, De-Bacco and Fassino2013; Williams and Reid, Reference Williams and Reid2010; Gregertsen et al., Reference Gregertsen, Mandy and Serpell2017). For instance, it took 4 years and 10 participating centres to recruit n = 242 eligible patients with AN for the ANTOP study (Zipfel et al., Reference Zipfel, Wild, Gross, Friederich, Teufel, Schellberg, Giel, de Zwaan, Dinkel, Herpertz, Burgmer, Lowe, Tagay, von Wietersheim, Zeeck, Schade-Brittinger, Schauenburg and Herzog2014). These factors should be taken into account when judging small sample sizes in clinical trials on AN.
Furthermore, Dr Jenkins states that null findings in superiority trials on AN are often interpreted in a way to suggest that the examined treatments are equivalent. Indeed, such an interpretation of a null finding in a superiority trial would be improper. However, Dr Jenkins comment lacks any reference for such interpretations in the AN literature. In our review we do not interpret findings in this way. In contrast, we clearly state (as cited by Dr Jenkins in his comment) that ‘there is no single psychotherapy that is substantially superior to another’. This is also the common tone in other reviews on treatments for AN (Hay, Reference Hay2013; Kass et al., Reference Kass, Kolko and Wilfley2013; Le Grange, Reference Le Grange2016). Likewise, in the original studies (Zipfel et al., Reference Zipfel, Wild, Gross, Friederich, Teufel, Schellberg, Giel, de Zwaan, Dinkel, Herpertz, Burgmer, Lowe, Tagay, von Wietersheim, Zeeck, Schade-Brittinger, Schauenburg and Herzog2014; Schmidt et al., Reference Schmidt, Magill, Renwick, Keyes, Kenyon, Dejong, Lose, Broadbent, Loomes, Yasin, Watson, Ghelani, Bonin, Serpell, Richards, Johnson-Sabine, Boughton, Whitehead, Beecham, Treasure and Landau2015) it was clearly communicated that there was no significant difference between the treatment conditions. No conclusions have been drawn that any treatment is not inferior or equivalent to another. Thus, Dr Jenkins presses charges where no crime has been committed.
Dr Jenkins further argues that differences between treatments for AN rarely exceed effect sizes around d = 0.30, without providing any proper reference for this specific number [actually, the sample size calculation for the ANTOP study was, for instance, based on an effect size of d = 0.59 which was deduced from a previous trial on AN (Dare et al., Reference Dare, Eisler, Russell, Treasure and Dodge2001)]. He then states that, given conventions of power analysis (α = 0.05; power = 80%), one would need a sample size of n = 139 in each treatment arm to detect an effect of this size. Unfortunately, we cannot reconstruct how this specific sample size results from the given parameters. The needed sample size very much depends on the statistical test that is applied (e.g. for an independent samples t test, a sample size of n = 176 per condition would be necessary, for a mixed ANOVA it could be n = 45 per condition, given the parameters suggested by Dr Jenkins). Dr Jenkins' line of argumentation suggests that previous clinical trials on AN have not utilised appropriate a priori sample size calculations, but this is definitely not the case. For instance, in the ANTOP study (Zipfel et al., Reference Zipfel, Wild, Gross, Friederich, Teufel, Schellberg, Giel, de Zwaan, Dinkel, Herpertz, Burgmer, Lowe, Tagay, von Wietersheim, Zeeck, Schade-Brittinger, Schauenburg and Herzog2014) it was expected that one of the two specific treatments (focal psychodynamic therapy and/or enhanced cognitive behaviour therapy) would result in an improvement in body mass index (BMI) of 1.0 kg/m2 compared with optimised treatment as usual, which was considered a clinically meaningful difference that translates into a between-groups effect size of d = 0.59. Given this expected effect size, an alpha of 0.025 (corrected for multiple comparison), and 80% power, one would need n = 55 per condition. Expecting an attrition rate of 30%, this sample size was increased to n = 80 per condition. In our view, this is a reasonable rationale for a clinical trial on AN. In addition, Dr Jenkins argumentation that previous treatment studies on AN have been insufficiently powered to detect effect sizes of d = 0.30 seems to neglect the issue of clinical significance (Jacobson and Truax, Reference Jacobson and Truax1991; Bauer et al., Reference Bauer, Lambert and Nielsen2004). Taking into account the standard deviation in BMI at end of treatment in the ANTOP study, for instance, such an effect size of d = 0.30 would translate into a mean difference of 0.513 BMI points, equalling 1.43 kg (given the mean height of the sample in this study). Thus, studies that are sufficiently powered to detect an effect size of d = 0.30 as suggested by Dr Jenkins, would render two treatments significantly different if they result in a mean difference in body weight of 1.43 kg. It can be questioned whether such a difference should be considered clinically meaningful.
In sum, we would like to emphasise once again that we agree with Dr Jenkins' point about the need for large enough sample sizes in AN research. However, this valuable discussion should neither discount the obstacles AN researchers have to face when planning a clinical trial (including low funding, low prevalence, high treatment ambivalence in patients) nor the efforts researchers in the field have undertaken to design and conduct methodologically rigorous randomised controlled trials in the past. Finally, discussions around sample size in psychotherapy research should generally not only take statistical but also clinical significance into account.
Author ORCIDs
Timo Brockmeyer, 0000-0003-2544-7610.
Target article
Treatment of anorexia nervosa: is it lacking power?
Related commentaries (1)
Sample size in clinical trials on anorexia nervosa: a rejoinder to Jenkins