Two recent studies in PSM (Brockmeyer et al., Reference Brockmeyer, Friederich and Schmidt2018; Murray et al., Reference Murray, Quintana, Loeb, Griffiths and Le Grange2018) have independently synthesised recent evidence on treatments for anorexia nervosa (AN) in adults and have reached similar, sobering, conclusions. In essence, despite the large amount of time, effort, and money that has been invested in evaluating psychological treatments (one, for example, reviewed studies comprising 2092 patients in 19 trials over 5 years), ‘no single psychotherapy has emerged as clearly superior to others in the treatment of adults with AN’ (Brockmeyer et al., Reference Brockmeyer, Friederich and Schmidt2018, p. 1250). Whilst this particular conclusion is largely supported by the evidence, results have often been interpreted as meaning that those psychological therapies that were evaluated are, therefore, equivalent in effectiveness. Although it has been argued that this supports a ‘common factors’ approach to treatment (e.g. Lose et al., Reference Lose, Davies, Renwick, Kenyon, Treasure and Schmidt2014), one shortcoming is often given insufficient attention: low statistical power.
When comparing two treatments of an illness, the researcher's aim is critical. Consider a novel treatment, treatment B, and an existing treatment, treatment A. Is the aim to determine that treatment B is: (1) different from treatment A (either better or worse); (2) at least as effective as treatment A, or (3) equivalent to treatment A (Tamayo-Sarver et al., Reference Tamayo-Sarver, Albert, Tamayo-Sarver and Cydulka2005)? The apparent confusion around equivalence and non-inferiority trials has been raised by other authors (e.g. see Leichsenring et al., Reference Leichsenring, Abbass, Driessen, Hilsenroth, Luyten, Rabung and Steinert2018), but suffice it to say here that the hypotheses (null and alternative) differ for each of these aims such that use of a traditional comparative test – as might be used in scenario (1) – is likely to be inappropriate in tests of non-inferiority and equivalence (Walker and Nowacki, Reference Walker and Nowacki2011).
Sticking with perhaps the simplest example: a researcher wishes to find out that treatment B is superior to treatment A. A power calculation is vital in determining the minimum sample size needed to detect an effect. Looking at the two review articles, effect sizes of the difference between two treatments on body mass index (BMI) outcomes rarely exceeded d = 0.30, a small-to-medium-sized effect (and were typically smaller than this, particularly for psychological outcomes). In the absence of an agreed non-inferiority margin (or, ‘clinically acceptable difference’), this would seem to be a reasonable assumption of magnitude and is less than half of the anticipated effect of the comparator treatment (for BMI, for example, effect sizes are usually around d = 0.6–1.00 for established treatments; e.g. Zipfel et al., Reference Zipfel, Wild, Groß, Friederich, Teufel, Schellberg, Giel, de Zwaan, Dinkel, Herpertz, Burgmer, Löwe, Tagay, von Wietersheim, Zeeck, Schade-Brittinger, Schauenburg and Herzog2014). Given conventions around acceptable error rates (i.e. α = 0.05, 1 − β = 0.80), this would suggest a minimum sample size of 139 in each arm to demonstrate that treatment B is superior to treatment A. None of the studies of outpatient trials in adults reached this level. This issue is emphasised when looking at confidence intervals of treatment comparisons, which are often large and encompass what might be proffered as a clinically acceptable difference.
Even if other study biases are limited, low statistical power will engender a tendency towards showing more false negatives. By illustration, in their review, Brockmeyer et al. (Reference Brockmeyer, Friederich and Schmidt2018) considered only studies ‘with a minimal sample size of n = 100’ (p. 1229) and included having a ‘sample size n > 30 in each condition’ (p. 1229) in their quality appraisal. Although this may seem encouraging, it remains possible that any lack of differences found between treatments rests on low statistical power; a sample size of 30 seems arbitrary and, if based on the hypothesised effectiveness (effect size) of one treatment, is unlikely to be sufficient to detect an effect between two treatments. Studies with low statistical power also risk reporting findings that are true when in fact they are not and over-estimating the magnitude of those effects (Button et al., Reference Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafò2013). This can have impacts on later research, whereby sample sizes determined on ‘historical precedent rather than through formal power calculation’ may hamper attempts at replication (Button et al., Reference Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafò2013, p. 367).
This brief summary echoes the conclusions of Rief and Hofmann (Reference Rief and Hofmann2018) that the scientific community needs to consider issues around study design more seriously. Of concern, a number of authors have repeatedly argued for the issue of low statistical power to be addressed, with studies suggesting that the problem is, in fact, endemic (e.g. see Le Hananff et al., Reference Le Hananff, Giraudeau, Baron and Ravaud2006; Button et al., Reference Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafò2013; Vankov et al., Reference Vankov, Bowers and Munafò2014). Power to detect an effect ought to be an essential element of research design within null-hypothesis significance testing; failing to meet a minimum sample size is likely to render subsequent conclusions questionable at best. Guidance has been published in the reporting non-inferiority and equivalence trials (Piaggio et al., Reference Piaggio, Elbourne, Pocock, Evans and Altman2012).
That researchers and other stakeholders invest so much in evaluating treatments of AN likely reflects both the severity of the illness and the positive intent of those wishing to eradicate it. Failing to recruit sufficient numbers casts doubt over most conclusions, can influence later meta-analyses (e.g. Peters et al., Reference Peters, Sutton, Jones, Abrams and Rushton2006), and, ultimately, does not do justice to those who volunteer to participate in these studies. If sufficient investment is given (alongside appropriate structural change; see Vankov et al., Reference Vankov, Bowers and Munafò2014; Higginson and Munafò, Reference Higginson and Munafò2016), rewards in terms of treatment outcomes should begin to emerge more clearly.
Conflict of interest
None.
Target article
Treatment of anorexia nervosa: is it lacking power?
Related commentaries (1)
Sample size in clinical trials on anorexia nervosa: a rejoinder to Jenkins