With more than 100 non-inferiority or equivalence trials published per year in many areas of research (Piaggio et al., Reference Piaggio2012), statistical and methodological issues involved in these trials become increasingly important. A recent article by Rief and Hofmann (Reference Rief and Hofmann2018) suggests, however, that some of these issues are not sufficiently clear. For this reason, central issues will be discussed here and some misunderstandings will be addressed.
Equivalence and non-inferiority margins
For defining a non-inferiority or equivalence margin (i.e. the minimum difference important enough to make treatments non-equivalent), no generally accepted standards exist. In 332 equivalence or non-inferiority medical trials, a median margin of 0.50 standard deviations was found (Lange and Freitag, Reference Lange and Freitag2005), corresponding quite well to the value of 0.42 reported by Gladstone and Vach (Reference Gladstone and Vach2014). Only five studies used margins < 0.25 (Gladstone and Vach, Reference Gladstone and Vach2014) and only 12% of studies margins ⩽0.25 (Lange and Freitag, Reference Lange and Freitag2005).
In psychotherapy research, margins ranging from 0.24 to 0.60 have been proposed (e.g. Steinert et al., Reference Steinert2017, p. 944). In a meta-analysis of psychodynamic therapy (PDT) including different mental disorders, Steinert et al. (Reference Steinert2017) chose a margin of g = 0.25, which is among the smallest margins ever used in psychotherapy and medical research (Gladstone and Vach, Reference Gladstone and Vach2014, Figure 2, Steinert et al., Reference Steinert2017, p. 944). This margin is very close to both (a) the threshold for a minimally important difference specifically suggested for depression (0.24, Cuijpers et al., Reference Cuijpers2014), and (b) the margin recommended by Gladstone and Vach (Reference Gladstone and Vach2014) to protect against degradation of treatment effects in non-inferiority trials (d = −0.23).
In their recent correspondence article, Rief and Hofmann (Reference Rief and Hofmann2018) make a quite different proposal, recommending margins not to fall below 90% of the uncontrolled effect size of the established treatment. This proposal, however, is associated with several problems described in more detail in Table 1, particularly regarding the clinical significance of the suggested margin and its implications for sample size determination, rendering non-inferiority trials in psychotherapy research virtually impossible (Table 1).
a Paul Crits-Christoph, personal communication, 16 February 2018.
b Paul Crits-Christoph, personal communication, 26 February 2018.
Statistical hypotheses in equivalence and non-inferiority testing
In equivalence testing, the null and alternative hypotheses of superiority testing are reversed and the statistical alternative hypothesis is consistent with the assumption of equivalence (Lesaffre, Reference Lesaffre2008; Walker and Nowacki, Reference Walker and Nowacki2011). To test for equivalence, two one-sided tests are performed determining whether the upper and the lower boundary of the CI are included in the margin, whereas, for testing non-inferiority, one one-sided test inspecting the lower boundary is used (Lesaffre, Reference Lesaffre2008; Walker and Nowacki, Reference Walker and Nowacki2011). A statistically significant result implies here that the effect size and its CI are within the margin, demonstrating equivalence or non-inferiority (Walker and Nowacki, Reference Walker and Nowacki2011). A recent meta-analysis testing equivalence of PDT to other approaches established in efficacy reported a significant result indicating that the effect sizes and their CIs were completely included in the margin (Steinert et al., Reference Steinert2017). Thus, the recently given interpretation by Rief and Hofmann (Reference Rief and Hofmann2018, p. 2) that Steinert et al. (Reference Steinert2017) ‘… found a significant disadvantage of PDT [psychodynamic therapy] compared with other treatments (including CBT)’ is simply wrong (Lesaffre, Reference Lesaffre2008; Walker and Nowacki, Reference Walker and Nowacki2011).
Equivalence v. non-inferiority testing
Equivalence and non-inferiority testing need to be differentiated (Treadwell et al., Reference Treadwell2012). In non-inferiority testing, for example, the test treatment is expected to be superior to the standard treatment in measures not related to efficacy such as side effects or costs (Treadwell et al., Reference Treadwell2012). Rief and Hofmann did not make this differentiation. In fact, the meta-analysis by Steinert et al. (Reference Steinert2017), for example, was a test of equivalence, not of non-inferiority as suggested by Rief and Hofmann (Reference Rief and Hofmann2018).
Assay sensitivity and constancy of study conditions
Equivalence and non-inferiority testing require that the efficacy of the comparator is ensured and that the study conditions are comparable with in which the efficacy of the comparator was established (Treadwell et al., Reference Treadwell2012). In those context, Rief and Hofmann (Reference Rief and Hofmann2018) claim that specific issues of (low) study quality favour non-inferiority results, e.g. low response rates found in specific studies or low treatment integrity. Again, however, these claims are not supported by evidence (Table 1). This applies to several further issues put forward by Rief and Hofmann (Reference Rief and Hofmann2018) which are briefly discussed in Table 1, for example to the relationship between equivalence testing and the number of studies available for a specific treatment (Table 1).
Conclusions
Equivalence and non-inferiority testing pose specific methodological problems (Piaggio et al., Reference Piaggio2012; Treadwell et al., Reference Treadwell2012), for example, in defining a margin, statistical testing, and ensuring the efficacy of the comparator or comparability of study conditions (Table 1). Conclusions about equivalence and non-inferiority testing differing from Rief and Hofmann's (Reference Rief and Hofmann2018) are presented which are more consistent with the available evidence and usual standards across a range of scientific disciplines.