Introduction
Consider the hypothetical information presented in Figure 1. Data like this are often followed by enthusiastic statements that observed responses were ‘very highly significantly different’ (Fig. 1A) or dispirited assertions that differences were ‘not statically significant’ (Fig. 1B) when assessed against a theoretical level of statistical significance such as 0.05. Researchers then interpret these findings as evidence for large (Fig. 1A) or no (Fig. 1B) effects of the independent variables on response variables. Perhaps you have witnessed such examples at recent meetings or in publications. However, do comparisons of P-values against cut-off values (e.g., α = 0.05) grant researchers the ability to make claims about the magnitude of differences, the strength of association between variables, or the practical relevance of a study? No! They do not (Nickerson, Reference Nickerson2000; Ellis, Reference Ellis2010; Aarts et al., Reference Aarts, van den Akker and Winkens2014; Nuzzo, Reference Nuzzo2014; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016; Wasserstein et al., Reference Wasserstein, Schrim and Lazar2019).
Unfortunately, the problem of P-value misinterpretation is widespread and chronic. Authors link this problem to: (1) pervasive misunderstandings regarding the fundamentals of statistical hypothesis testing; (2) conflating the original intent of P-values as a test of evidence against a null with the later application of P-values in evidence-based decision-making frameworks and (3) shortcomings of statistical training programmes (Nickerson, Reference Nickerson2000; Nuzzo, Reference Nuzzo2014; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Pernet, Reference Pernet2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016). Regardless, consensus exists that the use of results from statistical hypothesis testing alone, especially when viewed through the lens of significance or non-significance, distorts conclusions (Nuzzo, Reference Nuzzo2014; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016; Kimmel et al., Reference Kimmel, Avolio and Ferraro2023). As Wasserstein and Lazar (Reference Wasserstein and Lazar2016) stated: ‘Statistical significance is not equivalent to scientific, human, or economic significance’ [p. 132]. In this brief paper, I plan to raise the awareness of effect size measurements as necessary statistical tools; provide resources for further consideration and encourage more widespread use of effect sizes in the seed science literature.
A reminder of the information P-values provide
Fundamentally, a P-value represents the probability of observing a summary statistic (e.g., a mean difference between two groups) that is equal to or more extreme than the sample statistic given a specific statistical model (Wasserstein and Lazar, Reference Wasserstein and Lazar2016). In more tangible terms, a P-value represents a measure of compatibility between observed data and the expected data if all assumptions of a test model (e.g., the null hypothesis) were correct. The smaller the P-value the more likely that observed data are unusual compared to the test model. Alternatively, the larger the P-value the more likely that observed data are not unusual compared to the test model. That's it! This is all the information a P-value provides the researcher – nothing else (Nuzzo, Reference Nuzzo2014; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016).
Crucially, notice how P-values provide no information regarding the magnitude of differences or the level of association between variables. Nuzzo (Reference Nuzzo2014), Greenland et al. (Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016), Wasserstein and Lazar (Reference Wasserstein and Lazar2016), and Wasserstein et al. (Reference Wasserstein, Schrim and Lazar2019) offer more complete descriptions of P-value misinterpretations, provide an excellent refresher on what P-values do and do not represent, and explain what not do with P-values while offering meaningful actions researchers can take in the context of statistical analyses.
The power of effect size indexes
It is important to note that sample size affects P-values. For instance, P-values typically decrease as sample size increases due to the impact of random error reduction. Moreover, variability decreases and measurements become more precise in large samples. Such improvements facilitate the detection of smaller differences (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016). This means that trivial differences or associations may be deemed statistically significant (e.g., Fig. 1A) if the sample size is large enough or measurements are highly precise. The reverse is also true. Non-trivial differences may show up as not statistically significant in studies with small sample sizes or imprecise measurements (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016). Alternatively, effect size indices are independent of sample size (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010).
So, what are effect size indices? Effect size indices are statistics that quantify the magnitude of differences between treatment groups or experimental conditions and correlations between variables (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010). Researchers may be familiar with some types of indices (Kallogjeri and Piccirillo, Reference Kallogjeri and Piccirillo2023) but may not have interpreted these as effect sizes. For example, the odds ratio, which is often calculated in connection with logistic regression, computes the odds of an event (e.g., fungal contamination) occurring in one group (e.g., seeds treated with fungicide A) compared to another group (e.g., seeds treated with fungicide B). Let's say a subsequent analysis yields an odds ratio equal to 1.86. This means that the odds of fungal contamination in seeds treated with fungicide A is 86% higher than the odds for fungal contamination in seeds treated with fungicide B. Similarly, the hazard ratio (HR), which is associated with regression-based time-to-event analyses in seed biology (McNair et al., Reference McNair, Sunkara and Frobish2012; Pérez and Kettner, Reference Pérez and Kettner2013; Genna et al., Reference Genna, Kane and Pérez2015; Adegbola and Pérez, Reference Adegbola and Pérez2016; Genna and Pérez, Reference Genna and Pérez2016; Pérez and Kane, Reference Pérez and Kane2017; Tyler et al., Reference Tyler, Adams, MacDonald and Pérez2017; Campbell-Martínez et al., Reference Campbell-Martínez, Thetford, Miller and Pérez2019; Pérez and Chumana, Reference Pérez and Chumana2020), represents the ratio of estimated hazard rates (i.e., likelihood of germination) between different covariate values (e.g., doses of a germination-stimulating chemical; treated vs. control) over a unit of time (Allison, Reference Allison2010). Consider an experiment from a germination perspective where a group of seeds received an increasing dose of a germination inhibitor. In this case, the calculated HR equals 0.95. Applying the formula 100 ⋅ (HR − 1) yields the percent change in hazard for each 1-unit increase in the germination inhibitor dose. Therefore, the likelihood of germination decreases by 5% for each 1-unit increase of inhibitor. Other types indices, such as Hedges' g, Cramér's V, or eta2 (η 2), may be less familiar, given the large number (around 70) of indices that exist (Cohen, Reference Cohen1988; Kirk, Reference Kirk and Davis2003; Ellis, Reference Ellis2010).
Effect size indices fall into the d or r families. Indices in the d family measure differences between groups. Indices of the r family measure associations between variables (Ellis, Reference Ellis2010; Kallogjeri and Piccirillo, Reference Kallogjeri and Piccirillo2023). Ellis (Reference Ellis2010, see table 1.1) goes on to subdivide the d family into indices that compare groups on dichotomous outcomes (e.g., odds ratio) and those that compare groups on continuous outcomes (e.g., Cohen's d). Likewise, the r family is divided into indices assessing correlation (e.g., Cramér's V) or the proportion of variance (e.g., η 2).
Selecting a suitable effect size index requires the consideration of several factors (Ellis, Reference Ellis2010; Kallogjeri and Piccirillo, Reference Kallogjeri and Piccirillo2023). For example, researchers should consider the research problem under investigation. This helps to identify study aims, target outcomes, data structure, measurement methods and study design. Next, researchers define whether outcomes or dependent variables are categorical, continuous or time-to-event in nature. Finally, researchers describe the type of analysis being conducted such as correlations, regressions, multivariate analysis or analysis of variance (ANOVA) with multiple groups. Researchers with this information in hand will find it easier to determine which index to use when referring to helpful tabulated resources (Ellis, Reference Ellis2010, see table 1.2) or decision trees (Kallogjeri and Piccirillo, Reference Kallogjeri and Piccirillo2023).
With the proper effect size index selected, researchers can then move on to analyses and interpretation. But first, consider these cautions. Different indexes will provide different measurement scales corresponding to what constitutes small, medium or large effects (or association). For example, depending on the scientific discipline, a Pearson's correlation coefficient (r) value of 0.25 could be considered as a small association between variables of interest. However, a Cohen's d value of 0.25 can be deemed a medium effect size (Aarts et al., Reference Aarts, van den Akker and Winkens2014). Therefore, it is challenging to compare indices that use different effect size criteria unless index conversion formulas are available. In some cases, it may not be possible to convert between indices. Additionally, the criteria for effect sizes of a specific index (e.g., Cohen's d) may not necessarily be applicable across disciplines. A small effect in seed science may not be the same as a small effect in medical research. Consequently, interpretations of effect sizes should be discipline-specific (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010; Brydges, Reference Brydges2019). Finally, remember to report confidence intervals associated with the calculated effect size index. This provides a measure of precision of the effect size estimate and represents good statistical practice (Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Wasserstein and Lazar, Reference Wasserstein and Lazar2016; Wasserstein et al., Reference Wasserstein, Schrim and Lazar2019; Kallogjeri and Piccirillo, Reference Kallogjeri and Piccirillo2023).
Researchers in the medical and social sciences have been applying effect size indices in their analyses for decades. Such a robust body of analyses often leads to the standardization and contextualization of small, medium and large effects within a discipline (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010). Alternatively, apart from ecology, the utilization of effect sizes in many plant-related disciplines including seed science has been negligible (Sileshi, Reference Sileshi2012). An important outcome is that the standardization of small, medium and large effects for some indices will be absent. To remedy this, Cohen (Reference Cohen1988) cautiously suggested using criteria outlined in his publication when no discipline-specific criteria exist. For example, values of Cohen's d = 0.2, 0.5, and 0.8 represent benchmarks for small, medium and large effect sizes. But the use of various general criteria offered by Cohen (Reference Cohen1988) must be tempered with the researcher's experience and wisdom. For instance, a five-percentage point difference in a laboratory germination test (e.g., 93 vs. 98%) for lettuce seeds may turn out to be a small effect. Nonetheless, when scaled to the field level, this difference can have a substantial impact since the success of a lettuce crop may rely on each sown seed producing a harvestable head. So, context is essential when interpreting effect sizes. Otherwise, criteria such as small, medium or large may remain ambiguous (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010; Carey et al., Reference Carey, Ridler, Ford and Stringaris2023).
More information on effect sizes
Reporting and interpretation of effect sizes in the seed science literature is rare (Sileshi, Reference Sileshi2012); suggesting that effect sizes represent new concepts for our discipline. If the topic of effect sizes is new to you, then a good place to start is with easy to digest reading materials (Table 1). Fortunately, most statistical analysis programmes have the capacity to calculate many effect size indices (Table 1). These programmes also tend to provide adequate documentation explaining available indices. If statistical programmes are unavailable or inaccessible, then various websites provide applications to calculate effect sizes (Table 1). Similarly, calculations for several effect sizes are straightforward (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010) and can easily be computed in a spreadsheet or by hand if necessary (Table 1).
Concluding remarks
Effect size indices are powerful tools crucial for extending our results beyond mere statistical significance. Moreover, effect sizes are important to ensure that our studies are properly powered rather than underpowered (Cohen, Reference Cohen1988; Ellis, Reference Ellis2010; Nuzzo, Reference Nuzzo2014; Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016, Brydges, Reference Brydges2019; Kimmel et al., Reference Kimmel, Avolio and Ferraro2023; also see Table 1). Effect size indices are simple to apply. More importantly, the information these indices yield contributes to more impactful conclusions relevant to broader audiences while moving a discipline forward. Therefore, I strongly encourage the use of effect sizes in future seed science research.
Funding statement
This work received no specific grant from any funding agency, commercial or non-profit sectors.
Competing interests
The author declares none.