Statistical inference has played a crucial role in scientific research since the latter half of the 20th century by bridging data and hypothesis testing (Gigerenzer, Swijtink, Porter, & Daston, Reference Gigerenzer, Swijtink, Porter and Daston1990). Currently, the most common statistical index in scientific literature is the p value, despite repeated criticism of its thoughtless use (Benjamin et al., Reference Benjamin, Berger, Johannesson, Nosek, Wagenmakers, Berk and Johnson2018; Cumming, Reference Cumming2013; Cumming et al., Reference Cumming, Fidler, Leonard, Kalinowski, Christiansen, Kleinig and Wilson2007; McCloskey & Ziliak, Reference McCloskey and Ziliak2008). In the last 20 years, items (e.g., figures and tables) displayed in the top three multidisciplinary journals (Nature, Science, and PNAS) progressively relied on p values (Cristea & Ioannidis, Reference Cristea and Ioannidis2018).
However, the widely used p value is also generally misunderstood. Several surveys in psychology show that most researchers and students misinterpret p values (Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos, & Longobardi, Reference Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos and Longobardi2016; Badenes-Ribera, Frías-Navarro, Monterde-i-Bort, & Pascual-Soler, Reference Badenes-Ribera, Frías-Navarro, Monterde-i-Bort and Pascual-Soler2015; Haller & Krauss, Reference Haller and Krauss2002; Lyu, Peng, & Hu, Reference Lyu, Peng and Hu2018; Oakes, Reference Oakes1986). This misinterpretation may result in the misuse and abuse of p values, such as the cult of statistical significance (McCloskey & Ziliak, Reference McCloskey and Ziliak2008) and p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, Reference Head, Holman, Lanfear, Kahn and Jennions2015; Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, Reference Nuijten, Hartgerink, van Assen, Epskamp and Wicherts2016), which might be the main reason behind the replication crisis in psychology (Hu et al., Reference Hu, Wang, Guo, Song, Sui and Peng2016; John, Loewenstein, & Prelec, Reference John, Loewenstein and Prelec2012; Simmons, Nelson, & Simonsohn, Reference Simmons, Nelson and Simonsohn2011).
An alternative to p values is effect sizes and their confidence intervals (CIs). In particular, CIs represent the variations of the effect size and help researchers produce improved statistical inference (Coulson, Healey, Fidler, & Cumming, Reference Coulson, Healey, Fidler and Cumming2010). However, CIs are also difficult to understand. For example, Hoekstra, Morey, Rouder, and Wagenmakers (Reference Hoekstra, Morey, Rouder and Wagenmakers2014) surveyed researchers’ understanding of CIs in a similar approach to surveys on the p value and found that most researchers misunderstood CIs. This phenomenon is confirmed by surveys from multiple countries (Greenland et al., Reference Greenland, Senn, Rothman, Carlin, Poole, Goodman and Altman2016; Lyu et al., Reference Lyu, Peng and Hu2018; Morey, Hoekstra, Rouder, & Wagenmakers, Reference Morey, Hoekstra, Rouder and Wagenmakers2016).
Even with the availability of multiple surveys, several questions remain unanswered. First, all available data are from psychological researchers or researchers in biomedical science. Only a few studies surveyed researchers in other disciplines. Given that p values and CIs are frequently used in other fields as much as in psychology (Colquhoun, Reference Colquhoun2014; Vidgen & Yasseri, Reference Vidgen and Yasseri2016), the extent of the understanding of researchers’ and students’ in other fields of these statistical indices is an open question. Second, the majority of previous surveys failed to identify how confident the respondents were of their own judgment. Third, most previous surveys only focused on the statistically significant statement, though non-significant results are equally important and often miscomprehended (Aczel et al., Reference Aczel, Palfi, Szollosi, Kovacs, Szaszi, Szecsi and Wagenmakers2018). To address these issues, a survey is conducted to investigate the following aspects related to the misinterpretation of p values and CIs: (1) whether the misinterpretation prevails across different fields of science; (2) whether researchers interpret significant and nonsignificant results differently; and (3) whether researchers are aware of their own misinterpretations, such as how confident they are when they endorse a statement toward p values or CIs.
In this survey, we adopt four questions from previous studies (Gigerenzer, Reference Gigerenzer2004; Haller & Krauss, Reference Haller and Krauss2002; Hoekstra et al., Reference Hoekstra, Morey, Rouder and Wagenmakers2014) for p values and CIs. These questions were used in Germany (Haller & Krauss, Reference Haller and Krauss2002), UK (Oakes, Reference Oakes1986), Spain (Badenes-Ribera et al., Reference Badenes-Ribera, Frías-Navarro, Monterde-i-Bort and Pascual-Soler2015), Italy (Badenes-Ribera et al., Reference Badenes-Ribera, Frías-Navarro, Monterde-i-Bort and Pascual-Soler2015), Chile (Badenes-Ribera et al., Reference Badenes-Ribera, Frías-Navarro, Monterde-i-Bort and Pascual-Soler2015) and China (Hu et al., Reference Hu, Wang, Guo, Song, Sui and Peng2016; Lyu et al., Reference Lyu, Peng and Hu2018). We selected four items to minimize the length of the questionnaire. We opted for these particular items because they are widely used and they enable a comparison between the results of the present and previous surveys. These items have several limitations. For example, certain items (e.g., “The probability that the true mean is greater than 0 is at least 95%”; “The probability that the true mean equals 0 is smaller than 5%.”) in the study of Hoekstra et al. (Reference Hoekstra, Morey, Rouder and Wagenmakers2014) could not be considered “incorrect” due to varied understanding of the conception “probability” (Miller & Ulrich, Reference Miller and Ulrich2015).
Materials and methods
Participants
All participants were recruited through online advertisements on WeChat-Public-Accounts; the subscribed accounts enable users to obtain information and interact with them (Montag, Becker, & Gan, Reference Montag, Becker and Gan2018). Specifically, our advertisements were spread via The Intellectuals (知识份子), Guoke Scientists (果壳科学人), Capital for Statistics (统计之都), Research Circle (科研圈), 52brain (我爱脑科学网), and Quantitative Sociology (定量群学). Advertisements posted among the WeChat-Public-Accounts are identical, emphasizing the importance of statistics and encouraging readers to devote their time for scientific purposes by clicking the Qualtrics link at the end of the post and participating in our survey. A total of 4,206 respondents from different backgrounds (respondents’ academic background was based on the degree they awarded in China) voluntarily participated in the survey. However, 2,727 of them withdrew before completing the survey, leaving a sample size of 1,479. All participants read and signed the informed consent form prior to their participation. Data were collected from September 2017 to November 2018. The response rate (35%) was relatively higher than previous studies in psychology; specifically, 10% and 7% higher response rates in comparison with Badenes-Ribera et al. (Reference Badenes-Ribera, Frías-Navarro, Monterde-i-Bort and Pascual-Soler2015) and Badenes-Ribera et al. (Reference Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos and Longobardi2016) respectively.
Materials
The questions on the interpretation of p values and CIs were adopted from Lyu et al. (Reference Lyu, Peng and Hu2018). These questions were first translated by C-P Hu and then reviewed by other bilingual psychological researchers (X-K Lyu and Dr Fei Wang at Tsinghua University) to ensure accuracy. Our survey included scenarios on p values and CIs. To investigate the understandings of non-significant results, we created two versions of the survey: one used a significant scenario (p < .05 and CIs did not include zero) and the other used a non-significant scenario (i.e., p > .05 and CIs included zero). Participants were randomly assigned to the significant and non-significant version by Qualtrics.
Questions for p values
This scenario was adopted from previous studies (Gigerenzer, Reference Gigerenzer2004; Haller & Krauss, Reference Haller and Krauss2002; Lyu et al., Reference Lyu, Peng and Hu2018). Respondents first read a research context and were then asked to judge whether the four statements could be logically inferred from the p values of the results. To explore the effect of significant and non-significant results, the p value was either smaller than .05 or greater than .05. Respondents first read the following scenario: Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (50 subjects in each sample). Your hypotheses are as follows. H0: No significant difference exists between the experimental and the control groups. H1: Significant difference exists between the experimental and the control groups. Further, suppose you use a simple independent means t test, and your result is t = 2.7, df = 98, p = .008 (in the significant version) or t = 1.26, df = 98, p = .21 (in the non-significant version).
Participants were asked to judge the following statements (note that the italicized phrases are different from two versions of our survey; the non-significant version is inside a bracket): (a) You have absolutely disproved (proved) the null hypothesis; (b) You have found the probability of the null (alternative) hypothesis true; (c) You are aware, if you decide to (not to) reject the null hypothesis of the probability that you are making the wrong decision; (d) You have a reliable (unreliable) experimental finding in the sense that you would obtain a significant result on 99% (21%) of occasions if, hypothetically, the experiment was repeated multiple times.
Questions for CIs
This scenario was also adopted from previous studies (Hoekstra et al., Reference Hoekstra, Morey, Rouder and Wagenmakers2014; Lyu et al., Reference Lyu, Peng and Hu2018). As in the p-value situation, respondents first read one of the two versions of the context in which the CIs did (significant) or did not (non-significant) include zero: A researcher conducts an experiment, analyzes the data, and reports: “The 95% (bilateral) confidence interval of the mean difference between the experimental group and the control group ranges from .1 to .4 (or from –.1 to .4 in the non-significant version).”
They were then required to make a judgment about the accuracy of each statement (note that the italicized phrases are different in the two versions of our survey; the non-significant version is in brackets): (a) A 95% probability exists that the true mean lies between .1 (–.1) and .4 (.4); (b) If we were to repeat the experiment over and over, then 95% of the time the true mean falls between .1 (–.1) to .4 (.4); (c) If the null hypothesis is that no difference exists between the mean of experimental group and control group, then the experiment has disproved (proved) the null hypothesis; (d) The null hypothesis is that no difference exists between the means of the experimental and the control groups. If you decide to (not to) reject the null hypothesis, then the probability that you are making the wrong decision is 5%. The English-translated questionnaires are available at: osf.io/mcu9q/.
After generating a judgment for each statement, respondents were immediately asked to indicate their confidence about the judgment from 1 (not confident at all) to 5 (very confident). All statements cannot be logically inferred from the results. Hence, any statement in which the “True” option was chosen would be coded as misinterpreting p value or CIs.
Data analysis
R 3.5.3 was used to analyze the data. The error rates of different groups of participants were compared with a chi-square test under the NHST framework. In addition, we reported Bayes factor (BF) as complementary indices for statistical inference. Bayes factors are calculated using JASP 8.6.0, with the default prior (Hu, Kong, Wagenmakers, Ly, & Peng, Reference Hu, Kong, Wagenmakers, Ly and Peng2018; Love et al., Reference Love, Selker, Marsman, Jamil, Dropmann, Verhagen and Wagenmakers2019). The following criteria for Bayesian inference are used: 1 < BF10 < 3 indicates anecdote evidence for H1, 3 < BF10 < 6 represent weak evidence for H1, 6 < BF10 < 10 means moderate evidence for H1, 10 < BF10 < 100 means strong evidence for H1, 100 < BF10 means overwhelming evidence for H1 (Jeffreys, Reference Jeffreys1961). All analysis codes are available at osf.io/mcu9q/.
Results
A total of 1,479 participants possess valid data in the p value or CI items. Sample sizes for the significant and the non-significant versions were n = 759 and n = 720 respectively. All the statements about p values and CIs cannot logically be inferred from the given context. Hence, refer to the supplementary materials where we calculated the error rate on each item to identify why these statements are wrong.
In general, the results (all the raw data are available at osf.io/mcu9q/) show that 89% of respondents had at least one error when interpreting a p value, and 93% of respondents committed at least one error when interpreting CIs. The percentage of misinterpretation failed to show differences across educational attainment (Figure 1a) or academic background (Figure 1b and Table 1). This pattern remains when we limited our analysis to postgraduates and researchers (excluding respondents with bachelor as highest degree, see Supplementary result 1, Figure S1).
Note: Discipline division was based on the degree of the respondents awarded in China. Science = disciplines awarded a degree of natural science, excluded Math and statistics, Eng/Agr. = engineering/agronomy, Social Science = sociology or other social sciences.
For the difference between the significant and the non-significant versions, the error rate for p values was lower in the latter (86%) than in the former (92%), χ2(1) = 16.841, p < .001, BF10 = 543.871. This study failed to find strong evidence for the difference between significant CIs (94%) and non-significant CIs (91%), χ2(1) = 2.892, p =.049, BF10 = 0.580. For detailed analysis and figures, see the supplementary materials.
This study discovered that most respondents were confident with the following. In all four statements for p values and CIs, the averaged confidence was over 3.8 out of 5 (see Figure 2a and 2b). We also compared the difference in confidence levels between correct answers and wrong answers by t test and found that high confidence level for accurate answers exist for certain items (see Supplementary results 3, Table S1).
Our exploratory analysis uncovered that respondents who get their highest degree overseas or in Hong Kong, Macao and Taiwan might have a lower error rate on the interpretation of p values than those who obtained their highest degree in Mainland China (See Figure 1c). For p values, 90% respondents who acquired their highest degree in Mainland China (n = 1231) had at least one wrong answer, whereas 84% respondents who attained their highest degree overseas (n = 248) had at least one wrong answer, χ2(1) = 6.38, p = .012, BF10 = 1.654. For CIs, 93% respondents who obtained their highest degree in Mainland China had at least one wrong answer, whereas 89% respondents who secured their highest degree overseas had at least one wrong answer, χ2(1) = 4.57, p = .033, BF10 = 0.602. For further analysis of the difference between Mainland China and Overseas, see Supplementary materials Figure S1c and S1d.
Discussion
The current survey found that the misinterpretation of p values and CI was prevalent in the Chinese scientific community, even in certain methodological fields. The rates of misinterpretation were high for significant or non-significant p values, and CIs that did or did not include zero. Moreover, researchers and students were generally confident about their (incorrect) judgements. These results suggest that researchers generally do not have a good understanding of these common statistical indices.
The possible reasons for these misconceptions have been discussed in the literature. For example, Gigerenzer (Reference Gigerenzer2004, Reference Gigerenzer2018) suggested that researchers used p values as a “null ritual”, which has the following steps (Gigerenzer, Reference Gigerenzer2004):
1. Set up a null hypothesis of “no mean difference” or “zero correlation”. Do not specify the predictions of your own research hypothesis.
2. Use 5% as a convention for rejecting the null hypothesis. If the test is significant, then accept your research hypothesis. Report the test result as p < .05, p < .01, or p < .001, whichever level is met by the obtained p value.
3. Always perform this procedure.
This “ritual” was “inherited” in psychology by generations of researchers, as demonstrated by the inaccurate interpretation of statistical significance in the introductory textbooks of psychology (Cassidy, Dimova, Giguère, Spence, & Stanley, Reference Cassidy, Dimova, Giguère, Spence and Stanley2019). Our results confirmed and extended this view. First, similar to many previous surveys in psychology (Haller & Krauss, Reference Haller and Krauss2002), our results found that respondents who were teaching statistics had a high error rate (>80%). Thus, students may have a wrong understanding of p value at the very beginning. Second, our results extended the scope of previous surveys and suggest that the “ritual” is not limited to psychology or social science but also to the entire scientific community. In our survey, the four items represent different “illusions” that are necessary for justifying the null ritual (Gigerenzer, Reference Gigerenzer2004, Reference Gigerenzer2018).
First, over half of respondents considered p values as evidence to disprove or prove a null hypothesis (statement A in p value and statement C in CIs). This “illusion of certainty” (Gigerenzer, Reference Gigerenzer2004, Reference Gigerenzer2018) justifies the use of null ritual. It may even motivate researchers to interrogate data to obtain a value smaller than .05 as evidence toward the existence of effects. This motivation was further enforced by the current publishing system in which p < .05 is a premise of publication.
Our results also revealed that respondents across different fields share the “replication delusion” and false Bayesian thinking. Over 50% respondents believe that 1-p or 1-α can represent the probability of successful replication (statement D in p-value section and statement B in CI section). However, p values convey nothing about the replication rate. As for Statement C in the p-value section, respondents thought the p value was equal to the type I error rate or type II error, which confused the probability of data, given the hypothesis, namely P(D|M). The probability of the hypothesis gives the data, such as P(M|D). This confusion represents Bayesian wishful thinking.
Methodologists have long discussed the lack of statistical thinking, but its potential consequences (Cohen, Reference Cohen1962, Reference Cohen1994; Gigerenzer, Reference Gigerenzer2004; Goodman, Reference Goodman2008; Meehl, Reference Meehl1978) were never heard. Only recently did researchers rediscover these problems with p values after the “replication crisis”. The “p-war” became one of the highlights in the field (Amrhein & Greenland, Reference Amrhein and Greenland2017; Amrhein, Greenland, & McShane, Reference Amrhein, Greenland and McShane2019; Benjamin et al., Reference Benjamin, Berger, Johannesson, Nosek, Wagenmakers, Berk and Johnson2018; Lakens et al., Reference Lakens, Adolfi, Albers, Anvari, Apps, Argamon and Zwaan2018). The rationale behind this debate is straightforward, that is, the p value is the most widely used statistic index, and many problems that have plagued psychology and social science are related to the misunderstanding of p values and statistics in general. For example, statistical power (Bakker, Hartgerink, Wicherts, & van der Maas, Reference Bakker, Hartgerink, Wicherts and van der Maas2016) was promoted by Cohen in the 1960s (Reference Cohen1962, Reference Cohen1994). However, the low power problem persisted in psychology (Button et al., Reference Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafo2013; Maxwell, Reference Maxwell2004), probably because statistical power is not part of the “null ritual” (Gigerenzer, Reference Gigerenzer2018). Other similar issues are questionable research practice (John et al., Reference John, Loewenstein and Prelec2012) and publication bias (Franco, Malhotra, & Simonovits, Reference Franco, Malhotra and Simonovits2014), which are probably due to the “illusion of certainty” among researchers. By revealing that researchers outside psychology share the same inaccurate understanding of p values and CIs, our results suggested that other fields might also be threatened by those problems.
Another important addition to information about the misunderstanding of p values and CIs is the confidence ratings from respondents. Most respondents were relatively confident about their own responses. This fact provides additional evidence that people have a false certainty about their own understanding, and this inaccurate certainty justifies their use of p values. Similar to the researcher’s understanding of power (Bakker et al., Reference Bakker, Hartgerink, Wicherts and van der Maas2016), this result revealed that researchers across different fields may rely on intuition more than statistical thinking when making research decisions.
In our survey, respondents who received their highest degree abroad performed on the p-value items better than their peers who acquired their highest degree in Mainland China. However, this finding did not apply to CI-related items. The only available explanation for this scenario might be that the replication crisis was discussed more in the English media than in the Chinese media. Therefore, students who had studied overseas were more familiar with this topic than their local counterparts.
Limitations
Several limitations in this survey should be pointed out. First, although we used a multidisciplinary and relatively large sample, the data were from a convenient sample, which might not be representative of the entire population. However, our results may underestimate the rate of misunderstanding of p values and CIs because our survey did not provide any compensation. Most respondents might be interested in p values and related issues. Typically, people who are interested in statistical issues may perform better than those who are not. Second, as mentioned before, we used four items for p values and four items for CIs, and the validity of certain items remain controversial. Ultimately, we found that respondents have great confidence in their interpretation of p values and CI, but we did not examine why they are confident and how they make their decisions.
Conclusion
The current survey showed that researchers from various fields of science may not be able to correctly interpret p values and CIs. They are unaware of their own misinterpretation. These results call for deep and accurate statistical training in all scientific fields.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/prp.2019.28
Acknowledgments
We appreciate the following new online media/websites for circulating our recruitment information: The Intellectuals (知识份子); Guoke Scientists (果壳科学人); Capital for Statistics (统计之者); Research Circle (科研圈); 52brain (我爱脑科学网); Quantitative Sociology (定量群学).
Financial Support
None.
Funding
This study was supported by Social Sciences and Humanities Youth Foundation of Ministry of Education of China (19YJC840030) and Philosophy and Social Science Foundation of Tianjin (TJJX18-001).