The replicability of empirical findings in science is disappointingly low. Recent data suggest that 70% of surveyed scientists admit to being unable to replicate another’s work (Baker, Reference Baker2016). Efforts to replicate well-known psychology studies further found only 36% of studies replicated the original findings (Open Science Collaboration, 2015). Nonetheless, the reporting of positive results in scientific literature increased by 22% between 1990 and 2007 (Fanelli, Reference Fanelli2012). This increase could be partially attributed to “p-hacking,” an ethically troubling practice of manipulating analyses to shift results into the range considered significant (Simmons et al., Reference Simmons, Nelson and Simonsohn2011). Although some behaviors could be defensible with a priori reasoning (e.g., removing statistical outliers; Sacco et al., Reference Sacco, Bruton and Brown2018, Reference Sacco and Brown2019), using these practices to increase the odds of significant results could inflate Type I error rates in published research. Governing bodies of science have gone so far as to call these practices detrimental (NASEM, 2017).
As concerns grew over the prevalence of these practices, various scientific fields have implemented ameliorative systemic reforms. Some academic journals have instituted submission checklists requiring authors to state adherence to best practices (Wicherts et al., Reference Wicherts, Veldkamp, Augusteijn, Bakker, Van Aert and Van Assen2016). For example, the Journal of Experimental Social Psychology requires authors to report every manipulation, measure, and exclusion. With psychology oftentimes leading efforts to develop and implement open science practices (Nosek et al., Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber, Fidler, Hilgard, Struhl, Nuijten, Rohrer, Romero, Scheel, Scherer, Schönbrodt and Vazire2022), evidence of their efficacy exists primarily in this discipline (Brown et al., Reference Brown, McGrath and Sacco2022; Protsko et al., Reference Protzko, Krosnick, Nelson, Nosek, Axt, Berent, Buttrick, DeBell, Ebersole, Lundmark, MacInnis, O’Donnell, Perfecto, Pustejovsky, Roeder, Walleczek and Schooler2023). Recent research has tapped scientists across disciplines with identifying these practices in their respective fields. Scientists in the life sciences report concerns of HARKing, p-hacking, selective reporting, and lack of methodological transparency. Political science has concerns with fashion-based selection of research ideas, politicization of research, p-hacking, salami-slicing, and selective reporting (Ravn & Sørensen, Reference Ravn and Sørensen2021; Rubenson, Reference Rubenson2021). This letter seeks to document the extent to which journals in behavioral sciences have developed submission requirements to minimize the proliferation of these practices in published research. From there, we provide preliminary evidence for the efficacy of these measures while addressing the practical constraints in more interdisciplinary sciences with tangible recommendations for journals to consider.
Detrimental practices increase the likelihood of a result being deemed “publishable” based on general biases of journals to publish significant findings. Examples of detrimental behaviors include the addition of unjustified covariates into a model on a post hoc basis (Simmons et al., Reference Simmons, Nelson and Simonsohn2011) and selectively reporting findings that support hypotheses (Ioannidis & Trikolinos, Reference Ioannidis and Trikalinos2007). To identify the potential rates of non-replicable findings, research has begun evaluating findings based on their results and calculating estimates of this likelihood. Many of these indices (e.g., p-curves) consider the extent to which p-values cluster around α rather than span the entire critical region of p < .05 (Simonsohn et al., Reference Simonsohn, Nelson and Simmons2014), whereas others focus on the probability of results being replicated (e.g., Z-curves; Bartoš & Schimmack, Reference Bartoš and Schimmack2022).
As the possibility of assessing replicability increases, an objective set of metrics could start evaluating the efficacy of submission requirements. We have recently begun evaluating these efforts empirically. This endeavor involved calculating p-curves for published findings in major psychology journals (e.g., Psychological Science and Journal of Personality and Social Psychology) following enactment of submission requirements. We quantified the number of requirements (e.g., reporting all measures and open data), as listed on journals’ websites. Journals with more submission requirements had lower estimates of non-replicable findings (Brown et al., Reference Brown, McGrath and Sacco2022). Table 1 provides the list of empirically identified submission rules from this analysis.
As psychology provides an initial model for how to implement best practice policies, other sciences could feel empowered to join this conversation to voice their concerns and needs. The intersection of political and life sciences presents an interesting challenge. Some journals within this purview have begun implementing submission requirements (e.g., Evolution and Human Behavior and Political Psychology). Conversely, Politics and the Life Sciences uses a version of these policies to encourage transparency and best practices but not as a requirement. This discrepancy could reflect a relatively moving target in interdisciplinary sciences based on constraints in their field. Outlets with less explicit ties to psychology may have different criteria for reporting results that could make the implementation of a standardized battery of requirements for these journals difficult. Journals in these areas could nonetheless begin comparing outlets with and without submission requirements. Even without reported p-values, many results remain amenable to analyses (e.g., confidence intervals). For qualitative analyses, researchers and journals could collaboratively develop metrics to assess robustness appropriate for the methodology.
As outlets in political and life sciences implement similar policies, it remains advantageous to consider collaborative discussions among those in the peer review process. Objective analyses of submission requirements could be complemented by system-level feedback from authors, reviewers, and editors. This feedback could inform policy based on what requirements could be helpful while identifying various burdens of these requirements and how to address them. For authors, for example, requirements could be prohibitive without special permission (e.g., proprietary data and participant privacy), which may increase systemic barriers for early-career researchers or those at smaller institutes (e.g., Beer et al., Reference Beer, Eastwick and Goh2023; Begum Ali et al., Reference Begum Ali, Holman, Goodwin, Heraty and Jones2023; McDermott, Reference McDermott2022; Mulligan, Reference Mulligan, Hall and Raphael2013; Rubenson, Reference Rubenson2021). Nonetheless, such measures could prove popular with reviewers and editors. Based on actual reviewer responses in MEDLINE in 2015, 63.8 million hours were dedicated to peer review; 20% of reviewers performed 60%–94% of reviews (Kovanis et al., Reference Kovanis, Porcher, Ravaud and Trinquart2016). This suggests that peer review can be burdensome. Requirements could allow editors to screen submissions and vet them more easily before sending them for review (i.e., desk rejections). Reviewers could provide more substantive reviews efficiently without needing to parse ambiguous findings that may obfuscate detrimental behaviors.
The increasing need for transparency in sciences requires governing bodies to address the appetite of participants to engage in best practices. In addition to rewarding engagement, research could begin investigating how to increase participation and reduce barriers to participation. Such efforts may require funding from outlets in political and life sciences, but governing bodies would benefit from putting their money where their mouths are given how effective they appear to be increasing empirical rigor in behavioral sciences (Protsko et al., Reference Protzko, Krosnick, Nelson, Nosek, Axt, Berent, Buttrick, DeBell, Ebersole, Lundmark, MacInnis, O’Donnell, Perfecto, Pustejovsky, Roeder, Walleczek and Schooler2023). Table 2 provides a summary of the potential recommendations discussed in this letter.