Hostname: page-component-586b7cd67f-dsjbd Total loading time: 0 Render date: 2024-11-27T19:27:53.671Z Has data issue: false hasContentIssue false

Consensus meetings will outperform integrative experiments

Published online by Cambridge University Press:  05 February 2024

Maximilian A. Primbs
Affiliation:
Behavioural Science Institute, Radboud University, Nijmegen, The Netherlands [email protected] [email protected], https://max-primbs.netlify.app/
Leonie A. Dudda
Affiliation:
Department of Otorhinolaryngology, Head and Neck Surgery, University Medical Center, Utrecht, The Netherlands [email protected] University Medical Center Utrecht Brain Center, University Medical Center Utrecht, Utrecht, The Netherlands
Pia K. Andresen
Affiliation:
Department of Methodology & Statistics, Utrecht University, Utrecht, The Netherlands [email protected]
Erin M. Buchanan
Affiliation:
Harrisburg University of Science and Technology, Harrisburg, PA, USA [email protected], https://www.aggieerin.com/
Hannah K. Peetz
Affiliation:
Behavioural Science Institute, Radboud University, Nijmegen, The Netherlands [email protected] [email protected], https://max-primbs.netlify.app/
Miguel Silan
Affiliation:
Annecy Behavioral Science Lab, Menthon Saint Bernard, France [email protected] Développement, individu, processus, handicap, éducation (DIPHE), Université Lumière Lyon 2, Bron Cedex, France
Daniël Lakens*
Affiliation:
Human–Technology Interaction Group, Eindhoven University of Technology, Eindhoven, The Netherlands [email protected], https://sites.google.com/site/lakens2
*
*Corresponding author.

Abstract

We expect that consensus meetings, where researchers come together to discuss their theoretical viewpoints, prioritize the factors they agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2024. Published by Cambridge University Press

Lack of coordination limits both the accumulation and integration of claims, as well as the efficient falsification of theories. How is the field to deal with this problem? We expect that consensus meetings (Fink, Kosecoff, Chassin, & Brook, Reference Fink, Kosecoff, Chassin and Brook1984), where researchers come together to discuss their theoretical viewpoints, prioritize the factors they all agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments. We provide four reasons.

First, design spaces are simply an extension of the principles of multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) to theory-building. Researchers have recognized that any specified multiverse is just one of many possible multiverses (Primbs et al., Reference Primbs, Rinck, Holland, Knol, Nies and Bijlstra2022). The same is true for design spaces. People from different backgrounds and fields are aware of different literatures and might therefore construct different design spaces. Therefore, in practice a design space does not include all factors that members of a scientific community deem relevant – they merely include one possible subset of these factors. While any single design space can lead to findings that can be used to generate new hypotheses, it is not sufficient to integrate existing hypotheses. Designing experiments that inform the integration of disparate findings requires that members of the community agree that the design space contains all relevant factors to corroborate or falsify their predictions. If any such factor is missing, members of the scientific community can more easily dismiss the conclusions of an integrative experiment for lacking a crucial moderator or including a condemning confound. Committing a priori to the outcome – for example, in a consensus meeting – makes it more difficult to dismiss the conclusions.

We believe that to guarantee that people from different backgrounds, fields, and convictions are involved in the creation and approval of the design space, consensus meetings will be required. During these consensus meetings, researchers will need to commit in advance to the consequences that the results of an integrative experiment will have for their hypotheses. Examples in the psychological literature show how initial versions of such consensus-based tests of predictions can efficiently falsify predictions (Vohs et al., Reference Vohs, Schmeichel, Lohmann, Gronau, Finley, Ainsworth and Albarracín2021), and exclude competing hypotheses (Coles et al., Reference Coles, March, Marmolejo-Ramos, Larsen, Arinze, Ndukaihe and Liuzza2022). Furthermore, because study-design decisions always predetermine the types of effects that can be identified in the design space, varying operationalizations may result in multiple versions of a study outcome that are not proforma comparable. To reduce the risks of a “methodological imperative” (Danziger, Reference Danziger1990), we need a consensus among experts on the theory and construct validity of the variables being tested.

Second, many of the observed effects in a partial design space will be either too small to be theoretically interesting, or too small to be practically important. Determining when effect sizes are too small to be theoretically or practically interesting can be challenging, yet it is essential to be able to falsify predictions, as well as to show the absence of differences between experiments (Primbs et al., Reference Primbs, Pennington, Lakens, Silan, Lieck, Forscher and Westwood2023). Due to the combination of “crud” (Orben & Lakens, Reference Orben and Lakens2020) and large sample sizes, very small effect sizes could be statistically significant in integrative experiments. Without specifying a smallest effect of interest, the scientific literature will be polluted with a multitude of irrelevant and unfalsifiable claims. For integrative experiments, which require a large investment of time and money, discussions about which effects are large enough to matter should happen before data are collected. Many fields that have specified smallest effect sizes of interest have used consensus meetings to discuss this important topic.

Third, it is important to note that due to the large number of comparisons made in integrative experiments, some significant differences might not be due to crud (i.e., true effects caused by uninteresting mechanisms), but due to false positives. Strictly controlling the type 1 error rate when comparing many variations of studies will lower the statistical power of tests as the number of comparisons increases. Not controlling for multiple comparisons will require follow-up replication studies before claims can be made. Such is the cost of a fishing expedition. Consensus meetings, which have as one goal to reach collective agreement on which research questions should be prioritized, while coordinating measures and manipulations across studies, might end up being more efficient.

Fourth, identifying variation in effect sizes across a range of combinatorial factors is not sufficient to explain this variation. To make generalizable claims and distinguish hypothesized effects from confounding variables, one must understand how design choices affect effect sizes. Here, we consider machine-learning (ML) approaches a toothless tiger. Because these models exploit all kinds of stochastic dependencies in the data, ML models are excellent at identifying predictors in nonexplanatory, predictive research (Hamaker, Mulder, & Van IJzendoorn, Reference Hamaker, Mulder and Van IJzendoorn2020; Shmueli, Reference Shmueli2010). If there is a true causal model explaining the influence of a set of design choices and variables on a study outcome, the algorithm will find all relations – even those due to confounding, collider bias, or crud (Pearl, Reference Pearl1995). Algorithms identify predictors only relative to the variable set – the design space – so even “interpretable, mechanistic” (target article, sect. 3.3.1, para. 3) ML models cannot simply grant indulgence in causal reasoning. Achieving causal understanding through ML tools (e.g., through causal discovery algorithms) requires researchers to make strong assumptions and engage in a priori theorizing about causal dependencies (Glymour, Zhang, & Spirtes, Reference Glymour, Zhang and Spirtes2019). Here again, we believe it would be more efficient to debate such considerations in consensus meetings.

We believe integrative experiments may be useful when data collection is cheap and the goal is to develop detailed models that predict variation in real-world factors. Such models are most useful when they aim to explain variation in naturally occurring combinations of factors (as effect sizes for combinations of experimental manipulations could quickly become nonsensical). For all other research questions where a lack of coordination causes inefficiencies, we hope researchers studying the same topic will come together in consensus meetings to coordinate their research.

Competing interest

None.

References

Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J. T., Arinze, N. C., Ndukaihe, I. L. G., … Liuzza, M. T. (2022). A multi-lab test of the facial feedback hypothesis by the Many Smiles Collaboration. Nature Human Behaviour, 6(12), 17311742. https://doi.org/10.1038/s41562-022-01458-9CrossRefGoogle ScholarPubMed
Danziger, K. (1990). Constructing the subject: Historical origins of psychological research. Cambridge University Press. https://doi.org/10.1017/CBO9780511524059CrossRefGoogle Scholar
Fink, A., Kosecoff, J., Chassin, M., & Brook, R. H. (1984). Consensus methods: Characteristics and guidelines for use. American Journal of Public Health, 74(9), 979983. https://doi.org/10.2105/AJPH.74.9.979CrossRefGoogle ScholarPubMed
Glymour, C., Zhang, K., & Spirtes, P. (2019). Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, 524. https://doi.org/10.3389/fgene.2019.00524CrossRefGoogle ScholarPubMed
Hamaker, E. L., Mulder, J. D., & Van IJzendoorn, M. H. (2020). Description, prediction and causation: Methodological challenges of studying child and adolescent development. Developmental Cognitive Neuroscience, 46, 100867. https://doi.org/10.1016/j.dcn.2020.100867CrossRefGoogle ScholarPubMed
Orben, A., & Lakens, D. (2020). Crud (re)defined. Advances in Methods and Practices in Psychological Science, 3(2), 238247. https://doi.org/10.1177/2515245920917961CrossRefGoogle ScholarPubMed
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669688. https://doi.org/10.2307/2337329CrossRefGoogle Scholar
Primbs, M. A., Pennington, C. R., Lakens, D., Silan, M. A. A., Lieck, D. S. N., & Forscher, P. S., … Westwood, S. J. (2023). Are small effects the indispensable foundation for a cumulative psychological science? A reply to Götz et al. (2022). Perspectives on Psychological Science, 18(2), 508512. https://doi.org/10.1177/17456916221100420CrossRefGoogle ScholarPubMed
Primbs, M. A., Rinck, M., Holland, R., Knol, W., Nies, A., & Bijlstra, G. (2022). The effect of face masks on the stereotype effect in emotion perception. Journal of Experimental Social Psychology, 103, Article 104394. https://doi.org/10.1016/j.jesp.2022.104394CrossRefGoogle Scholar
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289310. https://doi.org/10.1214/10-sts330CrossRefGoogle Scholar
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702712. https://doi.org/10.1177/1745691616658637CrossRefGoogle ScholarPubMed
Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., … Albarracín, D. (2021). A multisite preregistered paradigmatic test of the ego-depletion effect. Psychological Science, 32(10), 15661581. https://doi.org/10.1177/0956797621989733CrossRefGoogle ScholarPubMed