We welcome this thoughtful and creative set of ideas for improving experimentation in the social sciences. We offer several points for discussion that might further clarify and strengthen the authors’ arguments.
First, how should the design space be constructed? The authors suggest that the design space from which researchers can sample various aspects of the phenomena of interest can be constructed mostly by reviewing past literature. However, past studies are often a biased sample of the phenomena of interest, driven by implicit or explicit theories their authors had at the time, by methodological limitations, or an adherence to a particular experimental paradigm.
An example from the judgment and decision-making literature is the phenomenon of overconfidence. The assumption that an experimenter can choose “good general knowledge items” led to results suggesting that people almost always show overconfidence. But using the Brunswikian ideas of representative design, later studies (Gigerenzer, Hoffrage, & Kleinbölting, Reference Gigerenzer, Hoffrage and Kleinbölting1991; Juslin, Reference Juslin1994) showed that the items that had been previously selected were not representative of the whole population of items people experience in the real world. By randomly sampling from the whole population of items, which approximates representative design, studies showed that the overconfidence effect is not as general as previously thought (Juslin, Olsson, & Björkman, Reference Juslin, Olsson and Björkman1997; Juslin, Winman, & Olsson, Reference Juslin, Winman and Olsson2000).
Another example is research on risky choices, where traditionally participants have been presented with summary descriptions of different options. Later research has shown that risky choices can be very different when people sample from the options themselves rather than relying on a description (Hertwig, Barron, Weber, & Erev, Reference Hertwig, Barron, Weber and Erev2004; Lejarraga & Hertwig, Reference Lejarraga and Hertwig2021; Wulff, Mergenthaler-Canseco, & Hertwig, Reference Wulff, Mergenthaler-Canseco and Hertwig2018). Relying solely on prior psychological studies to understand risky choice would not discover these insights.
Of course, new dimensions can always be added to the design space as they are discovered by new research, but this poses a practical problem of the rapidly growing number of experiments that could potentially be conducted. We therefore propose two ideas for a more exhaustive construction of the design space. One is to sample the phenomenon of interest directly. For example, Brunswik would sample participants’ behavior in random intervals during several weeks, recording the behavior of interest as it occurs in the participants’ natural environments (Brunswik, Reference Brunswik1944). With today's technological developments, such experience-based sampling becomes easier to do and might be a way toward a more exhaustive grasp of the phenomenon of interest.
The other way to improve the construction of the design space is to do it collectively by many labs, in particular labs situated in different disciplines. For example, decades of research in social psychology suggest many different biases in human social cognition, which are often contradictory (Krueger & Funder, Reference Krueger and Funder2004). A tighter integration of psychology and network science has enabled recognizing how some of these biases in fact reflect a well-adapted cognition in specific social network structures (Dawes, Reference Dawes1989; Galesic, Olsson, & Rieskamp, Reference Galesic, Olsson and Rieskamp2018; Lee et al., Reference Lee, Karimi, Wagner, Jo, Strohmaier and Galesic2019; Lerman, Yan, & Wu, Reference Lerman, Yan and Wu2016).
Second, how to deal with adaptive nature of complex social systems? As the authors point out, social and behavioral phenomena are typically caused by many interacting factors that can be hard to pin down. An additional, often overlooked property of these social-cognitive systems is that they are adaptive: They change over time in response to internal and external factors. As a consequence, even the most detailed static picture of these systems would not provide the full understanding of the underlying dynamics. This of course is a problem for both one-shot and integrative experiments, and it can be addressed by conducting longitudinal studies of these systems, coupled with theoretical development. For integrative experiments, however, it introduces the additional complication and cost of longitudinal studies, which multiplies the already large number of dimensions of the design space.
This explosion of potentially important dimensions in integrative experiment design could be tamed by assigning a stronger role to theory and modeling. The article focuses mostly on their role in interpreting the results of samples taken from an already constructed design space. However, theory and computational models seem essential already in the construction of the design space. In particular, an integrative theoretical framework constructed by a collective, strongly interdisciplinary effort mentioned above, could be a useful starting point for developing the initial design space. Such collective effort could also help recognize parts of the space that are implausible and would hardly be expected to occur in the real world. Then, computational modeling could be used to further narrow down the space by investigating which of the dimensions could have a meaningful influence on the results. Such models could show that some apparently important dimensions have only a marginal influence on the system performance. Recognizing this could significantly narrow the otherwise vast space of possible experiments that could be run.
Third, what does it mean when results of experiments at particular points in the design space fail to generalize to other points? The authors suggest that this might point to an important missing dimension or even a fundamental limit of explanation of a particular phenomenon. It is however also possible that the reason is more prosaic, merely reflecting an inevitable random measurement error. This suggests that the integrative design experiments, just as one-at-a-time experiments, should be replicated. This would allow researchers to approximate confidence intervals around each of the samples from the design space and recognize what apparent differences between different points can be expected by chance. Moreover, it is likely that beyond random error, experiments conducted by any single lab will have some systematic biases stemming from lab-specific practices that can be hard to recognize without explicitly comparing labs. Different data analysts are also likely to reach different conclusions even from exactly the same data, so different labs conducting experiments from the same design space could reach different conclusions (Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Van Assche2022). To the extent that the integrative design experiments require resources that will limit them to a few larger labs, these biases could go unnoticed.
We welcome this thoughtful and creative set of ideas for improving experimentation in the social sciences. We offer several points for discussion that might further clarify and strengthen the authors’ arguments.
First, how should the design space be constructed? The authors suggest that the design space from which researchers can sample various aspects of the phenomena of interest can be constructed mostly by reviewing past literature. However, past studies are often a biased sample of the phenomena of interest, driven by implicit or explicit theories their authors had at the time, by methodological limitations, or an adherence to a particular experimental paradigm.
An example from the judgment and decision-making literature is the phenomenon of overconfidence. The assumption that an experimenter can choose “good general knowledge items” led to results suggesting that people almost always show overconfidence. But using the Brunswikian ideas of representative design, later studies (Gigerenzer, Hoffrage, & Kleinbölting, Reference Gigerenzer, Hoffrage and Kleinbölting1991; Juslin, Reference Juslin1994) showed that the items that had been previously selected were not representative of the whole population of items people experience in the real world. By randomly sampling from the whole population of items, which approximates representative design, studies showed that the overconfidence effect is not as general as previously thought (Juslin, Olsson, & Björkman, Reference Juslin, Olsson and Björkman1997; Juslin, Winman, & Olsson, Reference Juslin, Winman and Olsson2000).
Another example is research on risky choices, where traditionally participants have been presented with summary descriptions of different options. Later research has shown that risky choices can be very different when people sample from the options themselves rather than relying on a description (Hertwig, Barron, Weber, & Erev, Reference Hertwig, Barron, Weber and Erev2004; Lejarraga & Hertwig, Reference Lejarraga and Hertwig2021; Wulff, Mergenthaler-Canseco, & Hertwig, Reference Wulff, Mergenthaler-Canseco and Hertwig2018). Relying solely on prior psychological studies to understand risky choice would not discover these insights.
Of course, new dimensions can always be added to the design space as they are discovered by new research, but this poses a practical problem of the rapidly growing number of experiments that could potentially be conducted. We therefore propose two ideas for a more exhaustive construction of the design space. One is to sample the phenomenon of interest directly. For example, Brunswik would sample participants’ behavior in random intervals during several weeks, recording the behavior of interest as it occurs in the participants’ natural environments (Brunswik, Reference Brunswik1944). With today's technological developments, such experience-based sampling becomes easier to do and might be a way toward a more exhaustive grasp of the phenomenon of interest.
The other way to improve the construction of the design space is to do it collectively by many labs, in particular labs situated in different disciplines. For example, decades of research in social psychology suggest many different biases in human social cognition, which are often contradictory (Krueger & Funder, Reference Krueger and Funder2004). A tighter integration of psychology and network science has enabled recognizing how some of these biases in fact reflect a well-adapted cognition in specific social network structures (Dawes, Reference Dawes1989; Galesic, Olsson, & Rieskamp, Reference Galesic, Olsson and Rieskamp2018; Lee et al., Reference Lee, Karimi, Wagner, Jo, Strohmaier and Galesic2019; Lerman, Yan, & Wu, Reference Lerman, Yan and Wu2016).
Second, how to deal with adaptive nature of complex social systems? As the authors point out, social and behavioral phenomena are typically caused by many interacting factors that can be hard to pin down. An additional, often overlooked property of these social-cognitive systems is that they are adaptive: They change over time in response to internal and external factors. As a consequence, even the most detailed static picture of these systems would not provide the full understanding of the underlying dynamics. This of course is a problem for both one-shot and integrative experiments, and it can be addressed by conducting longitudinal studies of these systems, coupled with theoretical development. For integrative experiments, however, it introduces the additional complication and cost of longitudinal studies, which multiplies the already large number of dimensions of the design space.
This explosion of potentially important dimensions in integrative experiment design could be tamed by assigning a stronger role to theory and modeling. The article focuses mostly on their role in interpreting the results of samples taken from an already constructed design space. However, theory and computational models seem essential already in the construction of the design space. In particular, an integrative theoretical framework constructed by a collective, strongly interdisciplinary effort mentioned above, could be a useful starting point for developing the initial design space. Such collective effort could also help recognize parts of the space that are implausible and would hardly be expected to occur in the real world. Then, computational modeling could be used to further narrow down the space by investigating which of the dimensions could have a meaningful influence on the results. Such models could show that some apparently important dimensions have only a marginal influence on the system performance. Recognizing this could significantly narrow the otherwise vast space of possible experiments that could be run.
Third, what does it mean when results of experiments at particular points in the design space fail to generalize to other points? The authors suggest that this might point to an important missing dimension or even a fundamental limit of explanation of a particular phenomenon. It is however also possible that the reason is more prosaic, merely reflecting an inevitable random measurement error. This suggests that the integrative design experiments, just as one-at-a-time experiments, should be replicated. This would allow researchers to approximate confidence intervals around each of the samples from the design space and recognize what apparent differences between different points can be expected by chance. Moreover, it is likely that beyond random error, experiments conducted by any single lab will have some systematic biases stemming from lab-specific practices that can be hard to recognize without explicitly comparing labs. Different data analysts are also likely to reach different conclusions even from exactly the same data, so different labs conducting experiments from the same design space could reach different conclusions (Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Van Assche2022). To the extent that the integrative design experiments require resources that will limit them to a few larger labs, these biases could go unnoticed.
Competing interest
None.