Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-30T17:00:35.427Z Has data issue: false hasContentIssue false

Beyond playing 20 questions with nature: Integrative experiment design in the social and behavioral sciences

Published online by Cambridge University Press:  21 December 2022

Abdullah Almaatouq*
Affiliation:
Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA [email protected]
Thomas L. Griffiths
Affiliation:
Departments of Psychology and Computer Science, Princeton University, Princeton, NJ, USA [email protected]
Jordan W. Suchow
Affiliation:
School of Business, Stevens Institute of Technology, Hoboken, NJ, USA [email protected]
Mark E. Whiting
Affiliation:
School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA [email protected]
James Evans
Affiliation:
Department of Sociology, University of Chicago, Chicago, IL, USA [email protected] Santa Fe Institute, Santa Fe, NM, USA
Duncan J. Watts
Affiliation:
Department of Computer and Information Science, Annenberg School of Communication, and Operations, Information, and Decisions Department, University of Pennsylvania, Philadelphia, PA, USA [email protected]
*
Corresponding author: Abdullah Almaatouq; Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

The dominant paradigm of experiments in the social and behavioral sciences views an experiment as a test of a theory, where the theory is assumed to generalize beyond the experiment's specific conditions. According to this view, which Alan Newell once characterized as “playing twenty questions with nature,” theory is advanced one experiment at a time, and the integration of disparate findings is assumed to happen via the scientific publishing process. In this article, we argue that the process of integration is at best inefficient, and at worst it does not, in fact, occur. We further show that the challenge of integration cannot be adequately addressed by recently proposed reforms that focus on the reliability and replicability of individual findings, nor simply by conducting more or larger experiments. Rather, the problem arises from the imprecise nature of social and behavioral theories and, consequently, a lack of commensurability across experiments conducted under different conditions. Therefore, researchers must fundamentally rethink how they design experiments and how the experiments relate to theory. We specifically describe an alternative framework, integrative experiment design, which intrinsically promotes commensurability and continuous integration of knowledge. In this paradigm, researchers explicitly map the design space of possible experiments associated with a given research question, embracing many potentially relevant theories rather than focusing on just one. Researchers then iteratively generate theories and test them with experiments explicitly sampled from the design space, allowing results to be integrated across experiments. Given recent methodological and technological developments, we conclude that this approach is feasible and would generate more-reliable, more-cumulative empirical and theoretical knowledge than the current paradigm – and with far greater efficiency.

Type
Target Article
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press

1. Introduction

You can't play 20 questions with Nature and win. (Newell, Reference Newell1973)

Fifty years ago, Allen Newell summed up the state of contemporary experimental psychology as follows: “Science advances by playing twenty questions with nature. The proper tactic is to frame a general question, hopefully binary, that can be attacked experimentally. Having settled that bits-worth, one can proceed to the next … Unfortunately, the questions never seem to be really answered, the strategy does not seem to work” (italics added for emphasis).

The problem, Newell noted, was a lack of coherence among experimental findings. “We never seem in the experimental literature to put the results of all the experiments together,” he wrote, “Innumerable aspects of the situations are permitted to be suppressed. Thus, no way exists of knowing whether the earlier studies are in fact commensurate with whatever ones are under present scrutiny, or are in fact contradictory.” Referring to a collection of papers by prominent experimentalists, Newell concluded that although it was “exceedingly clear that each paper made a contribution … I couldn't convince myself that it would add up, even in thirty more years of trying, even if one had another 300 papers of similar, excellent ilk.”

More than 20 years after Newell's imagined future date, his outlook seems, if anything, optimistic. To illustrate the problem, consider the phenomenon of group “synergy,” defined as the performance of an interacting group exceeding that of an equivalently sized “nominal group” of individuals working independently (Hill, Reference Hill1982; Larson, Reference Larson2013). A century of experimental research in social psychology, organizational psychology, and organizational behavior has tested the performance implications of working in groups relative to working individually (Allen & Hecht, Reference Allen and Hecht2004; Richard Hackman & Morris, Reference Richard Hackman, Morris and Berkowitz1975; Husband, Reference Husband1940; Schulz-Hardt & Mojzisch, Reference Schulz-Hardt and Mojzisch2012; Tasca, Reference Tasca2021; Watson, Reference Watson1928), but substantial contributions can also be found in cognitive science, communications, sociology, education, computer science, and complexity science (Allport, Reference Allport1924; Arrow, McGrath, & Berdahl, Reference Arrow, McGrath and Berdahl2000; Barron, Reference Barron2003; Devine, Clayton, Dunford, Seying, & Pryce, Reference Devine, Clayton, Dunford, Seying and Pryce2001). In spite of this attention across time and disciplines – or maybe because of it – this body of research often reaches inconsistent or conflicting conclusions. For example, some studies find that interacting groups outperform individuals because they are able to distribute effort (Laughlin, Bonner, & Miner, Reference Laughlin, Bonner and Miner2002), share information about high-quality solutions (Mason & Watts, Reference Mason and Watts2012), or correct errors (Mao, Mason, Suri, & Watts, Reference Mao, Mason, Suri and Watts2016), whereas other studies find that “process losses” – including social loafing (Harkins, Reference Harkins1987; Karau & Williams, Reference Karau and Williams1993), groupthink (Janis, Reference Janis1972), and interpersonal conflict (Steiner, Reference Steiner1972) – cause groups to underperform their members.

As we will argue, the problem is not that researchers lack theoretically informed hypotheses about the causes and predictors of group synergy; to the contrary, the literature contains dozens, or possibly even hundreds, of such hypotheses. Rather, the problem is that because each of these experiments was designed with the goal of testing a hypothesis but, critically, not with the goal of explicitly comparing the results with other experiments of the same general class, researchers in this space have no way to articulate how similar or different their experiment is from anyone else's. As a result, it is impossible to determine – via systematic review, meta-analysis, or any other ex-post method of synthesis – how all of the potentially relevant factors jointly determine group synergy or how their relative importance and interactions change over contexts and populations.

Nor is group synergy the only topic in the social and behavioral sciences for which one can find a proliferation of irreconcilable theories and empirical results. For any substantive area of the social and behavioral sciences on which we have undertaken a significant amount of reading, we see hundreds of experiments that each tests the effects of some independent variables on other dependent variables while suppressing innumerable “aspects of the situation.”Footnote 1 Setting aside the much-discussed problems of replicability and reproducibility, many of these papers are interesting when read in isolation, but it is no more possible to “put them all together” today than it was in Newell's time (Almaatouq, Reference Almaatouq2019; Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Watts, Reference Watts2017).

Naturally, our subjective experience of reading across several domains of interest does not constitute proof that successful integration of many independently designed and conducted experiments cannot occur in principle, or even that it has not occurred in practice. Indeed it is possible to think of isolated examples, such as mechanism design applied to auctions (Myerson, Reference Myerson1981; Vickrey, Reference Vickrey1961) and matching markets (Aumann & Hart, Reference Aumann and Hart1992; Gale & Shapley, Reference Gale and Shapley1962), in which theory and experiment appear to have accumulated into a reasonably self-consistent, empirically validated, and practically useful body of knowledge. We believe, however, that these examples represent rare exceptions and that examples such as group synergy are far more typical.

We propose two explanations for why not much has changed since Newell's time. The first is that not everyone agrees with the premise of Newell's critique – that “putting things together” is a pressing concern for the scientific enterprise. In effect, this view holds that the approach Newell critiqued (and that remains predominant in the social and behavioral sciences) is sufficient for accumulating knowledge. Such accumulation manifests itself indirectly through the scientific publishing process, with each new paper building upon earlier work, and directly through literature reviews and meta-analyses. The second explanation for the lack of change since Newell's time is that even if one accepts Newell's premise, neither Newell nor anyone else has proposed a workable alternative; hence, the current paradigm persists by default in spite of its flaws.Footnote 2

In the remainder of this paper, we offer our responses to the two explanations just proposed. Section 2 addresses the first explanation, describing what we call the “one-at-a-time” paradigm and arguing that it is poorly suited to the purpose of integrating knowledge over many studies in large part because it was not designed for that purpose. We also argue that existing mechanisms for integrating knowledge, such as systematic reviews and meta-analyses, are insufficient on the grounds that they, in effect, assume commensurability. If the studies that these methods are attempting to integrate cannot be compared with one another, because they were not designed to be commensurable, then there is little that ex-post methods can do.Footnote 3 Rather, an alternative approach to designing experiments and evaluating theories is needed. Section 3 addresses the second explanation by describing such an alternative, which we call the “integrative” approach, that is explicitly designed to integrate knowledge about a particular problem domain. Although integrative experiments of the sort we describe may not have been possible in Newell's day, we argue that they can now be productively pursued in parts of the social and behavioral sciences thanks to increasing theoretical maturity and methodological developments. To illustrate this point, section 4 illustrates the potential of the integrative approach by describing three experiments that are first steps in its direction. Finally, section 5 outlines questions and concerns we have encountered and offers our response.

2. The “one-at-a-time” paradigm

In the simplest version of what we call the “one-at-a-time” approach to experimentation, a researcher poses a question about the relation between one independent and one dependent variable and then offers a theory-motivated hypothesis that the relation is positive or negative. Next, the researcher devises an experiment to test this hypothesis by introducing variability in the independent variable, aiming to reject the “null hypothesis” that the proposed dependency does not exist on the basis of the evidence, quantified by a p-value. If the null hypothesis is successfully rejected, the researcher concludes that the experiment corroborates the theory and then elaborates on potential implications, both for other experiments and for phenomena outside the lab.

In practice, one-at-a-time experiments can be considerably more complex. The researcher may articulate hypotheses about more than one independent variable, more than one dependent variable, or both. The test itself may focus on effect sizes or confidence intervals rather than statistical significance, or it may compare two or more competing hypotheses. Alternatively, both the hypothesis and the test may be qualitative in nature. Regardless, each experiment tests at most a small number of theoretically informed hypotheses in isolation by varying at most a small number of parameters. By design, all other factors are held constant. For example, a study of the effect of reward or punishment on levels of cooperation typically focuses on the manipulation of theoretical interest (e.g., introducing a punishment stage between contribution rounds in a repeated game) while holding fixed other parameters, such as the numerical values of the payoffs or the game's length (Fehr & Gachter, Reference Fehr and Gachter2000). Similarly, a study of the effect of network structure on group performance typically focuses on some manipulation of the underlying network while holding fixed the group size or the time allotted to perform the task (Almaatouq et al., Reference Almaatouq, Noriega-Campero, Alotaibi, Krafft, Moussaid and Pentland2020; Becker, Brackbill, & Centola, Reference Becker, Brackbill and Centola2017).

2.1. The problem with the one-at-a-time paradigm

As Newell himself noted, this approach to experimentation seems reasonable. After all, the sequence of question → theory → hypothesis → experiment → analysis → revision to theory → repeat appears to be almost interchangeable with the scientific method itself. Nonetheless, the one-at-a-time paradigm rests on an important but rarely articulated assumption: That because the researcher's purpose in designing an experiment is to test a theory of interest, the only constructs of interest are those that the theory itself explicitly articulates as relevant. Conversely, where the theory is silent, the corresponding parameters are deemed to be irrelevant. According to this logic, articulating a precise theory leads naturally to a well-specified experiment with only one, or at most a few, constructs in need of consideration. Correspondingly, theory can aid the interpretation of the experiment's results – and can be generalized to other cases (Mook, Reference Mook1983; Zelditch, Reference Zelditch1969).

Unfortunately, while such an assumption may be reasonable in fields such as physics, it is rarely justified in the social and behavioral sciences (Debrouwere & Rosseel, Reference Debrouwere and Rosseel2022; Meehl, Reference Meehl1967). Social and behavioral phenomena exhibit higher “causal density” (or what Meehl called the “crud factor”) than physical phenomena, such that the number of potential causes of variation in any outcome is much larger than in physics and the interactions among these causes are often consequential (Manzi, Reference Manzi2012; Meehl, Reference Meehl1990b). In other words, the human world is vastly more complex than the physical one, and researchers should be neither surprised nor embarrassed that their theories about it are correspondingly less precise and predictive (Watts, Reference Watts2011). The result is that theories in the social and behavioral sciences are rarely articulated with enough precision or supported by enough evidence for researchers to be sure which parameters are relevant and which can be safely ignored (Berkman & Wilson, Reference Berkman and Wilson2021; Meehl, Reference Meehl1990b; Turner & Smaldino, Reference Turner and Smaldino2022; Yarkoni, Reference Yarkoni2022). Researchers working independently in the same domain of inquiry will therefore invariably make design choices (e.g., parameter settings, subject pools) differently (Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Żółtak2022; Gelman & Loken, Reference Gelman and Loken2014). Moreover, because the one-at-a-time paradigm is premised on the (typically unstated) assumption that theories dictate the design of experiments, the process of making design decisions about constructs that are not specified under the theory being tested is often arbitrary, vague, undocumented, or (as Newell puts it) “suppressed.”

2.2. The universe of possible experiments

To express the problem more precisely, it is useful to think of a one-at-a-time experiment as a sample from an implicit universe of possible experiments in a domain of inquiry. Before proceeding, we emphasize that neither the sample nor the universe is typically acknowledged in the one-at-a-time paradigm. Indeed, it is precisely the transition from implicit to explicit construction of the sampling universe that forms the basis of the solution we describe in the next section.

In imagining such a universe, it is useful to distinguish the independent variables needed to define the effect of interest – the experimental manipulation – from the experiment's context. We define this context as the set of independent variables that are hypothesized to moderate the effect in question as well as the nuisance parameters (which, strictly speaking, are also independent variables) over which the effect is expected to generalize and that correspond to the design choices the researcher makes about the specific experiment that will be conducted. For example, an experiment comparing the performance of teams to that of individuals not only will randomize participants into a set of experimental conditions (e.g., individuals vs. teams of varying sizes), but will also reflect decisions about other contextual features, including, for example, the specific tasks on which to compare performance, where each task could then be parameterized along multiple dimensions (Almaatouq, Alsobay, Yin, & Watts, Reference Almaatouq, Alsobay, Yin and Watts2021a; Larson, Reference Larson2013). Other contextual choices include the incentives provided to participants, time allotted to perform the task, modality of response, and so on. Similarly, we define the population of the experiment as a set of measurable attributes that characterize the sample of participants (e.g., undergraduate women in the United States aged 18–23 with a certain distribution of cognitive reflection test scores). Putting all these choices together, we can now define an abstract space of possible experiments, the dimensions of which are the union of the context and population. We call this space the design space on the grounds that every conceivable design of the experiment is describable by some choice of parameters that maps to a unique point in the space.Footnote 4 (Although this is an abstract way of defining what we mean by the experiment design space, we will suggest concrete and practical ways of defining it later in the article.)

Figure 1 shows a simplified rendering of a design space and illustrates several important properties of the one-at-a-time paradigm. Figure 1A shows a single experiment conducted in a particular context with a particular sample population. The color of the point represents the “result” of the experiment: The effect of one or more independent variables on some dependent variable. In the absence of a theory, nothing can be concluded from the experiment alone, other than that the observed result holds for one particular sample of participants under one particular context. From this observation, the appeal of strong theory becomes clear: By framing an experiment as a test of a theory, rather than as a measurement of the relationship between dependent and independent variables (Koyré, Reference Koyré1953), the observed results can be generalized well beyond the point in question, as shown in Figure 1B. For example, while a methods section of an experimental paper might note that the participants were recruited from the subject pool at a particular university, it is not uncommon for research articles to report findings as if they apply to all of humanity (Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010). According to this view, theories (and in fields such as experimental economics, formal models) are what help us understand the world, whereas experiments are merely instruments that enable researchers to test theories (Lakens, Uygun Tunç, & Necip Tunç, Reference Lakens, Uygun Tunç and Necip Tunç2022; Levitt & List, Reference Levitt and List2007; Mook, Reference Mook1983; Zelditch, Reference Zelditch1969).

Figure 1. Implicit design space. Panel A depicts a single experiment (a single point) that generates a result in a particular sample population and context; the point's color represents a relationship between variables. Panel B depicts the expectation that results will generalize over broader regions of conditions. Panel C shows a result that applies to a bounded range of conditions. Panel D illustrates how isolated studies about specific hypotheses can reach inconsistent conclusions, as represented by different-colored points.

As noted above, however, we rarely expect theories in the social and behavioral sciences to be universally valid. The ability of the theory in question to generalize the result is therefore almost always limited to some region of the design space that includes the sampled point but not the entire space, as shown in Figure 1C. While we expect that most researchers would acknowledge that they lack evidence for unconstrained generality over the population, it is important to note that there is nothing special about the subjects. In principle, what goes for subjects also holds for contexts (Simons, Shoda, & Lindsay, Reference Simons, Shoda and Lindsay2017; Yarkoni, Reference Yarkoni2022). Indeed, as Brunswik long ago observed, “…proper sampling of situations and problems may in the end be more important than proper sampling of subjects, considering the fact that individuals are probably on the whole much more alike than are situations among one another” (Brunswik, Reference Brunswik1947).

Unfortunately, because the design space is never explicitly constructed, and hence the sampled point has no well-defined location in the space, the one-at-a-time paradigm cannot specify a proposed domain of generalizability. Instead, any statements regarding “scope” or “boundary” conditions for a finding are often implicit and qualitative in nature, leaving readers to assume the broadest possible generalizations. These scope conditions may appear in an article's discussion section but typically not in its title, abstract, or introduction. Rarely, if ever, is it possible to precisely identify, based on the theory alone, over what domain of the design space one should expect an empirical result to hold (Cesario, Reference Cesario2014, Reference Cesario2022).

2.3. Incommensurability leads to irreconcilability

Given that the choices about the design of experiments are not systematically documented, it becomes impossible to establish how similar or different two experiments are. This form of incommensurability, whereby experiments about the same effect of interest are incomparable, generates a pattern like that shown in Figure 1D, where inconsistent and contradictory findings appear in no particular order or pattern (Levinthal & Rosenkopf, Reference Levinthal and Rosenkopf2021). If one had a metatheory that specified precisely under what conditions (i.e., over what region of parameter values in the design space) each theory should apply, it might be possible to reconcile the results under that metatheory's umbrella, but rarely do such metatheories exist (Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019). As a result, the one-at-a-time paradigm provides no mechanism by which to determine whether the observed differences (a) are to be expected on the grounds that they lie in distinct subdomains governed by different theories, (b) represent a true disagreement between competing theories that make different claims on the same subdomain, or (c) indicate that one or both results are likely to be wrong and therefore require further replication and scrutiny. In other words, inconsistent findings arising in the research literature are essentially irreconcilable (Almaatouq, Reference Almaatouq2019; Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Van Bavel, Mende-Siedlecki, Brady, & Reinero, Reference Van Bavel, Mende-Siedlecki, Brady and Reinero2016; Watts, Reference Watts2017; Yarkoni, Reference Yarkoni2022).

Critically, the absence of commensurability also creates serious problems for existing methods of synthesizing knowledge such as systematic reviews and meta-analyses. As all these methods are post-hoc, meaning that they are applied after the studies in question have been completed, they are necessarily reliant on the designs of the experiments they are attempting to integrate. If those designs do not satisfy the property of commensurability (again, because they were never intended to), then ex-post methods are intrinsically limited in how much they can say about observed differences. A concrete illustration of this problem has emerged recently in the context of “nudging” due to the publication of a large meta-analysis of over 400 studies spanning a wide range of contexts and interventions (Mertens, Herberz, Hahnel, & Brosch, Reference Mertens, Herberz, Hahnel and Brosch2022). The paper was subsequently criticized for failing to account adequately for publication bias (Maier et al., Reference Maier, Bartoš, Stanley, Shanks, Harris and Wagenmakers2022), the quality of the included studies (Simonsohn, Simmons, & Nelson, Reference Simonsohn, Simmons and Nelson2022), and their heterogeneity (Szaszi et al., Reference Szaszi, Higney, Charlton, Gelman, Ziano, Aczel and Tipton2022). While the first two of these problems can be addressed by proposed reforms in science, such as universal registries of study designs (which are designed to mitigate publication bias) and adoption of preanalysis plans (which are specified to improve study quality), the problem of heterogeneity requires a framework for expressing study characteristics in a way that is commensurate. If two studies are different, that is, a meta-analysis is left with no means to incorporate information from both of them that properly accounts for their differences. Thus, while meta-analyses (and reviews more generally) can acknowledge the importance of moderating variables, they are inherently limited in their ability to do so by the commensurability of the underlying studies.

Finally, we note that the lack of commensurability is also unaddressed by existing proposals to improve the reliability of science by, for example, increasing sample sizes, calculating effect sizes rather than measures of statistical significance, replicating findings, or requiring preregistered designs. Although these practices can indeed improve the reliability of individual findings, they are not concerned directly with the issue of how many such findings “fit together” and hence do not address our fundamental concern with the one-at-a-time framework. In other words, just as Newell claimed 50 years ago, improving the commensurability of experiments – and the theories they seek to test – will require a paradigmatic shift in how we think about experimental design.

3. From one-at-a-time to integrative by design

We earlier noted that a second explanation for the persistence of the one-at-a-time approach is the lack of any realistic alternative. Even if one sees the need for a “paradigmatic shift in how we think about experimental design,” it remains unclear what that shift would look like and how to implement it. To address this issue, we now describe an alternative approach, which we call “integrative” experimentation, that can resolve some of the difficulties described previously. In general terms, the one-at-a-time approach starts with a single, often very specific, theoretically informed hypothesis. In contrast, the integrative approach starts from the position of embracing many potentially relevant theories: All sources of measurable experimental-design variation are potentially relevant, and decisions about which parameters are relatively more or less important are to be answered empirically. The integrative approach proceeds in three phases: (1) Constructing a design space, (2) sampling from the design space, and (3) building theories from the resulting data. The rest of this section elucidates these three main conceptual components of the integrative approach.

3.1. Constructing the design space

The integrative approach starts by explicitly constructing the design space. Experiments that have already been conducted can then be assigned well-defined coordinates, whereas those not yet conducted can be identified as as-yet-unsampled points. Critically, the differences between any pair of experiments that share the same effect of interest – whether past or future – can be determined; thus, it is possible to precisely identify the similarities and differences between two designs. In other words, commensurability is “baked in” by design.

How should the design space be constructed in practice? The method will depend on the domain of interest but is likely to entail a discovery stage that identifies candidate dimensions from the literature. Best practices for constructing the design space will emerge with experience, giving birth to a new field of what we tentatively label “research cartography”: The systematic process of mapping out research fields in design spaces. Efforts in research cartography are likely to benefit from and contribute to ongoing endeavors to produce formal ontologies in social and behavioral science research and other disciplines, in support of a more integrative science (Larson & Martone, Reference Larson and Martone2009; Rubin et al., Reference Rubin, Lewis, Mungall, Misra, Westerfield, Ashburner and Musen2006; Turner & Laird, Reference Turner and Laird2012).

To illustrate this process, consider the phenomenon of group synergy discussed earlier. Given existing theory and decades of experiments, one might expect the existence and strength of group synergy to depend on the task: For some tasks, interacting groups might outperform nominal groups, whereas for others, the reverse might hold. In addition, synergy might (or might not) be expected depending on the specific composition of the group: Some combinations of skills and other individual attributes might lead to synergistic performance; other combinations might not. Finally, group synergy might depend on “group processes,” defined as variables such as the communications technology or incentive structure that affect how group members interact with one another, but which are distinct both from the individuals themselves and their collective task.

Given these three broad sources of variation, an integrative approach would start by identifying the dimensions associated with each, as suggested either by prior research or some other source of insight such as practical experience. In this respect, research cartography resembles the process of identifying the nodes of a nomological network (Cronbach & Meehl, Reference Cronbach and Meehl1955; Preckel & Brunner, Reference Preckel and Brunner2017) or the dimensions of methodological diversity for a meta-analysis (Higgins, Thompson, Deeks, & Altman, Reference Higgins, Thompson, Deeks and Altman2003); however, it will typically involve many more dimensions and require the “cartographer” to assign numerical coordinates to each “location” in the space. For example, the literature on group performance has produced several well-known task taxonomies, such as those by Shaw (Reference Shaw1963), Hackman (Reference Hackman1968), Steiner (Reference Steiner1972), McGrath (Reference McGrath1984), and Wood (Reference Wood1986). Task-related dimensions of variation (e.g., divisibility, complexity, solution demonstrability, and solution multiplicity) would be extracted from these taxonomies and used to label tasks that have appeared in experimental studies of group performance. Similarly, prior work has variously suggested that group performance depends on the composition of the group with respect to individual-level traits as captured by, say, average skill (Bell, Reference Bell2007; Devine & Philips, Reference Devine and Philips2001; LePine, Reference LePine2003; Stewart, Reference Stewart2006), skill diversity (Hong & Page, Reference Hong and Page2004; Page, Reference Page2008), gender diversity (Schneid, Isidor, Li, & Kabst, Reference Schneid, Isidor, Li and Kabst2015), social perceptiveness (Engel, Woolley, Jing, Chabris, & Malone, Reference Engel, Woolley, Jing, Chabris and Malone2014; Kim et al., Reference Kim, Engel, Woolley, Lin, McArthur and Malone2017; Woolley, Chabris, Pentland, Hashmi, & Malone, Reference Woolley, Chabris, Pentland, Hashmi and Malone2010), and cognitive-style diversity (Aggarwal & Woolley, Reference Aggarwal and Woolley2018; Ellemers & Rink, Reference Ellemers and Rink2016), all of which could be represented as dimensions of the design space. Finally, group-process variables might include group size (Mao et al., Reference Mao, Mason, Suri and Watts2016), properties of the communication network (Almaatouq, Rahimian, Burton, & Alhajri, Reference Almaatouq, Rahimian, Burton and Alhajri2022; Becker et al., Reference Becker, Brackbill and Centola2017; Mason & Watts, Reference Mason and Watts2012), and the ability of groups to reorganize themselves (Almaatouq et al., Reference Almaatouq, Noriega-Campero, Alotaibi, Krafft, Moussaid and Pentland2020). Together, these variables might identify upward of 50 dimensions that define a design space of possible experiments for studying group synergy through integrative experiment design, where any given study should, in principle, be assignable to one unique point in the space.Footnote 5

As this example illustrates, the list of possibly relevant variables can be long, and the dimensionality of the design space can therefore be large. Complicating matters, we do not necessarily know up front which of the many variables are in fact relevant to the effects of interest. In the example of group synergy, for instance, even an exhaustive reading of the relevant literature is not guaranteed to reveal all the ways in which tasks, groups, and group processes can vary in ways that meaningfully affect synergy. Conversely, there is no guarantee that all, or even most, of the dimensions chosen to represent the design space will play any important role in generating synergy. As a result, experiments that map to the same point in the design space could yield different results (because some important dimension is missing from the representation of the space), while in other cases, experiments that map to very different points yield indistinguishable behavior (because the dimensions along which they differ are irrelevant).

Factors such as these complicate matters in practice but do not present a fundamental problem to the approach described here. The integrative approach does not require the initial configuration of the space to be correct or its dimensionality to be fixed. Rather, the dimensionality of the space can be learned in parallel with theory construction and testing. Really, the only critical requirement for constructing the design space is to do it explicitly and systematically by identifying potentially relevant dimensions (either from the literature or from experience, including any known experiments that have already been performed) and by assigning coordinates to individual experiments along all identified dimensions. Using this process of explicit, systematic mapping of research designs to points in the design space (research cartography), the integrative approach ensures commensurability. We next will describe how the approach leverages commensurability to produce integrated knowledge in two steps: Via sampling, and via theory construction and testing.

3.2. Sampling from the design space

An important practical challenge to integrative experiment design is that the size of the design space (i.e., the number of possible experiments) increases exponentially with the number of identified dimensions D. To illustrate, assume that each dimension can be represented as a binary variable (0, 1), such that a given experiment either exhibits the property encoded in the dimension or does not. The number of possible experiments is then 2D. When D is reasonably small and experiments are inexpensive to run, it may be possible to exhaustively explore the space by conducting every experiment in a full factorial design. For example, when D = 8, there are 256 experiments in the design space, a number that is beyond the scale of most studies in the social and behavioral sciences but is potentially achievable with recent innovations in crowdsourcing and other “high-throughput” methods, especially if distributed among a consortium of labs (Byers-Heinlein et al., Reference Byers-Heinlein, Bergmann, Davies, Frank, Kiley Hamlin, Kline and Soderstrom2020; Jones et al., Reference Jones, DeBruine, Flake, Liuzza, Antfolk, Arinze and Coles2021). Moreover, running all possible experiments may not be necessary: If the goal is to estimate the impact that each variable has, together with their interactions, a random (or more efficient) sample of the experiments can be run (Auspurg & Hinz, Reference Auspurg and Hinz2014). This sample could also favor areas where prior work suggests meaningful variation will be observed. Using these methods, together with large samples, it is possible to run studies for higher values of D (e.g., 20). Section 4 describes examples of such studies.

Exhaustive and random sampling are both desirable because they allow unbiased evaluation of hypotheses that are not tethered to the experimental design – there is no risk of looking only at regions of the space that current hypotheses favor (Dubova, Moskvichev, & Zollman, Reference Dubova, Moskvichev and Zollman2022), and no need to collect more data from the design space because the hypotheses under consideration change. But as the dimensionality increases, exhaustive and random sampling quickly becomes infeasible. When D is greater than 20, the number of experiment designs grows to over 1 million, and when D = 30, it is over 1 billion. Given that the dimensionality of design spaces for even moderately complex problems could easily exceed these numbers, and that many dimensions will be not binary but ternary or greater, integrative experiments will require using different sampling methods.

Fortunately, there already exist a number of methods that enable researchers to efficiently sample high-dimensional design spaces (Atkinson & Donev, Reference Atkinson and Donev1992; McClelland, Reference McClelland1997; Smucker, Krzywinski, & Altman, Reference Smucker, Krzywinski and Altman2018; Thompson, Reference Thompson1933). For example, one contemporary class of methods is “active learning,” an umbrella term for sequential optimal experimental-design strategies that iteratively select the most informative design points to sample.Footnote 6 Active learning has become an important tool in the design of A/B tests in industry (Letham, Karrer, Ottoni, & Bakshy, Reference Letham, Karrer, Ottoni and Bakshy2019) and, more recently, of behavioral experiments in the lab (Balietti, Klein, & Riedl, Reference Balietti, Klein and Riedl2021).Footnote 7 Most commonly, an active learning process begins by conducting a small number of randomly selected experiments (i.e., points in the design space) and fitting a surrogate model to the outcome of these experiments. As we later elucidate, one can think of the surrogate model as a “theory” that predicts the outcome of all experiments in the design space, including those that have not been conducted. Then, a sampling strategy (also called an “acquisition function,” “query algorithm,” or “utility measure”) selects a new batch of experiments to be conducted according to the value of potential experiments. Notably, the choice of a surrogate model and sampling strategy is flexible, and the best alternative to choose will depend on the problem (Eyke, Koscher, & Jensen, Reference Eyke, Koscher and Jensen2021).Footnote 8

We will not explore the details of these methods or their implementation,Footnote 9 as this large topic has been – and continues to be – extensively developed in the machine-learning and statistics communities.Footnote 10 For the purpose of our argument, it is necessary only to convey that systematic sampling from the design space allows for unbiased evaluation of hypotheses (see Fig. 2A) and can leverage a relatively small number of sampled points in the design space to make predictions about every point in the space, the vast majority of which are never sampled (see Fig. 2B). Even so, by iteratively evaluating the model against newly sampled points and updating it accordingly, the model can learn about the entire space, including which dimensions are informative. As we explain next, this iterative process will also form the basis of theory construction and evaluation.

Figure 2. Explicit design space. Panel A shows that systematically sampling the space of possible experiments can reveal contingencies, thereby increasing the integrativeness of theories (as shown in panel B). Panel C depicts that what matters most is the overlap between the most practically useful conditions and domains defined by theoretical boundaries. The elephants in panels B and C represent the bigger picture that findings from a large number of experiments allow researchers to discern, but which is invisible to those from situated theoretical and empirical positions.

3.3. Building and testing theories

Much like in the one-at-a-time paradigm, the ultimate goal of integrative experiment design is to develop a reliable, cohesive, and cumulative theoretical understanding. However, because the integrative approach constructs and tests theories differently, the theories that tend to emerge from it depart from the traditional notion of theory in two regards. First, the shift to integrative experiments will change our expectations about what theories look like (Watts, Reference Watts2014, Reference Watts2017), requiring researchers to focus less on proposing novel theories that seek to differentiate themselves from existing theories by identifying new variables and their effects, and more on identifying theory boundaries, which may involve many known variables working together in complex ways. Second, although traditional theory development distinguishes sharply between basic and applied research, integrative theories will lend themselves to a “use-inspired” approach in which basic and applied science are treated as complements rather than as substitutes where one necessarily drives out the other (Stokes, Reference Stokes1997; Watts, Reference Watts2017). We now describe each of these adaptations in more detail.

3.3.1. Integrating and reconciling existing theories

As researchers sample experiments that cover more of the design space, simple theories and models that explain behavior with singular factors will no longer be adequate because they will fail to generalize. From a statistical perspective, the “bias-variance trade-off” principle identifies two ways a model (or theory) can fail to generalize: It can be too simple and thus unable to capture trends in the observed data, or too complex, overfitting the observed data and manifesting great variance across datasets (Geman, Bienenstock, & Doursat, Reference Geman, Bienenstock and Doursat1992). However, this variance decreases as the datasets increase in size and breadth, making oversimplification and reliance on personal intuitions more-likely causes of poor generalization. As a consequence, we must develop new kinds of theories – or metatheories – that capture the complexity of human behaviors while retaining the interpretability of simpler theories.Footnote 11 In particular, such theories must account for variation in behavior across the entire design space and will be subject to different evaluation criteria than those traditionally used in the social and behavioral sciences.

One such criterion is the requirement that theories generate “risky” predictions, defined roughly as quantitative predictions about as-yet unseen outcomes (Meehl, Reference Meehl1990b; Yarkoni, Reference Yarkoni2022). For example, in the “active sampling” approach outlined above, the surrogate model encodes prior theory and experimental results into a formal representation that (a) can be viewed as an explanation of all previously sampled experimental results and (b) can be queried for predictions treated as hypotheses. This dual status of the surrogate model as both explanation and prediction (Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths, Kleinberg and Yarkoni2021; Nemesure, Heinz, Huang, & Jacobson, Reference Nemesure, Heinz, Huang and Jacobson2021; Yarkoni & Westfall, Reference Yarkoni and Westfall2017) distinguishes it from the traditional notion of hypothesis testing. Rather than evaluating a theory based on how well it fits existing (i.e., in-sample) experimental data, the surrogate model is continually evaluated on its ability to predict new (i.e., out-of-sample) experimental data. Moreover, once the new data have been observed, the model is updated to reflect the new information, and new predictions are generated.

We emphasize that the surrogate model from the active learning approach is just one way to generate, test, and learn from risky predictions. Many other approaches also satisfy this criterion. For example, one might train a machine-learning model other than the surrogate model to estimate heterogeneity of treatment effects and to discover complex structures that were not specified in advance (Wager & Athey, Reference Wager and Athey2018). Alternatively, one could use an interpretable, mechanistic, model. The only essential requirements for an integrative model are that it leverages the commensurability of the design space to in some way (a) accurately explain data that researchers have already observed, (b) make predictions about as-yet-unseen experiments, and then, having run those experiments, and (c) integrate the newly learned information to improve the model. If accurate predictions are achievable across some broad domain of the design space, the model can then be interpreted as supporting or rejecting various theoretical claims in a context-population-dependent way, as illustrated schematically in Figure 2B. Reflecting Merton's (Reference Merton1968) call for “theories of the middle range,” a successful metatheory could identify the boundaries between empirically distinct regions of the design space (i.e., regions where different observed answers to the same research question pertain), making it possible to precisely state under what conditions (i.e., for which ranges of parameter values) one should expect different theoretically informed results to apply.

If accurate predictions are unachievable even after an arduous search, the result is not a failure of the integrative framework. Rather, it would be an example of the framework's revealing a fundamental limit to prediction and, hence, explanation (Hofman, Sharma, & Watts, Reference Hofman, Sharma and Watts2017; Martin, Hofman, Sharma, Anderson, & Watts, Reference Martin, Hofman, Sharma, Anderson and Watts2016; Watts et al., Reference Watts, Beck, Bienenstock, Bowers, Frank, Grubesic and Salganik2018).Footnote 12 In the extreme, when no point in the space is informative of any other point, generalizations of any sort are unwarranted. In such a scenario, applied research might still be possible, for example, by sampling the precise point of interest (Manzi, Reference Manzi2012), but the researcher's drive to attain a generalizable theoretical understanding of a domain of inquiry would be exposed as fruitless. Such an outcome would be disappointing, but from a larger scientific perspective, it is better to know what cannot be known than to believe in false promises. Naturally, whether such outcomes arise – and if so, how frequently – is itself an empirical question that the proposed framework could inform. With sufficient integrative experiments over many domains, the framework might yield a “meta-metatheory” that clarifies under which conditions one should (or should not) expect to find predictively accurate metatheories.

3.3.2. Bridging scientific and pragmatic knowledge

Another feature of integrative theories is that they will lend themselves to a “use-inspired” approach. Practitioners and researchers alike generally acknowledge that no single intervention, however evidence-based, benefits all individuals in all circumstances (i.e., across populations and contexts) and that overgeneralization from lab experiments in many areas of behavioral science can (and routinely does) lead practitioners and policymakers to deploy suboptimal and even dangerous real-world interventions (Brewin, Reference Brewin2022; de Leeuw, Motz, Fyfe, Carvalho, & Goldstone, Reference de Leeuw, Motz, Fyfe, Carvalho and Goldstone2022; Grubbs, Reference Grubbs2022; Wiernik, Raghavan, Allan, & Denison, Reference Wiernik, Raghavan, Allan and Denison2022). Therefore, social scientists should precisely identify the most effective intervention under each arising set of circumstances.

The integrative approach naturally emphasizes contingencies and enables practitioners to distinguish between the most general result and the result that is most useful in practice. For example, in Figure 2B, the experiments depicted with a gray point correspond to the most general claim, occupying the largest region in the design space. However, this view ignores relevance, defined as points that represent the “target” conditions or the particular real-world context to which the practitioner hopes to generalize the results (Berkman & Wilson, Reference Berkman and Wilson2021; Brunswik, Reference Brunswik1955), as shown in Figure 2C. By concretely emphasizing these theoretical contingencies, the integrative approach supports “use-inspired” research (Stokes, Reference Stokes1997; Watts, Reference Watts2017).

4. Existing steps toward integrative experiments

Integrative experiment design is not yet an established framework. However, some recent experimental work has begun to move in the direction we endorse – for example, by explicitly constructing a design space, sampling conditions more broadly and densely than the one-at-a-time approach would have, and constructing new kinds of theories that reflect the complexity of human behavior. In this section, we describe three examples of such experiments in the domains of (1) moral judgments, (2) risky choices, and (3) subliminal priming effects. Note that these examples are not an exhaustive accounting of relevant work, nor fully fleshed out exemplars of the integrative framework. Rather, we find them to be helpful illustrations of work that is closely adjacent to what we describe and evidence that the approach is realizable and can yield useful insights.

4.1. Factors influencing moral judgments

Inspired by the trolley problem, the seminal “Moral Machine” experiment used crowdsourcing to study human perspectives on moral decisions made by autonomous vehicles (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018, Reference Awad, Dsouza, Bonnefon, Shariff and Rahwan2020). The experiment was supported by an algorithm that sampled a nine-dimensional space of over 9 million distinct moral dilemmas. In the first 18 months after deployment, the researchers collected more than 40 million decisions in 10 languages from over 4 million unique participants in 233 countries and territories (Fig. 3A).

Figure 3. Examples of integrative experiments. The top row illustrates the experimental tasks used in the Moral Machine, decisions under risk, and subliminal priming effects experiments, respectively, followed by the parameters varied across each experiment (bottom row). Each experiment instance (i.e., a scenario in the Moral Machine experiment, a pair of gambles in the risky-choice experiment, and a selection of facet values in the subliminal priming effects experiment) can be described by a vector of parameter values. Reducing the resulting space to two dimensions (2D) visualizes coverage by different experiments. This 2D embedding results from applying principal component analysis (PCA) to the parameters of these experimental conditions.

The study offers numerous findings that were neither obvious nor deducible from prior research or traditional experimental designs. For example, they show that once a moral dilemma is made sufficiently complex, few people will hold to the principle of treating all lives equally. Instead, they appear to treat demographic groups quite differently – for example, a willingness to sacrifice the elderly in service of the young, and a preference for sparing the wealthy over the poor at about the same level as the preference for preserving people following the law over those breaking it (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018). A second surprising finding by Awad et al. (Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018) was that the differences between omission and commission (a staple of discussions of Western moral philosophy) ranks surprisingly low relative to other variables affecting judgments of morality and that this ethical preference for inaction is primarily concentrated in Western cultures (e.g., North America and many European countries of Protestant, Catholic, and Orthodox Christian cultural groups). Indeed, the observation that clustering between countries is not just based on one or two ethical dimensions, but on a full profile of the multiplicity of ethical dimensions is something that would have been impossible to detect using studies that lacked the breadth of experimental conditions sampled in this study.

Moreover, such an approach to experimentation yields datasets that are more useful to other researchers as they evaluate their hypotheses, develop new theories, and address long-standing concerns such as which variables matter most to producing a behavior and what their relative contributions might be. For instance, Agrawal and colleagues used the dataset generated by the Moral Machine experiment to build a model with a black-box machine-learning method (specifically, an artificial neural network) for predicting people's decisions (Agrawal, Peterson, & Griffiths, Reference Agrawal, Peterson and Griffiths2020). This predictive model was used to critique a traditional cognitive model and identify potentially causal variables influencing people's decisions. The cognitive model was then evaluated in a new round of experiments that tested its predictions about the consequences of manipulating the causal variables. This approach of “scientific regret minimization” combined machine learning with rational choice models to jointly maximize the theoretical model's predictive accuracy and interpretability in the context of moral judgments. It also yielded a more-complex theory than psychologists might be accustomed to: The final model had over 100 meaningful predictors, each of which could have been the subject of a distinct experiment and theoretical insight about human moral reasoning. By considering the influence of these variables in a single study by Awad et al. (Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018), the researchers could ask what contribution each made to explaining the results. Investigation at this scale becomes possible when machine-learning methods augment the efforts of human theorists (Agrawal et al., Reference Agrawal, Peterson and Griffiths2020).

4.2. The space of risky decisions

The choice prediction competitions studied human decisions under risk (i.e., where outcomes are uncertain) by automating selection of more than 100 pairs of gambles from a 12-dimensional space with an algorithm (Erev, Ert, Plonsky, Cohen, & Cohen, Reference Erev, Ert, Plonsky, Cohen and Cohen2017; Plonsky et al., Reference Plonsky, Apel, Ert, Tennenholtz, Bourgin, Peterson and Erev2019). Recent work scaled this approach by taking advantage of the larger sample sizes made possible by virtual labs, collecting human decisions for over 10,000 pairs of gambles (Bourgin, Peterson, Reichman, Russell, & Griffiths, Reference Bourgin, Peterson, Reichman, Russell, Griffiths, Chaudhuri and Salakhutdinov2019; Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021).

By sampling the space of possible experiments (in this case, gambles) much more densely (Fig. 3B), Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) found that two of the classic phenomena of risky choice – loss aversion and overweighting of small probabilities – did not manifest uniformly across the entire space of possible gambles. These two phenomena originally prompted the development of prospect theory (Kahneman & Tversky, Reference Kahneman and Tversky1979), representing significant deviations from the predictions of classic expected utility theory. By identifying regions of the space of possible gambles where loss aversion and overweighting of small probabilities occur, Kahneman and Tversky showed that expected utility theory does not capture some aspects of human decision making. However, in analyzing predictive performance across the entire space of gambles, Peterson et al. found that prospect theory was outperformed by a model in which the degree of loss aversion and overweighting of small probabilities varied smoothly over the space.

The work of Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) illustrates how the content of theories might be expected to change with a shift to the integrative approach. Prospect theory makes a simple assertion about human decision making: People exhibit loss aversion and overweight small probabilities. Densely sampling a larger region of the design space yields a more nuanced theory: While the functional form of prospect theory is well suited for characterizing human decisions, the extent to which people show loss aversion and overweight small probabilities depends on the context of the choice problem. That dependency is complicated. Even so, Peterson et al. identified several relevant variables such as the variability of the outcomes of the underlying gambles and whether the gamble was entirely in the domain of losses. Machine-learning methods were useful in developing this theory, initially to optimize the parameters of the functions assumed by prospect theory and other classic theories of decision making so as to ensure evaluation of the best possible instances of those theories, and then to demonstrate that these models did not capture variation in people's choices that could be predicted by more-complex models.

4.3. A metastudy of subliminal priming effects

A recent cognitive psychology paper described an experiment in which a subliminal cue influences how participants balance speed and accuracy in a response-time task (Reuss, Kiesel, & Kunde, Reference Reuss, Kiesel and Kunde2015). In particular, participants were instructed to rapidly select a target according to a cue that signaled whether to prioritize response accuracy over speed, or vice versa. Reuss et al. reported typical speed–accuracy tradeoffs: When cued to prioritize speed, participants were faster and gave less accurate responses, whereas when cued to prioritize accuracy, participants were slower and more accurate. Crucially, this relationship was also found with cues that were rendered undetectable via a mask, an image presented directly before or after the cue that can suppress conscious perception of it.

The study design of the original experiment included several nuisance variables (e.g., the color of the cue), the values of which were not thought to affect the finding of subliminal effects. If the claimed effects were general, it would appear for all plausible values of the nuisance variables, whereas its appearance in some (contiguous) ranges of values but not in others would indicate contingency. And if spurious, the effect would appear only for the original values, if at all.

Baribault et al. (Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018) took a “radical randomization” approach (also called a “metastudy” approach) in examining the generalizability and robustness of the original finding by randomizing 16 independent variables that could moderate the subliminal priming effect (Fig. 3C). By sampling nearly 5,000 “microexperiments” from the 16-dimensional design space, Baribault et al. revealed that masked cues had an effect on participant behavior only in the subregion of the design space where the cue is consciously visible, thus providing much stronger evidence about the lack of the subliminal priming effect than any single traditional experiment evaluating this effect could have. For a recent, thorough discussion of the metastudy approach and its advantages, along with a demonstration using the risky-choice framing effect, see DeKay, Rubinchik, Li, and De Boeck (Reference DeKay, Rubinchik, Li and De Boeck2022).

5. Critiques and concerns

We have argued that adopting what we have called “integrative designs” in experimental social and behavioral science will lead to more-consistent, more-cumulative, and more-useful science. As should be clear from our discussion, however, our proposal is preliminary and therefore subject to several questions and concerns. Here we outline some of the critiques we have encountered and offer our responses.

5.1. Isn't the critique of the one-at-a-time approach unfair?

One possible response is that our critique of the one-at-a-time approach is unduly critical and does not recognize its proper role in the future of social and behavioral sciences. To be clear, we are neither arguing that scientists should discard the “one-at-a-time” paradigm entirely nor denigrating studies (including our own!) that have employed it. The approach has generated a substantial amount of valuable work and continues to be useful for understanding individual causal effects, shaping theoretical models, and guiding policy. For example, it can be a sufficient and effective means to provide evidence for the existence of a phenomenon (but not the conditions under which it exists), as in field experiments that show that job applicants with characteristically “Black” names are less likely to be interviewed than those with “White” names, revealing the presence of structural racism and informing public debates about discrimination (Bertrand & Mullainathan, Reference Bertrand and Mullainathan2004). Moreover, one-at-a-time experimentation can precede the integrative approach when exploring a new topic and identifying the variables that make up the design space.

Rather, our point is that the one-at-a-time approach cannot do all the work that is being asked of it, in large part because theories in the social and behavioral sciences cannot do all the work that is being asked of them. Once we recognize the inherent imprecision and ambiguity of social and behavioral theories, the lack of commensurability across independently designed and executed experiments is revealed as inevitable. Similarly, the solution we describe here can be understood simply as baking commensurability into the design process, by explicitly recognizing potential dimensions of variability and mapping experiments such that they can be compared with one another. In this way, the integrative approach can complement one-at-a-time experiments by incorporating them within design spaces (analogous to how articles already contextualize their contribution in terms of the prior literature), through which the research field might quickly recognize creative and pathbreaking contributions from one-at-a-time research.

5.2. Can't we solve the problem with meta-analysis?

As discussed earlier, meta-analyses offer the attractive proposition that accumulation of knowledge can be achieved through a procedure that compares and combines results across experiments. But the integrative approach is different in at least three important ways.

First, meta-analyses – as well as systematic reviews and integrative conceptual reviews – are by nature post hoc mechanisms for performing integration: The synthesis and integration steps occur after the data are collected and the results are published. Therefore, it can take years of waiting for studies to accumulate “naturally” before one can attempt to “put them together” via meta-analyses (if at all, as the vast majority of published effects are never meta-analyzed). More importantly, because commensurability is not a first-order consideration of one-at-a-time studies, attempts to synthesize collections of such studies after the fact are intrinsically challenging. The integrative approach is distinct in that it treats commensurability as a first-order consideration that is baked into the research design at the outset (i.e., ex ante). As we have argued, the main benefit of ex ante over ex post integration is that the explicit focus on commensurability greatly eases the difficulty of comparing different studies and hence integrating their findings (whether similar or different). In this respect, our approach can be viewed as a “planned meta-analysis” that is explicitly designed to sample conditions more broadly, minimize sampling bias, and efficiently reveal how effects vary across conditions. Although it may take more time and effort (and thus money) to run an integrative experiment than a single traditional experiment, when considering the accumulated effort of all the original research, this effort is much less than that of typical meta-analyses (see sect. 5.6 for a discussion about costs).

Second, although a meta-analysis typically aims to estimate the size of an effect by aggregating (e.g., averaging) over design variations across experiments, our emphasis is on trying to map the variation in an effect across an entire design space. While some meta-analyses with sufficient data attempt to determine the heterogeneity of the effect of interest, these efforts are typically hindered by the absence of systematic data on the variations in design choices (as well as in methods).

Third, publication bias induced by selective reporting of conditions and results – known as the file drawer problem (Carter, Schönbrodt, Gervais, & Hilgard, Reference Carter, Schönbrodt, Gervais and Hilgard2019; Rosenthal, Reference Rosenthal1979) – can lead to biased effect-size estimates in meta-analyses. While there are methods for identifying and correcting such biases, one cannot be sure of their effectiveness in any particular case because of their sensitivity to untestable assumptions (Carter et al., Reference Carter, Schönbrodt, Gervais and Hilgard2019; Cooper, Hedges, & Valentine, Reference Cooper, Hedges and Valentine2019). Another advantage of the integrative approach is that it is largely immune to such problems because all sampled experiments are treated as informative, regardless of the novelty or surprise value of the individual findings, thereby greatly reducing the potential for bias.

5.3. How do integrative experiments differ from other recent innovations in psychology?

There have been several efforts to innovate on traditional experiments in the behavioral and social sciences. One key innovation is collaboration by multiple research labs to conduct systematic replications or to run larger-scale experiments than had previously been possible. For instance, the Many Labs initiative coordinated numerous research labs to conduct a series of replications of significant psychological results (Ebersole et al., Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks and Nosek2016; Klein et al., Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein and Nosek2014, Reference Klein, Vianello, Hasselman, Adams, Adams, Alper and Nosek2018). This effort has itself been replicated in enterprises such as the ManyBabies Consortium (ManyBabies Consortium, 2020), ManyClasses (Fyfe et al., Reference Fyfe, de Leeuw, Carvalho, Goldstone, Sherman, Admiraal and Motz2021), and ManyPrimates (Many Primates et al., Reference Altschul, Beran, Bohn, Call, DeTroy, Duguid and Watzek2019), which pursue the same goal with more-specialized populations, and in the DARPA SCORE program, which did so over a representative sample of experimental research in the behavioral and social sciences (Witkop, Reference Witkopn.d.).Footnote 13 The Psychological Science Accelerator brings together multiple labs with a different goal: To evaluate key findings in a broader range of participant populations and at a global scale (Moshontz et al., Reference Moshontz, Campbell, Ebersole, IJzerman, Urry, Forscher and Chartier2018). Then, there is the Crowdsourcing Hypothesis Tests collaboration, which assigned 15 research teams to each design a study targeting the same hypothesis, varying in methods (Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020). Moreover, there is a recent trend in behavioral science to run “megastudies,” in which researchers test a large number of treatments in a single study in order to increase the pace and comparability of experimental results (Milkman et al., Reference Milkman, Patel, Gandhi, Graci, Gromet, Ho and Duckworth2021, Reference Milkman, Gandhi, Patel, Graci, Gromet, Ho and Duckworth2022; Voelkel et al., Reference Voelkel, Stagnaro, Chu, Pink, Mernyk, Redekopp and Willer2022).

All of these efforts are laudable and represent substantial methodological advances that we view as complements to, not substitutes for, integrative designs. What is core to the integrative approach is the explicit construction of, sampling from, and building theories upon a design space of experiments. Each ongoing innovation can contribute to the design of integrative experiments in its own way. For example, large-scale collaborative networks such as Many Labs can run integrative experiments together by assigning points in the design space to participating labs. Or in the megastudy research design, the interventions selected by researchers can be explicitly mapped into design spaces and then analyzed in a way that aims to reveal contingencies and generate metatheories of the sort discussed in section 3.3.

5.4. What about unknown unknowns?

There will always be systematic nontrivial variables that should be represented in the design space but are missing – these are the unknown unknowns. We believe our responses to this challenge are worth expanding upon.

First, we acknowledge the challenge inherent in the first step of integrative experiment design: Constructing the design space. This construction requires identifying the subset of variables to include from an infinite set of possible variables that could define the design space of experiments within a domain. To illustrate such a process, we discussed the example domain of group synergy (see sect. 3.1). But, of course, we think that the field is wide open, with many options to explore; that the methodological details will depend on the domain of interest; and that best practices will emerge with experience.

Second, although we do not yet know which of the many potentially relevant dimensions should be selected to represent the space, and there are no guarantees that all (or even most) of the selected dimensions will play a role in determining the outcome, the integrative approach can shed light on both issues. On the one hand, experiments that map to the same point in the design space but yield different results indicate that some important dimension is missing from the representation of the space. On the other, experiments that systematically vary in the design space but yield similar results could indicate that the dimensions where they differ are irrelevant to the effect of interest and should be collapsed.

5.5. This sounds great in principle but it is impossible to do in practice

Even with an efficient sampling scheme, integrative designs are likely to require a much larger number of experiments than is typical in the one-at-a-time paradigm; therefore, practical implementation is a real concern. However, given recent innovations in virtual lab environments, participant sourcing, mass collaboration mechanisms, and machine-learning methods, the approach is now feasible to some.

5.5.1. Virtual lab environments

Software packages such as jsPsych (de Leeuw, Reference de Leeuw2015) nodeGame (Balietti, Reference Balietti2017), Dallinger (https://dallinger.readthedocs.io/), Pushkin (Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019), Hemlock (Bowen, Reference Bowenn.d.), and Empirica (Almaatouq et al., Reference Almaatouq, Becker, Houghton, Paton, Watts and Whiting2021b) support development of integrative experiments that can systematically cover an experimental design's parameter space with automatically executed conditions. Even with these promising tools, for which development is ongoing, we still believe that one of the most promising, cost-effective ways to accelerate and improve progress in social science is to increase investment in automation (Yarkoni et al., Reference Yarkoni, Eckles, Heathers, Levenstein, Smaldino and Lane2019).

5.5.2. Recruiting participants

Another logistical challenge to integrative designs is that adequately sampling the space of experiments will typically require a large participant pool from which the experimenter can draw, often repeatedly. As it stands, the most common means of recruiting participants online involves crowdsourcing platforms (Horton, Rand, & Zeckhauser, Reference Horton, Rand and Zeckhauser2011; Mason & Suri, Reference Mason and Suri2012). The large-scale risky-choice dataset described above, for example, used this approach to collect its 10,000 pairs of gambles (Bourgin et al., Reference Bourgin, Peterson, Reichman, Russell, Griffiths, Chaudhuri and Salakhutdinov2019). However, popular crowdsourcing platforms such as Amazon Mechanical Turk (Litman, Robinson, & Abberbock, Reference Litman, Robinson and Abberbock2017) were designed for basic labeling tasks, which can be performed by a single person and require low levels of effort. And the crowdworkers performing the tasks may have widely varying levels of commitment and produce work of varying quality (Goodman, Cryder, & Cheema, Reference Goodman, Cryder and Cheema2013). Researchers are prevented by Amazon's terms of use from knowing whether crowdworkers have participated in similar experiments in the past, possibly as professional study participants (Chandler, Mueller, & Paolacci, Reference Chandler, Mueller and Paolacci2014). To accommodate behavioral research's special requirements, Prolific and other services (Palan & Schitter, Reference Palan and Schitter2018) have made changes to the crowdsourcing model, such as by giving researchers greater control over how participants are sampled and over the quality of their work.

Larger, more diverse volunteer populations are also possible to recruit, as the Moral Machine experiment exemplifies. In the first 18 months after deployment, that team gathered more than 40 million moral judgments from over 4 million unique participants in 233 countries and territories (Awad, Dsouza, Bonnefon, Shariff, & Rahwan, Reference Awad, Dsouza, Bonnefon, Shariff and Rahwan2020). Recruiting such large sample sizes from volunteers is appealing; however, success with such recruitment requires participant-reward strategies like gamification or personalized feedback (Hartshorne et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Li, Germine, Mehr, Srinivasan, & Hartshorne, Reference Li, Germine, Mehr, Srinivasan and Hartshorne2022). Thus, it has been hard to generalize the model to other important research questions and experiments, particularly when taking part in the experiment does not appear to be fun or interesting. Moreover, such large-scale data collection using viral platforms such as the Moral Machine may require some flexibility from Institutional Review Boards (IRBs), as they resemble software products that are open to consumers more than they do closed experiments that recruit from well-organized, intentional participant pools. In the Moral Machine experiment, for example, the MIT IRB approved pushing the consent to an “opt-out” option at the end, rather than obtaining consent prior to participation in the experiment, as the latter would have significantly increased participant attrition (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018).

5.5.3. Mass collaboration

Obtaining a sufficiently large sample may require leveraging emerging forms of organizing research in the behavioral and social sciences, such as distributed collaborative networks of laboratories (Moshontz et al., Reference Moshontz, Campbell, Ebersole, IJzerman, Urry, Forscher and Chartier2018). As we discussed earlier, in principle, large-scale collaborative networks can cooperatively run integrative experiments by assigning points in the design space to participating labs.

5.5.4. Machine learning

The physical and life sciences have benefited greatly from machine learning. Astrophysicists use image-classification systems to interpret the massive amounts of data recorded by their telescopes (Shallue & Vanderburg, Reference Shallue and Vanderburg2018). Life scientists use statistical methods to reconstruct phylogeny from DNA sequences and use neural networks to predict the folded structure of proteins (Jumper et al., Reference Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger and Hassabis2021). Experiments in the social and behavioral sciences, in contrast, have had relatively few new methodological breakthroughs related to these technologies. While social and behavioral scientists in general have embraced “big data” and machine learning, their focus to date has largely been on nonexperimental data.Footnote 14 In contrast, the current scale of experiments in the experimental social and behavioral sciences does not typically produce data at the volumes necessary for machine-learning models to yield substantial benefits over traditional methods.

Integrative experiments offer several new opportunities for machine-learning methods to be used to facilitate social and behavioral science. First, by producing larger datasets – either within a single experiment or across multiple integrated experiments in the same design space – the approach makes it possible to use a wider range of machine-learning methods, particularly ones less constrained by existing theories. This advantage is illustrated by the work of Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021), whose neural network models were trained on human choice data to explore the implications of different theoretical assumptions for predicting decisions. Second, these methods can play a valuable role in helping scientists make sense of the many factors that potentially influence behavior in these larger datasets, as in Agrawal et al.'s (Reference Agrawal, Peterson and Griffiths2020) analysis of the Moral Machine data. Finally, machine-learning techniques are a key part of designing experiments that efficiently explore large design spaces, as they are used to define surrogate models that are the basis for active sampling methods.

5.6. Even if such experiments are possible, costs will be prohibitive

It is true that integrative experiments are more expensive to run than individual one-at-a-time experiments, which may partly explain why the former have not yet become more popular. However, this comparison is misleading because it ignores the cost of human capital in generating scientific insight. Assume that a typical experimental paper in the social and behavioral sciences reflects on the order of $100,000 of labor costs in the form of graduate students or postdocs designing and running the experiment, analyzing the data, and writing up the results. Under the one-at-a-time approach, such a paper typically contains just one or at most a handful of experiments. The next paper builds upon the previous results and the process repeats. With hundreds of articles published over a few decades, the cumulative cost of a research program that explores roughly 100 points in the implicit design space easily reaches tens of millions of dollars.

Of those tens of millions of dollars, a tiny fraction – on the order of $1,000 per paper, or $100,000 per research program (<1%) – is spent on data collection. If instead researchers conducted a single-integrative experiment that covered the entire design space, they could collect all the data produced by the entire research program and then some. Even if this effort explored the design space significantly less efficiently than the traditional research program, requiring 10 times more data, data collection would cost about $1,000,000 (<10%). This is a big financial commitment, but the labor costs for interpreting these data do not scale with the amount of data. So, even if researchers needed to commit 10 times as much labor as for a typical research paper, they would have discovered everything an entire multidecade research program would uncover in a single study costing only $2,000,000.

The cost–benefit ratio of integrative experiments is hence at least an order of magnitude better than that of one-at-a-time experiments.Footnote 15 Pinching pennies on data collection results in losing dollars (and time and effort) in labor. If anything, when considered in aggregate, the efficiency gains of the integrative approach will be substantially greater than this back of the envelope calculation suggests. As an institution, the social and behavioral sciences have spent tens of billions of dollars during the past half-century.Footnote 16 With integrative designs, a larger up-front investment can save decades of unfruitful investigation and instead realize grounded, systematic results.

5.7. Does this mean that small labs can't participate?

Although the high up-front costs of designing and running an integrative experiment may seem to exclude small labs as well as Principal investigators (PIs) from low-resource institutions, we anticipate that the integrative approach will actually broaden the range of people involved in behavioral research. The key insight here is that the methods and infrastructure needed to run integrative experiments are inherently shareable. Thus, while the development costs are indeed high, once the infrastructure has been built, the marginal costs of using it are low – potentially even lower than running a single, one-at-a-time experiment. As long as funding for the necessary technical infrastructure is tied to a requirement for sustaining collaborative research (as discussed in previous sections), it will create opportunities for a wider range of scientists to be involved in integrative projects and for researchers at smaller or undergraduate-focused institutions to participate in ambitious research efforts.

Moreover, research efforts in other fields illustrate how labs of different sizes can make different kinds of contributions. In biology and physics, some groups of scientists form consortia that work together to define a large-scale research agenda and seek the necessary funding (as described earlier, several thriving experimental consortia in the behavioral sciences illustrate this possibility). Other groups develop theory by digging deeper into the data produced by these large-scale efforts to make discoveries they may not have imagined when the data were first collected; some scientists focus on answering questions that do not require large-scale studies, such as the properties of specific organisms or materials that can be easily studied in a small lab; still other researchers conduct exploratory work to identify the variables or theoretical principles that may be considered in future large-scale studies. We envision a similar ecosystem for the future of the behavioral sciences.

5.8. Shouldn't the replication crisis be resolved first?

The replication crisis in the behavioral sciences has led to much reflection about research methods and substantial efforts to conduct more-applicable research (Freese & Peterson, Reference Freese and Peterson2017). We view our proposal as being consistent with these goals, but with a different emphasis than replication. To some extent, this difference is complementary to replication and can be pursued in parallel with it, but may suggest a different allocation of resources than a “replication first” approach.

Discussing the complementary role first, integrative experiments naturally support replicable science. Because choices about nuisance variables are rarely documented systematically in the one-at-a-time paradigm, it is not generally possible to establish how similar or different two experiments are. This observation may account for some recently documented replication failures (Camerer et al., Reference Camerer, Dreber, Holzmeister, Ho, Huber, Johannesson and Wu2018; Levinthal & Rosenkopf, Reference Levinthal and Rosenkopf2021). While the replication debate has focused on shoddy research practices (e.g., p-hacking) and bad incentives (e.g., journals rewarding “positive, novel, and exciting” results), another possible cause of nonreplication is that the replicating experiment is in fact sufficiently dissimilar to the original (usually as a result of different choices of nuisance parameters) that one should not expect the result to replicate (Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Yarkoni, Reference Yarkoni2022). In other words, without operating within a space that makes experiments commensurate, failures to replicate previous findings are never conclusive, because doubt remains as to whether one of the many possible moderator variables explains the lack of replication (Cesario, Reference Cesario2014). Regardless of whether an experimental finding's fragility to (supposedly) theoretically irrelevant parameters should be considered a legitimate defense of the finding, the difficulty of resolving such arguments further illustrates the need for a more explicit articulation of theoretical scope conditions.

The integrative approach, accepting that treatment effects vary across conditions, would also recommend that directing massive resources to replicating existing effects may not be the best way to help our fields advance. Given that those historical effects were discovered under the one-at-a-time approach, they evaluate only specific points in the design space. Consistent with the argument above, rather than trying to perfectly reproduce those points in the design space (via “direct” replications), a better use of resources would be to sample the design space more extensively and use continuous measures to compare different studies (Gelman, Reference Gelman2018). In this way, researchers can not only discover whether historical effects replicate, but also draw stronger conclusions about whether (and to what extent) they generalize.

5.9. This proposal is incompatible with incentives in the social and behavioral sciences

Science does not occur in a vacuum. Scientists are constantly evaluated by their peers as they submit papers for publication, seek funding, apply for jobs, and pursue promotions. For the integrative approach to become widespread, it must be compatible with the incentives of individual behavioral scientists, including early career researchers. Given the current priority that hiring, tenure & promotion, and awards committees in the social and behavioral sciences place on identifiable individual contributions (e.g., lead authorship of scholarly works, perceived “ownership” of distinct programs of research, leadership positions, etc.), a key pragmatic concern is that the large-scale collaborative nature of integrative research designs might make them less rewarding than the one-at-a-time paradigm for anyone other than the project leaders.

Although a shift to large-scale, collaborative science does indeed present an adoption challenge, it is encouraging to note that even more dramatic shifts have taken place in other fields. In physics, for example, some of the most important results in recent decades – the discovery of the Higgs Boson (Aad et al., Reference Aad, Abajyan, Abbott, Abdallah, Abdel Khalek, Abdelalim and Zwalinski2012), gravitational waves (Abbott et al., Reference Abbott, Abbott, Abbott, Abernathy, Acernese and Ackley2016), and so on – have been obtained via collaborations of thousands of researchers.Footnote 17 To ensure that junior team members are rewarded for their contributions, many collaborations maintain “speaker lists” that prominently feature early career researchers, offering them a chance to appear as the face of the collaboration. When these researchers apply for jobs or are considered for promotion, the leader of the collaboration writes a letter of recommendation that describes the scientists' role in the collaboration and why their work is significant. A description of such roles can also be included directly in manuscripts through the Contributor Roles Taxonomy (Allen, Scott, Brand, Hlava, & Altman, Reference Allen, Scott, Brand, Hlava and Altman2014), a high-level taxonomy with 14 roles that describe typical contributions to scholarly output; the taxonomy has been adopted as an American National Standards Institute (ANSI)/National Information Standards Organization (NISO) standard and is beginning to see uptake (National Information Standards Organization, 2022). Researchers who participate substantially in creating the infrastructure used by a collaborative effort can receive “builder” status, appearing as coauthors on subsequent publications that use that infrastructure. Many collaborations also have mentoring plans designed to support early career researchers. Together, these mechanisms are intended to make participation in large collaborations attractive to a wide range of researchers at various career stages. While acknowledging that physics differs in many ways from the social and behavioral sciences, we nonetheless believe that the model of large collaborative research efforts can take root in the latter. Indeed, we have already noted the existence of several large collaborations in the behavioral sciences that appear to have been successful in attracting participation from small labs and early career researchers.

6. Conclusion

The widespread approach of designing experiments one-at-a-time – under different conditions with different participant pools, and with nonstandardized methods and reporting – is problematic because it is at best an inefficient way to accumulate knowledge, and at worst it fails to produce consistent, cumulative knowledge. The problem clearly will not be solved by increasing sample sizes, focusing on effect sizes rather than statistical significance, or replicating findings with preregistered designs. We instead need a fundamental shift in how to think about theory construction and testing.

We describe one possible approach, one that promotes commensurability and continuous integration of knowledge by design. In this “integrative” approach, experiments would not just evaluate a few hypotheses but would explore and integrate over a wide range of conditions that deserve explanation by all pertinent theories. Although this kind of experiment may strike many as atheoretical, we believe the one-at-a-time approach owes its dominance not to any particular virtues of theory construction and evaluation but rather to the historical emergence of experimental methods under a particular set of physical and logistical constraints. Over time, generations of researchers have internalized these features to such an extent that they are thought to be inseparable from sound scientific practice. Therefore, the key to realizing our proposed type of reform – and to making it productive and useful – is not only technical, but also cultural and institutional.

Acknowledgments

We owe an important debt to Saul Perlmutter, Serguei Saavedra, Matthew J. Salganik, Gary King, Todd Gureckis, Alex “Sandy” Pentland, Thomas W. Malone, David G. Rand, Iyad Rahwan, Ray E. Reagans, and the members of the MIT Behavioral Lab and the UPenn Computational Social Science Lab for valuable discussions and comments. This article also benefited from conversations with dozens of people at two workshops: (1) “Scaling Cognitive Science” at Princeton University in December 2019, and (2) “Scaling up Experimental Social, Behavioral, and Economic Science” at the University of Pennsylvania in January 2020.

Financial support

This work was supported in part by the Alfred P. Sloan Foundation (2020-13924) and the NOMIS Foundation.

Competing interest

None.

Footnotes

1. Although we restrict the focus of our discussion to lab experiments in the social and behavioral sciences, with which we are most familiar, we expect that our core arguments generalize well to other modes of inquiry and adjacent disciplines.

2. By analogy, we note that for almost as long as p-values have been used as a standard of evidence in the social and behavioral sciences, critics have argued that they are somewhere between insufficient and meaningless (Cohen, Reference Cohen1994; Dienes, Reference Dienes2008; Gelman & Carlin, Reference Gelman and Carlin2017; Meehl, Reference Meehl1990a). Yet, in the absence of an equally formulaic alternative, p-value analysis remains pervasive (Benjamin et al., Reference Benjamin, Berger, Johannesson, Nosek, Wagenmakers, Berk and Camerer2018).

3. Nor do recent proposals to improve the replicability and reproducibility of scientific results (Gelman & Loken, Reference Gelman and Loken2014; Ioannidis, Reference Ioannidis2005; Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, du Sert and Ioannidis2017; Open Science Collaboration, 2015; Simmons, Nelson, & Simonsohn, Reference Simmons, Nelson and Simonsohn2011) address the problem. While these proposals are worthy, their focus is on individual results, not on how collections of results fit together.

4. We also note that in an alternative formulation of the design space, all variables (including what one would think of as experimental manipulations) are included as dimensions of the design space and the focal experimental manipulation is represented as a comparison across two or more points in the space. Some of the examples described in section 4 are more readily expressed in one formulation, whereas others are more readily expressed in the other. They are equivalent: It is possible to convert from one to the other without any loss of information.

5. To illustrate with another example, cultural psychologists such as Hofstede (Reference Hofstede2016), Inglehart and Welzel (Reference Inglehart and Welzel2005), and Schwartz (Reference Schwartz2006) identified cultural dimensions along which groups differ, which then can be used to define distance measures between populations and to guide researchers in deciding where to target their data-collection efforts (Muthukrishna et al., Reference Muthukrishna, Bell, Henrich, Curtin, Gedranovich, McInerney and Thue2020). Another example of this exercise is the extensive breakdown of the “auction design space” by Wurman, Wellman, and Walsh (Reference Wurman, Wellman and Walsh2001), which captures the essential similarities and differences of many auction mechanisms in a format more descriptive and useful than simple taxonomies and serves as an organizational framework for classifying work within the field.

6. Active learning is also called “query learning” or sometimes “sequential optimal experimental design” in the statistics literature.

7. Active learning has recently become an important tool for optimizing experiments in other fields, such as machine-learning hyperparameters (Snoek, Larochelle, & Adams, Reference Snoek, Larochelle and Adams2012), materials and mechanical designs (Burger et al., Reference Burger, Maffettone, Gusev, Aitchison, Bai, Wang and Cooper2020; Gongora et al., Reference Gongora, Xu, Perry, Okoye, Riley, Reyes and Brown2020; Lei et al., Reference Lei, Kirk, Bhattacharya, Pati, Qian, Arroyave and Mallick2021), and chemical reaction screening (Eyke, Green, & Jensen, Reference Eyke, Green and Jensen2020, Reference Eyke, Koscher and Jensen2021; Shields et al., Reference Shields, Stevens, Li, Parasram, Damani, Alvarado and Doyle2021) – just to mention a few.

8. For example, surrogate models can be probabilistic models (e.g., a Gaussian process) as well as nonprobabilistic (e.g., neural networks, tree-based methods), while sampling strategies can include uncertainty sampling, greedy sampling, and distance-based sampling.

9. Popular active learning libraries for experiments include Ax (Bakshy et al., Reference Bakshy, Dworkin, Karrer, Kashin, Letham, Murthy and Singh2018), BoTorch (Balandat et al., Reference Balandat, Karrer, Jiang, Daulton, Letham, Wilson and Bakshy2020), and GPflowOpt (Knudde, van der Herten, Dhaene, & Couckuyt, Reference Knudde, van der Herten, Dhaene and Couckuyt2017).

11. Given that the data from the integrative approach are generated independent of the current set of theories in the field, the resulting data are potentially informative not just about those theories, but about theories that are yet to be proposed. As a consequence, data generated by this integrative approach are intended to have greater longevity than data generated by “one-at-a-time” experiments.

12. Another explanation for the inability to make accurate predictions is that the majority of dimensions defining the design space are uninformative and need to be reconsidered.

13. For a more comprehensive list, see Uhlmann et al. (Reference Uhlmann, Ebersole, Chartier, Errington, Kidwell, Lai and Nosek2019).

14. For example, the CHILDES dataset of child-directed speech (MacWhinney, Reference MacWhinney2014) has had a significant impact on studies of language development, and census data, macroeconomic data, and other large datasets (e.g., from social media and e-commerce platforms) are increasingly prevalent in political science, sociology, and economics.

15. This shift has already occurred in some areas. For example, the cognitive neuroscience field has been transformed in the past few decades by the availability of increasingly effective methods for brain imaging. Researchers now take for granted that data collection costs tens or hundreds of thousands of dollars and that the newly required equipment and other infrastructure for this kind of research costs millions of dollars – that is, they now budget more for data collection than for hiring staff. Unlocking the full potential of our envisioned integrative approach will require similarly new, imaginative ways of allocating resources and a willingness to spend money on generating more-definitive, reusable datasets (Griffiths, Reference Griffiths2015).

16. The budget associated with the NSF Directorate for Social, Behavioral, and Economic Sciences alone is roughly 5 billion dollars over the past two decades and, by its 2022 estimate, accounts for “approximately 65 percent of the federal funding for basic research at academic institutions in the social, behavioral, and economic sciences” (National Science Foundation, 2022). Extending the time range to 50 years and accounting for sources of funding beyond the US federal government, including all other governments, private foundations, corporations, and direct funding from universities, brings our estimate to tens of billions of dollars.

17. We thank Saul Perlmutter for sharing his perspective on how issues of incentives are addressed in physics, drawing on his experience in particle physics and cosmology.

References

Aad, G., Abajyan, T., Abbott, B., Abdallah, J., Abdel Khalek, S., Abdelalim, A. A., … Zwalinski, L. (2012). Observation of a new particle in the search for the Standard Model Higgs Boson with the ATLAS detector at the LHC. Physics Letters, Part B, 716(1), 129.CrossRefGoogle Scholar
Abbott, B. P., Abbott, R., Abbott, T. D., Abernathy, M. R., Acernese, F., Ackley, K., … LIGO Scientific Collaboration and Virgo Collaboration. (2016). Observation of gravitational waves from a binary black hole merger. Physical Review Letters, 116(6), 061102.CrossRefGoogle ScholarPubMed
Aggarwal, I., & Woolley, A. W. (2018). Team creativity, cognition, and cognitive style diversity. Management Science, 65(4), 15861599. https://doi.org/10.1287/mnsc.2017.3001CrossRefGoogle Scholar
Agrawal, M., Peterson, J. C., & Griffiths, T. L. (2020). Scaling up psychology via scientific regret minimization. Proceedings of the National Academy of Sciences of the United States of America, 117(16), 88258835.CrossRefGoogle ScholarPubMed
Allen, L., Scott, J., Brand, A., Hlava, M., & Altman, M. (2014). Publishing: Credit where credit is due. Nature, 508(7496), 312313.CrossRefGoogle ScholarPubMed
Allen, N. J., & Hecht, T. D. (2004). The “romance of teams”: Toward an understanding of its psychological underpinnings and implications. Journal of Occupational and Organizational Psychology, 77(4), 439461.CrossRefGoogle Scholar
Allport, F. H. (1924). The group fallacy in relation to social science. The American Journal of Sociology, 29(6), 688706.CrossRefGoogle Scholar
Almaatouq, A. (2019). Towards stable principles of collective intelligence under an environment-dependent framework. Massachusetts Institute of Technology. https://dspace.mit.edu/handle/1721.1/123223?show=full?show=fullGoogle Scholar
Almaatouq, A., Alsobay, M., Yin, M., & Watts, D. J. (2021a). Task complexity moderates group synergy. Proceedings of the National Academy of Sciences of the United States of America, 118(36), e2101062118. https://doi.org/10.1073/pnas.2101062118CrossRefGoogle ScholarPubMed
Almaatouq, A., Becker, J., Houghton, J. P., Paton, N., Watts, D. J., & Whiting, M. E. (2021b). Empirica: A virtual lab for high-throughput macro-level experiments. Behavior Research Methods, 53, 21582171. https://doi.org/10.3758/s13428-020-01535-9CrossRefGoogle ScholarPubMed
Almaatouq, A., Noriega-Campero, A., Alotaibi, A., Krafft, P. M., Moussaid, M., & Pentland, A. (2020). Adaptive social networks promote the wisdom of crowds. Proceedings of the National Academy of Sciences of the United States of America, 117(21), 1137911386.CrossRefGoogle ScholarPubMed
Almaatouq, A., Rahimian, M. A., Burton, J. W., & Alhajri, A. (2022). The distribution of initial estimates moderates the effect of social influence on the wisdom of the crowd. Scientific Reports, 12(1), 16546.CrossRefGoogle ScholarPubMed
Many Primates, Altschul, D. M., Beran, M. J., Bohn, M., Call, J., DeTroy, S., Duguid, S. J., … Watzek, J. (2019). Establishing an infrastructure for collaboration in primate cognition research. PLoS ONE, 14(10), e0223675.Google ScholarPubMed
Arrow, H., McGrath, J. E., & Berdahl, J. L. (2000). Small groups as complex systems: Formation, coordination, development, and adaptation. Sage.CrossRefGoogle Scholar
Atkinson, A. C., & Donev, A. N. (1992). Optimum experimental designs (Oxford statistical science series, 8) (1st ed.). Clarendon Press.Google Scholar
Aumann, R. J., & Hart, S. (1992). Handbook of game theory with economic applications. Elsevier.Google Scholar
Auspurg, K., & Hinz, T. (2014). Factorial survey experiments. Sage.Google Scholar
Awad, E., Dsouza, S., Bonnefon, J.-F., Shariff, A., & Rahwan, I. (2020). Crowdsourcing moral machines. Communications of the ACM, 63(3), 4855.CrossRefGoogle Scholar
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., … Rahwan, I. (2018). The Moral Machine experiment. Nature, 563(7729), 5964.CrossRefGoogle ScholarPubMed
Bakshy, E., Dworkin, L., Karrer, B., Kashin, K., Letham, B., Murthy, A., & Singh, S. (2018). AE: A domain-agnostic platform for adaptive experimentation. Workshop on System for ML. http://learningsys.org/nips18/assets/papers/87CameraReadySubmissionAE%20-%20NeurIPS%202018.pdfGoogle Scholar
Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., & Bakshy, E. (2020). BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS'20) (pp. 21524–21538). Curran Associates Inc.Google Scholar
Balietti, S. (2017). NodeGame: Real-time, synchronous, online experiments in the browser. Behavior Research Methods, 49(5), 16961715.CrossRefGoogle ScholarPubMed
Balietti, S., Klein, B., & Riedl, C. (2021). Optimal design of experiments to identify latent behavioral types. Experimental Economics, 24, 772799. https://doi.org/10.1007/s10683-020-09680-wCrossRefGoogle Scholar
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America, 115(11), 26072612.CrossRefGoogle ScholarPubMed
Barron, B. (2003). When smart groups fail. Journal of the Learning Sciences, 12(3), 307359.CrossRefGoogle Scholar
Becker, J., Brackbill, D., & Centola, D. (2017). Network dynamics of social influence in the wisdom of crowds. Proceedings of the National Academy of Sciences of the United States of America, 114(26), E5070E5076.Google ScholarPubMed
Bell, S. T. (2007). Deep-level composition variables as predictors of team performance: A meta-analysis. The Journal of Applied Psychology, 92(3), 595615.CrossRefGoogle ScholarPubMed
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Camerer, C. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 610. https://doi.org/10.1038/s41562-017-0189-zCrossRefGoogle ScholarPubMed
Berkman, E. T., & Wilson, S. M. (2021). So useful as a good theory? The practicality crisis in (social) psychological theory. Perspectives on Psychological Science, 16(4), 864874. https://doi.org/10.1177/1745691620969650CrossRefGoogle ScholarPubMed
Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. The American Economic Review, 94(4), 9911013.CrossRefGoogle Scholar
Bourgin, D. D., Peterson, J. C., Reichman, D., Russell, S. J., & Griffiths, T. L. (2019). Cognitive model priors for predicting human decisions. In Chaudhuri, K. & Salakhutdinov, R. (Eds.), Proceedings of the 36th international conference on machine learning (Vol. 97, pp. 51335141). PMLR.Google Scholar
Bowen, D. (n.d.). Hemlock. Retrieved April 22, 2022, from https://dsbowen.gitlab.io/hemlockGoogle Scholar
Brewin, C. R. (2022). Impact on the legal system of the generalizability crisis in psychology. The Behavioral and Brain Sciences, 45, e7.CrossRefGoogle ScholarPubMed
Breznau, N., Rinke, E. M., Wuttke, A., Nguyen, H. H. V., Adem, M., Adriaans, J., … Żółtak, T. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America, 119(44), e2203150119.CrossRefGoogle ScholarPubMed
Brunswik, E.. (1947). Systematic and representative design of psychological experiments. In Proceedings of the Berkeley symposium on mathematical statistics and probability (pp. 143202). University of California Press.Google Scholar
Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review, 62(3), 193217.CrossRefGoogle Scholar
Burger, B., Maffettone, P. M., Gusev, V. V., Aitchison, C. M., Bai, Y., Wang, X., … Cooper, A. I. (2020). A mobile robotic chemist. Nature, 583(7815), 237241.CrossRefGoogle ScholarPubMed
Byers-Heinlein, K., Bergmann, C., Davies, C., Frank, M. C., Kiley Hamlin, J., Kline, M., … Soderstrom, M. (2020). Building a collaborative psychological science: Lessons learned from ManyBabies 1. Canadian Psychology/Psychologie Canadienne, 61(4), 349363. https://doi.org/10.1037/cap0000216CrossRefGoogle ScholarPubMed
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., … Wu, H. (2018). Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature Human Behaviour, 2(9), 637644.CrossRefGoogle Scholar
Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2), 115144.CrossRefGoogle Scholar
Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological Science, 9(1), 4048. https://doi.org/10.1177/1745691613513470CrossRefGoogle ScholarPubMed
Cesario, J. (2022). What can experimental studies of bias tell us about real-world group disparities?. Behavioral and Brain Sciences, 45, E66. https://doi.org/10.1017/S0140525X21000017CrossRefGoogle Scholar
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112130.CrossRefGoogle ScholarPubMed
Cohen, J. (1994). The earth is round (p<.05). The American Psychologist, 49(12), 997.CrossRefGoogle Scholar
Cooper, H., Hedges, L. V., & Valentine, J. C. (Eds.) (2019). The handbook of research synthesis and meta-analysis. Russell Sage Foundation.CrossRefGoogle Scholar
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281302.CrossRefGoogle ScholarPubMed
Debrouwere, S., & Rosseel, Y. (2022). The conceptual, cunning and conclusive experiment in psychology. Perspectives on Psychological Science, 17(3), 852862. https://doi.org/10.1177/17456916211026947CrossRefGoogle ScholarPubMed
DeKay, M. L., Rubinchik, N., Li, Z., & De Boeck, P. (2022). Accelerating psychological science with metastudies: A demonstration using the risky-choice framing effect. Perspectives on Psychological Science, 17(6), 17041736. https://doi.org/10.1177/17456916221079611CrossRefGoogle ScholarPubMed
de Leeuw, J. R. (2015). JsPsych: A JavaScript library for creating behavioral experiments in a web browser. Behavior Research Methods, 47(1), 112.CrossRefGoogle Scholar
de Leeuw, J. R., Motz, B. A., Fyfe, E. R., Carvalho, P. F., & Goldstone, R. L. (2022). Generalizability, transferability, and the practice-to-practice gap [Review of Generalizability, transferability, and the practice-to-practice gap]. The Behavioral and Brain Sciences, 45, e11.CrossRefGoogle ScholarPubMed
Devine, D. J., Clayton, L. D., Dunford, B. B., Seying, R., & Pryce, J. (2001). Jury decision making: 45 years of empirical research on deliberating groups. Psychology, Public Policy, and Law, 7(3), 622727.CrossRefGoogle Scholar
Devine, D. J., & Philips, J. L. (2001). Do smarter teams o better: A meta-analysis of cognitive ability and team performance. Small Group Research, 32(5), 507532.CrossRefGoogle Scholar
Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Macmillan.Google Scholar
Dubova, M., Moskvichev, A., & Zollman, K. (2022). Against theory-motivated experimentation in science. MetaArXiv. June 24. https://doi.org/10.31222/osf.io/ysv2uGoogle Scholar
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 6882.CrossRefGoogle Scholar
Ellemers, N., & Rink, F. (2016). Diversity in work groups. Current Opinion in Psychology, 11, 4953.CrossRefGoogle Scholar
Engel, D., Woolley, A. W., Jing, L. X., Chabris, C. F., & Malone, T. W. (2014). Reading the mind in the eyes or reading between the lines? Theory of mind predicts collective intelligence equally well online and face-to-face. PLoS ONE, 9(12), e115212.CrossRefGoogle ScholarPubMed
Erev, I., Ert, E., Plonsky, O., Cohen, D., & Cohen, O. (2017). From anomalies to forecasts: Toward a descriptive model of decisions under risk, under ambiguity, and from experience. Psychological Review, 124(4), 369409.CrossRefGoogle Scholar
Eyke, N. S., Green, W. H., & Jensen, K. F. (2020). Iterative experimental design based on active machine learning reduces the experimental burden associated with reaction screening. Reaction Chemistry & Engineering, 5(10), 19631972.CrossRefGoogle Scholar
Eyke, N. S., Koscher, B. A., & Jensen, K. F. (2021). Toward machine learning-enhanced high-throughput experimentation. Trends in Chemistry, 3(2), 120132.CrossRefGoogle Scholar
Fehr, E., & Gachter, S. (2000). Cooperation and punishment in public goods experiments. The American Economic Review, 90(4), 980994.CrossRefGoogle Scholar
Freese, J., & Peterson, D. (2017). Replication in social science. Annual Review of Sociology, 43, 147165. https://doi.org/10.1146/annurev-soc-060116-053450CrossRefGoogle Scholar
Fyfe, E. R., de Leeuw, J. R., Carvalho, P. F., Goldstone, R. L., Sherman, J., Admiraal, D., … Motz, B. A. (2021). ManyClasses 1: Assessing the generalizable effect of immediate feedback versus delayed feedback across many college classes. Advances in Methods and Practices in Psychological Science, 4(3), 25152459211027575.CrossRefGoogle Scholar
Gale, D., & Shapley, L. S. (1962). College admissions and the stability of marriage. The American Mathematical Monthly, 69(1), 915.CrossRefGoogle Scholar
Gelman, A. (2018). Don't characterize replications as successes or failures [Review of Don't characterize replications as successes or failures]. The Behavioral and Brain Sciences, 41, e128.CrossRefGoogle ScholarPubMed
Gelman, A., & Carlin, J. (2017). Some natural solutions to the p-value communication problem – and why they won't work. Journal of the American Statistical Association, 112(519), 899901.CrossRefGoogle Scholar
Gelman, A., & Loken, E. (2014). The statistical crisis in science data-dependent analysis – a “garden of forking paths” – explains why many statistically significant comparisons don't hold up. American Scientist, 102(6), 460.CrossRefGoogle Scholar
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 158.CrossRefGoogle Scholar
Gongora, A. E., Xu, B., Perry, W., Okoye, C., Riley, P., Reyes, K. G., … Brown, K. A. (2020). A Bayesian experimental autonomous researcher for mechanical design. Science Advances, 6(15), eaaz1708.CrossRefGoogle ScholarPubMed
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples: Data collection in a flat world. Journal of Behavioral Decision Making, 26(3), 213224.CrossRefGoogle Scholar
Greenhill, S., Rana, S., Gupta, S., Vellanki, P., & Venkatesh, S. (2020). Bayesian optimization for adaptive experimental design: A review. IEEE Access, 8, 1393713948.CrossRefGoogle Scholar
Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cognition, 135, 2123.CrossRefGoogle ScholarPubMed
Grubbs, J. B. (2022). The cost of crisis in clinical psychological science [Review of The cost of crisis in clinical psychological science]. The Behavioral and Brain Sciences, 45, e18.CrossRefGoogle ScholarPubMed
Hackman, J. R. (1968). Effects of task characteristics on group products. Journal of Experimental Social Psychology, 4(2), 162187.CrossRefGoogle Scholar
Harkins, S. G. (1987). Social loafing and social facilitation. Journal of Experimental Social Psychology, 23(1), 118.CrossRefGoogle Scholar
Hartshorne, J. K., de Leeuw, J. R., Goodman, N. D., Jennings, M., & O'Donnell, T. J. (2019). A thousand studies for the price of one: Accelerating psychological science with Pushkin. Behavior Research Methods, 51(4), 17821803. https://doi.org/10.3758/s13428-018-1155-zCrossRefGoogle Scholar
Henrich, J., Heine, S., & Norenzayan, A. (2010). The weirdest people in the world?. Behavioral and Brain Sciences, 33(2-3), 6183. https://doi.org/10.1017/S0140525X0999152XCrossRefGoogle ScholarPubMed
Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. BMJ, 327(7414), 557560.CrossRefGoogle ScholarPubMed
Hill, G. W. (1982). Group versus individual performance: Are N + 1 heads better than one? Psychological Bulletin, 91(3), 517539.CrossRefGoogle Scholar
Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science (New York, N.Y.), 355(6324), 486488.CrossRefGoogle ScholarPubMed
Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., … Yarkoni, T. (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181188.CrossRefGoogle ScholarPubMed
Hofstede, G. (2016). Culture's consequences: Comparing values, behaviors, institutions, and organizations across nations (2nd ed.). Collegiate Aviation Review, 34(2), 108109. Retrieved from https://www.proquest.com/scholarly-journals/cultures-consequences-comparing-values-behaviors/docview/1841323332/se-2Google Scholar
Hong, L., & Page, S. E. (2004). Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences of the United States of America, 101(46), 1638516389.CrossRefGoogle ScholarPubMed
Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting experiments in a real labor market. Experimental Economics, 14(3), 399425.CrossRefGoogle Scholar
Husband, R. W. (1940). Cooperative versus solitary problem solution. The Journal of Social Psychology, 11(2), 405409.CrossRefGoogle Scholar
Inglehart, R., & Welzel, C. (2005). Modernization, cultural change, and democracy: The human development sequence. Cambridge University Press.Google Scholar
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.CrossRefGoogle ScholarPubMed
Janis, I. L. (1972). Victims of groupthink: A psychological study of foreign-policy decisions and fiascoes (p. 277). Houghton Mifflin Company. https://psycnet.apa.org/fulltext/1975-29417-000.pdfGoogle Scholar
Jones, B. C., DeBruine, L. M., Flake, J. K., Liuzza, M. T., Antfolk, J., Arinze, N. C., … Coles, N. A. (2021). To which world regions does the valence-dominance model of social perception apply? Nature Human Behaviour, 5(1), 159169.CrossRefGoogle ScholarPubMed
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583589.CrossRefGoogle ScholarPubMed
Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica: Journal of the Econometric Society, 47(2), 263291.CrossRefGoogle Scholar
Karau, S. J., & Williams, K. D. (1993). Social loafing: A meta-analytic review and theoretical integration. Journal of Personality and Social Psychology, 65(4), 681706.CrossRefGoogle Scholar
Kim, Y. J., Engel, D., Woolley, A. W., Lin, J. Y.-T., McArthur, N., & Malone, T. W. (2017). What makes a strong team?: Using collective intelligence to predict team performance in league of legends. Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing – CSCW ’17 (pp. 2316–2329). New York, NY, USA.CrossRefGoogle Scholar
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, Š., Bernstein, M. J., … Nosek, B. A. (2014). Investigating variation in replicability. Social Psychology, 45(3), 142152.CrossRefGoogle Scholar
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443490.CrossRefGoogle Scholar
Knudde, N., van der Herten, J., Dhaene, T., & Couckuyt, I. (2017). GPflowOpt: A Bayesian optimization library using TensorFlow. arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1711.03845Google Scholar
Koyré, A. (1953). An experiment in measurement. Proceedings of the American Philosophical Society, 97(2), 222237.Google Scholar
Lakens, D., Uygun Tunç, D., & Necip Tunç, M. (2022). There is no generalizability crisis [Review of There is no generalizability crisis]. The Behavioral and Brain Sciences, 45, e25.CrossRefGoogle ScholarPubMed
Landy, J. F., Jia, M. L., Ding, I. L., Viganola, D., Tierney, W., Dreber, A., … Uhlmann, E. L. (2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychological Bulletin, 146(5), 451479.CrossRefGoogle ScholarPubMed
Larson, J. R. (2013). In search of synergy in small group performance. Psychology Press.CrossRefGoogle Scholar
Larson, S. D., & Martone, M. E. (2009). Ontologies for neuroscience: What are they and what are they good for? Frontiers in Neuroscience, 3(1), 6067. https://doi.org/10.3389/neuro.01.007.2009CrossRefGoogle ScholarPubMed
Laughlin, P. R., Bonner, B. L., & Miner, A. G. (2002). Groups perform better than the best individuals on letters-to-numbers problems. Organizational Behavior and Human Decision Processes, 88(2), 605620.CrossRefGoogle Scholar
Lei, B., Kirk, T. Q., Bhattacharya, A., Pati, D., Qian, X., Arroyave, R., & Mallick, B. K. (2021). Bayesian optimization with adaptive surrogate models for automated experimental design. NPJ Computational Materials, 7(1), 112.CrossRefGoogle Scholar
LePine, J. A. (2003). Team adaptation and postchange performance: Effects of team composition in terms of members’ cognitive ability and personality. The Journal of Applied Psychology, 88(1), 2739.CrossRefGoogle ScholarPubMed
Letham, B., Karrer, B., Ottoni, G., & Bakshy, E. (2019). Constrained Bayesian optimization with noisy experiments. Bayesian Analysis, 14(2), 495519. https://doi.org/10.1214/18-ba1110CrossRefGoogle Scholar
Levinthal, D. A., & Rosenkopf, L. (2021). Commensurability and collective impact in strategic management research: When non-replicability is a feature, not a bug. Working-paper (unpublished preprint). https://mackinstitute.wharton.upenn.edu/2020/commensurability-and-collective-impact-in-strategic-management-research/Google Scholar
Levitt, S. D., & List, J. A. (2007). What do laboratory experiments measuring social preferences reveal about the real world? The Journal of Economic Perspectives: A Journal of the American Economic Association, 21(2), 153174.CrossRefGoogle Scholar
Li, W., Germine, L. T., Mehr, S. A., Srinivasan, M., & Hartshorne, J. (2022). Developmental psychologists should adopt citizen science to improve generalization and reproducibility. Infant and Child Development, e2348. https://doi.org/10.1002/icd.2348CrossRefGoogle Scholar
Litman, L., Robinson, J., & Abberbock, T. (2017). TurkPrime.com: A versatile crowdsourcing data acquisition platform for the behavioral sciences. Behavior Research Methods, 49(2), 433442.CrossRefGoogle ScholarPubMed
MacWhinney, B. (2014). The childes project: Tools for analyzing talk, volume II: The database (3rd ed.). Psychology Press. https://doi.org/10.4324/9781315805641CrossRefGoogle Scholar
Maier, M., Bartoš, F., Stanley, T. D., Shanks, D. R., Harris, A. J. L., & Wagenmakers, E.-J. (2022). No evidence for nudging after adjusting for publication bias. Proceedings of the National Academy of Sciences of the United States of America, 119(31), e2200300119.CrossRefGoogle ScholarPubMed
ManyBabies Consortium. (2020). Quantifying sources of variability in infancy research using the infant-directed-speech preference. Advances in Methods and Practices in Psychological Science, 3(1), 2452.CrossRefGoogle Scholar
Manzi, J. (2012). Uncontrolled: The surprising payoff of trial-and-error for business, politics, and society (pp. 1320). Basic Books.Google Scholar
Mao, A., Mason, W., Suri, S., & Watts, D. J. (2016). An experimental study of team size and performance on a complex task. PLoS ONE, 11(4), e0153048.CrossRefGoogle ScholarPubMed
Martin, T., Hofman, J. M., Sharma, A., Anderson, A., & Watts, D. J. (2016). Exploring limits to prediction in complex social systems. In Proceedings of the 25th international conference on world wide web no. 978-1-4503-4143-1 (pp. 683–694). Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.CrossRefGoogle Scholar
Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon's Mechanical Turk. Behavior Research Methods, 44(1), 123.CrossRefGoogle ScholarPubMed
Mason, W., & Watts, D. J. (2012). Collaborative learning in networks. Proceedings of the National Academy of Sciences of the United States of America, 109(3), 764769.CrossRefGoogle ScholarPubMed
McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2(1), 319.CrossRefGoogle Scholar
McGrath, J. E. (1984). Groups: Interaction and performance. Prentice Hall.Google Scholar
Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103115.CrossRefGoogle Scholar
Meehl, P. E. (1990a). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195244.CrossRefGoogle Scholar
Meehl, P. E. (1990b). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108141.CrossRefGoogle Scholar
Mertens, S., Herberz, M., Hahnel, U. J. J., & Brosch, T. (2022). The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains. Proceedings of the National Academy of Sciences of the United States of America, 119(1). https://doi.org/10.1073/pnas.2107346118Google ScholarPubMed
Merton, R. K. (1968). On sociological theories of the middle range. Social Theory and Social Structure, 3972.Google Scholar
Milkman, K. L., Gandhi, L., Patel, M. S., Graci, H. N., Gromet, D. M., Ho, H., … Duckworth, A. L. (2022). A 680,000-person megastudy of nudges to encourage vaccination in pharmacies. Proceedings of the National Academy of Sciences of the United States of America, 119(6). https://doi.org/10.1073/pnas.2115126119Google ScholarPubMed
Milkman, K. L., Patel, M. S., Gandhi, L., Graci, H. N., Gromet, D. M., Ho, H., … Duckworth, A. L. (2021). A megastudy of text-based nudges encouraging patients to get vaccinated at an upcoming doctor's appointment. Proceedings of the National Academy of Sciences of the United States of America, 118(20), e2101165118.CrossRefGoogle ScholarPubMed
Mook, D. G. (1983). In defense of external invalidity. The American Psychologist, 38(4), 379387.CrossRefGoogle Scholar
Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., … Chartier, C. R. (2018). The psychological science accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501515.CrossRefGoogle ScholarPubMed
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., du Sert, N. P., … Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 21.CrossRefGoogle ScholarPubMed
Muthukrishna, M., Bell, A. V., Henrich, J., Curtin, C. M., Gedranovich, A., McInerney, J., & Thue, B. (2020). Beyond western, educated, industrial, rich, and democratic (WEIRD) psychology: Measuring and mapping scales of cultural and psychological distance. Psychological Science, 31(6), 678701.CrossRefGoogle ScholarPubMed
Muthukrishna, M., & Henrich, J. A. (2019). A problem in theory. Nature Human Behaviour, 3, 221229. https://doi.org/10.1038/s41562-018-0522-1CrossRefGoogle ScholarPubMed
Myerson, R. B. (1981). Optimal auction design. Mathematics of Operations Research, 6(1), 5873.CrossRefGoogle Scholar
National Information Standards Organization. (2022). ANSI/NISO Z39. 104-2022, CRediT, contributor roles taxonomy. [S. L.]. National Information Standards Organization. https://www.niso.org/publications/z39104-2022-creditGoogle Scholar
National Science Foundation. (2022). NSF budget requests to congress and annual appropriations. National Science Foundation. https://www.nsf.gov/about/budget/Google Scholar
Nemesure, M. D., Heinz, M. V., Huang, R., & Jacobson, N. C. (2021). Predictive modeling of depression and anxiety using electronic health records and a novel machine learning approach with artificial intelligence. Scientific Reports, 11(1), 1980.CrossRefGoogle Scholar
Newell, A. (1973). You can't play 20 questions with nature and win: Projective comments on the papers of this symposium. http://shelf2.library.cmu.edu/Tech/240474311.pdfGoogle Scholar
Open Science Collaboration. (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349(6251), aac4716.CrossRefGoogle Scholar
Page, S. E. (2008). The difference: How the power of diversity creates better groups, firms, schools, and societies – New edition. Princeton University Press.CrossRefGoogle Scholar
Palan, S., & Schitter, C. (2018). Prolific.ac – A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 2227.CrossRefGoogle Scholar
Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D., & Griffiths, T. L. (2021). Using large-scale experiments and machine learning to discover theories of human decision-making. Science (New York, N.Y.), 372(6547), 12091214.CrossRefGoogle ScholarPubMed
Plonsky, O., Apel, R., Ert, E., Tennenholtz, M., Bourgin, D., Peterson, J. C., … Erev, I. (2019). Predicting human decisions with behavioral theories and machine learning. arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1904.06866Google Scholar
Preckel, F., & Brunner, M. (2017). Nomological nets. Encyclopedia of Personality and Individual Differences, 14. https://doi.org/10.1007/978-3-319-28099-8_1334-1CrossRefGoogle Scholar
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., … Wang, X. (2021). A survey of deep active learning. ACM Computing Surveys, 54(9), 140.Google Scholar
Reuss, H., Kiesel, A., & Kunde, W. (2015). Adjustments of response speed and accuracy to unconscious cues. Cognition, 134, 5762.CrossRefGoogle ScholarPubMed
Richard Hackman, J., & Morris, C. G. (1975). Group tasks, group interaction process, and group performance effectiveness: A review and proposed integration. In Berkowitz, L. (Ed.), Advances in Experimental Social Psychology (Vol. 8, pp. 4599). Academic Press. https://doi.org/10.1016/s0065-2601(08)60248-8Google Scholar
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638641.CrossRefGoogle Scholar
Rubin, D. L., Lewis, S. E., Mungall, C. J., Misra, S., Westerfield, M., Ashburner, M., … Musen, M. A. (2006). National center for biomedical ontology: Advancing biomedicine through structured organization of scientific knowledge. OMICS: A Journal of Integrative Biology, 10(2), 185198. https://doi.org/10.1089/omi.2006.10.185CrossRefGoogle ScholarPubMed
Schneid, M., Isidor, R., Li, C., & Kabst, R. (2015). The influence of cultural context on the relationship between gender diversity and team performance: A meta-analysis. The International Journal of Human Resource Management, 26(6), 733756.CrossRefGoogle Scholar
Schulz-Hardt, S., & Mojzisch, A. (2012). How to achieve synergy in group decision making: Lessons to be learned from the hidden profile paradigm. European Review of Social Psychology, 23(1), 305343.CrossRefGoogle Scholar
Schwartz, S. (2006). A theory of cultural value orientations: Explication and applications. Comparative Sociology, 5(2–3), 137182.CrossRefGoogle Scholar
Settles, B. (2011). From theories to queries: Active learning in practice. In Guyon, I., Cawley, G., Dror, G., Lemaire, V., & Statnikov, A. (Eds.), Active learning and experimental design workshop in conjunction with AISTATS 2010 (Vol. 16, pp. 118). PMLR.Google Scholar
Shallue, C. J., & Vanderburg, A. (2018). Identifying exoplanets with deep learning: A five-planet resonant chain around Kepler-80 and an eighth planet around Kepler-90. AJS; American Journal of Sociology, 155(2), 94.Google Scholar
Shaw, M. E. (1963). Scaling group tasks: A method for dimensional analysis. https://apps.dtic.mil/sti/pdfs/AD0415033.pdfGoogle Scholar
Shields, B. J., Stevens, J., Li, J., Parasram, M., Damani, F., Alvarado, J. I. M., … Doyle, A. G. (2021). Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590(7844), 8996.CrossRefGoogle ScholarPubMed
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 13591366.CrossRefGoogle ScholarPubMed
Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 12(6), 11231128.CrossRefGoogle ScholarPubMed
Simonsohn, U., Simmons, J., & Nelson, L. D. (2022). Above averaging in literature reviews. Nature Reviews Psychology, 1(10), 551552.CrossRefGoogle Scholar
Smucker, B., Krzywinski, M., & Altman, N. (2018). Optimal experimental design. Nature Methods, 15(8), 559560.CrossRefGoogle ScholarPubMed
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1206.2944Google Scholar
Steiner, I. D. (1972). Group process and productivity. Academic Press.Google Scholar
Stewart, G. L. (2006). A meta-analytic review of relationships between team design features and team performance. Journal of Management, 32(1), 2955.CrossRefGoogle Scholar
Stokes, D. E. (1997). Pasteur's quadrant: Basic science and technological innovation. Brookings Institution Press.Google Scholar
Szaszi, B., Higney, A., Charlton, A., Gelman, A., Ziano, I., Aczel, B., … Tipton, E. (2022). No reason to expect large and consistent effects of nudge interventions [Review of No reason to expect large and consistent effects of nudge interventions]. Proceedings of the National Academy of Sciences of the United States of America, 119(31), e2200732119.CrossRefGoogle ScholarPubMed
Tasca, G. A. (2021). Team cognition and reflective functioning: A review and search for synergy. Group Dynamics: Theory, Research, and Practice, 25(3), 258270.CrossRefGoogle Scholar
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4), 285294.CrossRefGoogle Scholar
Turner, J. A., & Laird, A. R. (2012). The cognitive paradigm ontology: Design and application. Neuroinformatics, 10(1), 5766.CrossRefGoogle ScholarPubMed
Turner, M. A., & Smaldino, P. E. (2022). Mechanistic modeling for the masses [Review of Mechanistic modeling for the masses]. The Behavioral and Brain Sciences, 45, e33.CrossRefGoogle ScholarPubMed
Uhlmann, E. L., Ebersole, C. R., Chartier, C. R., Errington, T. M., Kidwell, M. C., Lai, C. K., … Nosek, B. A. (2019). Scientific utopia III: Crowdsourcing science. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 14(5), 711733.CrossRefGoogle ScholarPubMed
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences of the United States of America, 113(23), 64546459.CrossRefGoogle ScholarPubMed
Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. The Journal of Finance, 16(1), 837.CrossRefGoogle Scholar
Voelkel, J. G., Stagnaro, M. N., Chu, J., Pink, S. L., Mernyk, J. S., Redekopp, C., … Willer, R. (2022). Megastudy identifying successful interventions to strengthen Americans’ democratic attitudes. Preprint. https://doi.org/10.31219/osf.io/y79u5CrossRefGoogle Scholar
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 12281242.CrossRefGoogle Scholar
Watson, G. B. (1928). Do groups think more efficiently than individuals? Journal of Abnormal and Social Psychology, 23(3), 328.CrossRefGoogle Scholar
Watts, D. (2017). Response to Turco and Zuckerman's “Verstehen for sociology.” The American Journal of Sociology, 122(4), 12921299.CrossRefGoogle Scholar
Watts, D. J. (2011). Everything is obvious*: Once you know the answer. Crown Business.Google Scholar
Watts, D. J. (2014). Common sense and sociological explanations. The American Journal of Sociology, 120(2), 313351.CrossRefGoogle ScholarPubMed
Watts, D. J. (2017). Should social science be more solution-oriented? Nature Human Behaviour, 1, 15.CrossRefGoogle Scholar
Watts, D. J., Beck, E. D., Bienenstock, E. J., Bowers, J., Frank, A., Grubesic, A., … Salganik, M. (2018). Explanation, prediction, and causality: Three sides of the same coin? https://doi.org/10.31219/osf.io/u6vz5CrossRefGoogle Scholar
Wiernik, B. M., Raghavan, M., Allan, T., & Denison, A. J. (2022). Generalizability challenges in applied psychological and organizational research and practice [Review of Generalizability challenges in applied psychological and organizational research and practice]. The Behavioral and Brain Sciences, 45, e38.CrossRefGoogle ScholarPubMed
Witkop, G. (n.d.). Systematizing confidence in open research and evidence (SCORE). DARPA. Retrieved June 22, 2022, from https://www.darpa.mil/program/systematizing-confidence-in-open-research-and-evidenceGoogle Scholar
Wood, R. E. (1986). Task complexity: Definition of the construct. Organizational Behavior and Human Decision Processes, 37(1), 6082.CrossRefGoogle Scholar
Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N., & Malone, T. W. (2010). Evidence for a collective intelligence factor in the performance of human groups. Science (New York, N.Y.), 330(6004), 686688.CrossRefGoogle ScholarPubMed
Wurman, P. R., Wellman, M. P., & Walsh, W. E. (2001). A parametrization of the auction design space. Games and Economic Behavior, 35(1), 304338.CrossRefGoogle Scholar
Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, E1. https://doi.org/10.1017/S0140525X20001685CrossRefGoogle Scholar
Yarkoni, T., Eckles, D., Heathers, J., Levenstein, M., Smaldino, P. E., & Lane, J. I. (2019). Enhancing and accelerating social science via automation: Challenges and opportunities. https://doi.org/10.31235/osf.io/vncweCrossRefGoogle Scholar
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 12(6), 11001122.CrossRefGoogle ScholarPubMed
Zelditch, M. Jr. (1969). Can you really study an army in the laboratory. A Sociological Reader on Complex Organizations, 528539.Google Scholar
Figure 0

Figure 1. Implicit design space. Panel A depicts a single experiment (a single point) that generates a result in a particular sample population and context; the point's color represents a relationship between variables. Panel B depicts the expectation that results will generalize over broader regions of conditions. Panel C shows a result that applies to a bounded range of conditions. Panel D illustrates how isolated studies about specific hypotheses can reach inconsistent conclusions, as represented by different-colored points.

Figure 1

Figure 2. Explicit design space. Panel A shows that systematically sampling the space of possible experiments can reveal contingencies, thereby increasing the integrativeness of theories (as shown in panel B). Panel C depicts that what matters most is the overlap between the most practically useful conditions and domains defined by theoretical boundaries. The elephants in panels B and C represent the bigger picture that findings from a large number of experiments allow researchers to discern, but which is invisible to those from situated theoretical and empirical positions.

Figure 2

Figure 3. Examples of integrative experiments. The top row illustrates the experimental tasks used in the Moral Machine, decisions under risk, and subliminal priming effects experiments, respectively, followed by the parameters varied across each experiment (bottom row). Each experiment instance (i.e., a scenario in the Moral Machine experiment, a pair of gambles in the risky-choice experiment, and a selection of facet values in the subliminal priming effects experiment) can be described by a vector of parameter values. Reducing the resulting space to two dimensions (2D) visualizes coverage by different experiments. This 2D embedding results from applying principal component analysis (PCA) to the parameters of these experimental conditions.