1. Introduction
You can't play 20 questions with Nature and win. (Newell, Reference Newell1973)
Fifty years ago, Allen Newell summed up the state of contemporary experimental psychology as follows: “Science advances by playing twenty questions with nature. The proper tactic is to frame a general question, hopefully binary, that can be attacked experimentally. Having settled that bits-worth, one can proceed to the next … Unfortunately, the questions never seem to be really answered, the strategy does not seem to work” (italics added for emphasis).
The problem, Newell noted, was a lack of coherence among experimental findings. “We never seem in the experimental literature to put the results of all the experiments together,” he wrote, “Innumerable aspects of the situations are permitted to be suppressed. Thus, no way exists of knowing whether the earlier studies are in fact commensurate with whatever ones are under present scrutiny, or are in fact contradictory.” Referring to a collection of papers by prominent experimentalists, Newell concluded that although it was “exceedingly clear that each paper made a contribution … I couldn't convince myself that it would add up, even in thirty more years of trying, even if one had another 300 papers of similar, excellent ilk.”
More than 20 years after Newell's imagined future date, his outlook seems, if anything, optimistic. To illustrate the problem, consider the phenomenon of group “synergy,” defined as the performance of an interacting group exceeding that of an equivalently sized “nominal group” of individuals working independently (Hill, Reference Hill1982; Larson, Reference Larson2013). A century of experimental research in social psychology, organizational psychology, and organizational behavior has tested the performance implications of working in groups relative to working individually (Allen & Hecht, Reference Allen and Hecht2004; Richard Hackman & Morris, Reference Richard Hackman, Morris and Berkowitz1975; Husband, Reference Husband1940; Schulz-Hardt & Mojzisch, Reference Schulz-Hardt and Mojzisch2012; Tasca, Reference Tasca2021; Watson, Reference Watson1928), but substantial contributions can also be found in cognitive science, communications, sociology, education, computer science, and complexity science (Allport, Reference Allport1924; Arrow, McGrath, & Berdahl, Reference Arrow, McGrath and Berdahl2000; Barron, Reference Barron2003; Devine, Clayton, Dunford, Seying, & Pryce, Reference Devine, Clayton, Dunford, Seying and Pryce2001). In spite of this attention across time and disciplines – or maybe because of it – this body of research often reaches inconsistent or conflicting conclusions. For example, some studies find that interacting groups outperform individuals because they are able to distribute effort (Laughlin, Bonner, & Miner, Reference Laughlin, Bonner and Miner2002), share information about high-quality solutions (Mason & Watts, Reference Mason and Watts2012), or correct errors (Mao, Mason, Suri, & Watts, Reference Mao, Mason, Suri and Watts2016), whereas other studies find that “process losses” – including social loafing (Harkins, Reference Harkins1987; Karau & Williams, Reference Karau and Williams1993), groupthink (Janis, Reference Janis1972), and interpersonal conflict (Steiner, Reference Steiner1972) – cause groups to underperform their members.
As we will argue, the problem is not that researchers lack theoretically informed hypotheses about the causes and predictors of group synergy; to the contrary, the literature contains dozens, or possibly even hundreds, of such hypotheses. Rather, the problem is that because each of these experiments was designed with the goal of testing a hypothesis but, critically, not with the goal of explicitly comparing the results with other experiments of the same general class, researchers in this space have no way to articulate how similar or different their experiment is from anyone else's. As a result, it is impossible to determine – via systematic review, meta-analysis, or any other ex-post method of synthesis – how all of the potentially relevant factors jointly determine group synergy or how their relative importance and interactions change over contexts and populations.
Nor is group synergy the only topic in the social and behavioral sciences for which one can find a proliferation of irreconcilable theories and empirical results. For any substantive area of the social and behavioral sciences on which we have undertaken a significant amount of reading, we see hundreds of experiments that each tests the effects of some independent variables on other dependent variables while suppressing innumerable “aspects of the situation.”Footnote 1 Setting aside the much-discussed problems of replicability and reproducibility, many of these papers are interesting when read in isolation, but it is no more possible to “put them all together” today than it was in Newell's time (Almaatouq, Reference Almaatouq2019; Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Watts, Reference Watts2017).
Naturally, our subjective experience of reading across several domains of interest does not constitute proof that successful integration of many independently designed and conducted experiments cannot occur in principle, or even that it has not occurred in practice. Indeed it is possible to think of isolated examples, such as mechanism design applied to auctions (Myerson, Reference Myerson1981; Vickrey, Reference Vickrey1961) and matching markets (Aumann & Hart, Reference Aumann and Hart1992; Gale & Shapley, Reference Gale and Shapley1962), in which theory and experiment appear to have accumulated into a reasonably self-consistent, empirically validated, and practically useful body of knowledge. We believe, however, that these examples represent rare exceptions and that examples such as group synergy are far more typical.
We propose two explanations for why not much has changed since Newell's time. The first is that not everyone agrees with the premise of Newell's critique – that “putting things together” is a pressing concern for the scientific enterprise. In effect, this view holds that the approach Newell critiqued (and that remains predominant in the social and behavioral sciences) is sufficient for accumulating knowledge. Such accumulation manifests itself indirectly through the scientific publishing process, with each new paper building upon earlier work, and directly through literature reviews and meta-analyses. The second explanation for the lack of change since Newell's time is that even if one accepts Newell's premise, neither Newell nor anyone else has proposed a workable alternative; hence, the current paradigm persists by default in spite of its flaws.Footnote 2
In the remainder of this paper, we offer our responses to the two explanations just proposed. Section 2 addresses the first explanation, describing what we call the “one-at-a-time” paradigm and arguing that it is poorly suited to the purpose of integrating knowledge over many studies in large part because it was not designed for that purpose. We also argue that existing mechanisms for integrating knowledge, such as systematic reviews and meta-analyses, are insufficient on the grounds that they, in effect, assume commensurability. If the studies that these methods are attempting to integrate cannot be compared with one another, because they were not designed to be commensurable, then there is little that ex-post methods can do.Footnote 3 Rather, an alternative approach to designing experiments and evaluating theories is needed. Section 3 addresses the second explanation by describing such an alternative, which we call the “integrative” approach, that is explicitly designed to integrate knowledge about a particular problem domain. Although integrative experiments of the sort we describe may not have been possible in Newell's day, we argue that they can now be productively pursued in parts of the social and behavioral sciences thanks to increasing theoretical maturity and methodological developments. To illustrate this point, section 4 illustrates the potential of the integrative approach by describing three experiments that are first steps in its direction. Finally, section 5 outlines questions and concerns we have encountered and offers our response.
2. The “one-at-a-time” paradigm
In the simplest version of what we call the “one-at-a-time” approach to experimentation, a researcher poses a question about the relation between one independent and one dependent variable and then offers a theory-motivated hypothesis that the relation is positive or negative. Next, the researcher devises an experiment to test this hypothesis by introducing variability in the independent variable, aiming to reject the “null hypothesis” that the proposed dependency does not exist on the basis of the evidence, quantified by a p-value. If the null hypothesis is successfully rejected, the researcher concludes that the experiment corroborates the theory and then elaborates on potential implications, both for other experiments and for phenomena outside the lab.
In practice, one-at-a-time experiments can be considerably more complex. The researcher may articulate hypotheses about more than one independent variable, more than one dependent variable, or both. The test itself may focus on effect sizes or confidence intervals rather than statistical significance, or it may compare two or more competing hypotheses. Alternatively, both the hypothesis and the test may be qualitative in nature. Regardless, each experiment tests at most a small number of theoretically informed hypotheses in isolation by varying at most a small number of parameters. By design, all other factors are held constant. For example, a study of the effect of reward or punishment on levels of cooperation typically focuses on the manipulation of theoretical interest (e.g., introducing a punishment stage between contribution rounds in a repeated game) while holding fixed other parameters, such as the numerical values of the payoffs or the game's length (Fehr & Gachter, Reference Fehr and Gachter2000). Similarly, a study of the effect of network structure on group performance typically focuses on some manipulation of the underlying network while holding fixed the group size or the time allotted to perform the task (Almaatouq et al., Reference Almaatouq, Noriega-Campero, Alotaibi, Krafft, Moussaid and Pentland2020; Becker, Brackbill, & Centola, Reference Becker, Brackbill and Centola2017).
2.1. The problem with the one-at-a-time paradigm
As Newell himself noted, this approach to experimentation seems reasonable. After all, the sequence of question → theory → hypothesis → experiment → analysis → revision to theory → repeat appears to be almost interchangeable with the scientific method itself. Nonetheless, the one-at-a-time paradigm rests on an important but rarely articulated assumption: That because the researcher's purpose in designing an experiment is to test a theory of interest, the only constructs of interest are those that the theory itself explicitly articulates as relevant. Conversely, where the theory is silent, the corresponding parameters are deemed to be irrelevant. According to this logic, articulating a precise theory leads naturally to a well-specified experiment with only one, or at most a few, constructs in need of consideration. Correspondingly, theory can aid the interpretation of the experiment's results – and can be generalized to other cases (Mook, Reference Mook1983; Zelditch, Reference Zelditch1969).
Unfortunately, while such an assumption may be reasonable in fields such as physics, it is rarely justified in the social and behavioral sciences (Debrouwere & Rosseel, Reference Debrouwere and Rosseel2022; Meehl, Reference Meehl1967). Social and behavioral phenomena exhibit higher “causal density” (or what Meehl called the “crud factor”) than physical phenomena, such that the number of potential causes of variation in any outcome is much larger than in physics and the interactions among these causes are often consequential (Manzi, Reference Manzi2012; Meehl, Reference Meehl1990b). In other words, the human world is vastly more complex than the physical one, and researchers should be neither surprised nor embarrassed that their theories about it are correspondingly less precise and predictive (Watts, Reference Watts2011). The result is that theories in the social and behavioral sciences are rarely articulated with enough precision or supported by enough evidence for researchers to be sure which parameters are relevant and which can be safely ignored (Berkman & Wilson, Reference Berkman and Wilson2021; Meehl, Reference Meehl1990b; Turner & Smaldino, Reference Turner and Smaldino2022; Yarkoni, Reference Yarkoni2022). Researchers working independently in the same domain of inquiry will therefore invariably make design choices (e.g., parameter settings, subject pools) differently (Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Żółtak2022; Gelman & Loken, Reference Gelman and Loken2014). Moreover, because the one-at-a-time paradigm is premised on the (typically unstated) assumption that theories dictate the design of experiments, the process of making design decisions about constructs that are not specified under the theory being tested is often arbitrary, vague, undocumented, or (as Newell puts it) “suppressed.”
2.2. The universe of possible experiments
To express the problem more precisely, it is useful to think of a one-at-a-time experiment as a sample from an implicit universe of possible experiments in a domain of inquiry. Before proceeding, we emphasize that neither the sample nor the universe is typically acknowledged in the one-at-a-time paradigm. Indeed, it is precisely the transition from implicit to explicit construction of the sampling universe that forms the basis of the solution we describe in the next section.
In imagining such a universe, it is useful to distinguish the independent variables needed to define the effect of interest – the experimental manipulation – from the experiment's context. We define this context as the set of independent variables that are hypothesized to moderate the effect in question as well as the nuisance parameters (which, strictly speaking, are also independent variables) over which the effect is expected to generalize and that correspond to the design choices the researcher makes about the specific experiment that will be conducted. For example, an experiment comparing the performance of teams to that of individuals not only will randomize participants into a set of experimental conditions (e.g., individuals vs. teams of varying sizes), but will also reflect decisions about other contextual features, including, for example, the specific tasks on which to compare performance, where each task could then be parameterized along multiple dimensions (Almaatouq, Alsobay, Yin, & Watts, Reference Almaatouq, Alsobay, Yin and Watts2021a; Larson, Reference Larson2013). Other contextual choices include the incentives provided to participants, time allotted to perform the task, modality of response, and so on. Similarly, we define the population of the experiment as a set of measurable attributes that characterize the sample of participants (e.g., undergraduate women in the United States aged 18–23 with a certain distribution of cognitive reflection test scores). Putting all these choices together, we can now define an abstract space of possible experiments, the dimensions of which are the union of the context and population. We call this space the design space on the grounds that every conceivable design of the experiment is describable by some choice of parameters that maps to a unique point in the space.Footnote 4 (Although this is an abstract way of defining what we mean by the experiment design space, we will suggest concrete and practical ways of defining it later in the article.)
Figure 1 shows a simplified rendering of a design space and illustrates several important properties of the one-at-a-time paradigm. Figure 1A shows a single experiment conducted in a particular context with a particular sample population. The color of the point represents the “result” of the experiment: The effect of one or more independent variables on some dependent variable. In the absence of a theory, nothing can be concluded from the experiment alone, other than that the observed result holds for one particular sample of participants under one particular context. From this observation, the appeal of strong theory becomes clear: By framing an experiment as a test of a theory, rather than as a measurement of the relationship between dependent and independent variables (Koyré, Reference Koyré1953), the observed results can be generalized well beyond the point in question, as shown in Figure 1B. For example, while a methods section of an experimental paper might note that the participants were recruited from the subject pool at a particular university, it is not uncommon for research articles to report findings as if they apply to all of humanity (Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010). According to this view, theories (and in fields such as experimental economics, formal models) are what help us understand the world, whereas experiments are merely instruments that enable researchers to test theories (Lakens, Uygun Tunç, & Necip Tunç, Reference Lakens, Uygun Tunç and Necip Tunç2022; Levitt & List, Reference Levitt and List2007; Mook, Reference Mook1983; Zelditch, Reference Zelditch1969).
As noted above, however, we rarely expect theories in the social and behavioral sciences to be universally valid. The ability of the theory in question to generalize the result is therefore almost always limited to some region of the design space that includes the sampled point but not the entire space, as shown in Figure 1C. While we expect that most researchers would acknowledge that they lack evidence for unconstrained generality over the population, it is important to note that there is nothing special about the subjects. In principle, what goes for subjects also holds for contexts (Simons, Shoda, & Lindsay, Reference Simons, Shoda and Lindsay2017; Yarkoni, Reference Yarkoni2022). Indeed, as Brunswik long ago observed, “…proper sampling of situations and problems may in the end be more important than proper sampling of subjects, considering the fact that individuals are probably on the whole much more alike than are situations among one another” (Brunswik, Reference Brunswik1947).
Unfortunately, because the design space is never explicitly constructed, and hence the sampled point has no well-defined location in the space, the one-at-a-time paradigm cannot specify a proposed domain of generalizability. Instead, any statements regarding “scope” or “boundary” conditions for a finding are often implicit and qualitative in nature, leaving readers to assume the broadest possible generalizations. These scope conditions may appear in an article's discussion section but typically not in its title, abstract, or introduction. Rarely, if ever, is it possible to precisely identify, based on the theory alone, over what domain of the design space one should expect an empirical result to hold (Cesario, Reference Cesario2014, Reference Cesario2022).
2.3. Incommensurability leads to irreconcilability
Given that the choices about the design of experiments are not systematically documented, it becomes impossible to establish how similar or different two experiments are. This form of incommensurability, whereby experiments about the same effect of interest are incomparable, generates a pattern like that shown in Figure 1D, where inconsistent and contradictory findings appear in no particular order or pattern (Levinthal & Rosenkopf, Reference Levinthal and Rosenkopf2021). If one had a metatheory that specified precisely under what conditions (i.e., over what region of parameter values in the design space) each theory should apply, it might be possible to reconcile the results under that metatheory's umbrella, but rarely do such metatheories exist (Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019). As a result, the one-at-a-time paradigm provides no mechanism by which to determine whether the observed differences (a) are to be expected on the grounds that they lie in distinct subdomains governed by different theories, (b) represent a true disagreement between competing theories that make different claims on the same subdomain, or (c) indicate that one or both results are likely to be wrong and therefore require further replication and scrutiny. In other words, inconsistent findings arising in the research literature are essentially irreconcilable (Almaatouq, Reference Almaatouq2019; Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Van Bavel, Mende-Siedlecki, Brady, & Reinero, Reference Van Bavel, Mende-Siedlecki, Brady and Reinero2016; Watts, Reference Watts2017; Yarkoni, Reference Yarkoni2022).
Critically, the absence of commensurability also creates serious problems for existing methods of synthesizing knowledge such as systematic reviews and meta-analyses. As all these methods are post-hoc, meaning that they are applied after the studies in question have been completed, they are necessarily reliant on the designs of the experiments they are attempting to integrate. If those designs do not satisfy the property of commensurability (again, because they were never intended to), then ex-post methods are intrinsically limited in how much they can say about observed differences. A concrete illustration of this problem has emerged recently in the context of “nudging” due to the publication of a large meta-analysis of over 400 studies spanning a wide range of contexts and interventions (Mertens, Herberz, Hahnel, & Brosch, Reference Mertens, Herberz, Hahnel and Brosch2022). The paper was subsequently criticized for failing to account adequately for publication bias (Maier et al., Reference Maier, Bartoš, Stanley, Shanks, Harris and Wagenmakers2022), the quality of the included studies (Simonsohn, Simmons, & Nelson, Reference Simonsohn, Simmons and Nelson2022), and their heterogeneity (Szaszi et al., Reference Szaszi, Higney, Charlton, Gelman, Ziano, Aczel and Tipton2022). While the first two of these problems can be addressed by proposed reforms in science, such as universal registries of study designs (which are designed to mitigate publication bias) and adoption of preanalysis plans (which are specified to improve study quality), the problem of heterogeneity requires a framework for expressing study characteristics in a way that is commensurate. If two studies are different, that is, a meta-analysis is left with no means to incorporate information from both of them that properly accounts for their differences. Thus, while meta-analyses (and reviews more generally) can acknowledge the importance of moderating variables, they are inherently limited in their ability to do so by the commensurability of the underlying studies.
Finally, we note that the lack of commensurability is also unaddressed by existing proposals to improve the reliability of science by, for example, increasing sample sizes, calculating effect sizes rather than measures of statistical significance, replicating findings, or requiring preregistered designs. Although these practices can indeed improve the reliability of individual findings, they are not concerned directly with the issue of how many such findings “fit together” and hence do not address our fundamental concern with the one-at-a-time framework. In other words, just as Newell claimed 50 years ago, improving the commensurability of experiments – and the theories they seek to test – will require a paradigmatic shift in how we think about experimental design.
3. From one-at-a-time to integrative by design
We earlier noted that a second explanation for the persistence of the one-at-a-time approach is the lack of any realistic alternative. Even if one sees the need for a “paradigmatic shift in how we think about experimental design,” it remains unclear what that shift would look like and how to implement it. To address this issue, we now describe an alternative approach, which we call “integrative” experimentation, that can resolve some of the difficulties described previously. In general terms, the one-at-a-time approach starts with a single, often very specific, theoretically informed hypothesis. In contrast, the integrative approach starts from the position of embracing many potentially relevant theories: All sources of measurable experimental-design variation are potentially relevant, and decisions about which parameters are relatively more or less important are to be answered empirically. The integrative approach proceeds in three phases: (1) Constructing a design space, (2) sampling from the design space, and (3) building theories from the resulting data. The rest of this section elucidates these three main conceptual components of the integrative approach.
3.1. Constructing the design space
The integrative approach starts by explicitly constructing the design space. Experiments that have already been conducted can then be assigned well-defined coordinates, whereas those not yet conducted can be identified as as-yet-unsampled points. Critically, the differences between any pair of experiments that share the same effect of interest – whether past or future – can be determined; thus, it is possible to precisely identify the similarities and differences between two designs. In other words, commensurability is “baked in” by design.
How should the design space be constructed in practice? The method will depend on the domain of interest but is likely to entail a discovery stage that identifies candidate dimensions from the literature. Best practices for constructing the design space will emerge with experience, giving birth to a new field of what we tentatively label “research cartography”: The systematic process of mapping out research fields in design spaces. Efforts in research cartography are likely to benefit from and contribute to ongoing endeavors to produce formal ontologies in social and behavioral science research and other disciplines, in support of a more integrative science (Larson & Martone, Reference Larson and Martone2009; Rubin et al., Reference Rubin, Lewis, Mungall, Misra, Westerfield, Ashburner and Musen2006; Turner & Laird, Reference Turner and Laird2012).
To illustrate this process, consider the phenomenon of group synergy discussed earlier. Given existing theory and decades of experiments, one might expect the existence and strength of group synergy to depend on the task: For some tasks, interacting groups might outperform nominal groups, whereas for others, the reverse might hold. In addition, synergy might (or might not) be expected depending on the specific composition of the group: Some combinations of skills and other individual attributes might lead to synergistic performance; other combinations might not. Finally, group synergy might depend on “group processes,” defined as variables such as the communications technology or incentive structure that affect how group members interact with one another, but which are distinct both from the individuals themselves and their collective task.
Given these three broad sources of variation, an integrative approach would start by identifying the dimensions associated with each, as suggested either by prior research or some other source of insight such as practical experience. In this respect, research cartography resembles the process of identifying the nodes of a nomological network (Cronbach & Meehl, Reference Cronbach and Meehl1955; Preckel & Brunner, Reference Preckel and Brunner2017) or the dimensions of methodological diversity for a meta-analysis (Higgins, Thompson, Deeks, & Altman, Reference Higgins, Thompson, Deeks and Altman2003); however, it will typically involve many more dimensions and require the “cartographer” to assign numerical coordinates to each “location” in the space. For example, the literature on group performance has produced several well-known task taxonomies, such as those by Shaw (Reference Shaw1963), Hackman (Reference Hackman1968), Steiner (Reference Steiner1972), McGrath (Reference McGrath1984), and Wood (Reference Wood1986). Task-related dimensions of variation (e.g., divisibility, complexity, solution demonstrability, and solution multiplicity) would be extracted from these taxonomies and used to label tasks that have appeared in experimental studies of group performance. Similarly, prior work has variously suggested that group performance depends on the composition of the group with respect to individual-level traits as captured by, say, average skill (Bell, Reference Bell2007; Devine & Philips, Reference Devine and Philips2001; LePine, Reference LePine2003; Stewart, Reference Stewart2006), skill diversity (Hong & Page, Reference Hong and Page2004; Page, Reference Page2008), gender diversity (Schneid, Isidor, Li, & Kabst, Reference Schneid, Isidor, Li and Kabst2015), social perceptiveness (Engel, Woolley, Jing, Chabris, & Malone, Reference Engel, Woolley, Jing, Chabris and Malone2014; Kim et al., Reference Kim, Engel, Woolley, Lin, McArthur and Malone2017; Woolley, Chabris, Pentland, Hashmi, & Malone, Reference Woolley, Chabris, Pentland, Hashmi and Malone2010), and cognitive-style diversity (Aggarwal & Woolley, Reference Aggarwal and Woolley2018; Ellemers & Rink, Reference Ellemers and Rink2016), all of which could be represented as dimensions of the design space. Finally, group-process variables might include group size (Mao et al., Reference Mao, Mason, Suri and Watts2016), properties of the communication network (Almaatouq, Rahimian, Burton, & Alhajri, Reference Almaatouq, Rahimian, Burton and Alhajri2022; Becker et al., Reference Becker, Brackbill and Centola2017; Mason & Watts, Reference Mason and Watts2012), and the ability of groups to reorganize themselves (Almaatouq et al., Reference Almaatouq, Noriega-Campero, Alotaibi, Krafft, Moussaid and Pentland2020). Together, these variables might identify upward of 50 dimensions that define a design space of possible experiments for studying group synergy through integrative experiment design, where any given study should, in principle, be assignable to one unique point in the space.Footnote 5
As this example illustrates, the list of possibly relevant variables can be long, and the dimensionality of the design space can therefore be large. Complicating matters, we do not necessarily know up front which of the many variables are in fact relevant to the effects of interest. In the example of group synergy, for instance, even an exhaustive reading of the relevant literature is not guaranteed to reveal all the ways in which tasks, groups, and group processes can vary in ways that meaningfully affect synergy. Conversely, there is no guarantee that all, or even most, of the dimensions chosen to represent the design space will play any important role in generating synergy. As a result, experiments that map to the same point in the design space could yield different results (because some important dimension is missing from the representation of the space), while in other cases, experiments that map to very different points yield indistinguishable behavior (because the dimensions along which they differ are irrelevant).
Factors such as these complicate matters in practice but do not present a fundamental problem to the approach described here. The integrative approach does not require the initial configuration of the space to be correct or its dimensionality to be fixed. Rather, the dimensionality of the space can be learned in parallel with theory construction and testing. Really, the only critical requirement for constructing the design space is to do it explicitly and systematically by identifying potentially relevant dimensions (either from the literature or from experience, including any known experiments that have already been performed) and by assigning coordinates to individual experiments along all identified dimensions. Using this process of explicit, systematic mapping of research designs to points in the design space (research cartography), the integrative approach ensures commensurability. We next will describe how the approach leverages commensurability to produce integrated knowledge in two steps: Via sampling, and via theory construction and testing.
3.2. Sampling from the design space
An important practical challenge to integrative experiment design is that the size of the design space (i.e., the number of possible experiments) increases exponentially with the number of identified dimensions D. To illustrate, assume that each dimension can be represented as a binary variable (0, 1), such that a given experiment either exhibits the property encoded in the dimension or does not. The number of possible experiments is then 2D. When D is reasonably small and experiments are inexpensive to run, it may be possible to exhaustively explore the space by conducting every experiment in a full factorial design. For example, when D = 8, there are 256 experiments in the design space, a number that is beyond the scale of most studies in the social and behavioral sciences but is potentially achievable with recent innovations in crowdsourcing and other “high-throughput” methods, especially if distributed among a consortium of labs (Byers-Heinlein et al., Reference Byers-Heinlein, Bergmann, Davies, Frank, Kiley Hamlin, Kline and Soderstrom2020; Jones et al., Reference Jones, DeBruine, Flake, Liuzza, Antfolk, Arinze and Coles2021). Moreover, running all possible experiments may not be necessary: If the goal is to estimate the impact that each variable has, together with their interactions, a random (or more efficient) sample of the experiments can be run (Auspurg & Hinz, Reference Auspurg and Hinz2014). This sample could also favor areas where prior work suggests meaningful variation will be observed. Using these methods, together with large samples, it is possible to run studies for higher values of D (e.g., 20). Section 4 describes examples of such studies.
Exhaustive and random sampling are both desirable because they allow unbiased evaluation of hypotheses that are not tethered to the experimental design – there is no risk of looking only at regions of the space that current hypotheses favor (Dubova, Moskvichev, & Zollman, Reference Dubova, Moskvichev and Zollman2022), and no need to collect more data from the design space because the hypotheses under consideration change. But as the dimensionality increases, exhaustive and random sampling quickly becomes infeasible. When D is greater than 20, the number of experiment designs grows to over 1 million, and when D = 30, it is over 1 billion. Given that the dimensionality of design spaces for even moderately complex problems could easily exceed these numbers, and that many dimensions will be not binary but ternary or greater, integrative experiments will require using different sampling methods.
Fortunately, there already exist a number of methods that enable researchers to efficiently sample high-dimensional design spaces (Atkinson & Donev, Reference Atkinson and Donev1992; McClelland, Reference McClelland1997; Smucker, Krzywinski, & Altman, Reference Smucker, Krzywinski and Altman2018; Thompson, Reference Thompson1933). For example, one contemporary class of methods is “active learning,” an umbrella term for sequential optimal experimental-design strategies that iteratively select the most informative design points to sample.Footnote 6 Active learning has become an important tool in the design of A/B tests in industry (Letham, Karrer, Ottoni, & Bakshy, Reference Letham, Karrer, Ottoni and Bakshy2019) and, more recently, of behavioral experiments in the lab (Balietti, Klein, & Riedl, Reference Balietti, Klein and Riedl2021).Footnote 7 Most commonly, an active learning process begins by conducting a small number of randomly selected experiments (i.e., points in the design space) and fitting a surrogate model to the outcome of these experiments. As we later elucidate, one can think of the surrogate model as a “theory” that predicts the outcome of all experiments in the design space, including those that have not been conducted. Then, a sampling strategy (also called an “acquisition function,” “query algorithm,” or “utility measure”) selects a new batch of experiments to be conducted according to the value of potential experiments. Notably, the choice of a surrogate model and sampling strategy is flexible, and the best alternative to choose will depend on the problem (Eyke, Koscher, & Jensen, Reference Eyke, Koscher and Jensen2021).Footnote 8
We will not explore the details of these methods or their implementation,Footnote 9 as this large topic has been – and continues to be – extensively developed in the machine-learning and statistics communities.Footnote 10 For the purpose of our argument, it is necessary only to convey that systematic sampling from the design space allows for unbiased evaluation of hypotheses (see Fig. 2A) and can leverage a relatively small number of sampled points in the design space to make predictions about every point in the space, the vast majority of which are never sampled (see Fig. 2B). Even so, by iteratively evaluating the model against newly sampled points and updating it accordingly, the model can learn about the entire space, including which dimensions are informative. As we explain next, this iterative process will also form the basis of theory construction and evaluation.
3.3. Building and testing theories
Much like in the one-at-a-time paradigm, the ultimate goal of integrative experiment design is to develop a reliable, cohesive, and cumulative theoretical understanding. However, because the integrative approach constructs and tests theories differently, the theories that tend to emerge from it depart from the traditional notion of theory in two regards. First, the shift to integrative experiments will change our expectations about what theories look like (Watts, Reference Watts2014, Reference Watts2017), requiring researchers to focus less on proposing novel theories that seek to differentiate themselves from existing theories by identifying new variables and their effects, and more on identifying theory boundaries, which may involve many known variables working together in complex ways. Second, although traditional theory development distinguishes sharply between basic and applied research, integrative theories will lend themselves to a “use-inspired” approach in which basic and applied science are treated as complements rather than as substitutes where one necessarily drives out the other (Stokes, Reference Stokes1997; Watts, Reference Watts2017). We now describe each of these adaptations in more detail.
3.3.1. Integrating and reconciling existing theories
As researchers sample experiments that cover more of the design space, simple theories and models that explain behavior with singular factors will no longer be adequate because they will fail to generalize. From a statistical perspective, the “bias-variance trade-off” principle identifies two ways a model (or theory) can fail to generalize: It can be too simple and thus unable to capture trends in the observed data, or too complex, overfitting the observed data and manifesting great variance across datasets (Geman, Bienenstock, & Doursat, Reference Geman, Bienenstock and Doursat1992). However, this variance decreases as the datasets increase in size and breadth, making oversimplification and reliance on personal intuitions more-likely causes of poor generalization. As a consequence, we must develop new kinds of theories – or metatheories – that capture the complexity of human behaviors while retaining the interpretability of simpler theories.Footnote 11 In particular, such theories must account for variation in behavior across the entire design space and will be subject to different evaluation criteria than those traditionally used in the social and behavioral sciences.
One such criterion is the requirement that theories generate “risky” predictions, defined roughly as quantitative predictions about as-yet unseen outcomes (Meehl, Reference Meehl1990b; Yarkoni, Reference Yarkoni2022). For example, in the “active sampling” approach outlined above, the surrogate model encodes prior theory and experimental results into a formal representation that (a) can be viewed as an explanation of all previously sampled experimental results and (b) can be queried for predictions treated as hypotheses. This dual status of the surrogate model as both explanation and prediction (Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths, Kleinberg and Yarkoni2021; Nemesure, Heinz, Huang, & Jacobson, Reference Nemesure, Heinz, Huang and Jacobson2021; Yarkoni & Westfall, Reference Yarkoni and Westfall2017) distinguishes it from the traditional notion of hypothesis testing. Rather than evaluating a theory based on how well it fits existing (i.e., in-sample) experimental data, the surrogate model is continually evaluated on its ability to predict new (i.e., out-of-sample) experimental data. Moreover, once the new data have been observed, the model is updated to reflect the new information, and new predictions are generated.
We emphasize that the surrogate model from the active learning approach is just one way to generate, test, and learn from risky predictions. Many other approaches also satisfy this criterion. For example, one might train a machine-learning model other than the surrogate model to estimate heterogeneity of treatment effects and to discover complex structures that were not specified in advance (Wager & Athey, Reference Wager and Athey2018). Alternatively, one could use an interpretable, mechanistic, model. The only essential requirements for an integrative model are that it leverages the commensurability of the design space to in some way (a) accurately explain data that researchers have already observed, (b) make predictions about as-yet-unseen experiments, and then, having run those experiments, and (c) integrate the newly learned information to improve the model. If accurate predictions are achievable across some broad domain of the design space, the model can then be interpreted as supporting or rejecting various theoretical claims in a context-population-dependent way, as illustrated schematically in Figure 2B. Reflecting Merton's (Reference Merton1968) call for “theories of the middle range,” a successful metatheory could identify the boundaries between empirically distinct regions of the design space (i.e., regions where different observed answers to the same research question pertain), making it possible to precisely state under what conditions (i.e., for which ranges of parameter values) one should expect different theoretically informed results to apply.
If accurate predictions are unachievable even after an arduous search, the result is not a failure of the integrative framework. Rather, it would be an example of the framework's revealing a fundamental limit to prediction and, hence, explanation (Hofman, Sharma, & Watts, Reference Hofman, Sharma and Watts2017; Martin, Hofman, Sharma, Anderson, & Watts, Reference Martin, Hofman, Sharma, Anderson and Watts2016; Watts et al., Reference Watts, Beck, Bienenstock, Bowers, Frank, Grubesic and Salganik2018).Footnote 12 In the extreme, when no point in the space is informative of any other point, generalizations of any sort are unwarranted. In such a scenario, applied research might still be possible, for example, by sampling the precise point of interest (Manzi, Reference Manzi2012), but the researcher's drive to attain a generalizable theoretical understanding of a domain of inquiry would be exposed as fruitless. Such an outcome would be disappointing, but from a larger scientific perspective, it is better to know what cannot be known than to believe in false promises. Naturally, whether such outcomes arise – and if so, how frequently – is itself an empirical question that the proposed framework could inform. With sufficient integrative experiments over many domains, the framework might yield a “meta-metatheory” that clarifies under which conditions one should (or should not) expect to find predictively accurate metatheories.
3.3.2. Bridging scientific and pragmatic knowledge
Another feature of integrative theories is that they will lend themselves to a “use-inspired” approach. Practitioners and researchers alike generally acknowledge that no single intervention, however evidence-based, benefits all individuals in all circumstances (i.e., across populations and contexts) and that overgeneralization from lab experiments in many areas of behavioral science can (and routinely does) lead practitioners and policymakers to deploy suboptimal and even dangerous real-world interventions (Brewin, Reference Brewin2022; de Leeuw, Motz, Fyfe, Carvalho, & Goldstone, Reference de Leeuw, Motz, Fyfe, Carvalho and Goldstone2022; Grubbs, Reference Grubbs2022; Wiernik, Raghavan, Allan, & Denison, Reference Wiernik, Raghavan, Allan and Denison2022). Therefore, social scientists should precisely identify the most effective intervention under each arising set of circumstances.
The integrative approach naturally emphasizes contingencies and enables practitioners to distinguish between the most general result and the result that is most useful in practice. For example, in Figure 2B, the experiments depicted with a gray point correspond to the most general claim, occupying the largest region in the design space. However, this view ignores relevance, defined as points that represent the “target” conditions or the particular real-world context to which the practitioner hopes to generalize the results (Berkman & Wilson, Reference Berkman and Wilson2021; Brunswik, Reference Brunswik1955), as shown in Figure 2C. By concretely emphasizing these theoretical contingencies, the integrative approach supports “use-inspired” research (Stokes, Reference Stokes1997; Watts, Reference Watts2017).
4. Existing steps toward integrative experiments
Integrative experiment design is not yet an established framework. However, some recent experimental work has begun to move in the direction we endorse – for example, by explicitly constructing a design space, sampling conditions more broadly and densely than the one-at-a-time approach would have, and constructing new kinds of theories that reflect the complexity of human behavior. In this section, we describe three examples of such experiments in the domains of (1) moral judgments, (2) risky choices, and (3) subliminal priming effects. Note that these examples are not an exhaustive accounting of relevant work, nor fully fleshed out exemplars of the integrative framework. Rather, we find them to be helpful illustrations of work that is closely adjacent to what we describe and evidence that the approach is realizable and can yield useful insights.
4.1. Factors influencing moral judgments
Inspired by the trolley problem, the seminal “Moral Machine” experiment used crowdsourcing to study human perspectives on moral decisions made by autonomous vehicles (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018, Reference Awad, Dsouza, Bonnefon, Shariff and Rahwan2020). The experiment was supported by an algorithm that sampled a nine-dimensional space of over 9 million distinct moral dilemmas. In the first 18 months after deployment, the researchers collected more than 40 million decisions in 10 languages from over 4 million unique participants in 233 countries and territories (Fig. 3A).
The study offers numerous findings that were neither obvious nor deducible from prior research or traditional experimental designs. For example, they show that once a moral dilemma is made sufficiently complex, few people will hold to the principle of treating all lives equally. Instead, they appear to treat demographic groups quite differently – for example, a willingness to sacrifice the elderly in service of the young, and a preference for sparing the wealthy over the poor at about the same level as the preference for preserving people following the law over those breaking it (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018). A second surprising finding by Awad et al. (Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018) was that the differences between omission and commission (a staple of discussions of Western moral philosophy) ranks surprisingly low relative to other variables affecting judgments of morality and that this ethical preference for inaction is primarily concentrated in Western cultures (e.g., North America and many European countries of Protestant, Catholic, and Orthodox Christian cultural groups). Indeed, the observation that clustering between countries is not just based on one or two ethical dimensions, but on a full profile of the multiplicity of ethical dimensions is something that would have been impossible to detect using studies that lacked the breadth of experimental conditions sampled in this study.
Moreover, such an approach to experimentation yields datasets that are more useful to other researchers as they evaluate their hypotheses, develop new theories, and address long-standing concerns such as which variables matter most to producing a behavior and what their relative contributions might be. For instance, Agrawal and colleagues used the dataset generated by the Moral Machine experiment to build a model with a black-box machine-learning method (specifically, an artificial neural network) for predicting people's decisions (Agrawal, Peterson, & Griffiths, Reference Agrawal, Peterson and Griffiths2020). This predictive model was used to critique a traditional cognitive model and identify potentially causal variables influencing people's decisions. The cognitive model was then evaluated in a new round of experiments that tested its predictions about the consequences of manipulating the causal variables. This approach of “scientific regret minimization” combined machine learning with rational choice models to jointly maximize the theoretical model's predictive accuracy and interpretability in the context of moral judgments. It also yielded a more-complex theory than psychologists might be accustomed to: The final model had over 100 meaningful predictors, each of which could have been the subject of a distinct experiment and theoretical insight about human moral reasoning. By considering the influence of these variables in a single study by Awad et al. (Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018), the researchers could ask what contribution each made to explaining the results. Investigation at this scale becomes possible when machine-learning methods augment the efforts of human theorists (Agrawal et al., Reference Agrawal, Peterson and Griffiths2020).
4.2. The space of risky decisions
The choice prediction competitions studied human decisions under risk (i.e., where outcomes are uncertain) by automating selection of more than 100 pairs of gambles from a 12-dimensional space with an algorithm (Erev, Ert, Plonsky, Cohen, & Cohen, Reference Erev, Ert, Plonsky, Cohen and Cohen2017; Plonsky et al., Reference Plonsky, Apel, Ert, Tennenholtz, Bourgin, Peterson and Erev2019). Recent work scaled this approach by taking advantage of the larger sample sizes made possible by virtual labs, collecting human decisions for over 10,000 pairs of gambles (Bourgin, Peterson, Reichman, Russell, & Griffiths, Reference Bourgin, Peterson, Reichman, Russell, Griffiths, Chaudhuri and Salakhutdinov2019; Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021).
By sampling the space of possible experiments (in this case, gambles) much more densely (Fig. 3B), Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) found that two of the classic phenomena of risky choice – loss aversion and overweighting of small probabilities – did not manifest uniformly across the entire space of possible gambles. These two phenomena originally prompted the development of prospect theory (Kahneman & Tversky, Reference Kahneman and Tversky1979), representing significant deviations from the predictions of classic expected utility theory. By identifying regions of the space of possible gambles where loss aversion and overweighting of small probabilities occur, Kahneman and Tversky showed that expected utility theory does not capture some aspects of human decision making. However, in analyzing predictive performance across the entire space of gambles, Peterson et al. found that prospect theory was outperformed by a model in which the degree of loss aversion and overweighting of small probabilities varied smoothly over the space.
The work of Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021) illustrates how the content of theories might be expected to change with a shift to the integrative approach. Prospect theory makes a simple assertion about human decision making: People exhibit loss aversion and overweight small probabilities. Densely sampling a larger region of the design space yields a more nuanced theory: While the functional form of prospect theory is well suited for characterizing human decisions, the extent to which people show loss aversion and overweight small probabilities depends on the context of the choice problem. That dependency is complicated. Even so, Peterson et al. identified several relevant variables such as the variability of the outcomes of the underlying gambles and whether the gamble was entirely in the domain of losses. Machine-learning methods were useful in developing this theory, initially to optimize the parameters of the functions assumed by prospect theory and other classic theories of decision making so as to ensure evaluation of the best possible instances of those theories, and then to demonstrate that these models did not capture variation in people's choices that could be predicted by more-complex models.
4.3. A metastudy of subliminal priming effects
A recent cognitive psychology paper described an experiment in which a subliminal cue influences how participants balance speed and accuracy in a response-time task (Reuss, Kiesel, & Kunde, Reference Reuss, Kiesel and Kunde2015). In particular, participants were instructed to rapidly select a target according to a cue that signaled whether to prioritize response accuracy over speed, or vice versa. Reuss et al. reported typical speed–accuracy tradeoffs: When cued to prioritize speed, participants were faster and gave less accurate responses, whereas when cued to prioritize accuracy, participants were slower and more accurate. Crucially, this relationship was also found with cues that were rendered undetectable via a mask, an image presented directly before or after the cue that can suppress conscious perception of it.
The study design of the original experiment included several nuisance variables (e.g., the color of the cue), the values of which were not thought to affect the finding of subliminal effects. If the claimed effects were general, it would appear for all plausible values of the nuisance variables, whereas its appearance in some (contiguous) ranges of values but not in others would indicate contingency. And if spurious, the effect would appear only for the original values, if at all.
Baribault et al. (Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018) took a “radical randomization” approach (also called a “metastudy” approach) in examining the generalizability and robustness of the original finding by randomizing 16 independent variables that could moderate the subliminal priming effect (Fig. 3C). By sampling nearly 5,000 “microexperiments” from the 16-dimensional design space, Baribault et al. revealed that masked cues had an effect on participant behavior only in the subregion of the design space where the cue is consciously visible, thus providing much stronger evidence about the lack of the subliminal priming effect than any single traditional experiment evaluating this effect could have. For a recent, thorough discussion of the metastudy approach and its advantages, along with a demonstration using the risky-choice framing effect, see DeKay, Rubinchik, Li, and De Boeck (Reference DeKay, Rubinchik, Li and De Boeck2022).
5. Critiques and concerns
We have argued that adopting what we have called “integrative designs” in experimental social and behavioral science will lead to more-consistent, more-cumulative, and more-useful science. As should be clear from our discussion, however, our proposal is preliminary and therefore subject to several questions and concerns. Here we outline some of the critiques we have encountered and offer our responses.
5.1. Isn't the critique of the one-at-a-time approach unfair?
One possible response is that our critique of the one-at-a-time approach is unduly critical and does not recognize its proper role in the future of social and behavioral sciences. To be clear, we are neither arguing that scientists should discard the “one-at-a-time” paradigm entirely nor denigrating studies (including our own!) that have employed it. The approach has generated a substantial amount of valuable work and continues to be useful for understanding individual causal effects, shaping theoretical models, and guiding policy. For example, it can be a sufficient and effective means to provide evidence for the existence of a phenomenon (but not the conditions under which it exists), as in field experiments that show that job applicants with characteristically “Black” names are less likely to be interviewed than those with “White” names, revealing the presence of structural racism and informing public debates about discrimination (Bertrand & Mullainathan, Reference Bertrand and Mullainathan2004). Moreover, one-at-a-time experimentation can precede the integrative approach when exploring a new topic and identifying the variables that make up the design space.
Rather, our point is that the one-at-a-time approach cannot do all the work that is being asked of it, in large part because theories in the social and behavioral sciences cannot do all the work that is being asked of them. Once we recognize the inherent imprecision and ambiguity of social and behavioral theories, the lack of commensurability across independently designed and executed experiments is revealed as inevitable. Similarly, the solution we describe here can be understood simply as baking commensurability into the design process, by explicitly recognizing potential dimensions of variability and mapping experiments such that they can be compared with one another. In this way, the integrative approach can complement one-at-a-time experiments by incorporating them within design spaces (analogous to how articles already contextualize their contribution in terms of the prior literature), through which the research field might quickly recognize creative and pathbreaking contributions from one-at-a-time research.
5.2. Can't we solve the problem with meta-analysis?
As discussed earlier, meta-analyses offer the attractive proposition that accumulation of knowledge can be achieved through a procedure that compares and combines results across experiments. But the integrative approach is different in at least three important ways.
First, meta-analyses – as well as systematic reviews and integrative conceptual reviews – are by nature post hoc mechanisms for performing integration: The synthesis and integration steps occur after the data are collected and the results are published. Therefore, it can take years of waiting for studies to accumulate “naturally” before one can attempt to “put them together” via meta-analyses (if at all, as the vast majority of published effects are never meta-analyzed). More importantly, because commensurability is not a first-order consideration of one-at-a-time studies, attempts to synthesize collections of such studies after the fact are intrinsically challenging. The integrative approach is distinct in that it treats commensurability as a first-order consideration that is baked into the research design at the outset (i.e., ex ante). As we have argued, the main benefit of ex ante over ex post integration is that the explicit focus on commensurability greatly eases the difficulty of comparing different studies and hence integrating their findings (whether similar or different). In this respect, our approach can be viewed as a “planned meta-analysis” that is explicitly designed to sample conditions more broadly, minimize sampling bias, and efficiently reveal how effects vary across conditions. Although it may take more time and effort (and thus money) to run an integrative experiment than a single traditional experiment, when considering the accumulated effort of all the original research, this effort is much less than that of typical meta-analyses (see sect. 5.6 for a discussion about costs).
Second, although a meta-analysis typically aims to estimate the size of an effect by aggregating (e.g., averaging) over design variations across experiments, our emphasis is on trying to map the variation in an effect across an entire design space. While some meta-analyses with sufficient data attempt to determine the heterogeneity of the effect of interest, these efforts are typically hindered by the absence of systematic data on the variations in design choices (as well as in methods).
Third, publication bias induced by selective reporting of conditions and results – known as the file drawer problem (Carter, Schönbrodt, Gervais, & Hilgard, Reference Carter, Schönbrodt, Gervais and Hilgard2019; Rosenthal, Reference Rosenthal1979) – can lead to biased effect-size estimates in meta-analyses. While there are methods for identifying and correcting such biases, one cannot be sure of their effectiveness in any particular case because of their sensitivity to untestable assumptions (Carter et al., Reference Carter, Schönbrodt, Gervais and Hilgard2019; Cooper, Hedges, & Valentine, Reference Cooper, Hedges and Valentine2019). Another advantage of the integrative approach is that it is largely immune to such problems because all sampled experiments are treated as informative, regardless of the novelty or surprise value of the individual findings, thereby greatly reducing the potential for bias.
5.3. How do integrative experiments differ from other recent innovations in psychology?
There have been several efforts to innovate on traditional experiments in the behavioral and social sciences. One key innovation is collaboration by multiple research labs to conduct systematic replications or to run larger-scale experiments than had previously been possible. For instance, the Many Labs initiative coordinated numerous research labs to conduct a series of replications of significant psychological results (Ebersole et al., Reference Ebersole, Atherton, Belanger, Skulborstad, Allen, Banks and Nosek2016; Klein et al., Reference Klein, Ratliff, Vianello, Adams, Bahník, Bernstein and Nosek2014, Reference Klein, Vianello, Hasselman, Adams, Adams, Alper and Nosek2018). This effort has itself been replicated in enterprises such as the ManyBabies Consortium (ManyBabies Consortium, 2020), ManyClasses (Fyfe et al., Reference Fyfe, de Leeuw, Carvalho, Goldstone, Sherman, Admiraal and Motz2021), and ManyPrimates (Many Primates et al., Reference Altschul, Beran, Bohn, Call, DeTroy, Duguid and Watzek2019), which pursue the same goal with more-specialized populations, and in the DARPA SCORE program, which did so over a representative sample of experimental research in the behavioral and social sciences (Witkop, Reference Witkopn.d.).Footnote 13 The Psychological Science Accelerator brings together multiple labs with a different goal: To evaluate key findings in a broader range of participant populations and at a global scale (Moshontz et al., Reference Moshontz, Campbell, Ebersole, IJzerman, Urry, Forscher and Chartier2018). Then, there is the Crowdsourcing Hypothesis Tests collaboration, which assigned 15 research teams to each design a study targeting the same hypothesis, varying in methods (Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020). Moreover, there is a recent trend in behavioral science to run “megastudies,” in which researchers test a large number of treatments in a single study in order to increase the pace and comparability of experimental results (Milkman et al., Reference Milkman, Patel, Gandhi, Graci, Gromet, Ho and Duckworth2021, Reference Milkman, Gandhi, Patel, Graci, Gromet, Ho and Duckworth2022; Voelkel et al., Reference Voelkel, Stagnaro, Chu, Pink, Mernyk, Redekopp and Willer2022).
All of these efforts are laudable and represent substantial methodological advances that we view as complements to, not substitutes for, integrative designs. What is core to the integrative approach is the explicit construction of, sampling from, and building theories upon a design space of experiments. Each ongoing innovation can contribute to the design of integrative experiments in its own way. For example, large-scale collaborative networks such as Many Labs can run integrative experiments together by assigning points in the design space to participating labs. Or in the megastudy research design, the interventions selected by researchers can be explicitly mapped into design spaces and then analyzed in a way that aims to reveal contingencies and generate metatheories of the sort discussed in section 3.3.
5.4. What about unknown unknowns?
There will always be systematic nontrivial variables that should be represented in the design space but are missing – these are the unknown unknowns. We believe our responses to this challenge are worth expanding upon.
First, we acknowledge the challenge inherent in the first step of integrative experiment design: Constructing the design space. This construction requires identifying the subset of variables to include from an infinite set of possible variables that could define the design space of experiments within a domain. To illustrate such a process, we discussed the example domain of group synergy (see sect. 3.1). But, of course, we think that the field is wide open, with many options to explore; that the methodological details will depend on the domain of interest; and that best practices will emerge with experience.
Second, although we do not yet know which of the many potentially relevant dimensions should be selected to represent the space, and there are no guarantees that all (or even most) of the selected dimensions will play a role in determining the outcome, the integrative approach can shed light on both issues. On the one hand, experiments that map to the same point in the design space but yield different results indicate that some important dimension is missing from the representation of the space. On the other, experiments that systematically vary in the design space but yield similar results could indicate that the dimensions where they differ are irrelevant to the effect of interest and should be collapsed.
5.5. This sounds great in principle but it is impossible to do in practice
Even with an efficient sampling scheme, integrative designs are likely to require a much larger number of experiments than is typical in the one-at-a-time paradigm; therefore, practical implementation is a real concern. However, given recent innovations in virtual lab environments, participant sourcing, mass collaboration mechanisms, and machine-learning methods, the approach is now feasible to some.
5.5.1. Virtual lab environments
Software packages such as jsPsych (de Leeuw, Reference de Leeuw2015) nodeGame (Balietti, Reference Balietti2017), Dallinger (https://dallinger.readthedocs.io/), Pushkin (Hartshorne, de Leeuw, Goodman, Jennings, & O'Donnell, Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019), Hemlock (Bowen, Reference Bowenn.d.), and Empirica (Almaatouq et al., Reference Almaatouq, Becker, Houghton, Paton, Watts and Whiting2021b) support development of integrative experiments that can systematically cover an experimental design's parameter space with automatically executed conditions. Even with these promising tools, for which development is ongoing, we still believe that one of the most promising, cost-effective ways to accelerate and improve progress in social science is to increase investment in automation (Yarkoni et al., Reference Yarkoni, Eckles, Heathers, Levenstein, Smaldino and Lane2019).
5.5.2. Recruiting participants
Another logistical challenge to integrative designs is that adequately sampling the space of experiments will typically require a large participant pool from which the experimenter can draw, often repeatedly. As it stands, the most common means of recruiting participants online involves crowdsourcing platforms (Horton, Rand, & Zeckhauser, Reference Horton, Rand and Zeckhauser2011; Mason & Suri, Reference Mason and Suri2012). The large-scale risky-choice dataset described above, for example, used this approach to collect its 10,000 pairs of gambles (Bourgin et al., Reference Bourgin, Peterson, Reichman, Russell, Griffiths, Chaudhuri and Salakhutdinov2019). However, popular crowdsourcing platforms such as Amazon Mechanical Turk (Litman, Robinson, & Abberbock, Reference Litman, Robinson and Abberbock2017) were designed for basic labeling tasks, which can be performed by a single person and require low levels of effort. And the crowdworkers performing the tasks may have widely varying levels of commitment and produce work of varying quality (Goodman, Cryder, & Cheema, Reference Goodman, Cryder and Cheema2013). Researchers are prevented by Amazon's terms of use from knowing whether crowdworkers have participated in similar experiments in the past, possibly as professional study participants (Chandler, Mueller, & Paolacci, Reference Chandler, Mueller and Paolacci2014). To accommodate behavioral research's special requirements, Prolific and other services (Palan & Schitter, Reference Palan and Schitter2018) have made changes to the crowdsourcing model, such as by giving researchers greater control over how participants are sampled and over the quality of their work.
Larger, more diverse volunteer populations are also possible to recruit, as the Moral Machine experiment exemplifies. In the first 18 months after deployment, that team gathered more than 40 million moral judgments from over 4 million unique participants in 233 countries and territories (Awad, Dsouza, Bonnefon, Shariff, & Rahwan, Reference Awad, Dsouza, Bonnefon, Shariff and Rahwan2020). Recruiting such large sample sizes from volunteers is appealing; however, success with such recruitment requires participant-reward strategies like gamification or personalized feedback (Hartshorne et al., Reference Hartshorne, de Leeuw, Goodman, Jennings and O'Donnell2019; Li, Germine, Mehr, Srinivasan, & Hartshorne, Reference Li, Germine, Mehr, Srinivasan and Hartshorne2022). Thus, it has been hard to generalize the model to other important research questions and experiments, particularly when taking part in the experiment does not appear to be fun or interesting. Moreover, such large-scale data collection using viral platforms such as the Moral Machine may require some flexibility from Institutional Review Boards (IRBs), as they resemble software products that are open to consumers more than they do closed experiments that recruit from well-organized, intentional participant pools. In the Moral Machine experiment, for example, the MIT IRB approved pushing the consent to an “opt-out” option at the end, rather than obtaining consent prior to participation in the experiment, as the latter would have significantly increased participant attrition (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018).
5.5.3. Mass collaboration
Obtaining a sufficiently large sample may require leveraging emerging forms of organizing research in the behavioral and social sciences, such as distributed collaborative networks of laboratories (Moshontz et al., Reference Moshontz, Campbell, Ebersole, IJzerman, Urry, Forscher and Chartier2018). As we discussed earlier, in principle, large-scale collaborative networks can cooperatively run integrative experiments by assigning points in the design space to participating labs.
5.5.4. Machine learning
The physical and life sciences have benefited greatly from machine learning. Astrophysicists use image-classification systems to interpret the massive amounts of data recorded by their telescopes (Shallue & Vanderburg, Reference Shallue and Vanderburg2018). Life scientists use statistical methods to reconstruct phylogeny from DNA sequences and use neural networks to predict the folded structure of proteins (Jumper et al., Reference Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger and Hassabis2021). Experiments in the social and behavioral sciences, in contrast, have had relatively few new methodological breakthroughs related to these technologies. While social and behavioral scientists in general have embraced “big data” and machine learning, their focus to date has largely been on nonexperimental data.Footnote 14 In contrast, the current scale of experiments in the experimental social and behavioral sciences does not typically produce data at the volumes necessary for machine-learning models to yield substantial benefits over traditional methods.
Integrative experiments offer several new opportunities for machine-learning methods to be used to facilitate social and behavioral science. First, by producing larger datasets – either within a single experiment or across multiple integrated experiments in the same design space – the approach makes it possible to use a wider range of machine-learning methods, particularly ones less constrained by existing theories. This advantage is illustrated by the work of Peterson et al. (Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021), whose neural network models were trained on human choice data to explore the implications of different theoretical assumptions for predicting decisions. Second, these methods can play a valuable role in helping scientists make sense of the many factors that potentially influence behavior in these larger datasets, as in Agrawal et al.'s (Reference Agrawal, Peterson and Griffiths2020) analysis of the Moral Machine data. Finally, machine-learning techniques are a key part of designing experiments that efficiently explore large design spaces, as they are used to define surrogate models that are the basis for active sampling methods.
5.6. Even if such experiments are possible, costs will be prohibitive
It is true that integrative experiments are more expensive to run than individual one-at-a-time experiments, which may partly explain why the former have not yet become more popular. However, this comparison is misleading because it ignores the cost of human capital in generating scientific insight. Assume that a typical experimental paper in the social and behavioral sciences reflects on the order of $100,000 of labor costs in the form of graduate students or postdocs designing and running the experiment, analyzing the data, and writing up the results. Under the one-at-a-time approach, such a paper typically contains just one or at most a handful of experiments. The next paper builds upon the previous results and the process repeats. With hundreds of articles published over a few decades, the cumulative cost of a research program that explores roughly 100 points in the implicit design space easily reaches tens of millions of dollars.
Of those tens of millions of dollars, a tiny fraction – on the order of $1,000 per paper, or $100,000 per research program (<1%) – is spent on data collection. If instead researchers conducted a single-integrative experiment that covered the entire design space, they could collect all the data produced by the entire research program and then some. Even if this effort explored the design space significantly less efficiently than the traditional research program, requiring 10 times more data, data collection would cost about $1,000,000 (<10%). This is a big financial commitment, but the labor costs for interpreting these data do not scale with the amount of data. So, even if researchers needed to commit 10 times as much labor as for a typical research paper, they would have discovered everything an entire multidecade research program would uncover in a single study costing only $2,000,000.
The cost–benefit ratio of integrative experiments is hence at least an order of magnitude better than that of one-at-a-time experiments.Footnote 15 Pinching pennies on data collection results in losing dollars (and time and effort) in labor. If anything, when considered in aggregate, the efficiency gains of the integrative approach will be substantially greater than this back of the envelope calculation suggests. As an institution, the social and behavioral sciences have spent tens of billions of dollars during the past half-century.Footnote 16 With integrative designs, a larger up-front investment can save decades of unfruitful investigation and instead realize grounded, systematic results.
5.7. Does this mean that small labs can't participate?
Although the high up-front costs of designing and running an integrative experiment may seem to exclude small labs as well as Principal investigators (PIs) from low-resource institutions, we anticipate that the integrative approach will actually broaden the range of people involved in behavioral research. The key insight here is that the methods and infrastructure needed to run integrative experiments are inherently shareable. Thus, while the development costs are indeed high, once the infrastructure has been built, the marginal costs of using it are low – potentially even lower than running a single, one-at-a-time experiment. As long as funding for the necessary technical infrastructure is tied to a requirement for sustaining collaborative research (as discussed in previous sections), it will create opportunities for a wider range of scientists to be involved in integrative projects and for researchers at smaller or undergraduate-focused institutions to participate in ambitious research efforts.
Moreover, research efforts in other fields illustrate how labs of different sizes can make different kinds of contributions. In biology and physics, some groups of scientists form consortia that work together to define a large-scale research agenda and seek the necessary funding (as described earlier, several thriving experimental consortia in the behavioral sciences illustrate this possibility). Other groups develop theory by digging deeper into the data produced by these large-scale efforts to make discoveries they may not have imagined when the data were first collected; some scientists focus on answering questions that do not require large-scale studies, such as the properties of specific organisms or materials that can be easily studied in a small lab; still other researchers conduct exploratory work to identify the variables or theoretical principles that may be considered in future large-scale studies. We envision a similar ecosystem for the future of the behavioral sciences.
5.8. Shouldn't the replication crisis be resolved first?
The replication crisis in the behavioral sciences has led to much reflection about research methods and substantial efforts to conduct more-applicable research (Freese & Peterson, Reference Freese and Peterson2017). We view our proposal as being consistent with these goals, but with a different emphasis than replication. To some extent, this difference is complementary to replication and can be pursued in parallel with it, but may suggest a different allocation of resources than a “replication first” approach.
Discussing the complementary role first, integrative experiments naturally support replicable science. Because choices about nuisance variables are rarely documented systematically in the one-at-a-time paradigm, it is not generally possible to establish how similar or different two experiments are. This observation may account for some recently documented replication failures (Camerer et al., Reference Camerer, Dreber, Holzmeister, Ho, Huber, Johannesson and Wu2018; Levinthal & Rosenkopf, Reference Levinthal and Rosenkopf2021). While the replication debate has focused on shoddy research practices (e.g., p-hacking) and bad incentives (e.g., journals rewarding “positive, novel, and exciting” results), another possible cause of nonreplication is that the replicating experiment is in fact sufficiently dissimilar to the original (usually as a result of different choices of nuisance parameters) that one should not expect the result to replicate (Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Yarkoni, Reference Yarkoni2022). In other words, without operating within a space that makes experiments commensurate, failures to replicate previous findings are never conclusive, because doubt remains as to whether one of the many possible moderator variables explains the lack of replication (Cesario, Reference Cesario2014). Regardless of whether an experimental finding's fragility to (supposedly) theoretically irrelevant parameters should be considered a legitimate defense of the finding, the difficulty of resolving such arguments further illustrates the need for a more explicit articulation of theoretical scope conditions.
The integrative approach, accepting that treatment effects vary across conditions, would also recommend that directing massive resources to replicating existing effects may not be the best way to help our fields advance. Given that those historical effects were discovered under the one-at-a-time approach, they evaluate only specific points in the design space. Consistent with the argument above, rather than trying to perfectly reproduce those points in the design space (via “direct” replications), a better use of resources would be to sample the design space more extensively and use continuous measures to compare different studies (Gelman, Reference Gelman2018). In this way, researchers can not only discover whether historical effects replicate, but also draw stronger conclusions about whether (and to what extent) they generalize.
5.9. This proposal is incompatible with incentives in the social and behavioral sciences
Science does not occur in a vacuum. Scientists are constantly evaluated by their peers as they submit papers for publication, seek funding, apply for jobs, and pursue promotions. For the integrative approach to become widespread, it must be compatible with the incentives of individual behavioral scientists, including early career researchers. Given the current priority that hiring, tenure & promotion, and awards committees in the social and behavioral sciences place on identifiable individual contributions (e.g., lead authorship of scholarly works, perceived “ownership” of distinct programs of research, leadership positions, etc.), a key pragmatic concern is that the large-scale collaborative nature of integrative research designs might make them less rewarding than the one-at-a-time paradigm for anyone other than the project leaders.
Although a shift to large-scale, collaborative science does indeed present an adoption challenge, it is encouraging to note that even more dramatic shifts have taken place in other fields. In physics, for example, some of the most important results in recent decades – the discovery of the Higgs Boson (Aad et al., Reference Aad, Abajyan, Abbott, Abdallah, Abdel Khalek, Abdelalim and Zwalinski2012), gravitational waves (Abbott et al., Reference Abbott, Abbott, Abbott, Abernathy, Acernese and Ackley2016), and so on – have been obtained via collaborations of thousands of researchers.Footnote 17 To ensure that junior team members are rewarded for their contributions, many collaborations maintain “speaker lists” that prominently feature early career researchers, offering them a chance to appear as the face of the collaboration. When these researchers apply for jobs or are considered for promotion, the leader of the collaboration writes a letter of recommendation that describes the scientists' role in the collaboration and why their work is significant. A description of such roles can also be included directly in manuscripts through the Contributor Roles Taxonomy (Allen, Scott, Brand, Hlava, & Altman, Reference Allen, Scott, Brand, Hlava and Altman2014), a high-level taxonomy with 14 roles that describe typical contributions to scholarly output; the taxonomy has been adopted as an American National Standards Institute (ANSI)/National Information Standards Organization (NISO) standard and is beginning to see uptake (National Information Standards Organization, 2022). Researchers who participate substantially in creating the infrastructure used by a collaborative effort can receive “builder” status, appearing as coauthors on subsequent publications that use that infrastructure. Many collaborations also have mentoring plans designed to support early career researchers. Together, these mechanisms are intended to make participation in large collaborations attractive to a wide range of researchers at various career stages. While acknowledging that physics differs in many ways from the social and behavioral sciences, we nonetheless believe that the model of large collaborative research efforts can take root in the latter. Indeed, we have already noted the existence of several large collaborations in the behavioral sciences that appear to have been successful in attracting participation from small labs and early career researchers.
6. Conclusion
The widespread approach of designing experiments one-at-a-time – under different conditions with different participant pools, and with nonstandardized methods and reporting – is problematic because it is at best an inefficient way to accumulate knowledge, and at worst it fails to produce consistent, cumulative knowledge. The problem clearly will not be solved by increasing sample sizes, focusing on effect sizes rather than statistical significance, or replicating findings with preregistered designs. We instead need a fundamental shift in how to think about theory construction and testing.
We describe one possible approach, one that promotes commensurability and continuous integration of knowledge by design. In this “integrative” approach, experiments would not just evaluate a few hypotheses but would explore and integrate over a wide range of conditions that deserve explanation by all pertinent theories. Although this kind of experiment may strike many as atheoretical, we believe the one-at-a-time approach owes its dominance not to any particular virtues of theory construction and evaluation but rather to the historical emergence of experimental methods under a particular set of physical and logistical constraints. Over time, generations of researchers have internalized these features to such an extent that they are thought to be inseparable from sound scientific practice. Therefore, the key to realizing our proposed type of reform – and to making it productive and useful – is not only technical, but also cultural and institutional.
Acknowledgments
We owe an important debt to Saul Perlmutter, Serguei Saavedra, Matthew J. Salganik, Gary King, Todd Gureckis, Alex “Sandy” Pentland, Thomas W. Malone, David G. Rand, Iyad Rahwan, Ray E. Reagans, and the members of the MIT Behavioral Lab and the UPenn Computational Social Science Lab for valuable discussions and comments. This article also benefited from conversations with dozens of people at two workshops: (1) “Scaling Cognitive Science” at Princeton University in December 2019, and (2) “Scaling up Experimental Social, Behavioral, and Economic Science” at the University of Pennsylvania in January 2020.
Financial support
This work was supported in part by the Alfred P. Sloan Foundation (2020-13924) and the NOMIS Foundation.
Competing interest
None.
Target article
Beyond playing 20 questions with nature: Integrative experiment design in the social and behavioral sciences
Related commentaries (31)
Against naïve induction from experimental data
Are language–cognition interactions bigger than a breadbox? Integrative modeling and design space thinking temper simplistic questions about causally dense phenomena
Assume a can opener
Beyond integrative experiment design: Systematic experimentation guided by causal discovery AI
Commensurability engineering is first and foremost a theoretical exercise
Confidence in research findings depends on theory
Consensus meetings will outperform integrative experiments
Dimensional versus conceptual incommensurability in the social and behavioral sciences
Discovering the unknown unknowns of research cartography with high-throughput natural description
Diversity of contributions is not efficient but is essential for science
Don't let perfect be the enemy of better: In defense of unparameterized megastudies
Eliminativist induction cannot be a solution to psychology's crisis
Experiment commensurability does not necessitate research consolidation
Explore your experimental designs and theories before you exploit them!
Getting lost in an infinite design space is no solution
Individual differences do matter
Integrative design for thought-experiments
Integrative experiments require a shared theoretical and methodological basis
Is generalization decay a fundamental law of psychology?
Measurement validity and the integrative approach
Neuroadaptive Bayesian optimisation can allow integrative design spaces at the individual level in the social and behavioural sciences… and beyond
Phenomena complexity, disciplinary consensus, and experimental versus correlational research in psychological science
Representative design: A realistic alternative to (systematic) integrative design
Sampling complex social and behavioral phenomena
Some problems with zooming out as scientific reform
Test many theories in many ways
The elephant's other legs: What some sciences actually do
The future of experimental design: Integrative, but is the sample diverse enough?
The miss of the framework
The social sciences needs more than integrative experimental designs: We need better theories
There are no shortcuts to theory
Author response
Replies to commentaries on beyond playing 20 questions with nature