1. Introduction
Observations are central to empirical science, though what counts as an observation is all but obvious: van Fraassen (Reference van Fraassen1980) coined a notion that allowed him to distinguish observation from inference by tying observation to unaided sense-perception; Shapere (Reference Shapere1982) criticized empiricist notions as inappropriate to scientific usage, but his own account was criticized as too narrow (Bogen and Woodward Reference Bogen and Woodward1988) or even off target (Linden Reference Linden1992). So what is observation, and what role does it play in the generation of scientific knowledge?
Furthermore, there is a complicated relation between observation and experiment that “mainstream philosophy of science has had rather little to say about” (Okasha Reference Okasha2011, 223). On one hand, experiments seem unthinkable without observations: Michelson and Morley (Reference Michelson and Morley1887) observed interference-fringes to determine earth’s motion relative to the ether and Geiger and Marsden (Reference Geiger and Marsden1913) scintillations on a fluorescent screen to probe the nucleus’s structure. On the other hand, “observational” is sometimes used as an antonym to “experimental,” and we see claims to experiment’s epistemic superiority over observation (Okasha Reference Okasha2011, 226–27; Woodward Reference Woodward2003b, 43–45). But this cannot be right if the preceding is too. So how does observation relate to experiment?
These questions can be answered only after due disambiguation. I shall hence distinguish “observation in the technical sense” (TO) from “experiential observation” (EO) as a concept closely tied to experience and from “field observation” (FO) as a notion that reasonably contrasts with experiment.
This threefold distinction will prove helpful in answering questions concerning the epistemic role of observation in science. Specifically, I will here argue that FO is by no means generally epistemically inferior to experiment: in certain cases, it may even enjoy systematic epistemic advantages due to its unperturbing nature.
The first part (sections 2 and 3) introduces the three notions and their relations. This requires going into the relation between observations and data, as the kind of data taking distinguishes experiment from observation and is vital for evaluating the epistemic priority among them.Footnote 1 The second part (sections 4–6) then focuses on this epistemic priority among experiment and observation as recently scrutinized also by Boyd and Matthiessen (Reference Boyd and Matthiessen2023).
2. Three notions of observation
There have been various attempts to define “observation” in general terms, but all of them are wanting in some respect or another.Footnote 2 Arguably, this connects to the fact that scientists’ use of “observation” “is typically relativized to the inquiry they have in hand” (Fodor Reference Fodor1984, 25). For instance, in high-energy physics (HEP), “observation” has a decidedly statistical character:
if you want to claim, at least in high-energy physics, that you have observed a phenomenon, your result must be at least five standard deviations above background. (Franklin Reference Franklin2013, 1)
Thus “observation” here means an excess of specific activities in a particle detector that cannot be explained as a random fluctuation but indicates the presence of a sought-for particle.
In contrast, tissue biologists use advanced microscopes to gain insight into things like the interaction between nanoparticles and biological tissues (Jin, Bae, and Hong Reference Jin, Bae and Hong2010). Atomic force microscopes, for example, direct a laser beam onto a cantilever with a sharp tip that interacts with the biological material through various forces. Because of this interaction, the cantilever is moved and the reflected light altered so that a differential image of the tissue is generated. In this way, biologists have “directly observed” the impact of nanoparticles on biological membranes through such things as the “formation of nanoscale holes …, membrane thinning, and/or membrane erosion” (Jin, Bae, and Hong Reference Jin, Bae and Hong2010, 815).
These usages of “observation” in HEP and biology are clearly distinct. Yet they have commonalities: both involve close causal contact with the studied system (also Bird Reference Bird2022). An “observation” in HEP is an observation because the relevant type of particle has been produced and decayed into characteristic products that interact with the detector so often that the resulting data cannot be discarded as a statistical fluctuation. Likewise, an “observation” in tissue biology is an observation because the cantilever interacts with the tissue through atomic forces.
A second unifying characteristic is that these are success terms: only if a certain level of statistical significance is exceeded, or an image can be interpreted as showing the action of certain nanoparticles, can observation be claimed. Hence I suggest collecting these different notions under a common header and speaking of “observation in the technical sense” (TO):
$x$ makes an observation in a technical sense (TO) on $y$ iff $x$ successfully establishes some relevant claim $c$ about $y$ by means of close causal contact with $y$ within a scientific inquiry.
This defines a family of terms because standards of relevance and success vary with the field and context of inquiry, as the examples show. “Success” must not be misinterpreted though: null results can represent tremendous successes (think Michelson–Morley). What does forestall epistemic productivity is when research remains “inconclusive,” that is, when the conditions for applying the respective notion of TO have been met neither positively nor negatively.Footnote 3
“Observation” in a nontechnical sense is arguably different. “Seeing with the unaided eye” may be “a clear case” (van Fraassen Reference van Fraassen1980, 16), but only if this includes the paying of attention to a certain property, pattern, or object (Shapere Reference Shapere1982, 507). For instance, observing a bird in the backyard is distinguished from merely gazing out the window exactly by the fact that dedicated attention is being paid to the bird; observing the color shift of a TV is distinguished in the same way from watching TV. Hence I suggest introducing a second notion, which we may preliminarily define as the paying of dedicated attention to an object of one’s sense perception.Footnote 4
I claimed that all experiments involve observation, but as we saw, this is not true if we mean this in the sense of TO: some experiments are inconclusive and thus yield no observations in the technical sense. However, is it at least true that all experiments involve observation-as-perception-plus-attention? Bird (Reference Bird2022, 169–70) discusses a science fiction scenario in which knowledge from an experiment is fed directly into a subject’s brain by means of an implant and so is gained without observation as perception-plus-attention. Hence there are imaginable experiments (and TOs) that could be done (or made) without perception.
This suggests that we should lean on a broader notion of “experience” than sense perception, for the subject would still experience the knowledge gain:
$x$ makes an experiential observation (EO) on $y$ iff $y$ is an object of $x$ ’s experience and $x$ pays dedicated attention to $y$ .
Now, if all experiments involve EOs and many even involve TOs, then neither EO nor TO defines a contrast class for “experiment.” So how can we make sense of the distinction between experiment and observation indicated in the introduction? I suggest that we must acknowledge a third, distinct notion that thus sensibly contrasts with experiment: that of a field observation, which we may preliminarily define as the unperturbed taking of data on an object of interest, that is, under natural conditions. In contrast, “manipulation” and “control” are the key terms defining experimentation.Footnote 5
For instance, consider how a team of biologists analyzing the correlation between vocalizations of male and female rhinoceroses and the testosterone levels in the males’ feces during mating season with advanced software, technology, and statistics (Jenikejew et al. Reference Jenikejew, Wauters, Dehnhard and Scheumann2021) is seemingly engaged in a very similar activity as a team of particle physicists analyzing count rates of quantities computed from detector readouts with advanced software, technology, and statistics. However, while the biologists will take every precaution not to disturb the rhinoceroses, there is no way of measuring the relevant quantities pertaining to certain particles without exerting control over them.
Putting these intuitions into explicit definitions again would require discussion of further features of experimentation (such as repeatability; Currie and Levy Reference Currie and Levy2019), but it will be sufficient to formulate criteria here that partially define both notions:
Process $p$ is a field observation (FO) of $y$ by $x$ only if, in the course of $p$ , $x$ takes data on $y$ in an unperturbed fashion, that is, without $x$ exerting control over $y$ by relevantly manipulating $y$ ’s state.
In contrast,
Process $p$ is an experiment on $y$ by $x$ only if, in the course of $p$ , $x$ takes data on $y$ while exerting control over $y$ by relevantly manipulating $y$ ’s state.
These conditions naturally extend to collectives of scientists, when none/some of the scientists in the collective take data by manipulating $y$ . They should be widely agreeable: Currie and Levy (Reference Currie and Levy2019, 1067, 1084) define experiments as “controlled manipulations” and contrast these with “observational fieldwork”; Boyd and Matthiessen (Reference Boyd and Matthiessen2023, 111) acknowledge a notion of “experiment as active manipulation,” whereas “observation is … characteristically non-manipulative.”Footnote 6
Earlier, I clarified the relation of EO and TO to experiment, but what is their relation to FO? Obviously, all FOs also involve EO and may generate TOs: a representation that provides evidence for certain phenomena may be generated or the statistical frequency of some type of event may exceed some threshold. In sum (figure 1), EO is the most encompassing notion, TO may occur as part of FO or experiment, and only FO contrasts with experiment.
3. Data taking and experiential observation
Prima facie, experimental control seems like a good thing: we can ensure (say) that particles collide where we want them to, in the quantities needed for TOs of Higgses. However, exertion of control means a perturbation of the studied system that may inevitably destroy subtle, sought-for effects. This issue will be centrally addressed later, but we should first clarify the notions of “data” and “data taking” centrally involved in the distinction between experiment and FO.
Empiricists like Hempel (Reference Hempel1952, 21) famously put great emphasis on “data … obtainable by direct experience,” but in the age of complex experimentation and computer-aided data taking, assuming an intimate connection between data and EO seems inappropriate. This aspect is prominent in the work of Leonelli (Reference Leonelli2015, 812), who emphasizes that data are
the results of complex processes of interaction between researchers and the world, which typically happen with the help of interfaces such as observational techniques, registration and measurement devices…. This is … also the case for data generated outside the controlled environment of the laboratory.
Thus an ornithologist watching a bird needs to write down selective results from her EO, or use a digital camera to make images and video clips, to create data. These data then are “conditioned both by the employment of specific techniques and instruments … and by the interests and position of the observer” (Leonelli Reference Leonelli2015, 812).
Furthermore, in many scientific disciplines, even “raw” data are not connected to the experience of anything to do with the system under study. To draw on the example again, high-energy physicists call “raw” those data “arriving from an experiment’s data acquisition system,” which are then “organized in ‘event records’” (Delfino Reference Delfino, Fabjan and Schopper2020, 626): lists of numbers that constitute basic representations of the activity in the detector (see Jacobsen Reference Jacobsen2006, 4–5).Footnote 7 Data taking here takes place when measurable currents created by the interaction of “debris”Footnote 8 from scattering events with the detector arrive at the storage.
Indeed, “what counts as data,” at least as relevant data, “depends on who uses them, how, and for which purposes” (Leonelli Reference Leonelli2015, 811). For example, particles’ energies, momenta, and angles relative to the colliding beams are usually computed as functions of HEP event records before analysis. Sometimes even higher functions are used, such as masses of decayed particles computed from energy-momentum-conservation, and these are simply considered “high(er)-level data.”
So data are representations of systems’ properties as exhibited in interactions, and raw data are generated by means of close causal contact. How and which of these properties are represented depends on the aims of the analysis.
Two things are noteworthy. First, in the preceding criteria for experiment and FO, I highlighted control and the unperturbing nature of the investigation, respectively. We can now make sense of this by taking into account the causal nature of data taking: if the act of data taking steers the studied system into a particular state, this cannot be an FO, though it might mean experimenting. If this feature is absent, this data taking cannot be part of an experiment, as control requires the manipulation of states. As I will argue, this can put FO at a systematic advantage, contrary to appearances.
Second, analyses aimed at establishing TOs target data, and we noted that EO is often quite distinct from data taking. Hence EO is not the main driving force behind the inferences made within those activities, so what role is left for it in modern science? I submit that EO usually functions as a mediator between FO or experiment and TO: only by witnessing certain displays on a computer screen, or by noticing information transmitted into the brain by a computer chip, can a scientist establish a claim of interest, based on experimental or field observational data.
As a corollary, empiricists remain at liberty to claim (van Fraassen Reference van Fraassen1980, 15) that our interpretation of EOs may change, leading to the reinterpretation and conceptual revision of many accepted TOs, but that the EOs themselves remain intact: EOs constitute the “phenomena” empiricists should want to save (Teller Reference Teller2001, 135).Footnote 9
4. Benefits of experimental control
A number of authors have addressed the question of why increased control over data taking, as involved in experimentation, might imply an epistemic advantage. I focus on two discernible claims to epistemic superiority: to an increased ability to establish causal dependencies and to an increased ability to confirm lawlike connections.
The first claim has been voiced by many scientists (see Woodward Reference Woodward and Radder2003a, 88) and is an integral part of Woodward’s own account of causation. Accordingly, the most valuable experiments are those that, like randomized controlled trials (RCTs), most closely approximate interventions.Footnote 10
For instance, in medical RCTs (see Rothman, Greenland, and Lash Reference Rothman, Greenland and Lash2008), patients are administered one of two treatments. The kind of treatment will be assigned at random, and one of them is typically a placebo, which can be safely assumed not to have the desired effect. Furthermore, randomizing eliminates the possibility of unconsciously selecting a group composition that by itself has an effect. In this way, many possible alternative causal chains from the initial conditions of the trial to the final outcome can be statistically nullified.Footnote 11 So a significantly better recovery in the treatment group suggests that the treatment has the desired effect.
The upshot is that experimental manipulations, interpreted as active changes in the causal variables describing a studied system’s state, may offer a handle on seeing whether changes in $X$ do cause changes in $Y$ if they reasonably approximate “surgical” interventions, as other influences on $Y$ have been (statistically, and approximately) eliminated. This level of control is clearly missing in FO: our ability to plausibly infer that $X$ influences $Y$ by means of FO may crucially depend on, say, the availability of different lines of sufficiently diverse evidence, and this availability may depend on pure happenstance.
Turn to the second claim of epistemic priority: that experiment increases our ability to confirm lawlike connections. A Bayesian argument to this effect has been given by Okasha (Reference Okasha2011). Confirming a law $\forall x\left( {Fx \to Gx} \right)$ by an FO to the effect that $Fa \wedge Ga$ for some $a$ can be problematic: in case the lawlike connection $\forall x\left( {Fx \to Gx} \right)$ doesn’t make it any likelier to meet an $F$ that is also $G$ , conditioning one’s credences on $Fa \wedge Ga$ won’t increase the law’s probability.
For example,Footnote 12 assume that, for some contingent reason, all meteoroids in our solar system happen to be such that meteorites landing on earth have diameter greater than five centimeters. Additionally, assume that a law ensures that meteorites on earth would end up being greater than five centimeters in diameter should these contingencies cease to exist. Does the law make it any likelier that the next meteorite will be greater than five centimeters in diameter? Given how the scenario was set up, this is doubtful.
In contrast, in an experiment in which all $a$ s are prepared to be $F$ , $Fa$ becomes part of the knowledge base, and the law is bound to receive confirmation from the observation that $Ga$ , so long as $0 \lt {P_{Fa}}\left( {Ga} \right) \lt 1$ —which should hold while we still seek confirmation. Thus, producing a small meteoroid and making it fall to earth, we would probably be able to observe a meteorite that is smaller than five centimeters.
Naively read, this argument seems oversimplifying, because we cannot always prepare our $a$ s to be $F$ . This is certainly true in the meteoroid example, but that basically just says that the envisioned experiment is not feasible. However, we also cannot prepare the particles produced in proton–proton collisions to be Higgs bosons. Should we thus take it that we cannot experimentally confirm that Higgs bosons have a mass of approximately 125 GeV?
I believe this would mean overstating the argument’s underlying intuition: even though our preparation method produces all kinds of things that are not Higgs events, we can at least select for such events by carefully selecting data points that fit the expected characteristics.Footnote 13 We then use these to confirm whether they exhibit a mass value expected on account of the “standard model.” But of course, this is possible only because the conditions of proton–proton collisions at the Large Hadron Collider (LHC) in Geneva are well controlled and, therefore, well known.
5. Systematic benefits of field observation
As we saw, there is an epistemic benefit to experimentation in HEP, as the relevant information would likely be impossible to acquire under less controlled conditions. But this is just one example. Generalized claims to an epistemic benefit from increased control have often been embraced unquestioningly. Among the few to argue for control’s benefits are Currie and Levy (Reference Currie and Levy2019, 1070ff.): according to them, control allows the isolation of a studied system from environmental factors so that one can reproducibly interact with the system’s relevant properties and retrieve more fine-grained information for discriminating between hypotheses. However, whether an epistemic priority transpires from this in general still remains unclear.
In fact, Boyd and Matthiessen (Reference Boyd and Matthiessen2023, 123–66) have recently argued that it does not. In detail, Boyd and Matthiessen discuss the following factors that make an empirical activity epistemically privileged: signal clarity, characterization of backgrounds, and the discrimination and variability of precipitating conditions. Signal clarity means establishing the sensitivity of an apparatus to a given type of signal, as well as its being affected by processes not of interest, generically termed “noise” (124). “Backgrounds,” in contrast, are data contaminations that “can be attributed more specifically to certain sources” (124). Finally, precipitating conditions are “the conditions that produce the signal in the first place” (125). Hence discriminating these means seeking out various causes of a TOed effect or signal.
Boyd and Matthiessen (Reference Boyd and Matthiessen2023) accomplish providing real-world examples whereby FOs can claim high performance on all these measures. This is an important achievement, but it does not quite establish whether intrinsic features of experiments can make them epistemically inferior (and hence FO intrinsically superior). In what follows, I discuss several cases in which the reasons for FO’s epistemic superiority have to do with an intrinsic factor: the absence of control. I will coin these reasons “systematic,” in contrast to “contingent” ones, where it just so happens that certain pieces of information can be obtained only by means of FO.
To be clear on this issue, let me first briefly discuss those cases in which superiority does hinge on contingent factors. Astronomy provides a wealth of examples, as FOs here cannot be complemented by experiments (also Boyd and Matthiessen Reference Boyd and Matthiessen2023). They are hence “all there is to go on” (Okasha Reference Okasha2011, 227). For instance, MIT describes the Even Horizon Telescope as “a group of observatories united to image the emission around supermassive black holes.”Footnote 14 The use of “observatory” here reflects the fact that we cannot prepare black holes and investigate their properties in a controlled fashion, as it just so happens that human beings lack the relevant measures of size and energy to perform these experiments.
However, to gather strong evidence about the laws of relativity, we might want to experiment on black holes: this could give us an edge in finding deviations, thereby tentatively confirming certain approaches to quantum gravity.
A distraction might be created by cases in which there are systematic deficiencies to actual experiments, but an experiment that could ultimately overrule FOs seems feasible. An example is caffeine research, in which experiments and FOs tend to highlight conflicting aspects in relation to health: whereas FOs suggest health benefits, such as cardioprotective effects and decreased risk for development of type 2 diabetes or even neurodegenerative conditions, experiments suggest adverse effects, such as increased systolic and diastolic blood pressure or increased blood glucose levels (James Reference James2018).
The main problem associated with the experimental evidence here is the time scale, for “acute physiological effects tend to … abate within hours,” and RCTs have so far only been conducted on the scale of “weeks and months” (James Reference James2018, 853). Hence there are limitations to the quality of experimental evidence that relates to intrinsic features of actual experiments but still falls short of establishing FOs’ superiority in this case:
Poorly understood confounder influence is a likely major cause of the enduring disjunction between the findings of experimental and observational studies…. Long-term randomised trials are needed to [understand] the health implications of lifelong coffee/caffeine consumption. (James Reference James2018, 852–53)
In other words, the conflict between experimentation and FO here has nothing to do with features of experimentation per se: “coffee consumption is but one among numerous variables of life-style and environment,” whence long-term experiments that control the “many factors” that “may confound the relatively weak coffee-health associations reported in the observational literature” (James Reference James2018, 852) might settle the debate.
A similar distraction arises when surrogate systems are experimented on. Famous examples are analogue (Dardashti, Thébault, and Winsberg Reference Dardashti, Thébault and Winsberg2017) and “bottle” experiments (Currie Reference Currie2020). Analogue experiments involve a system that is easier to handle than the system of interest but is assumed to share a set of common laws with it under specific conditions (Dardashti, Thébault, and Winsberg Reference Dardashti, Thébault and Winsberg2017, 63ff.). Dardashti, Thébault, and Winsberg and Dardashti et al. (Reference Dardashti, Hartmann, Thébault and Winsberg2019) argue that this delivers a basis for confirming facts about the targeted system; others (e.g., Crowther, Linnemann, and Wu¨thrich Reference Crowther, Linnemann and Wu¨thrich2021) have been more skeptical. In any case, the fact that a different system is used makes this a surrogate experiment, something that has been suggested to define a general sense of simulation (Dardashti, Thébault, and Winsberg Reference Dardashti, Thébault and Winsberg2017; Boge Reference Boge2019, Reference Boge2020) or representation (Suárez Reference Suárez2004).
Due to the need for first establishing the connection between targeted system and system experimented on, it remains unclear whether such experiments are advantageous to FO if the latter is conducted on the right kind of system. But it seems clear that an experiment on the right kind of system would be advantageous.
Bottle experiments are another example (Currie Reference Currie2020, 905), which, however, involves specimens from the relevant ontological domain. In ecology (where the term originates) these are experiments on “lab-raised, easily managed critters in highly artificial environments” (Currie Reference Currie2020, 906). So, does this not provide an epistemic advantage over both analogue experiments and FO?
This seems doubtful, as the surrogate nature of bottle experiments nevertheless creates obstacles in confirming laws and causation, because it relies on what Currie (Reference Currie2020, 912) calls “extrapolationism”:
Surrogates, according to extrapolationism, target natural systems, and the resemblance between them facilitates extrapolating results from the former to the latter…. For the extrapolationist the value of an investigation is primarily due to its confirmatory prowess: it provides grounds for belief in some hypothesis pertaining to natural systems.
Despite the fact that a bottled ecosystem is an ecosystem, it is an additional assumption in need of justification that findings on the latter can be representative of those on its larger-scale counterpart.Footnote 15 Furthermore, these limitations are due to factors intrinsic to the experiment itself: they arise from the fact that a surrogate (scaled down or merely analogous) system is being used. However, as with the meteoroid case, this does not establish that an experiment on an entire ecosystem would not be advantageous over bottled experiment and FO.
None of these examples is thus convincing as an example of systematic advantages of FO. As a kind of proof of concept, note that Boyd and Matthiessen (Reference Boyd and Matthiessen2023, 120) discuss causal models by Spirtes, Glymour, and Scheines (Reference Spirtes, Glymour and Scheines2000) in which “observation can distinguish between two hypotheses that experiment cannot.” Another such proof is delivered by the possibility of “intervention artifacts,” as discussed by Craver and Dan-Cohen (Reference Craver and Dan-Cohen2024, 259):
Perhaps when $I$ alters $X$ it also influences the detection apparatus via a route that does not pass through $Y$ . Or perhaps some intermediate variable $S$ influences the detection in a way that foils our ability to assess the changes to $Y$ .
However, we are here looking for real-world cases that exemplify systematic disadvantages to experimentation. Hence, to see the general kind of problem associated with experimental evidence at work, consider the so-called Hawthorne effect (also Feest Reference Feest2022). This effect was first discovered in experiments conducted by Roethlisberger and Dickson (Reference Roethlisberger and Dickson1939) at Western Electric Company’s Hawthorne plant that were supposed to investigate the relation between workplace illumination and productivity.
The findings were curious: “the illumination was decreased step by step,” but “it was not until illumination in the experimental room was reduced to a level corresponding to moonlight that … productivity finally started to decline” (Wickström and Bendix Reference Wickström and Bendix2000, 363). Later analysis suggested that the detailed engagement with the workers, which was supposed to ensure their cooperation in the study, led to an increase in motivation, which fully compensated for the effects of decreased lighting. Thus the very act of making workers participate in the experiment was in large part responsible for the outcome.
Today, the “Hawthorne effect” is used as an umbrella term for any kind of effect whereby controlled data taking on human subjects influences their behavior, and the evidence for this is fairly robust (McCambridge, Witton, and Elbourne Reference McCambridge, Witton and Elbourne2014). However, control is definitive of experiments. Thus, insofar as the data taking relevantly alters subjects’ behaviors, an experiment cannot possibly reveal the sought-for information and enjoys a systematic disadvantage.
Now, data taking is involved in FO as well, and subjects might alter their behaviors in virtue of the very fact that data are being taken on them. Thus maybe there is no advantage to FO after all? This is indeed a problem, but there is the option of concealing the data taking in FO. By definition, this is not possible in experiment: its data-taking activities involve manipulating the investigated system’s state.Footnote 16
Concealment of data taking has been discussed in the marketing sciences as a means of compensating for Hawthorne-like effects (Grove and Fisk Reference Grove and Fisk1992). An example is “mystery shopping,” whereby a participant (experiential) observer acts as a regular customer so as to not be recognized as an observer. Of course, this concealment might not work: the EOed subjects might notice some odd behavior from the participant observer or equally notice a hidden camera. But when executed skillfully, concealed FO can compensate for the problem of “fat-handed” manipulations, as involved in the Hawthorne effect.
An anonymous referee has confronted me with an interesting objection here: in so-called deception studies (Stricker Reference Stricker1967, 13), test subjects are misled about the attitudes, beliefs, and so on being probed. Hence, when the relevance condition involved in the partial definition of experiment offered earlier is taken into account, concealment of experimentation might be possible after all.
A prominent example is conformity experiments, such as those by Asch (Reference Asch and Guetzkow1951). Here test subjects were instructed to offer perceptual judgments about sameness or difference between lengths of lines on paper. In reality, most participants were actors offering false judgments, and the conformity of actual test subjects’ judgments to the majority was probed.
Deceptions like these might seem to mitigate Hawthorne-like effects. However, Schulman (Reference Schulman1967, 27) early on demonstrated that subjects’ responses varied as “a function of concern with the evaluation of [their] behavior,” by varying “whether the experimenter and the group were perceived by the subject … to observe (evaluate) [them].” In turn, this dependency might be mitigated by concealing the test subject from direct EO by other participants and the experimenter in the response situation. But regardless of this, participants’ suspicions about the purposes of a given experiment remain a delicate matter: Stricker (Reference Stricker1967) reported this issue to be underconsidered, inadequately probed (for example, by binarized variables), or underestimated in many psychological studies.
To date, methods for probing for suspicion are varied, as are estimates of the percentage of suspicious participants, and a unified framework is missing (Barrett, Neuberg, and Luce Reference Barrett, Neuberg and Luce2023). Furthermore, the use of deception methods within psychological experiments is now widely known, whence the worry quickly arose that participants would become more and more unreliable sources of information over time (Kelman Reference Kelman1967). Thus it remains a legitimate concern that the very act of making subjects participate in a study can distort their responses, and this sort of effect cannot be handled by deception.
This reasonably establishes that FO may be advantageous for certain purposes in psychological research, but does this issue pertain only to the social sciences? I believe the answer is no: an issue quite analogous to the Hawthorne effect can be straightforwardly seen to arise in natural science experiments, as preparing a physical, chemical, or biological system in a particular way may accidentally introduce additional effects that spoil the informativeness of the outcome.Footnote 17
For example, Weber (Reference Weber2004, 287, emphasis omitted) points out that “preparation artifacts,” which “arise when the biological specimen is fixed, cut, stained, or decorated for light or electron microscopy,” are “one of the most frequent forms of error in biological laboratories.” Thus, depending on the type of artifact, an experimental study of biological materials may well become uninformative about the properties investigated, and in virtue of the very preparation method. However, it remains unclear whether an FO could here yield the sought-for information instead.
A clearer example is provided by the conflict between internal and external validity in medical RCTs. “Internal validity” refers to a study’s freedom from systematic biases, “external validity” to its generalizability. In RCTs, the attempt to achieve internal validity is “operationalized … as inclusion and exclusion criteria,” which lead to “a study population … with increasingly controlled conditions” (Averitt et al. Reference Averitt, Weng, Ryan and Perotte2020, 1). However, a treatment might have a nonrandom variability across different subgroups (Varadhan and Seeger Reference Varadhan, Seeger and Velentgas2013), and this information can be lost by exclusion of relevant subjects.
So ensuring internal validity relies crucially on exerting control by handcrafting treatment and control groups. At the same time, this might spoil generalizability. In particular, one can apply eligibility criteria from RCTs to select data from an FO. If the RCT is externally valid, this should not lead to differences in the comparison between FO and RCT—but nevertheless sometimes does (see Averitt et al. Reference Averitt, Weng, Ryan and Perotte2020, 2ff.). This ostensibly shows that there are pieces of information (such as the influence of “undocumented factors” on treatment variability; Averitt et al. Reference Averitt, Weng, Ryan and Perotte2020, 7) that are destroyed by the very act of exerting control.
I have provided two examples in which intrinsic disadvantages of experiment are salient and FOs exist that can arguably yield the sought-for information. What to conclude from this in general? The least we can say is that whether experiment or FO is advantageous is a case-by-case decision and that this is due to features that make an empirical inquiry an experiment or FO. However, I would also point out that it is usually very hard to tell what the overall effects of manipulation are. Hence, in disciplines ranging from physics to social science, researchers should value FO as a complementary source of information that need not be seen as generally inferior but can also provide hints as to where experiment might go wrong.
6. A strict dichotomy?
I proposed that data-taking activities that involve control over a studied system are not FOs, whereas those that don’t are not experiments. This leaves it open whether there are data-taking activities that are neither. But are there any compelling cases?
Indeed, Perović (Reference Perović2021) argues that experiment and observation lie on a continuum but acknowledges that certain cases “are points at the far ends of the continuum in terms of their respective levels of manipulation.” It is unclear to me whether the distinctions drawn above are not sufficient to cut that continuum in half.
First off, note the crucial qualifier “relevantly” in the criterion for FO. For instance, we may ask people to fill out a survey, and of course we would thereby manipulate their state, but not necessarily the relevant state: what the survey is supposed to find out is whether people antecedently happened to be in some state that led to certain responses in the survey. So carefully planned “observational studies” involving questionnaires may count as FO (rather than experiment) if they are indeed unperturbing in the desired sense.
Furthermore, consider the role of “field” in “field observation”: experiments may famously also be conducted in the field (e.g., Morgan Reference Morgan2013), but this merely means that a system is studied, in a controlled way, within its natural environment. It doesn’t mean that one leaves the system alone so that it exhibits its natural behavior. This, however, is what I take to be implied by the “field” in FO: that some naturally occurring sequence of states can be detected on $y$ by means of data taking, without thereby running the risk of altering that sequence.Footnote 18
In contrast, because “field experiments” are “experiments designed and carried out by scientists to ape … laboratory conditions in the field” (Morgan Reference Morgan2013, 343), we immediately see that these are just specific experiments: “the interventions are controlled by means such as dividing subject units into treated and untreated groups in order that experimental effects can be isolated” (Morgan Reference Morgan2013, 343). The original Hawthorne studies may serve as an example exhibiting the disadvantages of experimentation even in the field.
Slightly more interesting are “natural experiments,” which Woodward (Reference Woodward2003b, 103) takes to be cases in which an intervention takes place without human action. As my account of experimentation decidedly involves human action, these still fall under FO, while underscoring that FO can be epistemically equivalent or even superior to experiment. This is consistent with verdicts by Anderl (Reference Anderl and Humphreys2016, 661), who describes them as “the direct equivalent of randomized controlled experiments in an observational situation,” or Currie and Levy (Reference Currie and Levy2019, 1086), who hold that “there are significant analogies between experiments simpliciter and natural experiments”: analogy and equivalence can meaningfully obtain only between things that are in fact distinct.Footnote 19
A final issue that deserves attention is the fact that Mättig (Reference Mättig2021, 14455) has recently called the LHC, which I have called an experiment, “a hybrid of experimental practices and observation”:
The collisions of interest are primarily not those of protons, but of the quarks and gluons inside the proton. These can hardly be varied by targeted intervention…. What the LHC delivers is a huge range of different final states. The “properties of interest” are obtained by selecting certain types of events, comparable to surveys of galaxies by telescopes. In consequence, the material information obtained from the LHC is a mixture of targeted intervention and observation. (14432–33)
So, should we say that the LHC inextricably intertwines FO with manipulation? I doubt it. First, note the tremendous degree of control exerted by physicists over the colliding protons. For example, the angle at which beams of protons cross is dynamically fine-tuned in the order of ${10^{ - 2}}$ radians so as to yield the greatest number of interactions in the right places.Footnote 20 Furthermore, following quantum field theory, particles like Higgs’s are literally brought into existence in proton–proton scattering. If this doesn’t count as “control” over relevant specimens, what does?
Of course, physicists are interested primarily in the interactions between quarks and gluons, not protons. Yet, it is fairly common that the targeted system can be controlled only indirectly: in vivo studies of the effects of drugs on an organ, say, will inevitably involve manipulating the entire organism. Nevertheless, such studies are straightforwardly considered experiments.
Finally, that properties of interest are “obtained by selecting certain types of events” is also rather typical for experiments. In particular, consider how the LHC serves multiple purposes: although it was designed primarily to search for the Higgs, it also serves the purpose of precision measurements on known particles and searches for new physics. Hence the “properties of interest” relative to one purpose define “background events” relative to another. But this says nothing over and above the fact that any measurement activity will also produce “noise,” next to the (final) states of interest.
There might be additional reasons to see the LHC as an FO. For example, “heavy-ion collisions at the LHC recreate in laboratory conditions the plasma of quarks and gluons that is thought to have existed shortly after the Big Bang.”Footnote 21 Thus, owing to the immense energies involved, the LHC can recreate “natural” conditions—conditions that have occurred absent any human intervention. And does that not make it FO by definition?
I believe concluding as much would be in error: just as FO can replicate experimental conditions when circumstances “happen to be” an intervention, some experiments can replicate natural conditions of interest. None of this speaks for a breakdown of a dichotomy between the two sorts of activities, with their complementary advantages.
7. Conclusion
I have argued that we need to distinguish between EO as dedicated attention to experience, TO as a family of technical success terms, and FO as the unperturbed taking of data. EO was argued to be distinct from data taking but to function as a mediator between an experiment or FO and its result in instrument-heavy fields. TO was argued to be that for which experiment and FO aim. Most importantly, FO was argued to be sometimes epistemically superior to experiment, and for systematic reasons: in some cases, the very act of exerting control forestalls the kind of TO that may be available in a carefully designed FO. Furthermore, because it is generally hard to estimate the overall effects of manipulating a targeted system, researchers might want to value FO as a complementary source of information not prone to the same kinds of error.
Acknowledgments
The research for this article was generously funded by the German Research Foundation (DFG) as part of the research unit “The Epistemology of the Large Hadron Collider” (grant FOR 2063) and my Emmy Noether Group (“UDNN: Scientific Understanding and Deep Neural Networks,” grant 508844757). I have profited from comments by three anonymous referees, from multiple discussions within the DFG research unit “The Epistemology of the Large Hadron Collider,” and from an internal conference on the experiment–observation dichotomy.