R1. Overview
We are pleased that so many commentators agree with so many of our core claims. For instance, there is general agreement that current deep neural networks (DNNs) do a poor job in accounting for many psychological findings; that an important direction for future research is to train DNNs on new tasks and datasets that more closely capture human experience; and that new objective functions like self-supervision may improve DNN–human correspondences. Most importantly, there is widespread agreement that research in psychology should play a central role in building better models of human vision. It is important to appreciate the implication of this last point because psychological experiments reveal some weird and wonderful properties of human vision that DNNs must seek to explain. We start by discussing some of these key properties before responding to the specific points of the commentators.
To give only the most cursory of overviews, the following findings should play a central role in theory and model building. The input to our visual system is degraded due to a large blind spot and an inverted retina with light having to pass through multiple layers of retinal neurons, axons, and blood vessels before reaching the photoreceptors. Nevertheless, we are unaware of the degraded signals due to a process of actively filling in missing signals in early visual cortex (e.g., Grossberg, Reference Grossberg, Pessoa and de Weerd2003; Ramachandran & Gregory, Reference Ramachandran and Gregory1991). We have fovea that supports high-acuity colour vision for only about 2 degrees of visual angle (about the size of a thumbnail at arm's length). Nevertheless, we have the subjective sense of a rich visual experience across a much wider visual field because we move our eyes approximately three times per second (Rayner, Reference Rayner1978), with the encoding of visual inputs suppressed during each saccade (Matin, Reference Matin1974), and the visual system somehow integrating inputs across fixations (Irwin, Reference Irwin1991). At the same time, we can identify multiple objects in scenes following a single fixation (Biederman, Reference Biederman1972), with object identification taking approximately 150 ms (Thorpe, Fize, & Marlot, Reference Thorpe, Fize and Marlot1996) – too quick to rely on recurrence. We are also blind to major changes in a scene as revealed by change blindness (Simons & Levin, Reference Simons and Levin1997) and have a visual short-term memory of approximately four items (Cowan, Reference Cowan2001). Our visual system organizes image contours by various Gestalt rules to separate figure from ground (Wagemans et al., Reference Wagemans, Elder, Kubovy, Palmer, Peterson, Singh and von der Heydt2012) and organize contours to build representations of object parts (Biederman, Reference Biederman1987). Objects are encoded in terms of their surfaces, parts, and relations between parts to build three-dimensional (3D) representations relying on monocular and binocular inputs (Biederman, Reference Biederman1987; Marr, Reference Marr1982; Nakayama & Shimojo, Reference Nakayama and Shimojo1992). Colour, form, and motion processing are factorized to the extent that it is possible to be cortically colour blind (Cavanagh et al., Reference Cavanagh, Hénaff, Michel, Landis, Troscianko and Intriligator1998), or suffer motion blindness where objects disappear during motion but are visible and recognizable while static (Zeki, Reference Zeki1991), or show severe impairments with object identification while maintaining the ability to reach and manipulate objects (Goodale & Milner, Reference Goodale and Milner1992). Participants can even classify objects while denying seeing them (Koculak & Wierzchoń). Our visual system manifests a wide range of visual, size, and shape constancies to estimate the distal properties of the world independent of the lighting and object pose, and we suffer from size, colour, and motion illusions that reflect the very mechanisms that serve the building of these distal representations from the proximal image projected onto our retinas. These representations of distal stimuli in the world support a range of visual tasks, including object classification, navigation, grasping, and visual reasoning. All this is done with spiking networks composed of neurons with a vast range of morphologies that vary in ways relevant to their function, with architectures constrained by evolution and biophysics.
All of this and much more needs to be explained, and various modelling approaches are warranted. We agree with the commentators that one valuable approach is to keep working with current image-computable DNNs while altering the tasks they solve, the data they are fed, their objective functions, learning rules, and architectures. Perhaps DNNs will converge with the biological solutions in some important respects. Whether DNNs will “automagically” (Xu & Vaziri-Pashkam) converge on many of these solutions when trained on the right tasks and data, however, is far from certain, and in our view, it is a mistake to put all our eggs in this one basket. Whatever approach one adopts, the current trend of emphasizing prediction success on observational behavioural and brain benchmarks and downplaying failures is unlikely to advance our understanding of human vision and the brain more generally.
Our response to the commentaries is organized as follows. In section R2 we show there is no basis for the claim that we are advocating for the abandonment of DNNs as a modelling framework to test hypotheses about human vision. In sections R3 and R4 we challenge the common claim that image computability is the minimal criteria for any serious model of vision and that DNNs are the “current best” models of human vision. In section R5 we argue that models should be developed for the sake of explanations rather than predictions. In section R6 we discuss how the marketing of DNNs as the best models of human vision is contributing to a current trend of emphasizing DNN–human similarities and downplaying discrepancies. Finally, in section R7, we respond to the DiCarlo, Yamins, Ferguson, Fedorenko, Bethge, Bonnen, & Schrimpf (DiCarlo et al.) and Golan, Taylor, Schütt, Peters, Sommers, Seeliger, Doerig, Linton, Konkle, van Gerven, Kording, Richards, Kietzmann, Lindsay, & Kriegeskorte (Golan et al.) commentaries. Many of the (over 20) authors have played leading roles in developing this new field comparing DNNs to humans, and in both commentaries, the authors are advancing research agendas going forward. However, the authors fail to address any of our concerns, and at the same time, mischaracterize some of our key positions.
R2. Do we recommend abandoning DNNs as models of human vision?
Many commentators claim that we are categorically rejecting DNNs as models of human vision (Golan et al.; Hermann, Nayebi, van Steenkiste, & Jones [Hermann et al.]; Love & Mok; Op de Beeck & Bracci; Summerfield & Thompson; Wichmann, Kornblith, & Geirhos [Wichmann et al.]; Yovel & Abudarham), with quotes like:
In the target article, Bowers et al. propose that psychologists should abandon DNNs as models of human vision, because they do not produce some of the perceptual effects that are found in humans (Yovel & Abudarham)
Unlike Bowers et al. we do not see any evidence that future, novel DNN architectures, training data and regimes may not be able to overcome at least some of the limitations mentioned in the target article – and Bowers et al. certainly do not provide any convincing evidence why solving such tasks is beyond DNNs in principle, that is, forever (Wichmann et al.)
Nevertheless, the target article advocates for jettisoning deep-learning models with some competency in object recognition for toy models evaluated against a checklist of laboratory findings (Love & Mok)
…Bowers et al. take failures of ImageNet-trained models to behave in human-like ways as support for abandoning DNN architectures (Hermann et al.)
However, this is not our position. Indeed, in section 6.1 in the target article, we clearly lay out four different approaches to modelling that should be pursued going forward, the first of which is to continue to work with standard DNNs that perform well in identifying naturalistic images of objects but modify their architectures, optimization rules, and training environments to better account for key experimental results in psychology. This is exactly the view that so many commentators are endorsing. Nowhere in the target article do we advocate for “jettisoning” DNNs, and it is hard to understand why so many researchers claim that we have.
R3. Is image computability an entry requirement for developing models of human vision?
While we explicitly endorse a research programme that, amongst other things, compares image-computable DNNs to human vision (if severely tested), most of the commentators are less ecumenical and reject alternative modelling approaches in psychology and neuroscience that already account for some key aspects of human vision and the brain more generally. The main reason for this selective interest in DNNs is that only DNNs can recognize photographic images of objects at human or superhuman levels (under some conditions), that is, only DNNs are “image computable.” This is considered an essential starting point for developing models of human vision (Anderson, Storrs, & Fleming [Anderson et al.]; DiCarlo et al.; Golan et al.; Love & Mok; Op de Beeck & Bracci; Spratling; Summerfield & Thompson; Wichmann et al.; Yovel & Abudarham). As Spratling puts it “… the ability to process images would seem to me to be a minimum requirement for a model of vision, and models that cannot be scaled to deal with images are not worth evaluating.” Similarly, Summerfield & Thompson describe working with nonimage-computable models as “regressive.” Not to be outdone, Love & Mok write:
The authors invite us to return to the halcyon days before deep learning to a time of box-and-arrow models in cognitive psychology and “blocks world” models of language (Winograd, Reference Winograd1971), when modelers could narrowly apply toy models to toy problems safe in the knowledge that they would not be called upon to generalize beyond their confines nor pave the way for future progress.
This emphasis on image computability betrays a fundamental misunderstanding of what models are and what they are for. The goal of a scientific theory/model in the cognitive sciences is to account for capacities, predict data, and explain key phenomena, not to superficially resemble that which it purports to explain. When developing DNNs of human vision, image computability makes a system look like a visual system, but it does not make that system a good model of the human visual system. The ability to identify photorealistic images is a perk, not a barrier to entry. The barrier to entry is explanatory power and accounting for key empirical results. Rather than dismiss alternative approaches to modelling because they are not image computable, the relevant questions are “What have we learned from the multitude of modelling approaches available to vision scientists?” and “What are the most promising approaches going forward?”
To answer these questions, we need to consider the different modelling approaches of the past and the different approaches currently on offer. First, there is a long history in neuroscience and psychology of developing conceptual and mathematical theories of human vision that have provided insights into key empirical phenomena, from wiring diagrams designed to explain single-cell responses of simple and complex cells in V1 (Hubel & Wiesel, Reference Hubel and Wiesel1962), to dual-stream theories of vision designed to explain neuropsychological disorders of vision (Goodale & Milner, Reference Goodale and Milner1992), to theories of object recognition in normal vision (e.g., Biederman, Reference Biederman1987; Marr, Reference Marr1982). These approaches to modelling are still active and providing valuable insights (Baker, Garrigan, & Kellman, Reference Baker, Garrigan and Kellman2021; Goodale & Milner, Reference Goodale and Milner2023; Vannuscorps, Galaburda, & Caramazza, Reference Vannuscorps, Galaburda and Caramazza2021).
Second, there is a long history of building neural networks that process simple visual inputs to gain insights into the psychological and neural processes involved in object recognition, such as the neocognitron model (Fukushima, Reference Fukushima1980) that implemented and extended the theory of Hubel and Wiesel, and the JIM model that implemented and extended the theory of Biederman (Hummel & Biederman, Reference Hummel and Biederman1992). This latter model, JIM, and its successors (Hummel, Reference Hummel2001; Hummel & Stankiewicz, Reference Hummel, Stankiewicz, Inui and McClelland1996) recognize simple line drawings of objects and are premised on the assumption that the goal of the ventral visual stream is to build a representation of the distal stimulus (the world and the objects in it) that can be used to understand the visual world. On this view, object classification is merely a consequence, not the be-all and end-all, of the ventral visual stream. Unlike current DNNs, JIM, and its successors account for many key psychological findings in human object recognition – such as the sensitivity of humans to part–whole relations – without being able to process naturalistic photographic images.
In a similar way, Grossberg et al. developed adaptive resonance theory (ART) models that quickly learn to classify simple visual patterns without forgetting past learning, that is, networks that solve the stability–plasticity dilemma (e.g., Carpenter & Grossberg, Reference Carpenter and Grossberg1987; Grossberg, Reference Grossberg1980). ART models not only account for a range of empirical findings reported in psychology and neuroscience (Grossberg, Reference Grossberg2021), but they have also been used to solve engineering challenges (Da Silva, Elnabarawy, & Wunsch, Reference Da Silva, Elnabarawy and Wunsch2019). Grossberg has also developed detailed models of low-level vision that take in simple visual inputs to capture a wide range of perceptual illusions (Grossberg, Reference Grossberg2014). Expanding on the work of Grossberg, Francis, Manassi, and Herzog (Reference Francis, Manassi and Herzog2017) implemented networks that process simple visual inputs to explain a range of crowding phenomena that current DNNs cannot explain. In related work, George et al. (Reference George, Lehrach, Kansky, Lázaro-Gredilla, Laan, Marthi and Phoenix2017, Reference George, Lazaro-Gredilla, Lehrach, Dedieu and Zhou2020) developed recursive cortical networks that support the recognition of “captchas” and can account for several phenomena core to human vision, including some Gestalt phenomena (Lavin, Guntupalli, Lázaro-Gredilla, Lehrach, & George, Reference Lavin, Guntupalli, Lázaro-Gredilla, Lehrach and George2018). These models rely on segmentation and occlusion-reasoning in a unified framework to support object recognition, but only work with simple visual stimuli. These modelling efforts (and many others) largely fall into the second research programme we endorse in section 6.1 in the target article, namely, building networks that focus on explaining key psychological phenomena rather than image computability.
Third, there are active research programmes following the third approach we endorse in section 6.1 in the target article, namely, building models that support various human capacities that current DNNs struggle with (without focusing on the details of psychological or neuroscience research). But again, these models cannot process the photographic images that DNNs recognize. For example, Hinton, a coauthor of AlexNet, rejects current image-computable DNNs as models of human vision and is instead developing Capsule and GLOM models (Hinton, Reference Hinton2022; Sabour, Frosst, & Hinton, Reference Sabour, Frosst and Hinton2017). Hinton (Reference Hinton2022) writes:
There is strong psychological evidence that people parse visual scenes into part–whole hierarchies and model the viewpoint–invariant spatial relationship between a part and a whole as the coordinate transformation between intrinsic coordinate frames that they assign to the part and the whole [Hinton, Reference Hinton1979]. If we want to make neural networks that understand images in the same way as people do, we need to figure out how neural networks can represent part–whole hierarchies.
Indeed, current DNNs fail to represent objects in terms of their parts and relations even when explicitly trained to do so (Malhotra, Dujmović, Hummel, & Bowers, Reference Malhotra, Dujmović, Hummel and Bowersin press).
Similarly, generative models, such as variational autoencoders, are being developed that learn disentangled representations of visual elements of a scene (single hidden units that encode shape, colour, position, etc.; e.g., Higgins et al., Reference Higgins, Matthey, Pal, Burgess, Glorot, Botvinick and Lerchner2016; Montero, Bowers, Ponte Costa, Ludwig, & Malhotra, Reference Montero, Bowers, Ponte Costa, Ludwig and Malhotra2022; Zhang et al., Reference Zhang, Zhang, Liu, Weller, Schölkopf and Xing2022) and object-centric learning models are being built to perform perceptual grouping (e.g., Anciukevicius, Fox-Roberts, Rosten, & Henderson, Reference Anciukevicius, Fox-Roberts, Rosten and Henderson2022; Locatello et al., Reference Locatello, Poole, Rätsch, Schölkopf, Bachem and Tschannen2020). To understand these principles, these models are frequently trained and tested on datasets of artificially created simple visual stimuli. German & Jacobs explicitly argue that variational autoencoders provide a more promising framework for understanding how human vision encodes objects in terms of their parts and relations between parts. But at present, exploring this requires working with simple rather than the photorealistic images.
The important point to emphasize here is that all these models would (and some actually do) receive low Brain-Scores (some cannot even be tested) because they cannot process the photorealistic inputs in ImageNet. Yet these models explore important phenomena in constrained settings. Are we supposed to discard these models because they cannot process and recognize photographs of objects? We think not. In our view, the diversity of modelling approaches in psychology (and the cognitive sciences more generally) fits well with the diversity of productive questions that can be asked about cognitive systems (cf., van Rooij, Reference van Rooij2022). This is important to counteract the assumption that all worthwhile models of vision can recognize naturalistic photographs of objects or are on a trajectory towards becoming image computable.
R4. Are image-computable models the “current best” models of human vision
Still, it might be argued that image-computable DNNs that perform well on prediction-based experiments are the current best models of human vision because they provide more insights into human vision. However, we are struggling to see what the new insights are (although see our responses to Anderson et al. and Op de Beeck & Bracci below). Current DNNs account for few findings from psychology, and only do well on brain prediction-based studies when there is no attempt to rule out confounds as the basis of their successes. At the same time, DNNs that vary in terms of their architectures (CNNs vs. transformers), and objective functions (classification vs. image reconstruction) support similar levels of predictions on behavioural and brain benchmarks (e.g., Storrs, Kietzmann, Walther, Mehrer, & Kriegeskorte, Reference Storrs, Kietzmann, Walther, Mehrer and Kriegeskorte2021), with Hermann et al. and Linsley & Serre noting a recent trend for better performing models of object recognition doing more poorly on Brain-Score (although Wichmann et al. note that a transformer model trained on 4 billion images does much better on behavioural benchmarks). And as noted by Xu & Vaziri-Pashkam, when RSA is assessed with higher quality brain data, the correspondence across levels of DNNs and visual cortex is lost for familiar objects, and the predictivity scores go down dramatically for unfamiliar objects. More problematically, Xu & Vaziri-Pashkam note that RSA scores are greatly reduced following theoretically motivated experimental manipulations of images. What conclusions or insights about human vision follow from these observations? At present, it seems that the main advantage of image-computable DNNs compared to alternative models is that they recognize things, with little evidence that they do this in the way that humans do.
In fact, many commentators readily concede that current DNNs are doing a poor job in accounting for the results of experimental studies of human vision, and multiple possible solutions have been proposed. DNNs need to be trained with a better diet of images that more closely resemble human experience (Linsley & Serre; Op de Beeck & Bracci; Yovel & Abudarham), more biological constraints need to be added to models, such as representing binocular input from two eyes (Chandran, Paul, Paul, & Ghosh), and new objective functions and tasks need to be explored, including building DNNs that support vision for action (German & Jacobs; Hermann et al.; Li & Mur; Liu & Bartolomeo; Rothkopf, Bremmer, Fiehler, Dobs, & Triesch; Slagter; Summerfield & Thomson), with many of these authors advocating for some combination of the above approaches. Again, we agree with these research agendas, and we are pursuing some of these ourselves, including adding biological constraints to networks (Evans, Malhotra, & Bowers, Reference Evans, Malhotra and Bowers2022; Tsvetkov, Malhotra, Evans, & Bowers, Reference Tsvetkov, Malhotra, Evans and Bowers2023) and modifying training environments (Biscione & Bowers, Reference Biscione and Bowers2022), in an attempt to make DNNs encode information in a more human-like manner. At the same time, there are good a priori reasons to think major architectural innovations may be necessary, for example, to encode relations between parts (Kellman, Baker, Garrigan, Phillips, & Lu), with some authors more pessimistic regarding the promise of DNNs as models of brains, with quotes such as: “Deep neural networks (DNNs) are not just inadequate models of the visual system but are so different in their structure and functionality that they are not even on the same playing field” (Gur) and the claim that DNNs “are doomed to be largely useless models for psychological research on language” (Bever, Chomsky, Fong, & Piattelli-Palmarini [Bever et al.]).
Of course, the human visual system is an image-computable neural network (although a network that differs from current DNNs in many fundamental ways; Izhikevich, Reference Izhikevich2004). However, the claim that current image-computable DNNs are the most promising models of human vision going forward, despite the limited insights gathered thus far, is nothing more than a faith-based prophecy that may or may not pan out. In our view, researchers should be pursuing multiple different modelling approaches to advance our understanding of human vision. It is the dismissal of alternative approaches that is regressive (cf., Rich, de Haan, Wareham, & van Rooij, Reference Rich, de Haan, Wareham and van Rooij2021, for a computational account of why this is detrimental).
R5. The role of prediction and explanation in model building
In the target article, we distinguished between uncontrolled, prediction-based studies that often highlight DNN–human similarities and controlled experiments that often highlight dissimilarities. We argued that the former experiments are problematic given that predictions can be driven by confounds whereas the latter experiments can help rule out confounds and allow researchers to draw causal conclusions regarding similarities and differences between DNNs and humans. To our surprise, few commentators even comment on this issue. The only exceptions are Srivastava, Sifar, & Srinivasan who highlight that similar issues apply in other domains, Golan et al. who highlight the importance of all variety of designs, and Veit & Browning who point out that properties and abilities of biological systems can be multiply realized and that controlled experiments are needed to make causal conclusions regarding the similarity of DNNs and humans.
Despite the potential problem of confounds in prediction-based studies, several commentators emphasize the importance of model predictions (Golan et al.; Lin; Moldoveanu; Op de Beeck & Bracci; Veit & Browning; Wichmann et al.; Yovel & Abudarham). For example, Wichmann et al. write: “we believe that both prediction and explanation are required: An explanation without prediction cannot be trusted, and a prediction without explanation does not aid understanding,” and Lin writes “developing models with predictive accuracy might be a complementary approach that could help to test the relevance of explanatory models that have been developed through controlled experimentation.”
These comments seem to suggest that testing models on controlled experiments does not involve prediction. In fact, both prediction-based studies and controlled experiments test model-based predictions (Golan et al.). The important distinction is between predictions with and without explanation. In the case of testing DNNs on prediction-based studies, there is no manipulation of independent variables designed to test specific hypotheses regarding how the models made their predictions, and accordingly, no explanation for any good predictions. Indeed, receiving 100% predictivity does not help the scientist understand how a DNN is predicting (see Fig. 5 in the target article). By contrast, in the case of testing DNNs on controlled experiments, the models are assessed in how well they predict performance across conditions designed to test hypotheses, and accordingly, good predictions can contribute to an explanation.
Of course, some types of predictions provide a stronger test of a model than others (Spratling), and this applies to both prediction-based studies and controlled experiments. In the case of prediction-based studies, current DNNs only perform well in the easy cases, namely, when training and test images are from the same distribution (often described as independent and identically distributed data or i.i.d. data). When DNNs are assessed on their ability to make behavioural or brain predictions for test images from a different distribution (out-of-distribution data or o.o.d. data), performance plummets. For example, as noted above, Xu & Vaziri-Pashkam showed that brain predictivity with RSA was much weaker when they included novel stimuli in the test set, and DNN successes on same-different visual judgements are limited to cases in which training and test images are similar (Puebla & Bowers, Reference Puebla and Bowers2022, Reference Puebla and Bowers2023). In other words, not only do prediction-based studies provide little insight into how models predict, but also their successful predictions are highly circumscribed.
Similarly, in the case of DNNs that successfully account for the results of controlled psychological experiments, the models predict that the controlled experiments will replicate on another sample of participants, images, and so on taken from the same population (i.i.d. data). But DNNs rarely make counterintuitive predictions that are subsequently confirmed in controlled experiments (analogous to predictions of o.o.d. data). It is worth noting that models tested on controlled experiments are generally described as accounting for (rather than predicting) results when successful, and this terminology might be more appropriate for prediction-based studies tested on i.i.d. data. Whatever the terminology, prediction-based studies and controlled experiments both assess how well DNNs predict (account) for data, but only the latter method tests hypotheses to rule out confounds and to make causal claims regarding how DNNs and humans identify objects.
Arguments regarding the relative advantages of prediction versus explanation touch on a broader debate regarding the relative advantages of studying natural systems in artificial conditions that allow precise control of variables versus naturalistic conditions where control is more limited. For example, Love & Mok cite the classic paper by Newell (Reference Newell1973) “You can't play 20 questions with nature and win” as a fundamental problem with studying the brain with controlled experiments. According to Love & Mok, laboratory studies in psychology have only produced a collection of findings they characterize as “cognitive science trivia.” Summerfield & Thompson are not so dismissive of these experimental results, but they are critical of models in psychology that narrowly focus on explaining a small set of laboratory findings. DNNs, by contrast, are thought to hold promise of “genuine predictive power in the natural world” when trained on tasks that humans face in everyday life.
It strikes us as peculiar to characterize the empirical findings from psychology as “trivia” rather than core constraints for theory building and odd to dismiss models of specific empirical findings if they help explain key aspects of vision. What other area of science does not break down complex phenomena into parts? When Summerfield & Thompson highlight the narrow scope of psychological models with the example “…a model that explains crowding typically does not explain filling in and vice versa,” it is important to note that current DNNs account for neither result.
For the sake of argument, let us accept the claim that image-computable models provide the best way forward for addressing Newell's challenge. Nevertheless, it is still the case that only controlled experiments provide specific hypotheses about how to improve DNN–human correspondences. For example, controlled experiments highlighted specific limitations of current DNNs as models of human vision (e.g., relying too much on texture, etc.) leading to specific suggestions about how to address them (e.g., a generative rather than discriminative objective function may result in a model that encodes shape rather than texture; German & Jacobs). A research programme of training image-computable DNNs on naturalistic datasets without running specific controlled experiments will simply lead to black-box models in which there is no understanding of how the model works, let alone whether the model learns similar representations to humans.
It is also important to recognize the challenges with working with naturalistic images even when relying on controlled studies. For example, Rust and Movshon (Reference Rust and Movshon2005) argued for the importance of building theories of biological vision using artificial and simple stimuli. They pushed back on the view that the best way to understand vision was to probe the system with naturalistic images, writing:
Implicit in this approach is the assumption that synthetic stimuli are in some way impoverished or “simplistic” and therefore somehow miss important features of visual response. The main – and in our view, crippling – challenge is that the statistics of natural images are complex and poorly understood. Without understanding the constituents of natural images, it is imprudent to use them to develop a well-controlled hypothesis-driven experiment.
Although these comments were made before the current interest in DNNs, it remains just as difficult to design well-controlled hypothesis-driven experiments using natural images now as it was then given the billions of features associated with images. As a result, DNNs trained on these images become liable to learning based on shortcuts (Geirhos et al., Reference Geirhos, Jacobsen, Michaelis, Zemel, Brendel, Bethge and Wichmann2020) and confounds (Dujmović, Bowers, Adolfi, & Malhotra, Reference Dujmović, Bowers, Adolfi and Malhotra2023), making it difficult to interpret their mechanisms and internal representations.
Finally, it is important to emphasize that model predictions are not the only way to advance our understanding of natural systems. Lin gives the example of Darwinian evolution as a model that has explanatory power but limited predictive accuracy. We think the term theory rather than model is more appropriate here, but the critical point is that evolution explains existing data very well, and it would be silly to dismiss the theory because it does not make precise predictions going forward. This point generalizes to all areas of science, such that unimplemented theories of vision can provide important insights into human vision if they can provide an account of key existing findings. Indeed, simply running experiments that test hypotheses can be highly informative. Of course, formal modelling has an important role to play, but in all cases, the focus should be on explanation, not prediction.
R6. The marketing of DNNs as the current best models of human vision is impeding our progress in developing better models
When comparing DNNs to humans it is not enough to carry out controlled experiments, it is also important to emphasize both the similarity and differences. This involves not only correctly characterizing the results from both DNNs and humans, but also carrying out studies that attempt to falsify claims regarding DNN–human similarities. Indeed, the best empirical evidence for a model is that it survives “severe” tests (Mayo, Reference Mayo2018), namely, experiments that have a high probability of falsifying a claim if and only if the claim is false in some relevant manner (for a detailed discussion of the importance of severe testing when comparing DNNs to humans, see Bowers et al., Reference Bowers, Malhotra, Adolfi, Dujmović, Montero, Biscione and Heaton2023).
However, this does not characterize standard practice in the field at present. Instead, there appears to be a bias towards highlighting similarities and downplaying differences. Indeed, Tarr notes that many of the strong claims regarding DNN–human similarities are best understood as marketing rather than serious scientific claims – and on his view, the problem rests with the consumers who take the hype (too) seriously. He writes a story of a fool buying a pig because he saw a brochure suggesting pigs could fly. It is an allegory – the person should not be so naïve to believe the marketing. Similarly, he cautions us to be smart consumers of science and not take strong claims regarding DNN–human similarity too seriously. He writes that DNNs are only “proxy models” of vision and writes: “I don't think there is much actual confusion that deep neural networks (DNNs) are ‘models of the human visual system.’”
We imagine it would be hard for DiCarlo et al. and Golan et al. to agree with this conclusion given they both repeat the claim that DNNs are the best models of human vision. But more importantly, this marketing impacts the field in two general ways.
R6.1. Marketing and research practices
When looking for DNN–human similarities, there is little motivation to move away from prediction-based studies that can provide misleading estimates of similarities, little reason for researchers to carry out controlled studies that provide severe tests of these claims, and little interest from editors and reviewers in publishing studies that highlight DNN–human dissimilarities. Consistent with these claims, two commentators explicitly minimize the importance of falsification. Tarr writes: “…less handwringing about what current models can't do; instead, they should focus on what DNNs can do.” Similarly, Love & Mok write: “…we do not share their enthusiasm for falsifying models that are a priori wrong and incomplete.” Instead, Love & Mok advocate for a Bayesian approach to model evaluation, where the question is which model is most likely given the data. But model selection depends on which data are under consideration, and currently, too many fundamental psychological findings are ignored because DNNs do not capture them. If Bayesian methods were used to select models that account for psychological phenomena, then in many cases, nonimage-computable models would perform best.
Perhaps the above comments are anomalous, and Golan et al. are right to doubt a bias against falsification in the field. But in our experience, this attitude towards falsification is widespread. For example, see the following NeurIPS workshop talk by Bowers (Reference Bowers2022) that provides multiple examples of reviewers and editors stating that falsification is not enough. Rather, it is necessary to find “solutions” to make DNNs more like humans to publish: https://slideslive.com/38996707/researchers-comparing-dnns-to-brains-need-to-adopt-standard-methods-of-science. Similar biases are well recognized in other fields. For example, it is analogous to a bias against publishing null results in psychology that is well understood to have led to many false conclusions (Simmons, Nelson, & Simonsohn, Reference Simmons, Nelson and Simonsohn2011).
R6.2. Marketing and (mis)characterizing research findings
There is another respect in which this marketing manifests itself, namely, weak or ambiguous findings are too often characterized as supporting strong conclusions. We gave multiple examples of this in the target article (e.g., Caucheteux, Gramfort, & King, Reference Caucheteux, Gramfort and King2022; Duan et al., Reference Duan, Matthey, Saraiva, Watters, Burgess, Lerchner and Higgins2020; Hermann, Chen, & Kornblith, Reference Hermann, Chen and Kornblith2020; Kim, Reif, Wattenberg, Bengio, & Mozer, Reference Kim, Reif, Wattenberg, Bengio and Mozer2021; Messina, Amato, Carrara, Gennaro, & Falchi, Reference Messina, Amato, Carrara, Gennaro and Falchi2021; Zhou & Firestone, Reference Zhou and Firestone2019) and there are more examples from the current commentaries themselves. For instance, de Vries, Flachot, Morimoto, & Gegenfurtner (de Vries et al.) criticize us for claiming that colour and form are processed entirely separately in V1 and cite some studies of theirs that show that DNNs do a good job in capturing important features of human colour processing. We take the point that the strong claims by Livingstone and Hubel (Reference Livingstone and Hubel1988) need to be qualified given subsequent work (e.g., Garg, Li, Rashid, & Callaway, Reference Garg, Li, Rashid and Callaway2019), but de Vries et al. mischaracterize their own findings. They claim that categorical perception of colour emerges as a function of training models to classify objects and note that this effect did not emerge in a DNN trained to distinguish artificial from human-made scenes (de Vries, Akbarinia, Flachot, & Gegenfurtner, Reference de Vries, Akbarinia, Flachot and Gegenfurtner2022). However, as reported in Appendix 7 of de Vries et al. (Reference de Vries, Akbarinia, Flachot and Gegenfurtner2022), an untrained DNN also showed some degree of categorical perceptual effects as well. This latter finding substantially weakens the evidence for their claim that colour perception emerges as a consequence of learning to classify objects.
Similarly, Love & Mok criticize us for not “engaging with work that successfully addresses their criticisms,” but the evidence they report do not support their conclusions. Love & Mok give two examples from their own lab. First, they describe the work of Sexton and Love (Reference Sexton and Love2022) who note that RSA and linear prediction methods of comparing DNNs to brains rely on correlations and write: “Just as correlation does [not] imply causation, correlation does not imply correspondence.” We agree. The problem is in how they draw correspondence claims. The authors assess whether brain signals can causally drive object recognition in DNNs by substituting the response elicited in an internal layer of a DNN with (a linear transform of) the brain response elicited by the same visual stimulus. They find that the activities from brain regions do indeed drive DNN object recognition performance above chance levels and take this as evidence that the representations in DNNs and brain are similar.
However, there are both empirical and logical problems with their studies and the conclusions they draw. Empirically, as reported in the Supplemental materials (Fig. S10 and Table S3), when brain data are used to drive DNN object recognition, performance drops from ~80 to <10% in one experiment and from ~58 to <2% in the second experiment. This large drop in performance is problematic for their conclusion. More fundamentally, the observation that brain responses support (limited) object recognition in DNNs does not address the issue of confounds. Just as texture-like representations in DNNs might be used to predict shape representations in cortex (leading to good RSA or Brain-Scores in the absence of similar representations), it is possible that shape representations in cortex can be mapped to texture-like representations in DNNs to drive object recognition to a limited extent. That is, the (weak) causal link between brain activation and DNN object recognition does nothing to address our concern that good predictions do not imply similar representations. Just as correlations do not imply causation, causation does not imply correspondence.
Love & Mok also describe a study by Dagaev et al. (Reference Dagaev, Roads, Luo, Barry, Patil and Love2023) that they claim addresses a problem identified by Malhotra, Evans, and Bowers (Reference Malhotra, Evans and Bowers2020), namely, that DNNs are so susceptible to shortcut learning that they will classify the images from CIFAR10 based on a single-pixel confound. Their solution involved introducing a too-good-to-be-true prior during training – if an image could be classified successfully by a low-capacity network (which Dagaev et al. use as a shortcut detector), the image is down-weighted during training a full-capacity network. This way, the full-capacity network only learned on images that, Dagaev et al. claim, are less likely to contain shortcuts. While this method is certainly of interest for a machine-learning engineer, it is of limited relevance to a cognitive scientist and does not address the criticisms made by Malhotra et al. (2021). Firstly, if the shortcut is widely prevalent in the dataset – in Malhotra et al. a diagnostic pixel was present in 80–100% of images – this method would fail. Secondly, there is nothing to say that shortcuts picked up by DNNs are necessarily easier to pick up by a low-capacity network. There could be many complex shortcuts, involving a conjunction of features that will be ignored by humans and picked up by full-capacity DNNs, but not by low-capacity DNNs. The point that Dagaev et al. miss is that we do not want models to ignore simple diagnostic visual features (humans rely on heuristics across a wide range of domains) but that they should learn the right kind of features, that is, models should incorporate appropriate human inductive biases, not whatever the low-capacity DNN does not happen to find diagnostic.
Yovel & Abudarham describe how DNNs capture the face-inversion effect, writing: “Interestingly, a human-like face inversion effect that is larger than an object inversion effect is found in DNNs.” In fact, as shown by Yovel, Grosbard, and Abudarham (Reference Yovel, Grosbard and Abudarham2022) and others, DNNs show similar size-inversion effects for face and nonface stimuli when trained with an equal number of images per category (e.g., when trained to identify the same number of human faces and birds of the same species). That is, the models showed an expertise inversion effect, not a face-specific inversion effect. This contradicts the bulk of current empirical evidence showing that humans exhibit a greater inversion effect for faces compared to other categories even when they are expert at the other category. To reconcile these findings with the modelling work, Yovel et al. (Reference Yovel, Grosbard and Abudarham2022) argue that bird watchers are more expert at human faces compared to birds, and this is why they show larger face inversion effects. Future work may well support this hypothesis, and if so, it would provide a good example of DNNs explaining important psychological data. However, as it stands, the DNN results are inconsistent with most psychological data.
This is not to say that there are no examples of DNNs doing a good job at accounting for the results from controlled experiments. For instance, Anderson et al. describe the results of Storrs et al. (Reference Storrs, Kietzmann, Walther, Mehrer and Kriegeskorte2021) who identified conditions in which DNNs do and do not replicate illusions of gloss in humans. They found that unsupervised but not supervised learning produced human-like results and suggest unsupervised learning may play a similar role in humans. Similarly, Op de Beeck & Bracci describe the controlled studies by Kubilius, Bracci, and Op de Beeck (Reference Kubilius, Bracci and Op de Beeck2016) showing that DNNs trained on ImageNet are sensitive to many of the nonaccidental features described by Biederman (Reference Biederman1987), a finding we found surprising but subsequently replicated in unpublished work.
However, these successes are, in our view, the exception, not the rule. A combination of relying so heavily on uncontrolled prediction-based studies, a bias against falsification in controlled studies, and selectively characterizing results to emphasize DNN–human similarities is not the way forward to advancing our understanding of human vision.
The same issues apply when large language models are also frequently compared to human language. In the target article, we gave the example of Caucheteux et al. (Reference Caucheteux, Gramfort and King2022) making strong conclusions about human language despite the fact that the DNNs accounted for approximately 0.004 of the BOLD variance in response to spoken sentences. Similarly, Schrimpf et al. (Reference Schrimpf, Blank, Tuckute, Kauf, Hosseini, Kanwisher and Fedorenko2021) report that transformer models predict nearly 100% of explainable variance in neural responses to written sentences and suggest that “a computationally adequate model of language processing in the brain may be closer than previously thought.” However, the strong claims from the article are undermined from data reported in the appendices. From Appendix S1 one finds out that the explainable variance is between 4 and 10% of the overall variance in three of the four datasets they analyse, and from the Appendix section “SI-1 – Language specificity,” we find out that DNNs not only predict brain activation of language areas, but also nonlanguage areas, and in some analyses, the predictions are numerically larger for nonlanguage areas. Rather than providing evidence that these models process language like humans, the correlations may be more akin to the spurious correlation observed between mouse brain activations and cryptocurrency markets (Meijer, Reference Meijer2021).
Furthermore, as noted by Houghton, Kazanina, & Sukumaran (Houghton et al.), when a child is learning to speak, it is unlikely that she is focusing on predicting the next word. Rather, it seems likely that she is trying to communicate thoughts and desires. That is, these models learn to produce well-formed syntactic sentences when trained on arguably the wrong objective function. Similarly, these DNNs do not appear to share human-like inductive biases in learning languages, what Bever et al. call a universal grammar. These innate properties of humans allow the child to learn languages with many orders of magnitude less training than DNNs (human learning must be compatible with the poverty of the stimulus constraint), and at the same time, limits the types of languages that the human language system acquires (unlike language learning in DNNs; Mitchell & Bowers, Reference Mitchell and Bowers2020). In our view, research with DNNs in the domain of language provides another example that good predictions in uncontrolled studies provide little evidence that DNNs rely on human-like representations, processes, or even objective functions.
We do agree with Houghton et al. that it can be useful to compare language in DNNs and humans to explore the capacities of DNNs that do not have any language-specific learning mechanism. But at present, not only do the learning objectives and learning constraints seem wildly different in the two systems, but also, the performance of fully trained models “sharply diverges” from humans in controlled experiments (Huang et al., Reference Huang, Arehalli, Kugemoto, Muxica, Prasad, Dillon and Linzen2023).
R7. The Brain-Score neuroconnectionists
Before concluding, we thought it would be worthwhile to focus on the commentaries by DiCarlo et al. and Golan et al.. Many of these authors have been amongst the most vocal in highlighting DNN–human similarities, and in both commentaries, they are describing agendas for how to push the field forward.
Perhaps most surprising for us, DiCarlo et al. do not even attempt to address the core problem with prediction-based studies used in Brain-Score, namely, predictions of observational datasets might be mediated by confounds. Instead, they mischaracterize our views regarding benchmarks, writing:
Bowers et al. eschew community-transparent suites of benchmarks yet they imply an alternative notion of vision model evaluation, which is somehow not a suite of benchmarks… we see no alternative to support advances in models of vision other than an open, transparent, and community-driven way of model comparison.
Where DiCarlo et al. get the impression that we are opposed to “open, transparent, and community-driven way of model comparison” is beyond us. Rather, we caution against prediction-based studies and endorse controlled experiments to assess models, including image-computable DNNs. Indeed, we are building our own (open, transparent, and community-driven) evaluation suite, that we call MindSet, that will make it easy for researchers to assess image-computable DNNs against key findings in psychology (Biscione et al., Reference Biscione, Yin, Malhotra, Dujmović, Montero, Puebla, Bowers and S2023). MindSet facilitates the testing of DNNs across a series of controlled psychological experiments, each of which tests a specific hypothesis regarding how DNNs process and represent information.
The authors also report on an upcoming update on Brain-Score, with the inclusion of a controlled study by Baker and Elder (Reference Baker and Elder2022). They note that some DNN vision models tested on this dataset are within the noise ceiling of human data. It will be interesting to see these results given that Baker and Elder reported that VGG19, ResNet50, CorNET, and a visual transformer all failed to capture human results, writing:
Our configural manipulation reveals an enormous difference in how humans and networks recognize the objects: while humans rely profoundly on configural cues, networks do not.
Regardless of how current DNNs perform on this specific dataset, we welcome the introduction of controlled studies to the Brain-Score benchmark. But if the authors of Brain-Score modify their benchmark to assess the results of controlled experiments, they will need to assess models in terms of how well they explain the impact of independent variables that test specific hypotheses rather than rank models by their overall prediction accuracy.
DiCarlo et al. also defend their claim that DNNs are the current leading models of human ventral visual processing and write: “Bowers et al. critique ANN models without offering a better alternative: They imply that better models exist or should exist, but do not elaborate on what those models are.” They set the bar quite low for “best” given that current DNNs do extremely poorly in predicting the results of experiments that manipulate independent variables and provide little insight into how humans identify the objects included in current behavioural and brain benchmark studies. But in any case, we have detailed a long list of alternative models in section 6.1 in the target article in section R3 in our response. In our view, these nonimage-computable models have provided more insight into human vision thus far. Still, going forward, we do think it is important to try to build image-computable DNNs that do account for controlled studies, and in parallel, pursue alternative modelling approaches.
Golan et al. describe a progressive Lakatosian research programme they call “neuroconnectionism” (Doerig et al., Reference Doerig, Sommers, Seeliger, Richards, Ismael, Lindsay and Kietzmann2023) that generates a rich variety of falsifiable hypotheses and advances through model comparison. They note that neuroconnectionism itself is best thought of as a computational language that cannot be falsified and that a failure of a specific DNN does not amount to a refutation of neural network models in general. The problem with this is that no one claims that a rejection of a specific model amounts to a falsification of DNNs in general, and no one rejects modelling as a core method for advancing science. They are mounting a defence against an imaginary critique (as do other commentators, as noted in sect. R2). Our criticism with neuroconnectionism is that current claims regarding DNN–human similarity are grossly overstated because researchers rely too heavily on uncontrolled prediction-based studies and avoid severe testing of their hypotheses. When the right methods are employed – namely, controlled experiments as used in virtually all other areas of science – models account for few empirical findings of interest to vision researchers.
Unlike DiCarlo et al., Golan et al. do note some of the advantages of controlled experiments and briefly touch on the limitations of uncontrolled prediction-based studies, writing:
Controlled experiments pose specific questions. They promise to give us theoretically important bits of information but are biased by theoretical assumptions and risk missing the computational challenge of task performance under realistic conditions… Observational studies and experiments with large numbers of natural images pose more general questions. They promise evaluation of many models with comprehensive data under more naturalistic conditions, but risk inconclusive results because they are not designed to adjudicate among alternative computational mechanisms (Rust & Movshon, Reference Rust and Movshon2005). Between these extremes lies a rich space of neural and behavioral empirical tests for models of vision. The community should seek models that can account for data across this spectrum, not just one end of it.
But we do not find their arguments against controlled studies and in support of obsrevational studies persuasive. Yes, controlled studies are biased in the sense that they are driven by theoretical assumptions, but the unstated (and unknown) assumptions in uncontrolled studies do not avoid biased results. For example, the image datasets used in Brain-Score (see Fig. 2 in the target article) are not “neutral” and different results are obtained in other datasets (Xu & Vaziri-Pashkam). And what does it mean to claim that observational studies with naturalistic images promise to evaluate many models, and at the same time, note that this approach risks inconclusive results? Indeed, predictions made from naturalistic images taken from observational studies are, by their very nature, ambiguous as there are many potential confounds that can lead models to make predictions on the basis of shortcuts and confounds (Dujmović et al., Reference Dujmović, Bowers, Adolfi and Malhotra2023; Geirhos et al., Reference Geirhos, Jacobsen, Michaelis, Zemel, Brendel, Bethge and Wichmann2020).
Furthermore, what does it mean to design tests that fall in-between observational and controlled studies? An experiment either does or does not manipulate independent variables designed to test hypotheses and rule out confounds. If the point is that it is important to work with image datasets that vary in their degree of complexity and naturalism, it remains the case that controlled experiments need to be run on all types of stimuli. Indeed, Golan et al. cite the discovery of texture bias and adversarial susceptibility as two examples of shortcomings of DNNs that have led to improvements. Putting aside the fact that current DNNs show almost none of the features of human shape processing and there are still no solutions to adversarial images, these limitations were both identified using controlled experiments that rely on complex but unnatural stimuli. Golan et al. do not identify any insights that have derived from uncontrolled studies.
Golan et al. also caricature psychology, writing: “Traditional psychological experiments are designed to test verbally defined theories.” In fact, controlled experiments have been used to assess computational models in psychology long before the invention of AlexNet (e.g., Grossberg, Reference Grossberg1967; Hummel & Biederman, Reference Hummel and Biederman1992; Medin & Schaffer, Reference Medin and Schaffer1978; Ratcliff & McKoon, Reference Ratcliff and McKoon2008; Rescorla & Wagner, Reference Rescorla, Wagner, Black and Prokasy1972; Shepard, Reference Shepard1987). This general lack of regard for formal models and results in psychology (not to mention the lack of regard for verbal theories) is impeding progress in characterizing DNN–human similarities and building better models of vision and the brain more generally. Indeed, this common and unwarranted attitude towards psychology partly motivated us to write the target article in the first place.
Golan et al. also defend the claim that DNNs are the “best models” of human vision, writing:
The empirical reason why ANNs can be called the “current best” models of human vision is that they offer unprecedented mechanistic explanations of the human capacity to make sense of complex, naturalistic inputs.
Here perhaps we should take the advice of Tarr and appreciate this is more marketing than a scientific statement.
Target article
Deep problems with neural network models of human vision
Related commentaries (29)
Explananda and explanantia in deep neural network models of neurological network functions
A deep new look at color
Beyond the limitations of any imaginable mechanism: Large language models and psycholinguistics
Comprehensive assessment methods are key to progress in deep learning
Deep neural networks are not a single hypothesis but a language for expressing computational hypotheses
Even deeper problems with neural network models of language
Fixing the problems of deep neural networks will require better training data and learning algorithms
For deep networks, the whole equals the sum of the parts
For human-like models, train on human-like tasks
Going after the bigger picture: Using high-capacity models to understand mind and brain
Implications of capacity-limited, generative models for human vision
Let's move forward: Image-computable models and a common model evaluation scheme are prerequisites for a scientific understanding of human vision
Modelling human vision needs to account for subjective experience
Models of vision need some action
My pet pig won't fly and I want a refund
Neither hype nor gloom do DNNs justice
Neural networks need real-world behavior
Neural networks, AI, and the goals of modeling
Perceptual learning in humans: An active, top-down-guided process
Psychophysics may be the game-changer for deep neural networks (DNNs) to imitate the human vision
Statistical prediction alone cannot identify good models of behavior
The model-resistant richness of human visual experience
The scientific value of explanation and prediction
There is a fundamental, unbridgeable gap between DNNs and the visual cortex
Thinking beyond the ventral stream: Comment on Bowers et al.
Using DNNs to understand the primate vision: A shortcut or a distraction?
Where do the hypotheses come from? Data-driven learning in science and the brain
Why psychologists should embrace rather than abandon DNNs
You can't play 20 questions with nature and win redux
Author response
Clarifying status of DNNs as models of human vision