Norbert Wiener, the founder of cybernetics, famously wrote: “the best […] model for a cat is another, or preferably the same cat” (Rosenblueth & Wiener, Reference Rosenblueth and Wiener1945). Wiener was referencing an assumption that good models are general – their predictions match data across diverse settings. A model of a cat should walk like a cat, purr like a cat, and scowl like a cat. Until recently, vision research has lacked general models. Instead, it has focused on unveiling a marvellous cabinet of perceptual curiosities, including the exotic illusions that characterise human vision. The models that explain these phenomena are typically quite narrow. For example, a model that explains crowding typically does not explain filling in and vice versa.
This (along with the vagaries of intellectual fashion) explains the enthusiasm that has greeted deep neural networks as theories of biological vision. Deep networks are (quite) general. After being trained to classify objects from a well-mixed distribution of natural scenes, they can generalise to accurately label new exemplars of those classes in wholly novel images. To achieve this, many networks use computational motifs recognisable from neurobiology, such as local receptivity, dimensionality reduction, divisive normalisation, and layerwise depth. This has provoked an upswell of enthusiasm around a model class that is both a passable neural simulacrum and has genuine predictive power in the natural world.
In the target article, Bowers et al. demur. Babies, they worry, may have been lost with the bathwater. Deep networks fail to capture many of the remarkable constraints on perception that have been painstakingly identified by vision researchers. The target article offers a useful tour of some behaviours we might want deep networks to display before victory is declared. For example, we should expect deep networks to show the advantage of uncrowding, to benefit from Gestalt principles, and to show a predilection to recognise objects by their shape rather than merely their texture. This point is well taken. The problem, which has been widely noted before, is that neural networks have an exasperating tendency to use every means possible to minimise their loss, including those alien to biology. In the supervised setting, if a single pixel unambiguously discloses the object label, deep networks will happily use it. If trained ad nauseam on shuffled labels, they will memorise the training set. If cows are always viewed in lush green pastures, they will mistake any animal in a field for a cow. None of this should be in the least surprising. It is, of course, mandated by the principles of gradient descent which empower learning in these networks.
So how do we build computational models that perceive the world in more biologically plausible ways? The target article is long on critique and short on solutions. In their concluding sections, the authors muse about the merits of a return to handcrafted models, or the augmentation of deep networks with neurosymbolic approaches. This would be a regressive step. To move away from large-scale function approximation would be to jettison the very boon that has (rightfully) propelled deep network models to prominence: Their remarkable generality.
Instead, to make progress, it would help to recall that primate vision relies on two parallel streams flowing dorsally and ventrally from early visual cortex (Mishkin, Ungerleider, & Macko, Reference Mishkin, Ungerleider and Macko1983). Deep networks trained for object recognition may offer a plausible model of the ventral stream, but an exclusive reliance on this stream leads to stereotyped deficits that seem to stem from a failure to understand how objects and scenes are structured. For example, damage to parieto-occipital regions can lead to integrative agnosia, where patients fail to recognise objects by integrating their parts; or to Balint's syndrome, where patients struggle to compare, count, or track multiple objects in space (Robertson, Treisman, Friedman-Hill, & Grabowecky, Reference Robertson, Treisman, Friedman-Hill and Grabowecky1997). These are precisely the sorts of deficits that standard deep networks display: They fail to process the “objectness” of an object, relying instead on shortcuts such as mapping textures onto labels (Geirhos et al., Reference Geirhos, Jacobsen, Michaelis, Zemel, Brendel, Bethge and Wichmann2020; Jagadeesh & Gardner, Reference Jagadeesh and Gardner2022). In primates, this computational problem is solved in the dorsal stream, where neurons code not just for objects and their labels but for the substrate (egocentric space) in which they occur. By representing space explicitly, neural populations in dorsal stream can signal how objects occupying different positions relate to each other (scene understanding), as well as encoding the spatially directed motor responses that are required to pick an object up or apprehend it with the gaze (skilled action). Thus, to account for the richness of primate visual perception, we need to build networks with both “what” and “where” streams. Recent research has started to make progress in this direction (Bakhtiari, Mineault, Lillicrap, Pack, & Richards, Reference Bakhtiari, Mineault, Lillicrap, Pack and Richards2021; Han & Sereno, Reference Han and Sereno2022; Thompson, Sheahan, & Summerfield, Reference Thompson, Sheahan and Summerfield2022).
More generally, the problem is not that the deep networks are poor models of vision. The problem is that popular tests of object recognition (such as ImageNet) are unrepresentative of the challenges that biological visual systems actually evolved to solve. In the natural world, object recognition is not an end in itself, but a route to scene understanding and skilled motor control. Of course, if a network is trained to slavishly maximise its accuracy at labelling carefully curated images of singleton objects, it will find shortcuts to solving this task which do not necessarily resemble those seen in biological organisms (which generally have other more interesting things to do, such as walking, purring, and scowling).
To tackle the challenges highlighted in the target article, thus, we do not need less generality – we need more. Neuroscience researchers should focus on the complex problems that biological organisms actually face, rather than copying benchmark problems from machine learning researchers (for whom building systems that solve object recognition alone is a perfectly reasonable goal). This will require a more serious consideration of what other brain regions – including dorsal stream structures involved in spatial cognition and action selection – contribute to visual perception.
Norbert Wiener, the founder of cybernetics, famously wrote: “the best […] model for a cat is another, or preferably the same cat” (Rosenblueth & Wiener, Reference Rosenblueth and Wiener1945). Wiener was referencing an assumption that good models are general – their predictions match data across diverse settings. A model of a cat should walk like a cat, purr like a cat, and scowl like a cat. Until recently, vision research has lacked general models. Instead, it has focused on unveiling a marvellous cabinet of perceptual curiosities, including the exotic illusions that characterise human vision. The models that explain these phenomena are typically quite narrow. For example, a model that explains crowding typically does not explain filling in and vice versa.
This (along with the vagaries of intellectual fashion) explains the enthusiasm that has greeted deep neural networks as theories of biological vision. Deep networks are (quite) general. After being trained to classify objects from a well-mixed distribution of natural scenes, they can generalise to accurately label new exemplars of those classes in wholly novel images. To achieve this, many networks use computational motifs recognisable from neurobiology, such as local receptivity, dimensionality reduction, divisive normalisation, and layerwise depth. This has provoked an upswell of enthusiasm around a model class that is both a passable neural simulacrum and has genuine predictive power in the natural world.
In the target article, Bowers et al. demur. Babies, they worry, may have been lost with the bathwater. Deep networks fail to capture many of the remarkable constraints on perception that have been painstakingly identified by vision researchers. The target article offers a useful tour of some behaviours we might want deep networks to display before victory is declared. For example, we should expect deep networks to show the advantage of uncrowding, to benefit from Gestalt principles, and to show a predilection to recognise objects by their shape rather than merely their texture. This point is well taken. The problem, which has been widely noted before, is that neural networks have an exasperating tendency to use every means possible to minimise their loss, including those alien to biology. In the supervised setting, if a single pixel unambiguously discloses the object label, deep networks will happily use it. If trained ad nauseam on shuffled labels, they will memorise the training set. If cows are always viewed in lush green pastures, they will mistake any animal in a field for a cow. None of this should be in the least surprising. It is, of course, mandated by the principles of gradient descent which empower learning in these networks.
So how do we build computational models that perceive the world in more biologically plausible ways? The target article is long on critique and short on solutions. In their concluding sections, the authors muse about the merits of a return to handcrafted models, or the augmentation of deep networks with neurosymbolic approaches. This would be a regressive step. To move away from large-scale function approximation would be to jettison the very boon that has (rightfully) propelled deep network models to prominence: Their remarkable generality.
Instead, to make progress, it would help to recall that primate vision relies on two parallel streams flowing dorsally and ventrally from early visual cortex (Mishkin, Ungerleider, & Macko, Reference Mishkin, Ungerleider and Macko1983). Deep networks trained for object recognition may offer a plausible model of the ventral stream, but an exclusive reliance on this stream leads to stereotyped deficits that seem to stem from a failure to understand how objects and scenes are structured. For example, damage to parieto-occipital regions can lead to integrative agnosia, where patients fail to recognise objects by integrating their parts; or to Balint's syndrome, where patients struggle to compare, count, or track multiple objects in space (Robertson, Treisman, Friedman-Hill, & Grabowecky, Reference Robertson, Treisman, Friedman-Hill and Grabowecky1997). These are precisely the sorts of deficits that standard deep networks display: They fail to process the “objectness” of an object, relying instead on shortcuts such as mapping textures onto labels (Geirhos et al., Reference Geirhos, Jacobsen, Michaelis, Zemel, Brendel, Bethge and Wichmann2020; Jagadeesh & Gardner, Reference Jagadeesh and Gardner2022). In primates, this computational problem is solved in the dorsal stream, where neurons code not just for objects and their labels but for the substrate (egocentric space) in which they occur. By representing space explicitly, neural populations in dorsal stream can signal how objects occupying different positions relate to each other (scene understanding), as well as encoding the spatially directed motor responses that are required to pick an object up or apprehend it with the gaze (skilled action). Thus, to account for the richness of primate visual perception, we need to build networks with both “what” and “where” streams. Recent research has started to make progress in this direction (Bakhtiari, Mineault, Lillicrap, Pack, & Richards, Reference Bakhtiari, Mineault, Lillicrap, Pack and Richards2021; Han & Sereno, Reference Han and Sereno2022; Thompson, Sheahan, & Summerfield, Reference Thompson, Sheahan and Summerfield2022).
More generally, the problem is not that the deep networks are poor models of vision. The problem is that popular tests of object recognition (such as ImageNet) are unrepresentative of the challenges that biological visual systems actually evolved to solve. In the natural world, object recognition is not an end in itself, but a route to scene understanding and skilled motor control. Of course, if a network is trained to slavishly maximise its accuracy at labelling carefully curated images of singleton objects, it will find shortcuts to solving this task which do not necessarily resemble those seen in biological organisms (which generally have other more interesting things to do, such as walking, purring, and scowling).
To tackle the challenges highlighted in the target article, thus, we do not need less generality – we need more. Neuroscience researchers should focus on the complex problems that biological organisms actually face, rather than copying benchmark problems from machine learning researchers (for whom building systems that solve object recognition alone is a perfectly reasonable goal). This will require a more serious consideration of what other brain regions – including dorsal stream structures involved in spatial cognition and action selection – contribute to visual perception.
Financial support
This work was funded by the Human Brain Project (SGA3) and by a European Research Council consolidator award to C. S.
Competing interest
None.