Hostname: page-component-586b7cd67f-g8jcs Total loading time: 0 Render date: 2024-12-05T02:47:49.609Z Has data issue: false hasContentIssue false

Models of vision need some action

Published online by Cambridge University Press:  06 December 2023

Constantin Rothkopf
Affiliation:
Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Germany [email protected] Frankfurt Institute for Advanced Studies, Goethe-Universität Frankfurt, Frankfurt am Main, Germany [email protected] Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/
Frank Bremmer
Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Applied Physics and Neurophysics, University of Marburg, Marburg, Germany [email protected]
Katja Fiehler
Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany [email protected] [email protected]
Katharina Dobs
Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany [email protected] [email protected]
Jochen Triesch
Affiliation:
Frankfurt Institute for Advanced Studies, Goethe-Universität Frankfurt, Frankfurt am Main, Germany [email protected] Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/

Abstract

Bowers et al. focus their criticisms on research that compares behavioral and brain data from the ventral stream with a class of deep neural networks for object recognition. While they are right to identify issues with current benchmarking research programs, they overlook a much more fundamental limitation of this literature: Disregarding the importance of action and interaction for perception.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

Computationally, perception, cognition, and action are inseparably intertwined in sequential, goal-directed behavior (Kessler, Frankenstein, & Rothkopf, Reference Kessler, Frankenstein and Rothkopf2022). However, the branch of research considered in Bowers et al. focuses on a single visual task, that of assigning single, discrete labels of object identity to images. This is as if the whole goal of human vision was to learn to shout out an appropriate word while being presented a random pile of photographs. But, in the words of Thomas H. Huxley, the nineteenth-century English biologist and anthropologist: “The great end of life is not knowledge but action.” Perception is not l'art-pour-l'art. Instead, it occurs continuously in space and time as we perform structured tasks in a complex and dynamic environment (Fiehler & Karimpur, Reference Fiehler and Karimpur2023). Perception guides action and action, in turn, impacts perception (Bremmer, Churan, & Lappe, Reference Bremmer, Churan and Lappe2017; Bremmer & Krekelberg, Reference Bremmer and Krekelberg2003; Eckmann, Klimmasch, Shi, & Triesch, Reference Eckmann, Klimmasch, Shi and Triesch2020; Fiehler, Brenner, & Spering, Reference Fiehler, Brenner and Spering2019). Without action, we could not make changes in the world or interact with others. Here we argue that many of the limitations of current deep neural networks (DNNs) pointed out by Bowers et al. are likely rooted in a flawed and limited framing of perception and implausible supervised learning objectives, that recent DNNs represent fruitful avenues for overcoming some of these limitations, but that we must extend current models to account for the different functions of vision: Perception, cognition, and action and how they interact. Acknowledging that perception and action are intimately related has fundamental consequences. Here we highlight five key consequences.

The sensory input to biological visual systems is highly structured as it unfolds during goal-directed behavior. Accordingly, DNNs should be trained not on independent images presented in random order with corresponding labels, but in self-supervised ways by observing continuous, structured datasets, that is, events unfolding in space and time. Many real-world objects, such as animals or faces, are not just static entities, but move dynamically and nonrigidly (Dobs, Bülthoff, & Schultz, Reference Dobs, Bülthoff and Schultz2018). One potential avenue currently being explored is using forms of time-based self-supervised deep learning (Orhan, Gupta, & Lake, Reference Orhan, Gupta and Lake2020; Schneider, Xu, Ernst, Yu, & Triesch, Reference Schneider, Xu, Ernst, Yu and Triesch2021; Zhuang et al., Reference Zhuang, Yan, Nayebi, Schrimpf, Frank, DiCarlo and Yamins2021), which form invariant object representations by mapping sequences of views onto close-by latent representations without the need for labels. These models also have the potential to capture dynamic aspects of object recognition, such as the perception of dynamic faces, which cannot be captured well by current models trained on static images (Jiahui et al., Reference Jiahui, Feilong, di Oleggio Castello, Nastase, Haxby and Gobbini2022).

The structure of sensory input is in large part dependent on the observer's own actions. Thus, object perception and vision in general can only be understood in the context of an active, exploratory, multi-sensory observer, a view also reflected in current experimental work (Ayzenberg & Behrmann, Reference Ayzenberg and Behrmann2023). Supervised approaches miss the impact of goal-directed action and interaction on structuring visual representations (Krakauer, Ghazanfar, Gomez-Marin, MacIver, & Poeppel, Reference Krakauer, Ghazanfar, Gomez-Marin, MacIver and Poeppel2017). Accordingly, models should learn in self-supervised ways while interacting with their environment. Indeed, visual representations have been shown to be dependent on the active visual policy (Rothkopf, Weisswange, & Triesch, Reference Rothkopf, Weisswange and Triesch2009). Going beyond pure self-supervised invariance learning, a recent approach considers the benefits of active control of the view point for learning object representations (Xu & Triesch, Reference Xu and Triesch2023). Mimicking visual input from self-generated object manipulations, it learns a hierarchical representation to satisfy the two complementary desiderata of being partly invariant to viewpoint changes while at the same time permitting to predict which action is responsible for a particular change in the representation.

Learning and adaptation must be a continuous process, not limited to discrete training and test phases, but occurring continually during extended interactions with the environment. Recent approaches involving DNNs have addressed the challenge of continual learning (Wang, Liu, Duan, Kong, & Tao, Reference Wang, Liu, Duan, Kong and Tao2022). However, the breadth of the required continuous adaptation to changing conditions (Roelfsema & Holtmaat, Reference Roelfsema and Holtmaat2018; Schmitt et al., Reference Schmitt, Schwenk, Schütz, Churan, Kaminiarz and Bremmer2021) and the delicate balance of the classic stability–plasticity dilemma are still open problems for current DNNs.

The learning objectives must permit rich and adaptive representations that can feed multiple forms of interacting with the world. Instead, many of the studies considered by Bowers et al. relate to the single task of object recognition simply because the vast majority of current DNN approaches to vision select a task that gets away with ignoring actions: Attaching labels to images. Few current NN models conceptualize visual tasks in terms of visual routines, with some exceptions applying the framework of reinforcement learning to sequential visual behaviors (Araslanov, Rothkopf, & Roth, Reference Araslanov, Rothkopf and Roth2019). Promising directions are to jointly investigate a broad range of visual tasks (Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021) and to investigate those computational visual tasks relevant for action, which are predominantly attributed to the dorsal stream, and considering ecologically relevant cost functions that can account for dorsal stream properties in the primate brain (Mineault, Bakhtiari, Richards, & Pack, Reference Mineault, Bakhtiari, Richards and Pack2021).

Models will need to properly compute the interactions of sensory uncertainties, internally model uncertain beliefs, and the action variabilities to successfully achieve the organism's goals in sequential, adaptive behavior. Bowers et al. do not mention uncertainty once in their article. Current DNN models are not well suited to the computations required for proper belief propagation in sequential perception and action under uncertainty as required in extended behavior, where they are inseparably intertwined. As an example, humans use their perception and their actions actively to shape their internal beliefs about landmarks in navigation (Kessler et al., Reference Kessler, Frankenstein and Rothkopf2022). In their critique, Bowers et al. ignore the major computational challenge, which requires making accurate causal inferences about the origins of uncertainty in sensory data and adaptive motor output (Straub & Rothkopf, Reference Straub and Rothkopf2022).

In conclusion, we agree with Bowers et al.'s critique, but if we want to fully understand human vision including object recognition, our models must embrace the fact that vision is intimately intertwined with action in behaving, goal-directed agents.

Financial support

The research reported herein was supported by the “The Adaptive Mind,” funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research and Art.

Competing interest

None.

References

Araslanov, N., Rothkopf, C. A., & Roth, S. (2019). Actor-critic instance segmentation. In L. Davis, P. Torr, & S.-Z. Zhu (Eds.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, California, 16–20 June 2019 (pp. 8237–8246).CrossRefGoogle Scholar
Ayzenberg, V., & Behrmann, M. (2023). The where, what, and how of object recognition. Trends in Cognitive Sciences, 27, 335336.CrossRefGoogle ScholarPubMed
Bremmer, F., Churan, J., & Lappe, M. (2017). Heading representations in primates are compressed by saccades. Nature Communications, 8, 920.CrossRefGoogle ScholarPubMed
Bremmer, F., & Krekelberg, B. (2003). Seeing and acting at the same time: Challenges for brain (and) research. Neuron, 38, 367370.CrossRefGoogle ScholarPubMed
Dobs, K., Bülthoff, I., & Schultz, J. (2018). Use and usefulness of dynamic face stimuli for face perception studies – A review of behavioral findings and methodology. Frontiers in Psychology, 9, 1355.CrossRefGoogle ScholarPubMed
Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Computational Biology, 17(8), e1009267.CrossRefGoogle ScholarPubMed
Eckmann, S., Klimmasch, L., Shi, B. E., & Triesch, J. (2020). Active efficient coding explains the development of binocular vision and its failure in amblyopia. Proceedings of the National Academy of Sciences of the United States of America, 117(11), 61566162.CrossRefGoogle ScholarPubMed
Fiehler, K., Brenner, E., & Spering, M. (2019). Prediction in goal-directed action. Journal of Vision, 19(9), 10, 1–21.CrossRefGoogle ScholarPubMed
Fiehler, K., & Karimpur, H. (2023). Spatial coding for action across spatial scales. Nature Reviews Psychology, 2, 7284.CrossRefGoogle Scholar
Jiahui, G., Feilong, M., di Oleggio Castello, M. V., Nastase, S. A., Haxby, J. V., & Gobbini, M. I. (2022). Modeling naturalistic face processing in humans with deep convolutional neural networks. bioRxiv, 139.Google Scholar
Kessler, F., Frankenstein, J., & Rothkopf, C. A. (2022). A dynamic Bayesian actor model explains endpoint variability in homing tasks. bioRxiv, 125.Google Scholar
Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M. A., & Poeppel, D. (2017). Neuroscience needs behavior: Correcting a reductionist bias. Neuron, 93, 480490.CrossRefGoogle ScholarPubMed
Mineault, P. J., Bakhtiari, S., Richards, B. A., & Pack, C. C. (2021). Your head is there to move you around: Goal-driven models of the primate dorsal pathway. Advances in Neural Information Processing Systems, 34, 2875728771.Google Scholar
Orhan, E., Gupta, V., & Lake, B. M. (2020). Self-supervised learning through the eyes of a child. Advances in Neural Information Processing Systems, 33, 99609971.Google Scholar
Roelfsema, P. R., & Holtmaat, A. (2018). Control of synaptic plasticity in deep cortical networks. Nature Reviews Neuroscience, 19, 166180.CrossRefGoogle ScholarPubMed
Rothkopf, C. A., Weisswange, T. H., & Triesch, J. (2009). Learning independent causes in natural images explains the space variant oblique effect. In M. Amine, N. Enayati, & H. Li (Eds.), 2009 IEEE 8th international conference on development and learning, Shanghai, China, 5–7 June 2009 (pp. 1–6). IEEE.Google Scholar
Schmitt, C., Schwenk, J. C. B., Schütz, A., Churan, J., Kaminiarz, A., & Bremmer, F. (2021). Preattentive processing of visually guided self-motion in humans and monkeys. Progress in Neurobiology, 205, 102117.CrossRefGoogle ScholarPubMed
Schneider, F., Xu, X., Ernst, M. R., Yu, Z., & Triesch, J. (2021). Contrastive learning through time. In SVRHM 2021 .Google Scholar
Straub, D., & Rothkopf, C. A. (2022). Putting perception into action with inverse optimal control for continuous psychophysics. eLife, 11, 76635.CrossRefGoogle ScholarPubMed
Wang, Z., Liu, L., Duan, Y., Kong, Y., & Tao, D. (2022). Continual learning with lifelong vision transformer. In R. Chellappa, J. Matas, L. Quan, & M. Shah (Eds.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, Louisiana, 19–24 June 2022 (pp. 171–181).CrossRefGoogle Scholar
Xu, X., & Triesch, J. (2023). CIPER: Combining invariant and equivariant representations using contrastive and predictive learning. http://arxiv.org/abs/2302.02330CrossRefGoogle Scholar
Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C., DiCarlo, J. J., & Yamins, D. L. (2021). Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences of the United States of America, 118(3), e2014196118.CrossRefGoogle ScholarPubMed