No CrossRef data available.
Article contents
For human-like models, train on human-like tasks
Published online by Cambridge University Press: 06 December 2023
Abstract
Bowers et al. express skepticism about deep neural networks (DNNs) as models of human vision due to DNNs' failures to account for results from psychological research. We argue that to fairly assess DNNs, we must first train them on more human-like tasks which we hypothesize will induce more human-like behaviors and representations.
- Type
- Open Peer Commentary
- Information
- Copyright
- Copyright © The Author(s), 2023. Published by Cambridge University Press
References
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.Google Scholar
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14(12), e1006613.CrossRefGoogle Scholar
Beery, S., Van Horn, G., & Perona, P. (2018). Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV) (pp. 456–473).CrossRefGoogle Scholar
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., … Fu, C. K. (2023). Do as I can, not as I say: Grounding language in robotic affordances. In Conference on robot learning (pp. 287–318). PMLR.Google Scholar
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., … Soricut, R. (2023). Pali: A jointly-scaled multilingual language-image model. International conference on learning representations (ICLR).Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (pp. 248–255).CrossRefGoogle Scholar
Gan, C., Schwartz, J., Alter, S., Schrimpf, M., Traer, J., De Freitas, J., … Yamins, D. L. K. (2021). ThreeDWorld: A platform for interactive multi-modal physical simulation. Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
Geirhos, R., Jacobsen, J. H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.CrossRefGoogle Scholar
Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F. A., & Brendel, W. (2021). Partial success in closing the gap between human and machine vision. Advances in Neural Information Processing Systems (NeurIPS), 34, 23885–23899.Google Scholar
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. International conference on learning representations (ICLR).Google Scholar
Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., … Tagliasacchi, A. (2022). Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3749–3761).CrossRefGoogle Scholar
Haber, N., Mrowca, D., Wang, S., Fei-Fei, L. F., & Yamins, D. L. (2018). Learning to play with intrinsically-motivated, self-aware agents. Advances in Neural Information Processing Systems (NeurIPS), 31.Google Scholar
Hermann, K., Chen, T., & Kornblith, S. (2020). The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 33, 19000–19015.Google Scholar
Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. (2020). Environmental drivers of systematicity and generalization in a situated agent. International conference on learning representations (ICLR).Google Scholar
Konkle, T., & Alvarez, G. A. (2022). A self-supervised domain-general learning framework for human ventral stream representation. Nature Communications, 13(1), 491.CrossRefGoogle ScholarPubMed
Kucker, S. C., Samuelson, L. K., Perry, L. K., Yoshida, H., Colunga, E., Lorenz, M. G., & Smith, L. B. (2019). Reproducibility and a unifying explanation: Lessons from the shape bias. Infant Behavior and Development, 54, 156–165.CrossRefGoogle Scholar
Kumar, M., Houlsby, N., Kalchbrenner, N., & Cubuk, E. D. (2022). Do better ImageNet classifiers assess perceptual similarity better? Transactions of Machine Learning Research.Google Scholar
Landau, B., Smith, L. B., & Jones, S. S. (1988). The importance of shape in early lexical learning. Cognitive Development, 3(3), 299–321.CrossRefGoogle Scholar
Malhotra, G., Evans, B. D., & Bowers, J. S. (2020). Hiding a plane with a pixel: Examining shape-bias in CNNs and the benefit of building in biological constraints. Vision Research, 174, 57–68.CrossRefGoogle ScholarPubMed
McCoy, R. T., Pavlick, E., & Linzen, T. (2020). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In 57th annual meeting of the association for computational linguistics, ACL 2019 (pp. 3428–3448). Association for Computational Linguistics (ACL). https://aclanthology.org/P19-1334/Google Scholar
Muttenthaler, L., Dippel, J., Linhardt, L., Vandermeulen, R. A., & Kornblith, S. (2023). Human alignment of neural network representations. International conference on learning representations (ICLR).Google Scholar
Nayebi, A., Kong, N. C., Zhuang, C., Gardner, J. L., Norcia, A. M., & Yamins, D. L. (2021). Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation. BioRxiv, 2021-06.Google Scholar
Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 427–436).CrossRefGoogle Scholar
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., & Torralba, A. (2018). VirtualHome: Simulating household activities via programs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8494–8502).CrossRefGoogle Scholar
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., … Batra, D. (2019). Habitat: A platform for embodied AI research. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9339–9347).CrossRefGoogle Scholar
Schrimpf, M. (2022). Advancing system models of brain processing via integrative benchmarking. Doctoral dissertation, Massachusetts Institute of Technology.Google Scholar
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., … DiCarlo, J. J. (2018). Brain-Score: Which artificial neural network for object recognition is most brain-like? BioRxiv, 407007.Google Scholar
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision (pp. 843–852).CrossRefGoogle Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.Google Scholar
Weihs, L., Kembhavi, A., Ehsani, K., Pratt, S. M., Han, W., Herrasti, A., … Farhadi, A. (2021). Learning generalizable visual representations via interactive gameplay. International conference on learning representations (ICLR).Google Scholar
Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., … Su, H. (2020). Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11097–11107).CrossRefGoogle Scholar
Xiao, K., Engstrom, L., Ilyas, A., & Madry, A. (2021). Noise or signal: The role of image backgrounds in object recognition. International conference on learning representations (ICLR).Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. International conference on learning representations (ICLR).Google Scholar
Zhuang, C., Xiang, V., Bai, Y., Jia, X., Turk-Browne, N., Norman, K., … Yamins, D. L. (2022). How well do unsupervised learning algorithms model human real-time and life-long learning? In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.CrossRefGoogle Scholar
Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C., DiCarlo, J. J., & Yamins, D. L. (2021). Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences of the United States of America, 118(3), e2014196118.CrossRefGoogle ScholarPubMed
You have
Access
We agree with Bowers et al. that accounting for results from behavioral experiments should serve as a North Star as we develop models of human vision. But what is a promising path to finding models that perform well on experimental benchmarks? In this commentary, we focus on the role of the task(s) on which models are trained. Zhang, Bengio, Hardt, Recht, and Vinyals (Reference Zhang, Bengio, Hardt, Recht and Vinyals2017) have shown that modern deep neural networks (DNNs) are more than expressive enough to overfit to any classification task on which they are trained. In particular, the authors show that DNNs can learn to classify ImageNet images (Deng et al., Reference Deng, Dong, Socher, Li, Li and Fei-Fei2009) with arbitrarily shuffled labels, demonstrating maximal flexibility with respect to this training set. To introduce a metaphor, our models are like sponges, capable of absorbing whatever information we teach them through the training tasks we present. Thus, when we ask about a model's behavior, we should ask, first, what it was trained to do. Although Bowers et al. take failures of ImageNet-trained models to behave in human-like ways as support for abandoning DNN architectures, we argue that we should instead consider alternative training tasks for DNNs.
Recent work has shown that pushing DNNs to perform well on ImageNet may not, in general, push them to be more human-like. Very high ImageNet performance becomes inversely related to primate neural predictivity (Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018; Schrimpf, Reference Schrimpf2022), and there is a tradeoff between perceptual scores from human judgments (Kumar, Houlsby, Kalchbrenner, & Cubuk, Reference Kumar, Houlsby, Kalchbrenner and Cubuk2022) and ImageNet performance, and between shape bias and ImageNet performance, when shape bias is modulated by data augmentation (Hermann, Chen, & Kornblith, Reference Hermann, Chen and Kornblith2020).
Certainly, humans can categorize the objects they see, but categorization is only a small part of how we process the visual world. Mostly, we use our visual systems to interact with the objects around us, in a closed loop comprising perception, inference, decision making, and action. There are several reasons to believe that training models on similarly embodied and active learning tasks may bring their behavior and representations closer to humans'. First, physically interacting with objects requires detailed perception of their global spatial properties (shape, position, motor affordances, etc.). Arguably, several of the most famous divergences between models and people stem from models' failures to weigh exactly this kind of information. For example, unlike people (Kucker et al., Reference Kucker, Samuelson, Perry, Yoshida, Colunga, Lorenz and Smith2019; Landau, Smith, & Jones, Reference Landau, Smith and Jones1988), many standard DNNs seem to rely on texture information more than shape (Baker, Lu, Erlikhman, & Kellman, Reference Baker, Lu, Erlikhman and Kellman2018; Geirhos et al., Reference Geirhos, Rubisch, Michaelis, Bethge, Wichmann and Brendel2019; Hermann et al., Reference Hermann, Chen and Kornblith2020). While, empirically, texture seems to be sufficient for good performance on ImageNet, it is unlikely to suffice for embodied navigation or manipulation tasks. In determining how to position oneself to sit in a chair, the shape and position of the chair are far more important than its color or upholstery texture. Similarly, adversarial examples (Nguyen, Yosinski, & Clune, Reference Nguyen, Yosinski and Clune2015; Szegedy et al., Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013), another often-cited separator of humans and DNNs, arguably arise from models' over-reliance on local pixel patterns at the expense of the global configural information required for embodied interaction. Overall, we hypothesize that existing DNN architectures, if trained to navigate the world and interact with objects in the way that humans do, would be more likely to display human-like visual behavior and representations than they do under current training methods.
Another implication of the Zhang et al. (Reference Zhang, Bengio, Hardt, Recht and Vinyals2017) work is that modern networks are sufficiently large that training them on a 1,000-way classification task on a million images is insufficient to exhaust their capacity, leaving important degrees of freedom governing their generalization performance underconstrained, which allows for deviant phenomena such as adversarial examples of the kind and severity currently observed. As another example of flexibility in how DNNs can learn a classification task, models often learn spurious/shortcut features (Arjovsky, Bottou, Gulrajani, & Lopez-Paz, Reference Arjovsky, Bottou, Gulrajani and Lopez-Paz2019; Geirhos et al., Reference Geirhos, Jacobsen, Michaelis, Zemel, Brendel, Bethge and Wichmann2020; McCoy, Pavlick, & Linzen, Reference McCoy, Pavlick and Linzen2020), for example, using image backgrounds rather than foreground objects (Beery, Van Horn, & Perona, Reference Beery, Van Horn and Perona2018; Xiao, Engstrom, Ilyas, & Madry, Reference Xiao, Engstrom, Ilyas and Madry2021), or single diagnostic pixels rather than other image content (Malhotra, Evans, & Bowers, Reference Malhotra, Evans and Bowers2020). This brings us to a second argument in favor of embodied training tasks. A dataset of similar size to ImageNet but with a richer, more ecological output space – for example, choosing a physical action and its control parameters, or predicting subsequent frames – would contain a vastly larger amount of information, perhaps more fully constraining the model's behavior.
Existing work validates the impact of training tasks on model behavior and representations. Even when restricted to training on ImageNet images, the training objective and/or data augmentation can affect how well models match human similarity judgments of images (Muttenthaler, Dippel, Linhardt, Vandermeulen, & Kornblith, Reference Muttenthaler, Dippel, Linhardt, Vandermeulen and Kornblith2023), categorization patterns (Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann and Brendel2021), performance on real-time and life-long learning benchmarks (Zhuang et al., Reference Zhuang, Xiang, Bai, Jia, Turk-Browne, Norman and Yamins2022), and feature preferences (Hermann et al., Reference Hermann, Chen and Kornblith2020), and also how well they predict primate physiology (Zhuang et al., Reference Zhuang, Yan, Nayebi, Schrimpf, Frank, DiCarlo and Yamins2021) and human fMRI (Konkle & Alvarez, Reference Konkle and Alvarez2022) data. Still, it is possible to enrich DNN training tasks much further, even for object categorization (Sun, Shrivastava, Singh, & Gupta, Reference Sun, Shrivastava, Singh and Gupta2017).
We have discussed the promise of training embodied, interactive agents in rich, ethologically relevant environments. What efforts have already been made in this direction, and what might they look like in the future? Past work situating a vision system within a simulated agent navigating and interacting with its environment gives promising initial indications that human-like visual behaviors can emerge in this setting (Haber, Mrowca, Wang, Fei-Fei, & Yamins, Reference Haber, Mrowca, Wang, Fei-Fei and Yamins2018; Hill et al., Reference Hill, Lampinen, Schneider, Clark, Botvinick, McClelland and Santoro2020; Nayebi et al., Reference Nayebi, Kong, Zhuang, Gardner, Norcia and Yamins2021; Weihs et al., Reference Weihs, Kembhavi, Ehsani, Pratt, Han, Herrasti and Farhadi2021). The continued development of new, more naturalistic training environments (Gan et al., Reference Gan, Schwartz, Alter, Schrimpf, Traer, De Freitas and Yamins2021; Greff et al., Reference Greff, Belletti, Beyer, Doersch, Du, Duckworth and Tagliasacchi2022; Puig et al., Reference Puig, Ra, Boben, Li, Wang, Fidler and Torralba2018; Savva et al., Reference Savva, Kadian, Maksymets, Zhao, Wijmans, Jain and Batra2019; Xiang et al., Reference Xiang, Qin, Mo, Xia, Zhu, Liu and Su2020) should support pushing this research program still further toward human-like learning. In addition, state-of-the-art large language models provide a new means of communicating richer tasks to models (Chen et al., Reference Chen, Wang, Changpinyo, Piergiovanni, Padlewski, Salz and Soricut2023), and a new reservoir of human-like knowledge for models to draw on (Brohan et al., Reference Brohan, Chebotar, Finn, Hausman, Herzog, Ho and Fu2023). We predict further work in these directions will address shortcomings Bowers et al. identify and yield improved DNN accounts of human vision.
Acknowledgments
We thank Mike Mozer and Robert Geirhos for interesting discussions and helpful feedback.
Financial support
Aran Nayebi is supported by a K. Lisa Yang Integrative and Computational Neuroscience (ICoN) Postdoctoral Fellowship. Matt Jones is supported in part by NSF Grant 2020906.
Competing interest
None.