Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-24T08:47:49.484Z Has data issue: false hasContentIssue false

Learning quantification from images: A structured neural architecture

Published online by Cambridge University Press:  02 April 2018

I. SORODOC
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
S. PEZZELLE
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
A. HERBELOT
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected]
M. DIMICCOLI
Affiliation:
University of Barcelona, Gran via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected] Computer Vision Center, Edificio O, Campus UAB, 08193 Bellaterra (Cerdanyola), Barcelona, Spain
R. BERNARDI
Affiliation:
Center for Mind/Brain Sciences (CIMeC), University of Trento, Palazzo Fedrigotti - corso Bettini 31, 38068 Rovereto (TN), Italy e-mails: [email protected], [email protected], [email protected], [email protected] Department of Information Engineering and Computer Science (DISI), University of Trento, Via Sommarive, 9 I-38123 Povo (TN), Italy

Abstract

Major advances have recently been made in merging language and vision representations. Most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw multimodal data to perform certain types of higher level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like few, some and all. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in most fish are red, most encodes the proportion of fish which are red fish. In this paper, we study how well current neural network strategies model such relations. We propose a task where, given an image and a query expressed by an object–property pair, the system must return a quantifier expressing which proportions of the queried object have the queried property. Our contributions are twofold. First, we show that the best performance on this task involves coupling state-of-the-art attention mechanisms with a network architecture mirroring the logical structure assigned to quantifiers by classic linguistic formalisation. Second, we introduce a new balanced dataset of image scenarios associated with quantification queries, which we hope will foster further research in this area.

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., and Baroni, M. 2013. Of words, eyes and brains: correlating image-based distributional semantic models with neural representations of concepts. In EMNLP, pp. 1960–70.Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D., 2016a. Learning to compose neural networks for question answering. In Proceedings of NAACL-HLT, San Diego, California: Association for Computational Linguistics, p. 15451554.Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016b. Neural module networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition.CrossRefGoogle Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: Visual question answering. In International Conference on Computer Vision (ICCV).Google Scholar
Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-c. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 23–32.Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3):209–26.Google Scholar
Baroni, M., Dinu, G., and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), pp. 238–47.Google Scholar
Barwise, J., and Cooper, R., 1981. Generalized quantifiers and natural language. Linguistics and Philosophy 4 (2): 159219.Google Scholar
Bass, B. M., Cascio, W. F., and O’connor, E. J., 1974. Magnitude estimations of expressions of frequency and amount. Journal of Applied Psychology 59 (3): 313.Google Scholar
Boleda, G., and Herbelot, A., 2016. Formal distributional semantics: Introduction to the special issue. Computational Linguistics 42 (4): 619–35.Google Scholar
Borji, A., Cheng, M., Jiang, H., and Li, J., 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24 (12): 57065722.Google Scholar
Chattopadhyay, P., Vedantam, R., Selvaraju, R. R., Batra, D., and Parikh, D. 2017. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 1135–1144.Google Scholar
Coventry, K. R., Cangelosi, A., Newstead, S., Bacon, A., and Rajapakse, R. 2005. Grounding natural language quantifiers in visual attention. In Proceedings of the 27th Annual Conference of the Cognitive Science Society, Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
Coventry, K. R., Cangelosi, A., Newstead, S. E., and Bugmann, D., 2010. Talking about quantities in space: Vague quantifiers, context and similarity. Language and Cognition 2 (2): 221–41.Google Scholar
Dehaene, S., and Changeux, J., 1993. Development of elementary numerical abilities: A neuronal model. Journal of Cognitive Neuroscience 5 (4): 390407.Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 248–55.Google Scholar
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., and Yuille, A. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. In International Conference on Learning Representations.Google Scholar
Geman, D., GErman, S., Hallonquist, N., and Younes, L., 2015. Visual turing test for computer vision systems. PNAS 112 (12): 3618–23.Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D., 2016. Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, pp. 69046913.Google Scholar
Halberda, J., Taing, L., and Lidz, J., 2008. The development of “most” comprehension and its potential dependence on counting ability in preschoolers. Language Learning and Development 4 (2): 99121.Google Scholar
Hammerton, M. 1976. How much is a large part? Applied ergonomics 7 (1): 1012.Google Scholar
Herbelot, A., and Vecchi, E. M. 2015. Building a shared world: mapping distributional to model-theoretic semantic spaces. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.Google Scholar
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.Google Scholar
Holyoak, K. J., and Glass, A. L., 1978. Recognition confusions among quantifiers. Journal of verbal learning and verbal behavior 17 (3): 249–64.Google Scholar
Hurewitz, F., Papafragou, A., Gleitman, L., and Gelman, R., 2006. Asymmetries in the acquisition of numbers and quantifiers. Language learning and development 2 (2): 7796.Google Scholar
Johnson, J., Hariharan, B., van~der~Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017. Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of CVPR.Google Scholar
Keenan, E., and Paperno, D., editors 2012. Handbook of Quantifiers in Natural Language. Springer Netherlands, Dordrecht.Google Scholar
Khemlani, S., Leslie, S.-J., and Glucksberg, S., 2009. Generics, prevalence, and default inferences. In Proceedings of the 31st annual conference of the Cognitive Science Society, Austin, TX: Cognitive Science Society, pp. 443–8.Google Scholar
Kumar, A., Irsoy, O., Su, J., Bradbury, J., E, R.., Pierce, B., Ondruska, P., Gulrajani, I., and Socher, R. 2016. Ask me anything: dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Lazaridou, A., Pham, N. T., and Baroni, M. 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of NAACL.Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014a. Microsoft COCO: common objects in context. In Proceedings of ECCV (European Conference on Computer Vision).Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. 2014b. Microsoft coco: common objects in context. In Microsoft COCO: Common Objects in Context.Google Scholar
Ma, L., Lu, Z., and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI).Google Scholar
Malinowski, M., and Fritz, M., 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS14), Montreal, Canada, pp. 16821690.Google Scholar
Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your neurons: a neural-based approach to answering questions about images. In In International Conference on Computer Vision (ICCV’15).Google Scholar
McCrink, K., and Wynn, K., 2004. Large-number addition and subtraction by 9-month-old infants. Psychological Science 15 (11): 776–81.Google Scholar
Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013. Efficient estimation of word representations in vector space. In Proceedings of the 26th International Conference on Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, Nevada, pp. 31113119.Google Scholar
Moxey, L. M., and Sanford, A. J. 1993. Communicating quantities: a psychological perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ.Google Scholar
Nouwen, R. 2010. What’s in a quantifier? The Linguistics Enterprise: From knowledge of language to knowledge in linguistics 150: 235.Google Scholar
Patterson, G., and Hays, J. 2016. Coco attributes: attributes for people, animals, and objects. In European Conference on Computer Vision.Google Scholar
Pezzelle, S., Marelli, M., and Bernardi, R. 2017. Be precise or fuzzy: learning the meaning of cardinals and quantifiers from vision. In Proceedings of EACL.Google Scholar
Piantadosi, S. T. 2011. Learning and the language of thought. PhD thesis, Massachusetts Institute of Technology.Google Scholar
Piantadosi, S. T., Tenenbaum, J. B., and Goodman, N. D. 2012. Modeling the acquisition of quantifier semantics: a case study in function word learnability. https://colala.bcs.rochester.edu/papers/piantadosi2012modeling.pdf.Google Scholar
Rajapakse, R., Cangelosi, A., Conventry, K., Newstead, S., and Bacon, A. 2005. Grounding linguistic quantifiers in perception: Experiments on numerosity judgments. In Proceeding of the 2nd Language and Technology Conference, Poland.Google Scholar
Ren, M., Kiros, R., and Zemel, R. 2015a. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
Ren, M., Kiros, R., and Zemel, R. 2015b. Image question answering: A visual semantic embedding model and a new dataset. In International Conference on Machine Learning Deep Learning Workshop.Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.Google Scholar
Seguí, S., Pujol, O., and Vitria, J. 2015. Learning to count with deep object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 90–6.Google Scholar
Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar
Sorodoc, I., Lazaridou, A., G. B. A. H., Pezzelle, S., and Bernardi, R., 2016. “Look, some green circles!”: Learning to quantify from image. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany: Association for Computational Linguistics, p. 7579.Google Scholar
Stoianov, I., and Zorzi, M., 2012. Emergence of a’visual number sense’in hierarchical generative models. Nature Neuroscience 15 (2): 194–6.Google Scholar
Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. 2015. End-to-end memory networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), vol. 28.Google Scholar
Szabolsci, A., 2010. Quantification. Cambridge, UK: Cambridge University Press.Google Scholar
Trott, A., Xiong, C., and Socher, R. 2017. Interpretable counting for visual question answering. https://arxiv.org/abs/1712.08697.Google Scholar
van Benthem, J., 1986. Essays in logical semantics. Dordrecht, The Netherlands: Reidel Publishing Co.Google Scholar
Vedaldi, A., and Lenc, K. 2015. MatConvNet – Convolutional Neural Networks for MATLAB. In Proceeding of the ACM International Conference on Multimedia.Google Scholar
Weston, J., Chopra, S., and Bordes, A. 2015. Memory networks. In International Conference on Learning Representations (ICLR).Google Scholar
Xiong, C., Merity, S., and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of International Conference on Machine Learning (ICML).Google Scholar
Xu, F., and Spelke, E. S. 2000. Large number discrimination in 6-month-old infants. Cognition 74 (1):B1B11.Google Scholar
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. 2016. Stacked attention networks for imagequestion answering. In Proceedings of CVPR.Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. J. 2015. Stacked attention networks for image question answering. CoRR, abs/1511.02274.Google Scholar
Zhang, J., Ma, S., Sameki, M., Sclaroff, S., Betke, M., Lin, Z., Shen, X., Price, B., and ech, R. M. 2015. Salient object subitizing. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., and Parikh, D. 2016. Yin and yang: balancing and answering binary visual questions. In Proceedings of CVPR.Google Scholar
Zhou, B., Tian, Y., Suhkbaatar, S., Szlam, A., and Fergus, R. 2015. Simple baseline for visual question answering. Technical report, arXiv:1512.02167, 2015.Google Scholar