Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-24T08:38:50.878Z Has data issue: false hasContentIssue false

The role of image representations in vision to language tasks

Published online by Cambridge University Press:  21 March 2018

PRANAVA MADHYASTHA
Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: [email protected], [email protected], [email protected]
JOSIAH WANG
Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: [email protected], [email protected], [email protected]
LUCIA SPECIA
Affiliation:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: [email protected], [email protected], [email protected]

Abstract

Tasks that require modeling of both language and visual information, such as image captioning, have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural-network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: The task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions.

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. SPICE: semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV).CrossRefGoogle Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the 2015 IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Arora, S., Liang, Y., and Ma, T. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations, Workshop Contributions.Google Scholar
Bahdanau, D., Cho, K., and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55 : 409–42.CrossRefGoogle Scholar
Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., and van de Weijer, J. 2016. Does multimodality help human and machine for translation and image captioning? In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Calixto, I., Elliott, D., and Frank, S. 2016. DCU-UvA multimodal MT system report. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Calixto, I., Liu, Q., and Campbell, N. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).CrossRefGoogle Scholar
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325.Google Scholar
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning and Representation Learning.Google Scholar
Clevert, D.-A., Unterthiner, T., and Hochreiter, S. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In Proc. of the International Conference on Learning Representation (ICLR).Google Scholar
Denkowski, M., and Lavie, A. 2014. Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL Workshop on Statistical Machine Translation.CrossRefGoogle Scholar
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. 2015. Language models for image captioning: the quirks and what works. In Proceedings of the Association for Computational Linguistics (ACL).CrossRefGoogle Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. 2014. Decaf: a deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Elliott, D., and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the Association for Computational Linguistics (ACL), arxiv preprint arxiv:1510.04709.Google Scholar
Elliott, D., and Kádár, A. 2017. Imagination improves multimodal translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP).Google Scholar
Elliott, D., and Keller, F. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the Association for Computational Linguistics (ACL).CrossRefGoogle Scholar
Elliott, D., Frank, S., and Hasler, E. 2015. Multi-language image description with neural sequence models. arxiv preprint arxiv:1510.04709.Google Scholar
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30K: multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language.CrossRefGoogle Scholar
Elman, J. L., 1990. Finding structure in time. Cognitive Science 14 : 179211.CrossRefGoogle Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., and Platt, J. C. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV).CrossRefGoogle Scholar
Ferraro, F., Mostafazadeh, N., Vanderwende, L., Devlin, J., Galley, M., and Mitchell, M. 2015. A survey of current datasets for vision and language research. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRefGoogle Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar
Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of the International Workshop on Language Resources for Content-Based Image Retrieval, OntoImage’2006.Google Scholar
He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Hitschler, J., Schamoni, S., and Riezler, S. 2016. Multimodal pivots for image caption translation. In Proceedings of the Association for Computational Linguistics (ACL).CrossRefGoogle Scholar
Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–80.CrossRefGoogle ScholarPubMed
Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 : 853–99.CrossRefGoogle Scholar
Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. 2016. Attention-based multimodal neural machine translation. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Karpathy, A. 2016. Connecting Images and Natural Language. PhD Thesis, Department of Computer Science, Stanford University.Google Scholar
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).CrossRefGoogle Scholar
Kiros, R., Salakhutdinov, R., and Zemel, R. S. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Kolář, M., Hradiš, M., and Zemčík, P. 2015. Technical report: Image captioning with semantically similar images. arXiv preprint arXiv:1506.03995.Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2012. Collective generation of natural image descriptions. In Proceedings of the Association for Computational Linguistics (ACL).Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. 2013. Generalizing image captions for image-text parallel corpus. In Proceedings of the Association for Computational Linguistics (ACL).Google Scholar
Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y. 2014. TREETALK: composition and compression of trees for image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
Lala, C., Madhyastha, P., Wang, J., and Specia, L., 2017. Unraveling the contribution of image captioning and neural machine translation for multimodal machine translation. The Prague Bulletin of Mathematical Linguistics 108 : 197208.CrossRefGoogle Scholar
Lebret, R., Pinheiro, P. O., and Collobert, R. 2015. Phrase-based image captioning. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., and Choi, Y. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning (CoNLL).Google Scholar
Libovický, J., Helcl, J., Tlustý, M., Bojar, O., and Pecina, P. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Luong, M.-T., Pham, H., and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRefGoogle Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. 2010. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech).CrossRefGoogle Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T., and Daume, H III. 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).Google Scholar
Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2Text: describing images using 1 million captioned photographs. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL).CrossRefGoogle Scholar
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.Google Scholar
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Redmon, J., and Farhadi, A. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–52.CrossRefGoogle Scholar
Shah, K., Wang, J., and Specia, L. 2016. SHEF-Multimodal: grounding machine translation on images. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representation (ICLR).Google Scholar
Socher, R., Karpathy, A., Le, Q., Manning, C., and Ng, A., 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 : 207–18.CrossRefGoogle Scholar
Specia, L., Frank, S., Simaan, K., and Elliott, D. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the Conference on Machine Translation (WMT).CrossRefGoogle Scholar
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar
van der Maaten, L., and Hinton, G., 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR) 9 : 2579–605.Google Scholar
van Miltenburg, E., and Elliott, D. 2017. Room for improvement in automatic image description: an error analysis. arXiv preprint arXiv:1704.04198.Google Scholar
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., 2016. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4): 652663.CrossRefGoogle ScholarPubMed
Wu, Q., Shen, C., Liu, L., Dick, A., and van den Hengel, A. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Yang, Y., Teo, C., Daumé, H. III, and Aloimonos, Y. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
Yao, B. Z., Yang, X., Lin, L., Lee, M. W., and Zhu, S. C. 2010. I2T: image parsing to text description. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).CrossRefGoogle Scholar
Yin, X., and Ordonez, V. 2017. Obj2Text: generating visually descriptive language from object layouts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRefGoogle Scholar
You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition (CVPR).CrossRefGoogle Scholar
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 : 6778.CrossRefGoogle Scholar
Zaremba, W., Sutskever, I., and Vinyals, O. 2014. Recurrent neural network regularization. In Proc. of the International Conference on Learning Representation (ICLR), arXiv preprint arXiv:1409.2329.Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. 2017. Places: a ten million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (99), http://ieeexplore.ieee.org/document/7968387/.Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. 2014. Learning deep features for scene recognition using places database. In Proceedings of the Advances in Neural Information Processing Systems (NIPS).Google Scholar