Hostname: page-component-78c5997874-s2hrs Total loading time: 0 Render date: 2024-11-03T08:32:28.423Z Has data issue: false hasContentIssue false

Neural embeddings: accurate and readable inferences based on semantic kernels

Published online by Cambridge University Press:  31 July 2019

Danilo Croce*
Affiliation:
Department of Enterprise Engineering, University of Roma, Tor Vergata, Rome, Italy
Daniele Rossini
Affiliation:
Department of Enterprise Engineering, University of Roma, Tor Vergata, Rome, Italy
Roberto Basili
Affiliation:
Department of Enterprise Engineering, University of Roma, Tor Vergata, Rome, Italy
*
*Corresponding author. Email: [email protected]

Abstract

Sentence embeddings are the suitable input vectors for the neural learning of a number of inferences about content and meaning. Similarity estimation, classification, emotional characterization of sentences as well as pragmatic tasks, such as question answering or dialogue, have largely demonstrated the effectiveness of vector embeddings to model semantics. Unfortunately, most of the above decisions are epistemologically opaque as for the limited interpretability of the acquired neural models based on the involved embeddings. We think that any effective approach to meaning representation should be at least epistemologically coherent. In this paper, we concentrate on the readability of neural models, as a core property of any embedding technique consistent and effective in representing sentence meaning. In this perspective, this paper discusses a novel embedding technique (the Nyström methodology) that corresponds to the reconstruction of a sentence in a kernel space, inspired by rich semantic similarity metrics (a semantic kernel) rather than by a language model. In addition to being based on a kernel that captures grammatical and lexical semantic information, the proposed embedding can be used as the input vector of an effective neural learning architecture, called Kernel-based deep architectures (KDA). Finally, it also characterizes by design the KDA explanatory capability, as the proposed embedding is derived from examples that are both human readable and labeled. This property is obtained by the integration of KDAs with an explanation methodology, called layer-wise relevance propagation (LRP), already proposed in image processing. The Nyström embeddings support here the automatic compilation of argumentations in favor or against a KDA inference, in form of an explanation: each decision can in fact be linked through LRP back to the real examples, that is, the landmarks linguistically related to the input instance. The KDA network output is explained via the analogy with the activated landmarks. Quantitative evaluation of the explanations shows that richer explanations based on semantic and syntagmatic structures characterize convincing arguments, as they effectively help the user in assessing whether or not to trust the machine decisions in different tasks, for example, Question Classification or Semantic Role Labeling. This confirms the epistemological benefit that Nyström embeddings may bring, as linguistically rich and meaningful representations for a variety of inference tasks.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Annesi, P., Croce, D. and Basili, R. (2014). Semantic compositionality in tree kernels. CIKM. ACM.Google Scholar
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W. and Suárez, Ó.D. (2015). On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation. PloS One 10, 146.CrossRefGoogle ScholarPubMed
Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K. and Müller, K.-R. (2010). How to explain individual classification decisions. Journal of Machine Learning Research 11, 18031831.Google Scholar
Bastianelli, E., Castellucci, G., Croce, D., Iocchi, L., Basili, R. and Nardi, D. (2014). Huric: a human robot interaction corpus. LREC. ELRA.Google Scholar
Bastianelli, E., Croce, D., Vanzo, A., Basili, R. and Nardi, D. (2016). A discriminative approach to grounded spoken language understanding in interactive robotics. IJCAI.Google Scholar
Bengio, Y., Courville, A. and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8), 17981828.CrossRefGoogle ScholarPubMed
Cancedda, N., Gaussier, É., Goutte, C., and Renders, J.-M. (2003). Word-sequence kernels. Journal of Machine Learning Research 3, 10591082.Google Scholar
Chakraborty, S., Tomsett, R., Raghavendra, R., Harborne, D., Alzantot, M., Cerutti, F., Srivastava, M.B., Preece, A.D., Julier, S.J., Rao, R.M., Kelley, T.D., Braines, D., Sensoy, M., Willis, C.J. and Gurram, P. (2017). Interpretability of deep learning models: A survey of results. SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI.CrossRefGoogle Scholar
Chang, C.-C. and Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 27:127:27.CrossRefGoogle Scholar
Collins, M. and Duffy, N. (2001). Convolution kernels for natural language. NIPS 625632.Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Artificial Intelligence Research 12, 24932537.Google Scholar
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20(3), 273297.CrossRefGoogle Scholar
Croce, D., Filice, S., Castellucci, G. and Basili, R. (2017). Deep learning in semantic kernel spaces. ACL.CrossRefGoogle Scholar
Croce, D., Moschitti, A. and Basili, R. (2011). Structured lexical similarity via convolution kernels on dependency trees. EMNLP.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.Google Scholar
Drineas, P. and Mahoney, M.W. (2005). On the nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 21532175.Google Scholar
Erhan, D., Courville, A. and Bengio, Y. (2010). Understanding representations learned in deep architectures. Technical Report 1355, Montreal, QC, Canada: Université de Montréal/DIRO.Google Scholar
Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C. and Smith, N.A. (2015). Sparse overcomplete word vector representations. ACL-IJCNLP.CrossRefGoogle Scholar
Filice, S., Castellucci, G., Croce, D. and Basili, R. (2015). Kelp: a kernel-based learning platform for natural language processing. ACL System Demonstrations. 1, 1924.Google Scholar
Filice, S., Castellucci, G., Martino, G.D.S., Moschitti, A., Croce, D., and Basili, R. (2018). Kelp: a kernel-based learning platform. Journal of Machine Learning Research 18(191), 15.Google Scholar
Fillmore, C.J. (1985). Frames and the semantics of understanding. Quaderni di Semantica 6(2).Google Scholar
Frosst, N. and Hinton, G. (2017). Distilling a neural network into a soft decision. Proceedings of the First International Workshop on Comprehensibility and Explanation in AI and ML 2017 co-located with 16th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2017), Bari, Italy, November 16th and 17th, 2017.Google Scholar
Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57, 5665.CrossRefGoogle Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 17351780.CrossRefGoogle ScholarPubMed
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S. and Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear svm. ICML. ACM.Google Scholar
Jacovi, A., Sar Shalom, O. and Goldberg, Y. (2018). Understanding convolutional neural networks for text classification. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. ACL.Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. EMNLP.CrossRefGoogle Scholar
Kononenko, I. and Bratko, I. (1991). Information-based evaluation criterion for classifier’s performance. Machine Learning 6(1), 6780.CrossRefGoogle Scholar
Lei, T., Barzilay, R. and Jaakkola, T. (2016). Rationalizing neural predictions. EMNLP. ACL.Google Scholar
Li, X. and Roth, D. (2006). Learning question classifiers: the role of semantic information. Natural Language Engineering 12(3), 229249.CrossRefGoogle Scholar
Lipton, Z.C. (2018). The mythos of model interpretability. Queue 16(3), 30:3130:57.Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, Maryland. pp. 5560.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR abs/1301.3781.Google Scholar
Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science 34(8), 161199.CrossRefGoogle ScholarPubMed
Moschitti, A. (2006). Efficient convolution kernels for dependency and constituent syntactic trees. ECML.CrossRefGoogle Scholar
Moschitti, A. (2012). State-of-the-art kernels for natural language processing. ACL (Tutorial Abstracts). Association for Computational Linguistics, p. 2.Google Scholar
Moschitti, A., Pighin, D. and Basili, R. (2008). Tree kernels for semantic role labeling. Computational Linguistics 34, 193224.CrossRefGoogle Scholar
Padó, S. and Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161199.CrossRefGoogle Scholar
Palmer, M., Gildea, D. and Xue, N. (2010). Semantic Role Labeling. IEEE Morgan & Claypool Synthesis eBooks Library. San Rafael, CA, USA: Morgan & Claypool Publishers.Google Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). Glove: Global vectors for word representation. EMNLP.CrossRefGoogle Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. NAACL.CrossRefGoogle Scholar
Ribeiro, M.T., Singh, S. and Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. CoRR abs/1602.04938.CrossRefGoogle Scholar
Robert Müller, K., Mika, S., Rätsch, G., Tsuda, K. and Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12(2), 181201.CrossRefGoogle Scholar
Sahlgren, M. (2006). The Word-Space Model. PhD Thesis, Stockholm University.Google Scholar
Schütze, H. (1993). Word space. Advances in Neural Information Processing Systems, Vol. 5. Burlington, MA, USA: Morgan-Kaufmann.Google Scholar
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge University Press.CrossRefGoogle Scholar
Simonyan, K., Vedaldi, A. and Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034.Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A. and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP.Google Scholar
Spinks, G. and Moens, M.-F. (2018). Evaluating textual representations through image generation. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. ACL.Google Scholar
Strubell, E., Verga, P. andor, D., Weiss, D. and McCallum, A. (2018). Linguistically-informed self-attention for semantic role labeling. EMNLP.CrossRefGoogle Scholar
Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. and Hovy, E.H. (2018). Spine: Sparse interpretable neural embeddings. AAAI.Google Scholar
Tai, K.S., Socher, R. and Manning, C.D. (2015). Improved semantic representations from tree-structured long short-term memory networks. ACL-IJCNLP.Google Scholar
Trifonov, V., Ganea, O.-E., Potapenko, A. and Hofmann, T. (2018). Learning and evaluating sparse interpretable sentence embeddings. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.Google Scholar
Vapnik, V.N. (1998). Statistical Learning Theory. New York, NY, USA: Wiley-Interscience.Google Scholar
Walton, D., Reed, C. and Macagno, F. (2008). Argumentation Schemes. Cambridge, England, UK: Cambridge University Press.CrossRefGoogle Scholar
Williams, C.K.I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. NIPS.Google Scholar
Zeiler, M.D. and Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR abs/1311.2901.Google Scholar
Zhang, R., Lee, H. and Radev, D.R. (2016). Dependency sensitive convolutional neural networks for modeling sentences and documents. NAACL-HLT.CrossRefGoogle Scholar
Zhou, C., Sun, C., Liu, Z. and Lau, F.C.M. (2015). A C-LSTM neural network for text classification. CoRR abs/1511.08630.Google Scholar