Hostname: page-component-586b7cd67f-r5fsc Total loading time: 0 Render date: 2024-12-04T09:25:40.708Z Has data issue: false hasContentIssue false

Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus

Published online by Cambridge University Press:  16 September 2013

M. VILA
Affiliation:
CLiC, Departament de Lingüística, Universitat de Barcelona, Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected], [email protected]
H. RODRÍGUEZ
Affiliation:
TALP, Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Jordi Girona Salgado 1-3, 08034 Barcelona, Spain e-mail: [email protected]
M. A. MARTÍ
Affiliation:
CLiC, Departament de Lingüística, Universitat de Barcelona, Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected], [email protected]

Abstract

Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the Wikipedia-based Relational Paraphrase Acquisition (WRPA) method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA corpus currently covers person-related and authorship relations in English and Spanish, respectively, suggesting that, given adequate Wikipedia coverage, our method is independent of the language and the relation addressed. WRPA extracts entity pairs from structured information in Wikipedia applying distant learning and, based on the distributional hypothesis, uses them as anchor points for candidate paraphrase extraction from the free text in the body of Wikipedia articles. Focussing on relational paraphrasing and taking advantage of Wikipedia-structured information allows for an automatic and consistent evaluation of the results. The WRPA corpus characteristics distinguish it from other types of corpora that rely on string similarity or transformation operations. WRPA relies on distributional similarity and is the result of the free use of language outside any reformulation framework. Validation results show a high precision for the corpus.

Type
Articles
Copyright
Copyright © Cambridge University Press 2013 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This work was supported by the MINECO projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01), as well as a MECD FPU grant (AP2008-02185). Also, we are grateful to Esther Arias, Santiago González, Rita Zaragoza and Oriol Borrega, the linguists that worked on the annotation processes.

References

Androutsopoulos, I., and Malakasiotis, P., 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38: 135–87.CrossRefGoogle Scholar
Arévalo, M., Civit, M., and Martí, M. A. 2004. MICE: a module for named entity recognition and classification. International Journal of Corpus Linguistics 9 (1): 5368.CrossRefGoogle Scholar
Bannard, C., and Callison-Burch, C. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 597604. Ann Arbor, MI: ACL.Google Scholar
Barrón-Cedeño, A., Vila, M., Martí, M. A., and Rosso, P. 2013. Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics – vol. 39, no. 4, doi:10.1162/COLI_a_00153.CrossRefGoogle Scholar
Barzilay, R., and Lee, L. 2003. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 1623. Edmonton, Canada: ACL.Google Scholar
Barzilay, R., and McKeown, K. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pp. 5057. Toulouse, France: ACL.Google Scholar
Bhagat, R., and Ravichandran, D. 2008. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2008), pp. 674–82. Columbus, OH: ACL.Google Scholar
Brin, S. 1999. Extracting patterns and relations from the World Wide Web. In Atzeni, P., Mendelzon, A., and Mecca, G. (eds.), Proceedings of the 1st International Workshop on the World Wide Web and Databases (WebDB 1998), Lecture Notes in Computer Science, Vol. 1590. pp. 172–83. Berlin, Heidelberg: Springer-Verlag.CrossRefGoogle Scholar
Burrows, S., Potthast, M., and Stein, B. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology 4 (3), article no. 43.CrossRefGoogle Scholar
Buzek, O., Resnik, P., and Bederson, B. B 2010. Error driven paraphrase annotation using Mechanical Turk. In Proceedings of the HLT/NAACL 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT 2010), pp. 217–21. Los Angeles, CA: ACL.Google Scholar
Carrasco, R. C, and Oncina, J. 1994. Learning stochastic regular grammars by means of a state merging method. In Carrasco, R. C. and Oncina, J. (eds.), Grammatical Inference and Applications. Proceedings of the 2nd International Colloquium (ICGI 1994), Lecture Notes in Computer Science, Vol. 862. pp. 139–52. Berlin, Heidelberg: Springer-Verlag.CrossRefGoogle Scholar
Chen, D. L, and Dolan, W. B 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), Vol. 1, pp. 190200. Portland, OR: ACL.Google Scholar
Clough, P., and Stevenson, M. 2011. Developing a corpus of plagiarised short answers. Language Resources and Evaluation 45 (1): 524.CrossRefGoogle Scholar
Cohn, T., Burch, Callison-C., and Lapata, M. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34 (4): 597614.CrossRefGoogle Scholar
Cohn, T., and Lapata, M. 2008. Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 137–44. Manchester: International Committee on Computational Linguistics.CrossRefGoogle Scholar
Coster, W., and Kauchak, D. 2011. Simple English Wikipedia: a new text simplification task. In Proceeding of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), pp. 665–9. Portland, OR: ACL.Google Scholar
Dolan, W. B, and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 916.Google Scholar
Dolan, B., Quirk, C., and Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 350–6. Geneva: International Committee on Computational Linguistics.Google Scholar
España-Bonet, C., Vila, M., Rodríguez, H., and Martí, M. A., 2009. CoCo, a web interface for corpora compilation. Procesamiento del Lenguaje Natural 43: 367–8.Google Scholar
Fillmore, C. J 1992. ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In Svartvik, J. (ed.), Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, pp. 3560. Berlin: Mouton de Gruyter.Google Scholar
Fujita, A., and Inui, K. 2005. A class-oriented approach to building a paraphrase corpus. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 2532.Google Scholar
Gonzàlez, E., Rodríguez, H., Turmo, J., Comas, P. R, Naderi, A., Ageno, A., Sapena, E., Vila, M., and Martí, M. A. 2012. The TALP participation at TAC-KBP 2012. In Proceedings of the Fifth Text Analysis Conference (TAC 2012), Gaithersburg, MD.Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.CrossRefGoogle Scholar
Harris, Z. 1954. Distributional structure. Word 10 (2–3): 146–62.CrossRefGoogle Scholar
Herrera, J., Peñas, A., and Verdejo, F., 2007. Paraphrase extraction from validated question answering corpora in Spanish. Procesamiento del Lenguaje Natural 39: 3744.Google Scholar
Knight, K., and Marcu, D., 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139: 91107.CrossRefGoogle Scholar
Kouylekov, M., and Negri, M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the ACL 2010 System Demonstrations (ACLDemos 2010), pp. 42–7. Uppsala: ACL.Google Scholar
Lin, D., and Pantel, P. 2001. DIRT-Discovery of Inference Rules from Text. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 323–8. San Francisco, CA: ACM.Google Scholar
Madnani, N., and Dorr, B. J. 2010. Generating phrasal and sentential paraphrases: a survey of data-driven methods. Computational Linguistics 36 (3): 341–87.CrossRefGoogle Scholar
Martzoukos, S., and Monz, C. 2012. Power-law distributions for paraphrases extracted from bilingual corpora. In Proceedings of the 13th Conference of the European Chapter on the Association for Computational Linguistics (EACL 2012), pp. 211. Avignon, France: ACL.Google Scholar
Max, A., and Wisniewski, G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia's revision history. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 3143–8. Valletta, Malta: European Language Resources Association.Google Scholar
Medelyan, O., Milne, D., Legg, C., and Witten, I. H 2009. Mining meaning from Wikipedia. International Journal of Human–Computer Studies 67 (9): 716–54.CrossRefGoogle Scholar
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/IJCNLP 2009), pp. 1003–11. Singapore: ACL.Google Scholar
Nilsson, N. J., 1982. Principles of Artificial Intelligence. Berlin/Heidelberg/New York: Springer-Verlag.CrossRefGoogle Scholar
Padró, L., Collado, M., Reese, S., Lloberes, M., and Castellón, I. 2010. Freeling 2.1: five years of open-source language processing tools. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 931–6. Valletta, Malta: European Language Resources Association.Google Scholar
Pang, B., Knight, K., and Marcu, D. 2003. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 102–9. Edmonton, Canada: ACL.CrossRefGoogle Scholar
Potthast, M., Stein, B., Barrón-Cedeño, A., and Rosso, P. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 9971005. Beijing: International Committee on Computational Linguistics.Google Scholar
Ravichandran, D., and Hovy, E. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 41–7. Philadelphia, PA: ACL.CrossRefGoogle Scholar
Szpektor, I., Tanev, H., Dagan, I., and Coppola, B. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 41–8. Barcelona, Spain: ACL.Google Scholar
Vila, M., Bertran, M., Martí, M. A., and Rodríguez, H. 2013. Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures (submitted).Google Scholar
Vila, M., Rodríguez, H., and Martí, M. A. 2010. WRPA: a system for relational paraphrase acquisition from Wikipedia. Procesamiento del Lenguaje Natural 45, 11–9.Google Scholar
Wubben, S., van den Bosch, A., and Krahmer, E. 2010. Paraphrase generation as monolingual translation: data and evaluation. In Proceedings of the 6th International Language Generation Conference (INLG 2010), pp. 203–7. Dublin: ACL.Google Scholar
Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., and Lee, L. 2010. For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2010), pp. 365–8. Los Angeles, CA: ACL.Google Scholar
Zesch, T., Müller, C., and Gurevych, I. 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1646–52. Marrakech, Morocco: European Language Resources Association.Google Scholar
Zhu, Z., Bernhard, D., and Gurevych, I. 2010. A monolingual tree-based translation method for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 1353–61. Beijing: International Committee on Computational Linguistics.Google Scholar