Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus†

M. VILA; H. RODRÍGUEZ; M. A. MARTÍ

doi:10.1017/S1351324913000235

Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus†

Published online by Cambridge University Press: 16 September 2013

M. VILA ,

H. RODRÍGUEZ and

M. A. MARTÍ

Show author details

M. VILA: Affiliation:
CLiC, Departament de Lingüística, Universitat de Barcelona, Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected], [email protected]
H. RODRÍGUEZ: Affiliation:
TALP, Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Jordi Girona Salgado 1-3, 08034 Barcelona, Spain e-mail: [email protected]
M. A. MARTÍ: Affiliation:
CLiC, Departament de Lingüística, Universitat de Barcelona, Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain e-mail: [email protected], [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Paraphrase corpora are an essential but scarce resource in Natural Language Processing. In this paper, we present the Wikipedia-based Relational Paraphrase Acquisition (WRPA) method, which extracts relational paraphrases from Wikipedia, and the derived WRPA paraphrase corpus. The WRPA corpus currently covers person-related and authorship relations in English and Spanish, respectively, suggesting that, given adequate Wikipedia coverage, our method is independent of the language and the relation addressed. WRPA extracts entity pairs from structured information in Wikipedia applying distant learning and, based on the distributional hypothesis, uses them as anchor points for candidate paraphrase extraction from the free text in the body of Wikipedia articles. Focussing on relational paraphrasing and taking advantage of Wikipedia-structured information allows for an automatic and consistent evaluation of the results. The WRPA corpus characteristics distinguish it from other types of corpora that rely on string similarity or transformation operations. WRPA relies on distributional similarity and is the result of the free use of language outside any reformulation framework. Validation results show a high precision for the corpus.

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 3 , May 2015 , pp. 355 - 389

DOI: https://doi.org/10.1017/S1351324913000235 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

This work was supported by the MINECO projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01), as well as a MECD FPU grant (AP2008-02185). Also, we are grateful to Esther Arias, Santiago González, Rita Zaragoza and Oriol Borrega, the linguists that worked on the annotation processes.

References

Androutsopoulos, I., and Malakasiotis, P., 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38: 135–87.CrossRef Google Scholar

Arévalo, M., Civit, M., and Martí, M. A. 2004. MICE: a module for named entity recognition and classification. International Journal of Corpus Linguistics 9 (1): 53–68.CrossRef Google Scholar

Bannard, C., and Callison-Burch, C. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 597–604. Ann Arbor, MI: ACL.Google Scholar

Barrón-Cedeño, A., Vila, M., Martí, M. A., and Rosso, P. 2013. Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics – vol. 39, no. 4, doi:10.1162/COLI_a_00153.CrossRef Google Scholar

Barzilay, R., and Lee, L. 2003. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 16–23. Edmonton, Canada: ACL.Google Scholar

Barzilay, R., and McKeown, K. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pp. 50–57. Toulouse, France: ACL.Google Scholar

Bhagat, R., and Ravichandran, D. 2008. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2008), pp. 674–82. Columbus, OH: ACL.Google Scholar

Brin, S. 1999. Extracting patterns and relations from the World Wide Web. In Atzeni, P., Mendelzon, A., and Mecca, G. (eds.), Proceedings of the 1st International Workshop on the World Wide Web and Databases (WebDB 1998), Lecture Notes in Computer Science, Vol. 1590. pp. 172–83. Berlin, Heidelberg: Springer-Verlag.CrossRef Google Scholar

Burrows, S., Potthast, M., and Stein, B. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology 4 (3), article no. 43.CrossRef Google Scholar

Buzek, O., Resnik, P., and Bederson, B. B 2010. Error driven paraphrase annotation using Mechanical Turk. In Proceedings of the HLT/NAACL 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT 2010), pp. 217–21. Los Angeles, CA: ACL.Google Scholar

Carrasco, R. C, and Oncina, J. 1994. Learning stochastic regular grammars by means of a state merging method. In Carrasco, R. C. and Oncina, J. (eds.), Grammatical Inference and Applications. Proceedings of the 2nd International Colloquium (ICGI 1994), Lecture Notes in Computer Science, Vol. 862. pp. 139–52. Berlin, Heidelberg: Springer-Verlag.CrossRef Google Scholar

Chen, D. L, and Dolan, W. B 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), Vol. 1, pp. 190–200. Portland, OR: ACL.Google Scholar

Clough, P., and Stevenson, M. 2011. Developing a corpus of plagiarised short answers. Language Resources and Evaluation 45 (1): 5–24.CrossRef Google Scholar

Cohn, T., Burch, Callison-C., and Lapata, M. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34 (4): 597–614.CrossRef Google Scholar

Cohn, T., and Lapata, M. 2008. Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), pp. 137–44. Manchester: International Committee on Computational Linguistics.CrossRef Google Scholar

Coster, W., and Kauchak, D. 2011. Simple English Wikipedia: a new text simplification task. In Proceeding of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT/ACL 2011), pp. 665–9. Portland, OR: ACL.Google Scholar

Dolan, W. B, and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 9–16.Google Scholar

Dolan, B., Quirk, C., and Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 350–6. Geneva: International Committee on Computational Linguistics.Google Scholar

España-Bonet, C., Vila, M., Rodríguez, H., and Martí, M. A., 2009. CoCo, a web interface for corpora compilation. Procesamiento del Lenguaje Natural 43: 367–8.Google Scholar

Fillmore, C. J 1992. ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In Svartvik, J. (ed.), Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, pp. 35–60. Berlin: Mouton de Gruyter.Google Scholar

Fujita, A., and Inui, K. 2005. A class-oriented approach to building a paraphrase corpus. In Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 25–32.Google Scholar

Gonzàlez, E., Rodríguez, H., Turmo, J., Comas, P. R, Naderi, A., Ageno, A., Sapena, E., Vila, M., and Martí, M. A. 2012. The TALP participation at TAC-KBP 2012. In Proceedings of the Fifth Text Analysis Conference (TAC 2012), Gaithersburg, MD.Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–18.CrossRef Google Scholar

Harris, Z. 1954. Distributional structure. Word 10 (2–3): 146–62.CrossRef Google Scholar

Herrera, J., Peñas, A., and Verdejo, F., 2007. Paraphrase extraction from validated question answering corpora in Spanish. Procesamiento del Lenguaje Natural 39: 37–44.Google Scholar

Knight, K., and Marcu, D., 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139: 91–107.CrossRef Google Scholar

Kouylekov, M., and Negri, M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the ACL 2010 System Demonstrations (ACLDemos 2010), pp. 42–7. Uppsala: ACL.Google Scholar

Lin, D., and Pantel, P. 2001. DIRT-Discovery of Inference Rules from Text. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 323–8. San Francisco, CA: ACM.Google Scholar

Madnani, N., and Dorr, B. J. 2010. Generating phrasal and sentential paraphrases: a survey of data-driven methods. Computational Linguistics 36 (3): 341–87.CrossRef Google Scholar

Martzoukos, S., and Monz, C. 2012. Power-law distributions for paraphrases extracted from bilingual corpora. In Proceedings of the 13th Conference of the European Chapter on the Association for Computational Linguistics (EACL 2012), pp. 2–11. Avignon, France: ACL.Google Scholar

Max, A., and Wisniewski, G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia's revision history. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 3143–8. Valletta, Malta: European Language Resources Association.Google Scholar

Medelyan, O., Milne, D., Legg, C., and Witten, I. H 2009. Mining meaning from Wikipedia. International Journal of Human–Computer Studies 67 (9): 716–54.CrossRef Google Scholar

Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL/IJCNLP 2009), pp. 1003–11. Singapore: ACL.Google Scholar

Nilsson, N. J., 1982. Principles of Artificial Intelligence. Berlin/Heidelberg/New York: Springer-Verlag.CrossRef Google Scholar

Padró, L., Collado, M., Reese, S., Lloberes, M., and Castellón, I. 2010. Freeling 2.1: five years of open-source language processing tools. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 931–6. Valletta, Malta: European Language Resources Association.Google Scholar

Pang, B., Knight, K., and Marcu, D. 2003. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In Proceedings of the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2003), pp. 102–9. Edmonton, Canada: ACL.CrossRef Google Scholar

Potthast, M., Stein, B., Barrón-Cedeño, A., and Rosso, P. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005. Beijing: International Committee on Computational Linguistics.Google Scholar

Ravichandran, D., and Hovy, E. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 41–7. Philadelphia, PA: ACL.CrossRef Google Scholar

Szpektor, I., Tanev, H., Dagan, I., and Coppola, B. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 41–8. Barcelona, Spain: ACL.Google Scholar

Vila, M., Bertran, M., Martí, M. A., and Rodríguez, H. 2013. Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures (submitted).Google Scholar

Vila, M., Rodríguez, H., and Martí, M. A. 2010. WRPA: a system for relational paraphrase acquisition from Wikipedia. Procesamiento del Lenguaje Natural 45, 11–9.Google Scholar

Wubben, S., van den Bosch, A., and Krahmer, E. 2010. Paraphrase generation as monolingual translation: data and evaluation. In Proceedings of the 6th International Language Generation Conference (INLG 2010), pp. 203–7. Dublin: ACL.Google Scholar

Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., and Lee, L. 2010. For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT/NAACL 2010), pp. 365–8. Los Angeles, CA: ACL.Google Scholar

Zesch, T., Müller, C., and Gurevych, I. 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1646–52. Marrakech, Morocco: European Language Resources Association.Google Scholar

Zhu, Z., Bernhard, D., and Gurevych, I. 2010. A monolingual tree-based translation method for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 1353–61. Beijing: International Committee on Computational Linguistics.Google Scholar

Article contents

Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus†

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests