Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-23T02:41:19.076Z Has data issue: false hasContentIssue false

Building a multi-domain comparable corpus using a learning to rank method

Published online by Cambridge University Press:  15 June 2016

RAZIEH RAHIMI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: [email protected], [email protected], [email protected], [email protected], [email protected]
AZADEH SHAKERY
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: [email protected], [email protected], [email protected], [email protected], [email protected] School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
JAVID DADASHKARIMI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: [email protected], [email protected], [email protected], [email protected], [email protected]
MOZHDEH ARIANNEZHAD
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: [email protected], [email protected], [email protected], [email protected], [email protected]
MOSTAFA DEHGHANI
Affiliation:
Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands e-mail: [email protected]
HOSSEIN NASR ESFAHANI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This research was in part supported by a grant from Institute for Research in Fundamental Sciences (No. CS1393-4-43).

References

AbduI-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL'09, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 16–23.Google Scholar
Agirre, E., Di Nunzio, G. M., Ferro, N., Mandl, T., and Peters, C. 2009. Clef 2008: ad hoc track overview. In Proceedings of the 9th Cross-language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, CLEF'08, Berlin, Heidelberg, Springer-Verlag, pp. 15–37.Google Scholar
Aker, A., Kanoulas, E. and Gaizauskas, R. 2012. A light way to collect comparable corpora from the web. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. European Language Resources Association (ELRA).Google Scholar
Aker, A., Paramita, M. and Gaizauskas, R. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Soa, Bulgaria: Association for Computational Linguistics, pp. 402–411.Google Scholar
Aker, A., Paramita, M. L., Pinnis, M. and Gaizauskas, R. 2014. Bilingual dictionaries for all EU languages. In The Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).Google Scholar
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., and Oroumchian, F. 2009. Hamshahri: a standard Persian text collection. Knowledge Based Systems 22 (5): 382387.Google Scholar
Azarbonyad, H., Shakery, A. and Faili, H. 2012. Using learning to rank approach for parallel corpora based cross-language information retrieval. In Proceedings of 20th European Conference on Artificial Intelligence (ECAI), Montpellier, France, pp. 79–84.Google Scholar
Braschler, M. and Schäuble, P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, ECDL'98, London, UK: Springer-Verlag, pp. 183–197.Google Scholar
Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning 20 (3): 273297.Google Scholar
Dadashkarimi, J., Shakery, A. and Heshaam, F. 2014. A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In Proceedings of the 3th Conference on Computational Linguistic, CLConference'14, Tehran, Iran.Google Scholar
Darwish, K. and Oard, D. W. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR'03, New York, NY, USA. ACM, pp. 338–344.Google Scholar
Fang, H., Tao, T. and Zhai, C. (2011). Diagnostic evaluation of information retrieval models. ACM Transactions on Information Systems 29 (2): 7:17:42.Google Scholar
Ferro, N. and Peters, C. 2009. Clef 2009 ad hoc track overview: TEL and Persian tasks. In Proceedings of the 10th Cross-Language Evaluation Forum Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments, CLEF'09, Berlin, Heidelberg, Springer-Verlag, pp. 13–35.Google Scholar
Finkel, J. R., Grenager, T. and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL'05, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 363–370.Google Scholar
Fung, P. and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING'04, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
Garera, N., Callison-Burch, C., and Yarowsky, D. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL'09, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 129–137.Google Scholar
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL'04, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 527–534.Google Scholar
Hashemi, H. B. and Shakery, A. 2014. Mining a Persian-English comparable corpus for cross-language information retrieval. Information Processing and Management 50 (2): 384398.Google Scholar
Hashemi, H. B., Shakery, A. and Faili, H. 2010. Creating a Persian-English comparable corpus. In Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation: Cross-Language Evaluation Forum, CLEF'10, Berlin, Heidelberg, Springer-Verlag, pp. 27–39.Google Scholar
Huang, D., Zhao, L., Li, L. and Yu, H. 2010. Mining large-scale comparable corpora from Chinese-English news collections. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. COLING'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 472–480.Google Scholar
Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'02, New York, NY, USA. ACM, pp. 133–142.Google Scholar
Li, H. and Hirst, G. 2011. Learning to Rank for Information Retrieval and Natural Language Processing. G - Reference, Information and Interdisciplinary Subjects Series. California, USA: Morgan & Claypool Publishers.Google Scholar
McNamee, P., Mayfield, J., and Nicholas, C. 2009. Translation corpus source and size in bilingual retrieval. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short'09 Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 25–28.Google Scholar
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.Google Scholar
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 81–88.Google Scholar
Nie, J.-Y. 2010. Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies. California, USA: Morgan & Claypool Publishers.Google Scholar
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.Google Scholar
Pal, S., Pakray, P. and Naskar, K. S. 2014. Automatic Building and Using Parallel Resources for SMT from Comparable Corpora. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra) at EACL. Association for Computational Linguistics, pp. 48–57.Google Scholar
Paramita, M. L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., and Sanderson, M. 2013. Methods for collection and evaluation of comparable documents. In Sharoff, S., Rapp, R., Zweigenbaum, P., and Fung, P. (eds), Building and Using Comparable Corpora, pp. 93112. Berlin Heidelberg: Springer.Google Scholar
Pilevar, M. T., Faili, H. and Pilevar, A. H. 2011. TEP: Tehran English-Persian parallel corpus. In Proceedings of the 12th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II, CICLing'11, Berlin, Heidelberg, Springer-Verlag, pp. 68–79.Google Scholar
Pomikálek, J. 2011. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic.Google Scholar
Rahimi, R. and Shakery, A. 2013. A language modeling approach for extracting translation knowledge from comparable corpora. In Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR'13, Berlin, Heidelberg. Springer-Verlag, pp. 606–617.Google Scholar
Rahimi, R., Shakery, A. and King, I. 2016. Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework. Information Processing & Management 52 (2): 299318.Google Scholar
Rahimi, Z. and Shakery, A. 2011. Topic based creation of a Persian-English comparable corpus. In Proceedings of the 7th Asia Conference on Information Retrieval Technology, AIRS'11, Berlin, Heidelberg, Springer-Verlag, pp. 458–469.Google Scholar
Saad, M., Langlois, D. and Smaïli, K. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia - Social and Behavioral Sciences 95: 4047.Google Scholar
Sheridan, P. and Ballerini, J. P. 1996. Experiments in multilingual information retrieval using the spider system. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, New York, NY, USA. ACM, pp. 58–65.Google Scholar
Skadia, I., Aker, A., Mastropavlos, N., Su, F., Tufi, D., Verlic, M., Vasijevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., and Pinnis, M. (2012). Collecting and using comparable corpora for statistical machine translation. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey: European Language Resources Association 824 (ELRA).Google Scholar
Smith, J. R., Quirk, C. and Toutanova, K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 403–411.Google Scholar
Strötgen, J., Gertz, M. and Junghans, C. 2011. An event-centric model for multilingual document similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 953–962.Google Scholar
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., and Keskustalo, H. 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (4), 1.Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., and Laurikkala, J. 2008. Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11 (5): 427445.Google Scholar
Tao, T. and Zhai, C. 2005. Mining comparable bilingual text corpora for cross-language information integration. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD'05, New York, NY, USA: ACM, pp. 691–696.Google Scholar
Ture, F., Elsayed, T. and Lin, J. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 943–952.Google Scholar
Vulić, I. and Moens, M.-F. 2012. Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL'12, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 449–459.Google Scholar
Xu, J. and Weischedel, R. 2005. Empirical studies on the impact of lexical resources on CLIR performance. Information Processing & Management 41 (3): 475487.Google Scholar
Zhai, C. and Lafferty, J. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management, CIKM'01, New York, NY, USA, ACM, pp. 403–410.Google Scholar