Leveraging bilingual terminology to improve machine translation in a CAT environment*

MIHAEL ARCAN; MARCO TURCHI; SARA TONELLI; PAUL BUITELAAR

doi:10.1017/S1351324917000195

Leveraging bilingual terminology to improve machine translation in a CAT environment*

Published online by Cambridge University Press: 30 May 2017

SARA TONELLI and

MIHAEL ARCAN: Affiliation:
Insight Centre for Data Analytics, National University of Ireland, Galway e-mail: [email protected], [email protected]
MARCO TURCHI: Affiliation:
FBK- Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy e-mail: [email protected], [email protected]
SARA TONELLI: Affiliation:
FBK- Fondazione Bruno Kessler, Via Sommarive 18, 38123 Trento, Italy e-mail: [email protected], [email protected]
PAUL BUITELAAR: Affiliation:
Insight Centre for Data Analytics, National University of Ireland, Galway e-mail: [email protected], [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality. Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system. We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model. We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 2.23 to 6.78 BLEU points over a baseline SMT system and from 0.05 to 3.03 compared to the widely-used XML markup approach.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 5 , September 2017 , pp. 763 - 788

DOI: https://doi.org/10.1017/S1351324917000195 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight).

References

Aker, A., Paramita, M., and Gaizauskas, R., 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 402–11.Google Scholar

Arcan, M., Federmann, C., and Buitelaar, P., 2012. Experiments with term translation. In Proceedings of the 24th International Conference on Computational Linguistics, Mumbai, India, pp. 67–82.Google Scholar

Arcan, M., Giuliano, C., Turchi, M., and Buitelaar, P., 2014a. Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), Dublin, Ireland, pp. 22–31.Google Scholar

Arcan, M., Turchi, M., Tonelli, S., and Buitelaar, P., 2014b. Enhancing statistical machine translation with bilingual terminology in a CAT environment. In Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 54–68.Google Scholar

Arcan, M., McCrae, J. P., and Buitelaar, P., 2016. Expanding wordnets to new languages with multilingual sense disambiguation. In International Conference on Computational Linguistics (COLING), Osaka, Japan, pp. 97–108.Google Scholar

Bentivogli, L., Bertoldi, N., Cettolo, M., Federico, M., Negri, M., and Turchi, M., 2016. On the evaluation of adaptive machine translation for human post-editing. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2): 388–99.CrossRef Google Scholar

Bertoldi, N., and Federico, M., 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, Greece, pp. 182–9.Google Scholar

Bertoldi, N., Haddow, B., and Fouet, J.-B., 2009. Improved minimum error rate training in moses. Prague Bulletin of Mathematical Linguistics 91 : 7–16.Google Scholar

Bertoldi, N., Cettolo, M., and Federico, M., 2013. Cache-based online adaptation for machine translation enhanced computer assisted translation. In Proceedings of Machine Translation Summit XIV, Nice, France, pp. 35–42.Google Scholar

Bouamor, D., Semmar, N., and Zweigenbaum, P., 2011. Improved statistical machine translation using multiword expressions. In Proceedings of the International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT 2011), Barcelona, Spain, pp. 15–20.Google Scholar

Bouamor, D., Semmar, N., and Zweigenbaum, P., 2012. Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 674–9.Google Scholar

Clark, J. H., Dyer, C., Lavie, A., and Smith, N. A., 2011. Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, pp. 176–81.Google Scholar

Daille, B., Gaussier, E., and Langé, J.-M., 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–21.Google Scholar

Denkowski, M., Dyer, C., and Lavie, A., 2014. Learning from post-editing: online model adaptation for statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 395–404.Google Scholar

Dice, L. R., 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297–302.Google Scholar

Ehrmann, M., Turchi, M., and Steinberger, R., 2011. Building a multilingual named entity-annotated corpus using annotation projection. In Recent Advances in Natural Language Processing, (RANLP), Hissar, Bulgaria, pp. 118–24.Google Scholar

Federico, M., Cattelan, A., and Trombetti, M., 2012. Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas, San Diego, California, pp. 44–56.Google Scholar

Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D., Martines, A., Massidda, A., Schwenk, H., Barrault, L., Blain, F., Koehn, P., Buck, C., and Germann, U., 2014. The MateCat tool. In Proceedings of 25th International Conference on Computational Linguistics: System Demonstrations (COLING), Dublin, Ireland, pp. 129–32.Google Scholar

Green, S., Heer, J., and Manning, C. D., 2013. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, pp. 439–48.Google Scholar

Haddow, B., and Koehn, P., 2012. Analysing the effect of out-of-domain data on SMT systems. In Proceedings of the 7th Workshop on Statistical Machine Translation, Montréal, Canada, pp. 422–32.Google Scholar

Heyn, M., 1996. Integrating machine translation into translation memory systems. In Proceedings of the EAMT Machine Translation Workshop, TKE’96, Vienna, Austria, pp. 113–26.Google Scholar

Itagaki, M., and Aikawa, T., 2008. Post-MT term swapper: supplementing a statistical machine translation system with a user dictionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1584–8.Google Scholar

Kim, S. N., Baldwin, T., and Kan, M.-Y., 2009. An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 94–8.Google Scholar

Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T., 2010. Semeval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–6.Google Scholar

Koehn, P., 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 79–86.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, pp. 177–80.Google Scholar

Läubli, S., Fishel, M., Massey, G., Ehrensberger-Dow, M., and Volk, M., 2013. Assessing post-editing efficiency in a realistic translation environment. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp. 83–91.Google Scholar

Levenberg, A., Callison-Burch, C., and Osborne, M., 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Los Angeles, California, pp. 394–402.Google Scholar

Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.Google Scholar

Okita, T., and Way, A., 2010. Statistical machine translation with terminology. In Proceedings of the First Symposium on Patent Information Processing (SPIP), Tokyo, Japan, pp. 1–8.Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-Z., 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311–8.Google Scholar

Pianta, E., and Tonelli, S., 2010. KX: a flexible system for Keyphrase eXtraction. In Proceedings of SemEval 2010, Task 5: Keyword extraction from Scientific Articles, Uppsala, Sweden, pp. 170–3.Google Scholar

Pinnis, M., 2015. Dynamic terminology integration methods in statistical machine translation. In Proceedings of the 18th Annual Conference of the European Association for Machine Translation (EAMT 2015), Antalya, Turkey, pp. 89–96.Google Scholar

Pinnis, M., and Skadins, R., 2012. MT adaptation for under-resourced domains - what works and what not. In Proceedings of the 5th International Conference Baltic Human Language Technologies - The Baltic Perspective, Tartu, Estonia, pp. 176–84.Google Scholar

Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., and Gornostay, T., 2012. Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference, Jeju Island, Korea, pp. 91–6.Google Scholar

Ren, Z., Lü, Y., Cao, J., Liu, Q., and Huang, Y. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, pp. 47–54.Google Scholar

Salton, G., Wong, A., and Yang, C.-S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613–20.CrossRef Google Scholar

Sparck Jones, K., 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1): 11–21.Google Scholar

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., and Varga, D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2142–7.Google Scholar

Stolcke, A., 2002. SRILM-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, Denver, USA, pp. 901–4.Google Scholar

Thurmair, G. and Aleksić, V., 2012. Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, pp. 253–60.Google Scholar

Tiedemann, J., 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In Proceeding of Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 237–48.Google Scholar

Vintar, S., and Fišer, D., 2008. Harvesting multi-word expressions from parallel corpora. In Proceedings of European Language Resources Association, Marrakech, Morocco, pp. 1091–6.Google Scholar

Weller, M., Fraser, A., and Heid, U., 2014. Combining bilingual terminology mining and morphological modeling for domain adaptation in SMT. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, Dubrovnik, Croatia, pp. 11–8.Google Scholar

Wu, C.-C., and Chang, J. S. 2004. Bilingual collocation extraction based on syntactic and statistical analyses. In Proceedings of the 15th Conference on Computational Linguistics and Speech Processing, Taiwan, pp. 1–20.Google Scholar

Xiong, D., Meng, F., and Liu, Q., 2016. Topic-based term translation models for statistical machine translation. Artificial Intelligence 232 : 54–75.Google Scholar

Article contents

Leveraging bilingual terminology to improve machine translation in a CAT environment*

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests