Statistics for Corpus-Based and Corpus-Driven Approaches to Empirical Translation Studies

doi:10.1017/9781108525695.003

3 - Statistics for Corpus-Based and Corpus-Driven Approaches to Empirical Translation Studies

Published online by Cambridge University Press: 10 June 2019

Michael Oakes

Edited by

Meng Ji and

Michael Oakes

Show author details

Meng Ji: Affiliation:
University of Sydney
Michael Oakes: Affiliation:
University of Wolverhampton

Book contents

Get access

Summary

Tognini-Bonelli (2001) made the following distinction between corpus-based and corpus-driven studies. While corpus-based studies start with pre-existing theories which are tested using corpus data, in corpus driven studies the hypothesis is derived by examination of the corpus evidence. This chapter will give an overview of the two different families of statistical tests which are suited for these two approaches. For corpus-based approaches, we use more traditional statistics, such as the t-test, or ANOVA which return a value called a p-value to tell us to what extent we should accept or reject the initial hypothesis. Multi-level modelling (also known as mixed modelling) is a new technique which shows considerable promise for corpus-based studies, and will also be described here to analyse the ENNTT subset of Europarl corpus. Multi-level modelling is useful for the examination of hierarchically structured or “nested” data, where for example translations may be “nested” together in a class if they have the same language of origin. A multi-level model takes account both of the variation between individual translations and the variation between classes. For example, we might expect the scores (such as vocabulary richness or readability scores) of two translations in the same class to be more similar to each other than two translations in different classes.

Keywords

Corpus-based and corpus-driven translation studies linear regression mixed models analysis of variance principal components analysis

Type: Chapter
Information: Advances in Empirical Translation Studies
Developing Translation Resources and Technologies
, pp. 28 - 52

DOI: https://doi.org/10.1017/9781108525695.003 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Baayen, R. Harald (2008). Analysing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge, UK: Cambridge.Google Scholar

Biber, Douglas (2009). Corpus-based and corpus-driven analyses of language variation and use. In Heine, Bernd and Narrog, Heiko (eds.), The Oxford Handbook of Linguistics (1st edition). Oxford, UK: Oxford University Press.Google Scholar

Koehn, Philipp (2005). Europarl: A parallel corpus for statistical machine translation. In Proceeding of the Tenth Machine Translation Summit (MT Summit X), Phuket, Thailand. Tokyo: Asia-Pacific Association for Machine Translation.Google Scholar

Koppel, M. and Ordan, N. (2011). Translationese and its dialects. In Proceedings of ACL, Portland OR, June 2011. Stroudsberg, PA: Association for Computing Machinery, pp. 1318–1326.Google Scholar

Nisioi, Sergiu, Rabinovich, Ella, Dinu, Liviu P. and Wintner, Shuly (2016). A corpus of native, non-native and translated texts. In Calzolari, Nicoletta et al. (eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), Portoroz, Slovenia May 23–28, 2016. European Languages Resources Association, pp. 4197–4200.Google Scholar

Rabinovich, Ella Nisioi, Sergiu, Ordan, Noam and Wintner, Shuly (2016). On the similarities between native, non-native and translated texts. In van den Bosch, Antal (General Chair) (ed.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August. Stroudsburg, PA: Association for Computing Machinery, pp. 1870–1881.Google Scholar

Rabinovich, Ella, Ordan, Noam and Wintner, Shuly (2017). Found in translation: Reconstructing phylogenetic language trees from translations. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 30 July–4 August. Stroudsburg, PA: Association for Computational Linguistics, pp. 530–540.Google Scholar

Serva, Maurizio and Petroni, Filippo (2008). Indo-European languages tree by Levenshtein distance. Europhysics Letters 81(6), 68005.Google Scholar

Tognini-Bonelli, Elena (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins.Google Scholar

Winter, Bodo (2013). Linear models and linear mixed effects models in R with linguistic applications. Tutorials 1 and 2. arXiv:1308.5499. http://arxiv.org/pdf/1308.5499.pdf.Google Scholar