Non-native text analysis: A survey

SEAN MASSUNG; CHENGXIANG ZHAI

doi:10.1017/S1351324915000303

Non-native text analysis: A survey

Published online by Cambridge University Press: 07 September 2015

SEAN MASSUNG and

CHENGXIANG ZHAI

Show author details

SEAN MASSUNG: Affiliation:
Department of Computer Science, College of Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA e-mail: [email protected], [email protected]
CHENGXIANG ZHAI: Affiliation:
Department of Computer Science, College of Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA e-mail: [email protected], [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising (British Council 2014). Online education in the form of massive online open courses is also primarily in English—even teaching English. This creates enormous amounts of text written by non-native speakers, which in turn generates a need for grammar correction and analysis. Even aside from massive online open courses, the number of English learners in Asia alone is in the tens of millions. In this paper, we provide a survey of the two main areas of existing work on non-native text analysis, prefaced by an overview of common datasets used by researchers, comparing their attributes and potential uses. Then, an introduction to native language identification follows: determining the native language of an author based on text in the second language. This section is subdivided into various techniques and a shared task on this classification problem. Next, we discuss non-native grammatical error correction—finding and modifying text to fix errors or to make it sound more fluent. Again, we discuss different methods before investigating a relevant shared task. Lastly, we end with conclusions and potential future directions. While this survey primarily focuses on detecting and correcting non-native English text, many approaches are general and can be used across any language pairing.

Type: Survey Paper
Information: Natural Language Engineering , Volume 22 , Issue 2 , March 2016 , pp. 163 - 186

DOI: https://doi.org/10.1017/S1351324915000303 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ando, R., and Zhang, T. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6: 1817–53.Google Scholar

Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., and Chodorow, M. 2013. TOEF11: a corpus of non-native English. Technical Report, Educational Testing Service (ETS). (https://www.ets.org/)CrossRef Google Scholar

Blei, D., Ng, A., and Jordan, M. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.Google Scholar

Blunsom, P., and Cohn, T. 2010. Unsupervised induction of tree substitution grammars for dependency parsing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, Massachusetts, USA, pp. 1204–13.Google Scholar

Bobicev, V. 2013. Native language identification with PPM. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 180–87.Google Scholar

Boisson, J., Kao, T., Wu, J., Yen, T., and Chang, J. 2013. Linggle: a web-scale linguistic search engine for words in context. In Proceedings of Association for Computational Linguistics (Conference System Demonstrations), Sofia, Bulgaria, pp. 139–44.Google Scholar

British Council 2014. How Many People Speak English? http://www.britishcouncil.org/learning-faq-the-english-language.htm Google Scholar

Brockett, C., Dolan, W., and Gamon, M. 2006. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 249–56.Google Scholar

Brooke, J., and Hirst, G. 2011. Native language detection with ‘Cheap’ learner corpora. In Proceedings of the 2011 Conference on Learner Corpus Research, Louvain-la-Neuve, Belgium, pp. 37–47.Google Scholar

Brooke, J., and Hirst, G. 2012. Robust, lexicalized native language identification. In Proceedings of the International Conference on Computational Linguistics, Mumbai, India, pp. 391–408.Google Scholar

Bykh, S., and Meurers, D. 2014. Exploring syntactic features for native language identification: a variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1962–73.Google Scholar

Cahill, A., Madnani, N., Tetreault, J., and Napolitano, D. 2013. Robust Systems for Preposition Error Correction Using Wikipedia Revisions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 507–17.Google Scholar

Chodorow, M., Dickinson, M., Israel, R., and Tetreault, J. 2012. Problems in evaluating grammatical error detection systems. In Proceedings of COLING 2012, Mumbai, India, pp. 611–28.Google Scholar

Corston-Oliver, S., Gamon, M., and Brockett, C. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Toulouse, France, pp. 148–55.Google Scholar

Dahlmeier, D., and Ng, H. 2011a. Correcting Semantic Collocation Errors with L1-induced Paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 107–17.Google Scholar

Dahlmeier, D., and Ng, H. 2011b. Grammatical error correction with alternating structure optimization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, USA, pp. 915–23Google Scholar

Dahlmeier, D., Ng, H., and Wu, S. 2013. Building a large annotated corpus of learner english: the NUS corpus of learner english. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 22–31.Google Scholar

Felice, M., Yuan, Z., Andersen, Ø., Yannakoudakis, H., and Kochmar, E. 2014. Grammatical error correction using hybrid systems and type filtering. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Baltimore, Maryland, USA, pp. 15–24.Google Scholar

Gamon, M. 2010. Using mostly native data to correct errors in Learners’ writing: a meta-classifier approach. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, USA, pp. 163–71.Google Scholar

Gamon, M., Aue, A., and Smets, M. 2005. Sentence-level MT evaluation without reference translations: beyond language modeling. In Proceedings of the European Association for Machine Translation (EAMT), Budapest, Hungary, pp. 103–11.Google Scholar

Granger, S. 2003. The international corpus of learner english: a new resource for foreign language learning and teaching and second language acquisition research. In Teachers of English to Speakers of Other Languages Quarterly, pp. 538–46.Google Scholar

Granger, S., Dagneaux, E., Meunier, F., and Paquot, M. 2009. The International Learner Corpus of English, Version 2. Presses Universitaires de Louvain.Google Scholar

Gui, S., and Yang, H. 2003. Zhongguo Xuexizhe Yingyu Yuliaohu (Chinese Learner English Corpus). In Shanghai Waiyu Jiaoyu Chubanshe.Google Scholar

Guthrie, D., Allison, B., Liu, W., Guthrie, L., and Wilks, Y. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, pp. 101–11.Google Scholar

Ionescu, R., Popescu, M., and Cahill, A. 2014. Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1363–73.Google Scholar

Ishikawa, S. 2009. Vocabulary in interlanguage: a study on corpus of english essays written by Asian university students (CEEAUS). In Phraseology: Corpus Linguistics and Lexicology, pp. 87–100.Google Scholar

Ishikawa, S. 2013. The ICNALE and sophisticated contrastive interlanguage analysis of Asian learners of english. In Learner Corpus Studies in Asia and the World, pp. 91–118.Google Scholar

Jarvis, S., Bestgen, Y., and Pepper, S. 2013. Maximizing classification accuracy in native language identification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 111–18.Google Scholar

Johnson, M., Griffiths, T., and Goldwater, S. 2006. Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. In Neural Information Processing Systems, pp. 641–8.Google Scholar

Kao, T., Chang, Y., Chiu, H., Yen, T., Boisson, J., Wu, J., and Chang, J. 2013. CoNLL-2013 shared task: grammatical error correction NTHU system description. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 20–25.Google Scholar

Koppel, M., Schler, J., and Argamon, S. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 9–26.CrossRef Google Scholar

Koppel, M., Schler, J., and Zigdon, K. 2005. Determining an author's native language by mining a text for errors. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA, pp. 624–8.Google Scholar

Kukich, K. 1992. Techniques for automatically correcting words in text. In ACM Computing Surveys, pp. 377–439.Google Scholar

Kulkarni, C., Wei, K. P., Le, H., Chia, D., Papadopoulos, K., Cheng, J., Koller, D., and Klemmer, S. R. 2013. Peer and Self Assessment in Massive Online Classes. ACM Trans. Comput.-Hum. Interact. 20 (6): 33:1–33:31. ISSN .CrossRef Google Scholar

Kyle, K., Crossley, S., Dai, J., and McNamara, D. 2013. Native language identification: a key N-gram category approach. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 242–50.Google Scholar

Lavergne, T., Illouz, G., Max, A., and Nagata, R. 2013. LIMSI’s participation to the 2013 shared task on Native Language Identification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 260–65.Google Scholar

Leacock, C., Chodorow, M., Gamon, M., Tetreault, J. 2014. Automated Grammatical Error Detection for Language Learners, 2nd ed. In G. Hirst (ed.), Morgan and Claypool (Synthesis lectures on human language technologies).CrossRef Google Scholar

Lee, J., and Seneff, S. 2006. Automatic grammar correction for second-language learners. In Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, Pennsylvania, USA, pp. 1978–81.Google Scholar

Madnani, N., Tetreault, J., and Chodorow, M. 2012. Exploring grammatical error correction with not-so-crummy machine translation. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Montreal, Canada, pp. 44–53.Google Scholar

Massung, S., Zhai, C., and Hockenmaier, J. 2013. Structural parse tree features for text representation. In Proceedings of the International Conference on Semantic Computing, Irvine, California, USA, pp. 9–16.Google Scholar

Ng, H., Wu, S., Briscoe, T., Hadiwinoto, C., Sustano, R., and Bryant, C. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Ann Arbor, Michigan, USA, pp. 1–14.Google Scholar

Ng, H., Wu, S., Wu, Y., Hadiwinoto, C., and Tetreault, J. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 1–12.Google Scholar

Popescu, M., and Ionescu, T. 2013. The story of the characters, the DNA and the native language. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 270–78.Google Scholar

Rangel, F., Rosso, F., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., and Daelemans, W. 2014. Overview of the 2nd author profiling task at PAN 2014. In Proceedings of the Conference and Labs of the Evaluation Forum (Working Notes), Sheffield, England, UK.Google Scholar

Rozovskaya, A., and Roth, D. 2010. Training paradigms for correcting errors in grammar and usage. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, pp. 154–62.Google Scholar

Rozovskaya, A., and Roth, D. 2013. Joint learning and inference for grammatical error correction. In Proceedings of Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 791–802.Google Scholar

Rozovskaya, A., Chang, K., Sammons, M., and Roth, D. 2013. The University of Illinois system in the CoNLL-2013 shared task. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 13–19.Google Scholar

Rozovskaya, A., Chang, K., Sammons, M., Roth, D., and Habash, N. 2014. The Illinois-Columbia system in the CoNLL-2014 shared task. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Ann Arbor, Michigan, USA, 34–42.Google Scholar

Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60: 538–56.CrossRef Google Scholar

Stolerman, A., Caliskan, A., and Greenstadt, R. 2013. From language to family and back: native language and language family identification from english text. In Proceedings of the 2013 NAACL HLT Student Research Workshop, Atlanta, Georgia, USA, pp. 32–9.Google Scholar

Swanson, B., and Charniak, E. 2012. Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, Jeju Island, Korea, pp. 193–97.Google Scholar

Tajiri, T., Komachi, M., and Matsumoto, Y. 2012. Tense and aspect error correction for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 198–202.Google Scholar

Tetreault, J., Blanchard, D., and Cahill, A. 2013. A report on the first native language identification shared task. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 48–57.Google Scholar

Tsur, O., and Rappoport, A. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, Prague, Czech Republic, pp. 9–16.Google Scholar

Tsvetkov, Y., Twitto, N., Schneider, N., Ordan, N., Faruqui, M., Chahuneau, V., Wintner, S., and Dyer, C. 2013. Identifying the L1 of non-native writers: the CMU-Haifa System. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 279–87.Google Scholar

West, R., Park, A., and Levy, R. 2011. Bilingual random walk models for automated grammar correction of ESL author-produced text. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Portland, Oregon, USA, pp. 170–79.Google Scholar

Wong, J., and Dras, M. 2009. Contrastive analysis and native language identification. In Australasian Language Technology Association Workshop 2009, pp. 53–61.Google Scholar

Wong, J., and Dras, M. 2011. Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 1600–10.Google Scholar

Wong, J., Dras, M., and Johnson, M. 2012. Exploring adaptor grammars for native language identification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 699–709.Google Scholar

Xiang, Y., Yuan, B., Zhang, Y., Wang, X., Zheng, W., and Wei, C. 2013. A hybrid model for grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 115–22.Google Scholar

Yannakoudakis, H., Briscoe, T., and Medlock, B. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 180–89.Google Scholar

Yu, L., Lee, L., and Chang, L. 2014. Overview of grammatical error diagnosis for learning Chinese as a foreign language. In Proceedings of the 22nd International Conference on Computers in Education, Nara, Japan, pp. 42–7.Google Scholar

Zheng, Z., Wu, X., and Srihari, R. 2004. Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explor. Newsl. 6 (1): 80–89. ISSN .CrossRef Google Scholar

Article contents

Non-native text analysis: A survey

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests