Programming for Corpus Linguistics with Python and Dataframes

Daniel Keller

doi:10.1017/9781108904094

Series: Elements in Corpus Linguistics

Programming for Corpus Linguistics with Python and Dataframes

Published online by Cambridge University Press: 24 May 2024

Daniel Keller

Show author details

Daniel Keller: Affiliation:
Western Kentucky University

Summary

This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.

Element contents

Summary
References

Get access

Keywords

corpus linguistics programming Python corpus linguistic methods algorithms for corpus linguistics

Type: Element
Information: Series: Elements in Corpus Linguistics

DOI: https://doi.org/10.1017/9781108904094 [Opens in a new window]

Online ISBN: 9781108904094

Publisher: Cambridge University Press

Print publication: 20 June 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Element purchase

Temporarily unavailable

References

Anthony, L. (2020). Programming for corpus linguistics. In Paquot, M. and Gries, S. T., eds. Practical Handbook of Corpus Linguistics. Springer, pp. 181–207.CrossRef Google Scholar

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at … : Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.CrossRef Google Scholar

Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.CrossRef Google Scholar

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.CrossRef Google Scholar

Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar

Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104.CrossRef Google Scholar

Egbert, J., & Biber, D. (2023). Key feature analysis: A simple, yet powerful method for comparing text varieties. Corpora, 18(1), 121–133.CrossRef Google Scholar

Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. Routledge, pp. 225–258.CrossRef Google Scholar

Hetland, M. L. (2014). Python Algorithms: Mastering Basic Algorithms in the Python Language. Apress.CrossRef Google Scholar

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength natural language processing in Python. https://spacy.io/Google Scholar

Ide, N., & Suderman, K. (2004, May). The American National Corpus First Release. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). https://aclanthology.org/L04-1313/Google Scholar

Lee, K. D., & Hubbard, S. H. (2015). Data Structures and Algorithms with Python. Springer.CrossRef Google Scholar

Nivre, J., Agić, Ž., Ahrenberg, L. et al. (2017). Universal Dependencies 2.1. https://universaldependencies.org/u/pos/Google Scholar

Rayson, P. (n.d.). Log-likelihood and effect size calculator. http://ucrel.lancs.ac.uk/llwizard.html Google Scholar

Rychlý, P. (2008). A lexicographer-friendly association score. Proceedings from Recent Advances in Slavonic Natural Language Processing (pp. 6–9). Karlova Studánka, Czech Republic: Masaryk University. nlp.fi.muni.cz/raslan/2008/raslan08.pdfGoogle Scholar

Element contents

Programming for Corpus Linguistics with Python and Dataframes

Summary

Keywords

Access options

Element purchase

Temporarily unavailable

References

Save element to Kindle

Save element to Dropbox

Save element to Google Drive