Pull out all the stops: Textual analysis via punctuation sequences

ALEXANDRA N. M. DARMON; MARYA BAZZI; SAM D. HOWISON; MASON A. PORTER

doi:10.1017/S0956792520000157

Pull out all the stops: Textual analysis via punctuation sequences

Published online by Cambridge University Press: 21 September 2020

ALEXANDRA N. M. DARMON ,

MARYA BAZZI ,

SAM D. HOWISON and

MASON A. PORTER

Show author details

ALEXANDRA N. M. DARMON: Affiliation:
Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK emails: [email protected], [email protected]
MARYA BAZZI: Affiliation:
Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK emails: [email protected], [email protected] The Alan Turing Institute, London NW1 2DB, UK email: [email protected] Warwick Mathematics Institute, University of Warwick, Coventry CV4 7AL, UK
SAM D. HOWISON: Affiliation:
Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK emails: [email protected], [email protected]
MASON A. PORTER: Affiliation:
Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK emails: [email protected], [email protected] Department of Mathematics, University of California, Los Angeles, Los Angeles, California 90095, USA email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Whether enjoying the lucid prose of a favourite author or slogging through some other writer’s cumbersome, heavy-set prattle (full of parentheses, em dashes, compound adjectives, and Oxford commas), readers will notice stylistic signatures not only in word choice and grammar but also in punctuation itself. Indeed, visual sequences of punctuation from different authors produce marvellously different (and visually striking) sequences. Punctuation is a largely overlooked stylistic feature in stylometry, the quantitative analysis of written text. In this paper, we examine punctuation sequences in a corpus of literary documents and ask the following questions: Are the properties of such sequences a distinctive feature of different authors? Is it possible to distinguish literary genres based on their punctuation sequences? Do the punctuation styles of authors evolve over time? Are we on to something interesting in trying to do stylometry without words, or are we full of sound and fury (signifying nothing)?

In our investigation, we examine a large corpus of documents from Project Gutenberg (a digital library with many possible editorial influences). We extract punctuation sequences from each document in our corpus and record the number of words that separate punctuation marks. Using such information about punctuation-usage patterns, we attempt both author and genre recognition, and we also examine the evolution of punctuation usage over time. Our efforts at author recognition are particularly successful. Among the features that we consider, the one that seems to carry the most explanatory power is an empirical approximation of the joint probability of the successive occurrence of two punctuation marks. In our conclusions, we suggest several directions for future work, including the application of similar analyses for investigating translations and other types of categorical time series.

Keywords

Stylometry computational linguistics natural language processing digital humanities computational methods mathematical modelling Markov processes categorical time series

Type: Papers
Information: European Journal of Applied Mathematics , Volume 32 , Special Issue 6: Special issue featuring papers on Professor Sam Howison , December 2021 , pp. 1069 - 1105

DOI: https://doi.org/10.1017/S0956792520000157 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Altmann, E. G., Dias, L. & Gerlach, M. (2017) Generalized entropies and the similarity of texts. J. Stat. Mech. Theory Exp. 1, 014002.CrossRef Google Scholar

Arun, R., Suresh, V. & Madhavan, C. E. V. (2009) Stopword graphs and authorship attribution in text corpora. In: Proceedings of the 2009 IEEE International Conference on Semantic Computing, pp. 192–196.CrossRef Google Scholar

Calhoun, A. J. (2016) Punctuation code. Available at https://github.com/adamjcalhoun/punctuation.Google Scholar

Calhoun, A. J. (2016) Punctuation in novels. Available at https://medium.com/ @neuroecology/punctuation-in-novels-8f316d542ec4#.brev0b3w1.Google Scholar

Calhoun, A. J. (2016) What does punctuation tell us about Republicans and Democrats? Avai-lable at https://medium.com/@neuroecology/what-does-punctuation-tell-us-about-republicans-and-democrats-bd46b9f98220.Google Scholar

Can, F. & Patton, J. M. (2004) Change of writing style with time. Comput. Human. 38, 61–82.CrossRef Google Scholar

Chaski, C. E. (2001) Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1–65.Google Scholar

Chevyreva, I. & Kormilitzin, A. (2016) A primer on the signature method in machine learning. arXiv:1603.03788.Google Scholar

Chiang, H., Ge, Y. & Wu, C. (2015) Classification of Book Genres by Cover and Title. Class report, Computer Science 229, Stanford University. Available at http://cs229.stanford.edu/proj2015/127_report.pdf.Google Scholar

Cover, T. M. & Thomas, J. A. (1991) Elements of Information Theory, John Wiley & Sons, Inc., New York City, NY, USA.CrossRef Google Scholar

Duda, R. O., Hart, P. E. & Stork, D. G. (2001) Pattern Classification, John Wiley & Sons, Inc., New York City, NY, USA.Google Scholar

Ebeling, W. & Pöschel, T. (1994) Entropy and long-range correlations in literary English. Europhysics Lett. 26, 241–246.CrossRef Google Scholar

Forsyth, R. S. (1999) Stylochronometry with substrings, or: a poet young and old. Literary Linguist. Comput. 14, 467–477.CrossRef Google Scholar

Fowler, H. W. & Fowler, F. G. (1906) The King’s English, Oxford University Press, Oxford, UK.Google Scholar

Gerlach, M. & Font-Clos, F. (2020) A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy 22, 126.CrossRef Google Scholar PubMed

Gerlach, M., Font-Clos, F. & Altmann, E. G. (2016) Similarity of symbol frequency distributions with heavy tails. Phys. Rev. X 6, 021009.Google Scholar

Grieve, J. (2007) Quantitative authorship attribution: an evaluation of techniques. Literary Linguist. Comput. 22, 251–270.CrossRef Google Scholar

Hart, M. S. (1971) Project Gutenberg. Available at https://www.gutenberg.org.Google Scholar

Hartman, C. O. (2015) Verse: An Introduction to Prosody, Wiley-Blackwell, Hoboken, NJ, USA.Google Scholar

Holmes, D. I. (1998) The evolution of stylometry in humanities scholarship. Literary Linguist. Comput. 50, 111–117.CrossRef Google Scholar

Honnibal, M. (2017) spaCy. Available at https://spacy.io.Google Scholar

Hughes, J. M., Foti, N. J., Krakauer, D. C. & Rockmore, D. N. (2012) Quantitative patterns of stylistic influence in the evolution of literature. Proc. Natl. Acad. Sci. U S A 109, 7682–7686.CrossRef Google Scholar

Jackson, M. P. (2002) Pause patterns in Shakespeare’s verse: canon and chronology. Literary Linguist. Comput. 17, 37–46.CrossRef Google Scholar

Kessler, B., Nunberg, G. & Schutze, H. (1996) Automatic detection of text genre. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics.CrossRef Google Scholar

Kjell, B. (1994) Authorship attribution of text samples using neural networks and Bayesian classifiers. In: Proceedings of the 1994 IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, pp. 1660–1664.CrossRef Google Scholar

Kullback, S. & Leibler, R. A. (1951) On information and sufficiency. Ann. Math. Stat. 22, 79–86.CrossRef Google Scholar

Lai, S., Xu, L., Liu, K. & Zhao, J. (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI ’15), pp. 2267–2273.Google Scholar

Lawler, J. (2006) Punctuation. In: Ken Brown (editor), Encyclopedia of Language & Linguistics, 2nd ed., Elsevier, Amsterdam, The Netherlands.Google Scholar

Lesne, A. (2014) Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics. Math. Struct. Comput. Sci. 24, e240311.CrossRef Google Scholar

Lewis, T. (1979) Notes on punctuation. In: The Medusa and the Snail: More Notes of a Biology Watcher, Viking Press, New York City, NY, USA.Google Scholar

Lin, J. (1991) Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37, 145–151.Google Scholar

Lyons, T. (2014) Rough paths, signatures and the modelling of functions on streams. In: Proceedings of the International Congress of Mathematicians 2014, Korea. Available at http://www.icm2014.org/download/Proceedings_Volume_IV.pdf.Google Scholar

Mendenhall, T. C. (1887) The characteristic curves of composition. Science 9, 237–249.CrossRef Google Scholar PubMed

Mosteller, F. & Wallace, D. L. (1964) Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, MA, USA.Google Scholar

Neal, T., Sundararajan, K., Fatima, A. & Woodard, D. (2018) Surveying stylometry techniques and applications. ACM Comput. Surv. 50, 86.CrossRef Google Scholar

Neidorf, L., Krieger, M. S., Yakubek, M., Chaudhuri, P. & Dexter, J. P. (2019) Large-scale quantitative profiling of the old English verse tradition. Nat. Hum. Behav. 3, 560–567.CrossRef Google Scholar PubMed

Nunberg, G. (1990) The Linguistics of Punctuation, Center for the Study of Language and Information, Stanford, CA, USA.Google Scholar

Parkes, M. B. (editor) (1992) Pause and Effect: An Introduction to the History of Punctuation in the West, University of California Press, Berkeley, CA, USA.Google Scholar

Pullum, G. & Huddleston, R. (2001) The Cambridge Grammar of the English Language, Cambridge University Press, The Other Place, UK.Google Scholar

Qian, C., He, T. & Zhang, R. (2017) Deep Learning Based Authorship Identification. Class report, Computer Science 224, Stanford University. Available at https://pdfs.semanticscholar.org/ab0e/be094ec0a44fb0013d640b344d8cfd7adc81.pdf?_ga=2.215953495.1190289256.1578845031-6826891.1578845031.Google Scholar

Santini, M. (2004) A shallow approach to syntactic feature extraction for genre classification. In: Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics.Google Scholar

Santini, M. State-of-the-Art on Automatic Genre Identification, Information Technology Research Institute (ITRI) Technical Report Series 04-03, University of Brighton, UK, (2004). Available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.7680.Google Scholar

Shannon, C. E. (1948) A mathematical theory of communication. Bell Syst. Tech. J., 379–423, 623–656.CrossRef Google Scholar

Shlens, J. (2014) Notes on Kullback–Leibler divergence and likelihood theory. arXiv:1404.2000.Google Scholar

Stamatatos, E. (2009) A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Tech. 60, 538–556.CrossRef Google Scholar

Stamou, C. (2008) Stylochronometry: stylistic development, sequence of composition, and relative dating. Literary Linguist. Comput. 23, 181–199.CrossRef Google Scholar

Truss, L. (2004) Eats, Shoots and Leaves: The Zero Tolerance Approach to Punctuation, Profile Books, London, UK.Google Scholar

Vieira, D. S., Picoli, S. & Mendes, R. S. (2018) Robustness of sentence length measures in written texts. Physica A 506, 749–754.CrossRef Google Scholar

Watson, C. (2019) Semicolon: The Past, Present, and Future of a Misunderstood Mark, Ecco Press, New York, NY, USA.Google Scholar

Whissell, C. (1996) Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Comput. Human. 30, 257–265.CrossRef Google Scholar

Yang, A. C.-C., Peng, C.-K., Yien, H.-W. and Goldberger, A. (2003) Information categorization approach to literary authorship disputes. Physica A 329, 473–483.CrossRef Google Scholar

Zhao, Y., Zobel, J. & Vines, P. (2006) Using relative entropy for authorship attribution. In: Proceedings of the Third Asia Conference on Information Retrieval Technology (AIRS ’06), pp. 92–105.CrossRef Google Scholar

Article contents

Pull out all the stops: Textual analysis via punctuation sequences

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests