Robust stylometric analysis and author attribution based on tones and rimes

Renkui Hou; Chu-Ren Huang

doi:10.1017/S135132491900010X

Robust stylometric analysis and author attribution based on tones and rimes

Published online by Cambridge University Press: 10 April 2019

Renkui Hou

and

Chu-Ren Huang

Show author details

Renkui Hou*: Affiliation:
Department of Linguistics, College of Humanities, Guangzhou University, Guangzhou, China Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang: Affiliation:
Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
*: *Corresponding author. Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.

Keywords

Stylometrics Quantitative stylistics Tone and rime motifs Random forest SVM Author identification

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 1 , January 2020 , pp. 49 - 71

DOI: https://doi.org/10.1017/S135132491900010X [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abbasi, A. and Chen, H. (2008). Writeprints: a stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems 26(), 1–29.Google Scholar

Argamon, S. and Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing. Victoria, BC, Canada.Google Scholar

Bingenheimer, M., Hung, J.-J. and Hsieh, C.-E. (2017). Stylometric analysis of Chinese Buddhist texts - Do different Chinese translations of the Gaṇḍavyūha reflect stylistic features that are typical for their age?. Journal of the Japanese Association for Digital Humanities 2(1), 1–30.CrossRef Google Scholar

Boroda, M. (1982). Häufigkeitsstrukturen musikalischer Texte. In Orlov, J.K., Boroda, M.G. and Nadarejšvili, I.Š. (eds), Sprache, text, kunst. Quantitative analysen. Bochum: Brockmeyer, pp. 231–262.Google Scholar

Chan, B.C. (1986). A computerized stylostatistical approach to the disputed authorship problem of the dream of the red chamber. Tamkang Review: A Quarterly of Comparative Studies between Chinese and Foreign Literatures 16, 247–278.Google Scholar

Chao, Y.R. (1968). A Grammar of Spoken Chinese. Berkeley and Los Angeles: University of California Press.Google Scholar

Chen, D.K.

(1987).

——

1, 293–318.CrossRef Google Scholar

Chen, H.H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4), 281–289.CrossRef Google Scholar

Chen, K.-J., Huang, C.-R., Chang, L.-P. and Hsu, H.-L. (1996). Sinica corpus: design methodology for balanced corpora. In Park, B.-S. and Kim, J.B. (eds), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul: Kyung Hee University, pp. 167–176.Google Scholar

Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management. ACM, New York, USA. pp. 137–142.Google Scholar

García, A.M. and Martin, J.C. (2006). Function words in authorship attribution studies. Literary and Linguistic Computing 22(1), 49–66.CrossRef Google Scholar

Grieve, J. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing 22(3), 251–270.CrossRef Google Scholar

Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed), Contributions to the Science of Text and Language. Netherlands: Springer, pp. 15–90.Google Scholar

Grzybek, P., Stadlober, E., Kelih, E. and Antić, G (2005). Quantitative text typology: the impact of word length. In Weihs, C. (ed), Classification—The Ubiquitous Challenge. Berlin, Heidelberg: Springer, pp. 53–64.CrossRef Google Scholar

He, X. and Liu, Y. (2014). Mining stylistic features of rhythm and tempo base on text clustering. Journal of Chinese Information Processing 18(6), 194–200.Google Scholar

Herdan, G. (1966). The Advanced Theory of Language as Choice and Chance. New York: Springer-Verlag.CrossRef Google Scholar

Hinh, R., Shin, S. and Taylor, J. (2016). Using frame semantics in authorship attribution. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC’16), pp. 004093–004098. Taiwan.CrossRef Google Scholar

Hirst, G. and Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22(4), 405–417.CrossRef Google Scholar

Ho, J. (2015). From the use of three functional words “

” examining author’s unique writing style–and on dream of red chamber author issues. BIBLID 120(1), 119–150.Google Scholar

Holmes, D.I. (1994). Authorship attribution. Computers and the Humanities 28(2), 87–106.CrossRef Google Scholar

Holmes, D.I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111–117.CrossRef Google Scholar

Holmes, D.I. and Kardos, J. (2003). Who was the author? An introduction to stylometry. Chance 16(2), 5–8.CrossRef Google Scholar

Hou, R., Huang, C. and Liu, H. (2017). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory, AOP. doi: 10.1515/cllt-2016-0062CrossRef Google Scholar

Hou, R., Huang, C.-R., Do, H.S. and Liu, H. (2017). A study on correlation between Chinese sentence and constituting clauses based on the Menzerath-Altmann law. Journal of Quantitative Linguistics 24(4), 350–366. doi: 10.1080/09296174.2017.1314411CrossRef Google Scholar

Hou, R., Huang, C.-R., Ahrens, K. and Sophia Lee, Y.-M. (2019). Linguistic characteristics of Chinese register based on the Menzerath– Altmann law and text clustering. Digital Scholarship in the Humanities. doi: 10.1093/llc/fqz005.CrossRef Google Scholar

Hu, S.

(1921).

.Google Scholar

Hu, X., Wang, Y. and Wu, Q. (2014). Multiple authors detection: a quantitative analysis of dream of the red chamber. Advances in Adaptive Data Analysis 6(4), 1450012.CrossRef Google Scholar

Huang, C.-R. and Chen, K.-J. (2017). Sinica treebank. In Ide, N. and Pustejovsky, J. (eds), Handbook of Linguistic Annotation. Berlin, Heidelberg: Springer.Google Scholar

Huang, C.-R. and Hsieh, S.-K. (2015). Chinese lexical semantics: From radicals to event structure. In William, S.-Y. W. and Sun, C.-F. (eds), The Oxford Handbook of Chinese Linguistics. New York: Oxford University Press, pp. 290–305.Google Scholar

Huang, C.-R. and Shi, D. (2016). A reference Grammar of Chinese. Cambridge: Cambridge University Press.CrossRef Google Scholar

Jin, M. (2002). Author identification based on n - gram pattern of auxiliary word. Measurement of Language. 23(5), 225–240.Google Scholar

Jin, M. and Jiang, M. (2012). Text clustering on authorship attribution based on the features of punctuations usage. In 2012 IEEE 11th International Conference on Signal Processing (ICSP), vol. 3. IEEE, pp. 2175–2178. Beijing. China.CrossRef Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning. Berlin, Heidelberg, Springer, pp. 137–142.CrossRef Google Scholar

Jockers, M.L. and Witten, D.M. (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing. 25(2), 215–223.CrossRef Google Scholar

Juola, P. (2008). Author attribution. Foundations and Trends in Information Retrieval. 1(3), 233–334.CrossRef Google Scholar

Kelih, E., Antić, G., Grzybek, P. and Stadlober, E. (2005). Classification of author and/or genre? The impact of word length. In Weihs, C. (eds), Classification—The Ubiquitous Challenge. Berlin, Heidelberg, Springer, pp. 498–505.CrossRef Google Scholar

Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60(1), 9–26.CrossRef Google Scholar

Koppel, M., Schler, J. and Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261–1276.Google Scholar

Köhler, R. (2006). The frequency distribution of the lengths of length sequences. In Genzor, J. and Bucková, M. (eds), Favete Linguis. Studies in Honour of Victor Krupa. Bratislava: Slovak Academic Press, pp. 145–152.Google Scholar

Köhler, R. (2008). Sequences of linguistic quantities report on a new unit of investigation. Glottotheory 1(1), 115–119.CrossRef Google Scholar

Köhler, R. (2012). Quantitative Syntax Analysis. Berlin/Boston: De Gruyter Mouton.CrossRef Google Scholar

Köhler, R. (2015). Linguistic motifs. Sequences in language and text. pp. 89–108.CrossRef Google Scholar

Köhler, R. and Naumann, S. (2010). A syntagmatic approach to automatic text classification. Statistical properties of F and L-motifs as text characteristics. In Grzybek, P., Kelih, E. and Mačutek, J. (eds), Text and Language. Wien: Praesens, pp. 81–89.Google Scholar

Layton, R., Watters, P. and Dazeley, R. (2013a). Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19(1), 95–120.CrossRef Google Scholar

Layton, R., Watters, P. and Dazeley, R. (2013b). Evaluating authorship distance methods using the positive Silhouette coefficient. Natural Language Engineering 19(4), 517–535.CrossRef Google Scholar

Li, J., Zheng, R. and Chen, H. (2006). From fingerprint to writeprint. Communication of ACM 49(4), 76–82.CrossRef Google Scholar

Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.CrossRef Google Scholar

Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning 1, 1–6.Google Scholar

Luyckx, K. and Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics, August 18–22, 2008, pp. 513–520. Manchester, United Kingdom.CrossRef Google Scholar

Luyckx, K. and Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26(1), 35–55.CrossRef Google Scholar

Marton, Y., Wu, N. and Hellerstein, L. (2005). On compression-based text classification. In Proceedings of the European Conference on Information Retrieval. Berlin, Germany: Springer, pp. 300–314.Google Scholar

Mendenhall, T.C. (1887). The characteristic curves of composition. Science IX, 237–249.CrossRef Google Scholar

Mosteller, F. and Wallace, D.L. (1964). Inference and Disputed Authorship: The Federalist. Reading, Massachusetts: Addison-Wesley.Google Scholar

Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y. and Woodard, D. (2018). Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR) 50(6), 86.CrossRef Google Scholar

Neergaard, K.D. and Huang, C.-R. (2019). Constructing the Mandarin phonological network: novel syllable inventory used to identify schematic segmentation. To Appear in Complexity (special issue), Cognitive Network Science: A New Frontier.Google Scholar

Peng, F., Schuurmans, D., Wang, S. and Keselj, V. (2003). Language independent authorship attribution using character level language models. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, Budapest, Hungary, April 12–17, 2003. doi: 10.3115/1067807.1067843.CrossRef Google Scholar

R Core Team. (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at https://www.R-project.org.Google Scholar

Ruano San Segundo, P. (2016). A corpus-stylistic approach to Dickens’ use of speech verbs: beyond mere reporting. Language and Literature. 25(2), 113–129.CrossRef Google Scholar

Sanderson, C. and Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering. Morristown, NJ: Association for Computational Linguistics, pp. 482–491.Google Scholar

Savoy, J. (2012). Authorship attribution: a comparative study of three text corpora and three language. Journal of Quantitative Linguistics 19(2), 132–161.CrossRef Google Scholar

Savoy, J. (2015). Comparative evaluation of term selection functions for authorship attribution. Literary and Linguistic Computing 30(2), 246–261.CrossRef Google Scholar

Sproat, R. (2000). A Computational Theory of Writing Systems. London: Cambridge University Press.Google Scholar

Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In Proceedings of the 18th International conference on Database and Expert Syterms Applications, Regensburg, Germany: IEEE Computer society. pp. 237–241.Google Scholar

Stamatatos, E. (2008). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 60(3), 538–556.CrossRef Google Scholar

Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471–495.CrossRef Google Scholar

Tan, P.-N., Steinbach, M. and Kumar, V. (Translated by Fan, Ming, Fan, Hongjian). (2006). Introduction to Data Mining. China, Beijing: Posts and Telecom Press, P115.Google Scholar

Vitevitch, M.S. (2002). The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(4). P735–747.Google Scholar PubMed

Wang, D. (1992). Fictional realism in Twentieth-Century China. Dun, Mao, She, Lao, Congwen, Shen. Columbia University Press. New York. USA.Google Scholar

Wang, K. and Qin, H. (2014). What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1), 57–77.CrossRef Google Scholar

Wang, S.-K., Dong, K.-J. and Bao-Ping, Y. (2011). Research on authorship identification based on sentence rhythm feature. Computer Engineering 37(9), 4–5 +8.Google Scholar

Wei, P. (2002). From the distribution of common words examining the author issue of Dream of Red Chamber Author. In Memorial Li Fanggui’s 100th Anniversary International Symposium on Chinese History. Seattle: University of Washington.Google Scholar

Williams, C.B. (1976). Mendenhall’s studies of word-length distribution in the works of Shakespeare and Bacon. Biometrika 62(1), 207–212.CrossRef Google Scholar

Wu, X.C., Huang, X.J. and Wu, L.D. (2006). Method research of author identification based on semantic analysis. Journal Chinese Information 20(6), 61–68.Google Scholar

Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval 1(1), 69–90.CrossRef Google Scholar

Yang, M.Zhu, D., Tang, Y. and Wang, J. (2017). Authorship Attribution with Topic Drift Model. Available at https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14152.Google Scholar

Yu, P.B.

. (1950).

.Google Scholar

Yu, B. (2012). Function words for Chinese authorship attribution. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, pp. 45–53. Montréal, Canada.Google Scholar

Yule, G.U. (1938). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika 30(3/4), 363–390.Google Scholar

Yule, G.U. (1944). The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.Google Scholar

Zheng, R., Li, J., Chen, H. and Huang, Z. (2006). A framework for authorship identification of online messages: writing style features and classification techniques. Journal of the American Society for Information Science and Technology 57(3), 378–393.CrossRef Google Scholar

Zhu, D. (1982). Lectures on Grammar. Beijing, China: Commercial Press.Google Scholar

Zipf, G.K. (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.CrossRef Google Scholar

Article contents

Robust stylometric analysis and author attribution based on tones and rimes

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests