Authorship attribution using author profiling classifiers

Caio Deutsch; Ivandré Paraboni

doi:10.1017/S1351324921000383

Authorship attribution using author profiling classifiers

Published online by Cambridge University Press: 19 January 2022

Caio Deutsch

and

Ivandré Paraboni

Show author details

Caio Deutsch: Affiliation:
School of Arts, Sciences and Humanities, University of São Paulo, Av. Arlindo Bettio 1000, São Paulo, Brazil
Ivandré Paraboni*: Affiliation:
School of Arts, Sciences and Humanities, University of São Paulo, Av. Arlindo Bettio 1000, São Paulo, Brazil
*: *Corresponding author. E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Authorship attribution – the computational task of identifying the author of a given text document within a set of possible candidates – has been attracting interest in Natural Language Processing research for many years. At the same time, significant advances have also been observed in the related field of author profiling, that is, the computational task of learning author demographics from text such as gender, age and others. The close relation between the two topics – both of which focused on gaining knowledge about the individual who wrote a piece of text – suggests that research in these fields may benefit from each other. To illustrate this, this work addresses the issue of author identification with the aid of author profiling methods, adding demographics predictions to an authorship attribution architecture that may be particularly suitable to extensions of this kind, namely, a stack of classifiers devoted to different aspects of the input text (words, characters and text distortion patterns.) The enriched model is evaluated across a range of text domains, languages and author profiling estimators, and its results are shown to compare favourably to those obtained by a standard authorship attribution method that does not have access to author demographics predictions.

Keywords

Authorship attribution Author profiling Text classification

Type: Article
Information: Natural Language Engineering , Volume 29 , Issue 1 , January 2023 , pp. 110 - 137

DOI: https://doi.org/10.1017/S1351324921000383 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Bagnall, D. (2016). Authorship clustering using multi-headed recurrent neural networks. In Cappellato, L., Ferro, N., Macdonald, C. and Balog, K. (eds), CEUR Workshop Proceedings, vol. 1609, Evora, Portugal. CEUR-WS.org, pp. 791–804.Google Scholar

Baker, C.F., Fillmore, C.J. and Lowe, J.B. (1998). The Berkeley FrameNet project. In COLING-1998, Montréal, Quebec, Canada. Association for Computational Linguistics, pp. 86–90.Google Scholar

Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H. and Nissim, M. (2017). N-GrAM: new groningen author-profiling model. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin.Google Scholar

Casavantes, M., López, R. and González, L.C. (2019). UACh at MEX-A3T 2019: preliminary results on detecting aggressive tweets by adding author information via an unsupervised strategy. In IberLEF@ SEPLN, Bilbao, Spain. CEUR-WS.org, pp. 537–543.Google Scholar

Chen, X., Hao, P., Chandramouli, R. and Subbalakshmi, K.P. (2011). Authorship similarity detection from email messages. In Machine Learning and Data Mining in Pattern Recognition - 7th International Conference, MLDM, New York, NY, USA. Berlin, Heidelberg: Springer, pp. 375–386.CrossRef Google Scholar

Custódio, J.E. and Paraboni, I. (2019). An ensemble approach to cross-domain authorship attribution. In International Conference of the Cross-Language Evaluation Forum for European Languages CLEF 2019, Lecture Notes in Computer Science, vol. 11696, Lugano, Switzerland. Springer, pp. 201–212.CrossRef Google Scholar

Custódio, J.E. and Paraboni, I. (2021). Stacked authorship attribution of digital texts. Expert Systems with Applications 176, 114866.CrossRef Google Scholar

dos Santos, W.R. and Paraboni, I. (2019). Moral stance recognition and polarity classification from Twitter and elicited text. In Recents Advances in Natural Language Processing (RANLP-2019), Varna, Bulgaria. INCOMA Ltd., pp. 1069–1075.CrossRef Google Scholar

dos Santos, W.R., Ramos, R.M.S. and Paraboni, I. (2019). Computational personality recognition from facebook text: psycholinguistic features, words and facets. New Review of Hypermedia and Multimedia 25(4), 268–287.CrossRef Google Scholar

Garrido-Espinosa, M.G., Rosales-Pérez, A. and López-Monroy, A.P. (2020). GRU with author profiling information to detect aggressiveness. In Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain.Google Scholar

Granados, A., Cebrián, M., Camacho, D. and de Borja Rodrguez, F. (2011). Reducing the loss of information through annealing text distortion. IEEE Transactions on Knowledge and Data Engineering 23(7), 1090–1102.CrossRef Google Scholar

Hinh, R., Shin, S. and Taylor, J. (2016). Using frame semantics in authorship attribution. In IEEE International Conference on Systems, Man, and Cybernetics, SMC-2016, Budapest, Hungary, pp. 4093–4098.CrossRef Google Scholar

Hsieh, F.C., Dias, R.F.S. and Paraboni, I. (2018). Author profiling from facebook corpora. In 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. ELRA, pp. 2566–2570.Google Scholar

Isbister, T., Kaati, L. and Cohen, K. (2017). Gender classification with data independent features in multiple languages. In European Intelligence and Security Informatics Conference (EISIC-2017), Athens, Greece. IEEE Computer Society, pp. 54–60.CrossRef Google Scholar

Jafariakinabad, F. and Hua, K.A. (2019). Style-aware neural model with application in authorship attribution. In 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 325–328.CrossRef Google Scholar

Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain. Association for Computational Linguistics, pp. 427–431.CrossRef Google Scholar

Juola, P. and Stamatatos, E. (2013). Overview of the author identification task at PAN 2013. In Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23–26, 2013. Google Scholar

Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M. and Stein, B. (2019). Overview of the cross-domain authorship attribution task at PAN 2019. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org.Google Scholar

Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B. and Potthast, M. (2018). Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In Cappellato L., Ferro N., Nie J.-Y. and Soulier L. (eds), Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org.Google Scholar

Kim, S.M., Xu, Q., Qu, L., Wan, S. and Paris, C. (2017). Demographic inference on Twitter using recursive neural networks. In Proceedings of ACL-2017, Vancouver, Canada, pp. 471–477.CrossRef Google Scholar

Koppel, M. and Seidman, S. (2018). Detecting pseudepigraphic texts using novel similarity measures. Digital Scholarship in the Humanities 33(1), 72–81.CrossRef Google Scholar

Markov, I., Stamatatos, E. and Sidorov, G. (2017). Improving cross-topic authorship attribution: the role of pre-processing. In 18th International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, pp. 289–302.Google Scholar

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157.CrossRef Google Scholar PubMed

Misra, K., Devarapalli, H., Ringenberg, T.R. and Rayz, J.T. (2019). Authorship analysis of online predatory conversations using character level convolution neural networks. In IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 623–628.CrossRef Google Scholar

Nguyen, D.-P., Trieschnigg, R.B., Dogruoz, A.S., Gravel, R., Theune, M., Meder, T. and de Jong, F.M. (2014). Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In Proceedings of COLING-2014. Association for Computational Linguistics, pp. 1950–1961.Google Scholar

Patchala, J. and Bhatnagar, R. (2018). Authorship attribution by consensus among multiple features. In 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2766–2777.Google Scholar

Pavan, M.C., dos Santos, V.G., Lan, A.G.J., ao Trevisan Martins, J., dos Santos, W.R., Deutsch, C., da Costa, P.B., Hsieh, F.C. and Paraboni, I. (2020). Morality classification in natural language text. IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.3034050 CrossRef Google Scholar

Peng, J., Choo, K.-K.R. and Ashman, H. (2016). Astroturfing detection in social media: using binary n-gram analysis for authorship attribution. In 2016 IEEE Trustcom/BigDataSE/ISPA, pp. 121–128.CrossRef Google Scholar

Pennebaker, J.W., Francis, M.E. and Booth, R.J. (2001). Inquiry and Word Count: LIWC. Mahwah, NJ: Lawrence Erlbaum.Google Scholar

Pizarro, J. (2019). Using N-grams to detect Bots on Twitter. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers, Lugano, Switzerland. CEUR-WS.org.Google Scholar

Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P. and Stein, B. (2017). Overview of PAN 17: author identification, author profiling, and author obfuscation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science, vol. 10456. Springer, pp. 275–290.CrossRef Google Scholar

Ramos, R.M.S., Neto, G.B.S., Silva, B.B.C., Monteiro, D.S., Paraboni, I. and Dias, R.F.S. (2018). Building a corpus for personality-dependent natural language understanding and generation. In 11th International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. ELRA, pp. 1138–1145.Google Scholar

Rangel, F. and Rosso, P. (2019). Overview of the 7th author profiling task at PAN 2019: bots and gender profiling. In Cappellato L., Ferro N., Losada D. and Müller H. (eds), CLEF 2019 Labs and Workshops, Notebook Papers, Lugano, Switzerland. CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M. and Stein, B. (2018). Overview of the 6th author profiling task at PAN 2018: multimodal gender identification in Twitter. In Cappellato L., Ferro N., Nie, J.-Y. and Soulier L. (eds), Working Notes Papers of the CLEF 2018 Evaluation Labs, CEUR Workshop Proceedings, Avignon, France. CLEF and CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Potthast, M. and Stein, B. (2017). Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin. CEUR-WS.org.Google Scholar

Rangel, F., Rosso, P., Zaghouani, W. and Charfi, A. (2020). Fine-grained analysis of language varieties and demographics. Natural Language Engineering 26(6), 641–661.CrossRef Google Scholar

Reddy, P.B., Reddy, T.R., Chand, M.G. and Venkannababu, A. (2018). A new approach for authorship attribution. In Advances in Intelligent Systems and Computing, vol. 701, pp. 1–9.CrossRef Google Scholar

Reddy, T.R., Vardhan, B.V. and Reddy, P.V. (2017). N-Gram approach for gender prediction. In Advance Computing Conference (IACC), Hyderabad, India, pp. 860–865.CrossRef Google Scholar

Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B. and Stamatatos, E. (2017). Authorship attribution for social media forensics. IEEE Transactions on Information Forensics and Security 12(1), 5–33.CrossRef Google Scholar

Sánchez-Junquera, J., nor Pineda, L.V., y Gómez, M.M., Rosso, P. and Stamatatos, E. (2020). Masking domain-specific information for cross-domain deception detection. Pattern Recognition Letters 135, 122–130.CrossRef Google Scholar

Sari, Y. and Stevenson, M. (2016). Exploring word embeddings and character N-grams for author clustering notebook for PAN at CLEF 2016. In CEUR Workshop Proceedings, Evora, Portugal. CEUR-WS.org.Google Scholar

Sari, Y., Stevenson, M. and Vlachos, A. (2018). Topic or style? exploring the most useful features for authorship attribution. In 27th International Conference on Computational Linguistics COLING-2018, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 343–353.Google Scholar

Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. (2006). Effects of age and gender on blogging. In AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Menlo Park, California, USA. AAAI Press, pp. 199–205.Google Scholar

Schwartz, R., Tsur, O., Rappoport, A. and Koppel, M. (2013). Authorship attribution of micro-messages. In Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Association for Computational Linguistics, pp. 1880–1891.Google Scholar

Sharon Belvisi, N.M., Muhammad, N. and Alonso-Fernandez, F. (2020). Forensic authorship analysis of microblogging texts using n-grams and stylometric features. In 8th International Workshop on Biometrics and Forensics (IWBF), Porto, Portugal. IEEE, pp. 1–6.Google Scholar

Shrestha, P., Sierra, S., Gonzalez, F., Rosso, P., Montes-Y-Gomez, M. and Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Valencia, Spain. Association for Computational Linguistics (ACL), pp. 669–674.CrossRef Google Scholar

Silva, B.B.C. and Paraboni, I. (2018). Personality recognition from Facebook text. In 13th International Conference on the Computational Processing of Portuguese (PROPOR-2018), LNCS, vol. 11122, Canela. Springer-Verlag, pp. 107–114.CrossRef Google Scholar

Stamatatos, E. (2017). Authorship attribution using text distortion. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-2017), Valencia, Spain. Association for Computational Linguistics.CrossRef Google Scholar

Stevenson, M., Vlachos, A. and Sari, Y. (2017). Continuous n-gram representations for authorship attribution. In 15th Conference of the European Chapter of the Association for Computational Linguistics EACL-2017, Valencia, Spain, pp. 267–273.Google Scholar

Sundararajan, K. and Woodard, D.L. (2018). What constitutes style in authorship attribution? In 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 2814–2822.Google Scholar

Takahashi, T., Tahara, T., Nagatani, K., Miura, Y., Taniguchi, T. and Ohkuma, T. (2018). Text and image synergy with feature cross technique for gender identification. In Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF 2018), vol. 2125, Avignon, France. CEUR-WS.org.Google Scholar

Vartapetiance, A. and Gillam, L. (2012). Quite simple approaches for authorship attribution, intrinsic plagiarism detection and sexual predator identification. In CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy. CEUR-WS.org.Google Scholar

Verhoeven, B., Daelemans, W. and Plank, B. (2016). TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. In 10th International Conference on Language Resources and Evaluation (LREC-2016), Portoroz, Slovenia. ELRA, pp. 1632–1637.Google Scholar

Wolpert, D.H. (1992). Stacked generalization. Neural Networks 5(2), 241–259.CrossRef Google Scholar

Article contents

Authorship attribution using author profiling classifiers

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests