Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-11T19:00:02.981Z Has data issue: false hasContentIssue false

Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

Published online by Cambridge University Press:  26 January 2023

Hauke Licht*
Affiliation:
Cologne Center for Comparative Politics, Institute of Political Science and European Affairs, University of Cologne, Cologne, Germany. E-mail: [email protected]
*
Corresponding author Hauke Licht

Abstract

Established approaches to analyze multilingual text corpora require either a duplication of analysts’ efforts or high-quality machine translation (MT). In this paper, I argue that multilingual sentence embedding (MSE) is an attractive alternative approach to language-independent text representation. To support this argument, I evaluate MSE for cross-lingual supervised text classification. Specifically, I assess how reliably MSE-based classifiers detect manifesto sentences’ topics and positions compared to classifiers trained using bag-of-words representations of machine-translated texts, and how this depends on the amount of training data. These analyses show that when training data are relatively scarce (e.g., 20K or less-labeled sentences), MSE-based classifiers can be more reliable and are at least no less reliable than their MT-based counterparts. Furthermore, I examine how reliable MSE-based classifiers label sentences written in languages not in the training data, focusing on the task of discriminating sentences that discuss the issue of immigration from those that do not. This analysis shows that compared to the within-language classification benchmark, such “cross-lingual transfer” tends to result in fewer reliability losses when relying on the MSE instead of the MT approach. This study thus presents an important addition to the cross-lingual text analysis toolkit.

Type
Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Artetxe, M., and Schwenk, H.. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7: 597610. https://doi.org/10.1162/tacl_a_00288.CrossRefGoogle Scholar
Baden, C., Pipal, C., Schoonvelde, M., and van der Velden, M. A. C. G.. 2021. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 16 (1): 18. https://doi.org/10.1080/19312458.2021.2015574.CrossRefGoogle Scholar
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., and Nagler, J.. 2021. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis 29 (1): 1942. https://doi.org/10.1017/pan.2020.8.CrossRefGoogle Scholar
Baumgartner, F. R., Breunig, C., and Grossman, E. (eds.). 2019. Comparative Policy Agendas: Theory, Tools, Data. Oxford: Oxford University Press.CrossRefGoogle Scholar
Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110 (2): 278295. https://doi.org/10.1017/S0003055416000058.CrossRefGoogle Scholar
Burscher, B., Vliegenthart, R., and De Vreese, C. H.. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?The Annals of the American Academy of Political and Social Science 659 (1): 122131. https://doi.org/10.1177/0002716215569441.CrossRefGoogle Scholar
Chan, C.-H., et al. 2020. “Reproducible Extraction of Cross-Lingual Topics (rectr).” Communication Methods and Measures 14 (4): 285305. https://doi.org/10.1080/19312458.2020.1812555.CrossRefGoogle Scholar
Conneau, A., et al. 2020. “Unsupervised Cross-Lingual Representation Learning at Scale.” In Jurafsky, D., J. Chai, N. Schluter, and J. Tetreault (eds.). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 84408451. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747.CrossRefGoogle Scholar
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A.. 2017. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” In Palmer, M., R. Hwa, S. Riedel (eds.). Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 670680. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070.CrossRefGoogle Scholar
Courtney, M., Breen, M., McMenamin, I., and McNulty, G.. 2020. “Automatic Translation, Context, and Supervised Learning in Comparative Politics.” Journal of Information Technology & Politics 17 (3): 208217. https://doi.org/10.1080/19331681.2020.1731245.CrossRefGoogle Scholar
D’Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P.. 2014. “Separating the Wheat from the Chaff: Applications of Automated Document Classification using Support Vector Machines.” Political Analysis 22 (2): 224242. https://doi.org/10.1093/pan/mpt030.CrossRefGoogle Scholar
De Vries, E., Schoonvelde, M., and Schumacher, G.. 2018. “No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications.” Political Analysis 26 (4): 417430. https://doi.org/10.1017/pan.2018.26.CrossRefGoogle Scholar
Düpont, N., and Rachuj, M.. 2022. “The Ties That Bind: Text Similarities and Conditional Diffusion among Parties.” British Journal of Political Science 52 (2): 613630. https://doi.org/10.1017/S0007123420000617.CrossRefGoogle Scholar
Fan, A., et al. 2021. “Beyond English-Centric Multilingual Machine Translation.” Journal of Machine Learning Research 22 (107): 148.Google Scholar
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J.. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences of the United States of America 115 (16): E3635E3644. https://doi.org/10.1073/pnas.1720347115.Google ScholarPubMed
Glavaš, G., Nanni, F., and Ponzetto, S. P.. 2017. “Cross-Lingual Classification of Topics in Political Texts.” In Hovy, D., S. Volkova, D. Bamman, D. Jurgens, B. O’Connor, O. Tsur, A. S. Doğruöz (eds.). Proceedings of the Second Workshop on NLP and Computational Social Science, 4246. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-2906.CrossRefGoogle Scholar
Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267297. https://doi.org/10.1093/pan/mps028.CrossRefGoogle Scholar
Harris, Z. S. 1954. “Distributional Structure.” WORD 10 (2–3): 146162. https://doi.org/10.1080/00437956.1954.11659520.CrossRefGoogle Scholar
Hillard, D., Purpura, S., and Wilkerson, J.. 2008. “Computer-Assisted Topic Classification for Mixed-Methods Social Science Research.” Journal of Information Technology & Politics 4 (4): 3146. https://doi.org/10.1080/19331680801975367.CrossRefGoogle Scholar
Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts using Words as Data.” The American Political Science Review 97 (2): 311331.CrossRefGoogle Scholar
Lehmann, P., and Zobel, M.. 2018. “Positions and Saliency of Immigration in Party Manifestos: A Novel Dataset Using Crowd Coding.” European Journal of Political Research 57 (4): 10561083.CrossRefGoogle Scholar
Licht, H. 2022a. “Replication Data for: Cross-Lingual Classification of Political Texts using Multilingual Sentence Embeddings.” Code Ocean V1. https://doi.org/10.24433/CO.5199179.v1.CrossRefGoogle Scholar
Licht, H. 2022b. “Replication Data for: Cross-Lingual Classification of Political Texts using Multilingual Sentence Embeddings.” Harvard Dataverse V1. https://doi.org/10.7910/DVN/OLRTXA.CrossRefGoogle Scholar
Lind, F., Eberl, J.-M., Eisele, O., Heidenreich, T., Galyga, S., and Boomgaarden, H. G.. 2021a. “Building the Bridge: Topic Modeling for Comparative Research.” Communication Methods and Measures 16: 96114. https://doi.org/10.1080/19312458.2021.1965973.CrossRefGoogle Scholar
Lind, F., Eberl, J.-M., Heidenreich, T., and Boomgaarden, H. G.. 2019. “When the Journey Is as Important as the Goal: A Roadmap to Multilingual Dictionary Construction.” International Journal of Communication 13: 21.Google Scholar
Lind, F., Heidenreich, T., Kralj, C., and Boomgaarden, H. G.. 2021b. “Greasing the Wheels for Comparative Communication Research: Supervised Text Classification for Multilingual Corpora.” Computational Communication Research 3 (3): 130. https://doi.org/10.5117/CCR2021.3.001.LIND.CrossRefGoogle Scholar
Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., and Tingley, D.. 2015. “Computer-Assisted Text Analysis for Comparative Politics.” Political Analysis 23 (2): 254277. https://doi.org/10.1093/pan/mpu019.CrossRefGoogle Scholar
Maier, D., Baden, C., Stoltenberg, D., De Vries-Kedem, M., and Waldherr, A.. 2021. “Machine Translation vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections.” Communication Methods and Measures 16: 1938. https://doi.org/10.1080/19312458.2021.1955845.CrossRefGoogle Scholar
Mikhaylov, S., Laver, M., and Benoit, K. R.. 2012. “Coder Reliability and Misclassification in the Human Coding of Party Manifestos.” Political Analysis 20 (1): 7891. https://doi.org/10.1093/pan/mpr047.CrossRefGoogle Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Burges, C.J., L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.). Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. Google Scholar
Osnabrügge, M., Ash, E., and Morelli, M.. 2021. “Cross-Domain Topic Classification for Political Texts.” Political Analysis First view: 1–22. https://doi.org/10.1017/pan.2021.37.CrossRefGoogle Scholar
Pennington, J., Socher, R., and Manning, C.. 2014. “GloVe: Global Vectors for Word Representation.” In Moschitti, A., B. Pang, W. Daelemans (eds.). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 15321543. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.CrossRefGoogle Scholar
Proksch, S.-O., Lowe, W., Wäckerle, J., and Soroka, S.. 2019. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 44 (1): 97131. https://doi.org/10.1111/lsq.12218.CrossRefGoogle Scholar
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R.. 2010. “How to Analyze Political Attention with Minimal Assumptions and Costs.” American Journal of Political Science 54 (1): 209228. https://doi.org/10.1111/j.1540-5907.2009.00427.x.CrossRefGoogle Scholar
Reber, U. 2019. “Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora.” Communication Methods and Measures 13 (2): 102125. https://doi.org/10.1080/19312458.2018.1555798.CrossRefGoogle Scholar
Reimers, N., and Gurevych, I.. 2020. “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.” In Webber, B., T. Cohn, Y. He, Y. Liu (eds.). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 45124525. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365.CrossRefGoogle Scholar
Rodman, E. 2020. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis 28 (1): 87111. https://doi.org/10.1017/pan.2019.23.CrossRefGoogle Scholar
Rodriguez, P. L., and Spirling, A.. 2021. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics 84 (1): 101115. https://doi.org/10.1086/715162.CrossRefGoogle Scholar
Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T.. 2019. “Transfer Learning in Natural Language Processing.” In Sarkar, A., and M. Strube (eds.). Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 1518. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-5004.CrossRefGoogle Scholar
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., and Sedlmair, M.. 2018. “More than Bags of Words: Sentiment Analysis with Word Embeddings.” Communication Methods and Measures 12 (2–3): 140157. https://doi.org/10.1080/19312458.2018.1455817.CrossRefGoogle Scholar
Ruedin, D., and Morales, L.. 2019. “Estimating Party Positions on Immigration: Assessing the Reliability and Validity of Different Methods.” Party Politics 25 (3): 303314. https://doi.org/10.1177/1354068817713122.CrossRefGoogle Scholar
Volkens, A., et al. 2020. The Manifesto Data Collection. Manifesto Project (MRG/CMP/MARPOR). Version 2020a. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB). https://doi.org/10.25522/manifesto.mpds.2020a.CrossRefGoogle Scholar
Windsor, L. C., Cupit, J. G., and Windsor, A. J.. 2019. “Automated Content Analysis across Six Languages.” PLoS One 14 (11): e0224425. https://doi.org/10.1371/journal.pone.0224425.CrossRefGoogle ScholarPubMed
Yang, Y., et al. 2020. “Multilingual Universal Sentence Encoder for Semantic Retrieval.” In Celikyilmaz, A., T.-H. Wen (eds.). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 8794. https://doi.org/10.18653/v1/2020.acl-demos.12.CrossRefGoogle Scholar
Supplementary material: Link
Link
Supplementary material: PDF

Licht supplementary material

Licht supplementary material

Download Licht supplementary material(PDF)
PDF 366 KB