Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Justin Grimmer; Brandon M. Stewart

doi:10.1093/pan/mps028

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Part of: PA editors' choice articles

Published online by Cambridge University Press: 04 January 2017

Justin Grimmer and

Brandon M. Stewart

Show author details

Justin Grimmer*: Affiliation:
Department of Political Science, Stanford University, Encina Hall West 616 Serra Street, Stanford, CA 94305
Brandon M. Stewart: Affiliation:
Department of Government and Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138 e-mail: [email protected]
*: e-mail: [email protected] (corresponding author)

Article contents

Abstract
Footnotes
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

Type: Research Article
Information: Political Analysis , Volume 21 , Issue 3 , Summer 2013 , pp. 267 - 297

DOI: https://doi.org/10.1093/pan/mps028 [Opens in a new window]
Copyright: Copyright © The Author 2013. Published by Oxford University Press on behalf of the Society for Political Methodology

Footnotes

Authors' note: For helpful comments and discussions, we thank participants in Stanford University's Text as Data class, Mike Alvarez, Dan Hopkins, Gary King, Kevin Quinn, Molly Roberts, Mike Tomz, Hanna Wallach, Yuri Zhurkov, and Frances Zlotnick. Replication data are available on the Political Analysis Dataverse at http://hdl.handle.net/1902.1/18517. Supplementary materials for this article are available on the Political Analysis Web site.

References

Adler, E. Scott, and Wilkerson, John. 2011. The Congressional bills project. http://www.congressionalbills.org.CrossRef Google Scholar

Ansolabehere, Stephen, and Iyengar, Shanto. 1995. Going negative: How political advertisements shrink and polarize the electorate. New York, NY: Simon & Schuster.Google Scholar

Armstrong, J. S. 1967. Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine. The American Statistician 21(1): 17–21.Google Scholar

Ashworth, Scott, and Bueno de Mesquita, Scott. 2006. Delivering the goods: Legislative particularism in different electoral and institutional settings. Journal of Politics 68(1): 168–79.CrossRef Google Scholar

Beauchamp, Nick. 2011. Using text to scale legislatures with uninformative voting. New York University Mimeo.Google Scholar

Benoit, K., Laver, M., and Mikhaylov, S. 2009. Treating words as data with error: Uncertainty in text statements of policy positions. American Journal of Political Science 53(2): 495–513.CrossRef Google Scholar

Berinsky, Adam, Huber, Greg, and Lenz, Gabriel. 2012. Using mechanical turk as a subject recruitment tool for experimental research. Political Analysis 20: 351–68.Google Scholar

Bishop, Christopher. 1995. Neural networks for pattern recognition. Gloucestershire, UK: Clarendon Press.CrossRef Google Scholar

Bishop, Christopher. 2006. Pattern recognition and machine learning. New York, NY: Springer.Google Scholar

Blei, David. 2012. Probabilistic topic models. Communications of the ACM 55(4): 77–84.Google Scholar

Blei, David, Ng, Andrew, and Jordan, Michael. 2003. Latent dirichlet allocation. Journal of Machine Learning and Research 3: 993–1022.Google Scholar

Blei, David, and Jordan, Michael. 2006. Variational inference for dirichlet process mixtures. Journal of Bayesian Analysis 1(1): 121–44.Google Scholar

Bonica, Adam. 2011. Estimating ideological positions of candidates and contributions from campaign finance records. Stanford University Mimeo.Google Scholar

Bradley, M. M., and Lang, P. J. 1999. Affective Norms for English Words (ANEW): Stimuli, instruction, manual and affective ratings. University of Florida Mimeo.Google Scholar

Breiman, L. 2001. Random Forests. Machine Learning 45: 5–32.CrossRef Google Scholar

Budge, Ian, and Pennings, Paul. 2007. Do they work? Validating computerised word frequency estimates against policy series. Electoral Studies 26: 121–29.CrossRef Google Scholar

Burden, Barry, and Sanberg, Joseph. 2003. Budget rhetoric in presidential campaigns from 1952 to 2000. Political Behavior 25(2): 97–118.CrossRef Google Scholar

Chang, Jonathan, Boyd-Graber, Jordan, Wang, Chong, Gerrish, Sean, and Blei, David M. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, eds. Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A., 288–96. Cambridge, MA: The MIT Press.Google Scholar

Cleveland, William S. 1979. Robust locally weighted regression and scatterplots. Journal of the American Statistical Association 74(368): 829–36.Google Scholar

Clinton, Joshua, Jackman, Simon, and Rivers, Douglas. 2004. The statistical analysis of roll call data. American Political Science Review 98(02): 355–70.CrossRef Google Scholar

Dempster, Arthur, Laird, Nathan, and Rubin, Donald. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1): 1–38.Google Scholar

Diermeier, Daniel, Godbout, Jean-Francois, Yu, Bei, and Kaufmann, Stefan. 2011. Language and ideology in Congress. British Journal of Political Science 42(1): 31–55.Google Scholar

Dietterich, T. 2000. Ensemble methods in machine learning. Multiple Classifier Systems 1–15.CrossRef Google Scholar

Efron, Bradley, and Gong, Gail. 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician 37(1): 36–48.Google Scholar

Eggers, Andy, and Hainmueller, Jens. 2009. MPs for sale? Returns to office in postwar British politics. American Political Science Review 103(04): 513–33.CrossRef Google Scholar

Eshbaugh-Soha, Matthew. 2010. The tone of local presidential news coverage. Political Communication 27(2): 121–40.CrossRef Google Scholar

Fenno, Richard. 1978. Home style: House members in their districts. Boston, MA: Addison Wesley.Google Scholar

Frey, Brendan, and Dueck, Delbert. 2007. Clustering by passing messages between data points. Science 315(5814): 972–6.Google Scholar

Gelpi, C., and Feaver, P. D. 2002. Speak softly and carry a big stick? Veterans in the political elite and the American use of force. American Political Science Review 96(4): 779–94.CrossRef Google Scholar

Gerber, Elisabeth, and Lewis, Jeff. 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6): 1364–83.CrossRef Google Scholar

Greene, William. 2007. Econometric analysis. 6th ed. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Grimmer, Justin. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases. Political Analysis 18(1): 1–35.CrossRef Google Scholar

Grimmer, Justin. Forthcoming 2012. Appropriators not position takers: The distorting effects of electoral incentives on congressional representation. American Journal of Political Science.CrossRef Google Scholar

Grimmer, Justin, and King, Gary. 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences 108(7): 2643–50.CrossRef Google Scholar PubMed

Hand, David J. 2006. Classifier technology and the illusion of progress. Statistical Science 21(1): 1–15.Google Scholar

Hart, R. P. 2000. Diction 5.0: The text analysis program. Thousand Oaks, CA: Sage-Scolari.Google Scholar

Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2001. The elements of statistical learning. New York, NY: Springer.CrossRef Google Scholar

Hillard, Dustin, Purpura, Stephen, and Wilkerson, John. 2008. Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology & Politics 4(4): 31–46.CrossRef Google Scholar

Hofmann, Thomas. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–7.CrossRef Google Scholar

Hopkins, Daniel, and King, Gary. 2010. Extracting systematic social science meaning from text. American Journal of Political Science 54(1): 229–47.Google Scholar

Hopkins, Daniel, King, Gary, Knowles, Matthew, and Melendez, Steven. 2010. ReadMe: Software for automated content analysis. http://gking.harvard.edu/readme.Google Scholar

Jackman, Simon. 2006. Data from Web into R. The Political Methodologist 14(2): 11–6.Google Scholar

Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Computing Surveys 31(3): 264–323.CrossRef Google Scholar

Jones, Bryan, Wilkerson, John, and Baumgartner, Frank. 2009. The policy agendas project. http://www.policyagendas.org.Google Scholar

Jurafsky, Dan, and Martin, James. 2009. Speech and natural language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Jurka, Timothy P., Collingwood, Loren, Boydstun, Amber, Grossman, Emiliano, and van Atteveldt, Wouter. 2012. RTextTools: Automatic text classification via supervised learning. http://cran.r-project.org/web/packages/RTextTools/index.html.Google Scholar

Kellstedt, Paul. 2000. Media framing and the dynamics of racial policy preferences. American Journal of Political Science 44(2): 245–60.CrossRef Google Scholar

Krippendorff, Klaus. 2004. Content analysis: An introduction to its methodology. New York: Sage.Google Scholar

Krosnick, Jon. 1999. Survey research. Annual Review of Psychology 50(1): 537–67.CrossRef Google Scholar PubMed

Laver, Michael, and Garry, John. 2000. Estimating policy positions from political texts. American Journal of Political Science 44(3): 619–34.CrossRef Google Scholar

Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(02): 311–31.CrossRef Google Scholar

Lodhi, H., Saunders, C., Shawe-Taylor, J., Christianini, N., and Watkins, C. 2002. Text classifications using string kernels. Journal of Machine Learning Research 2: 419–44.Google Scholar

Loughran, Tim, and McDonald, Bill. 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance 66(1): 35–65.Google Scholar

Lowe, Will. 2008. Understanding wordscores. Political Analysis 16(4): 356–71.Google Scholar

Lowe, Will, Benoit, Ken, Mihaylov, Slava, and Laver, M. 2011. Scaling policy preferences from coded political texts. Legislative Studies Quarterly 36(1): 123–55.CrossRef Google Scholar

MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1: 281–97. London, UK: Cambridge University Press.Google Scholar

Manning, Christopher, Raghavan, Prabhakar, and Schütze, Hinrich. 2008. Introduction to information retrieval. Cambridge, UK: Cambridge University Press.CrossRef Google Scholar

Maron, M. E., and Kuhns, J. L. 1960. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery 7(3): 216–44.Google Scholar

Martin, Lanny, and Vanberg, Georg. 2007. A robust transformation procedure for interpreting political text. Political Analysis 16(1): 93–100.Google Scholar

Mayhew, David. 1974. Congress: The electoral connection. New Haven, CT: Yale University Press.Google Scholar

Mikhaylov, S., Laver, M., and Benoit, K. 2010. Coder reliability and misclassification in the human coding of party manifestos. 66th MPSA annual national conference, Palmer House Hilton Hotel and Towers.Google Scholar

Monroe, Burt, and Maeda, Ko. 2004. Talk's cheap: Text-based estimation of rhetorical ideal points. Paper presented at the 21st annual summer meeting of the Society of Political Methodology.Google Scholar

Monroe, Burt, Colaresi, Michael, and Quinn, Kevin. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372.CrossRef Google Scholar

Mosteller, F., and Wallace, D. L. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58: 275–309.Google Scholar

Neuendorf, K. A. 2002. The content analysis guidebook. Thousand Oaks, CA: Sage Publications, Inc.Google Scholar

Ng, Andrew, Jordan, Michael, and Weiss, Yair. 2001. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems 14: Proceeding of the 2001 conference, eds. Dietterich, T., Becker, S., and Gharamani, Z., 849–56. Cambridge, MA: The MIT Press.Google Scholar

Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs up?: Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing 10: 79–86.CrossRef Google Scholar

Pennebaker, James, Francis, Martha, and Booth, Roger. 2001. Linguistic inquiry and word count: LIWC 2001. Mahway, NJ: Erlbaum Publishers.Google Scholar

Poole, Keith, and Rosenthal, Howard. 1997. Congress: A political-economic history of roll call voting. Oxford, UK: Oxford University Press.Google Scholar

Porter, Martin. 1980. An algorithm for suffix stripping. Program 14(3): 130–37.CrossRef Google Scholar

Quinn, Kevin. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1): 209–28.CrossRef Google Scholar

Schrodt, Philip. 2000. Pattern recognition of international crises using Hidden Markov Models. In Political complexity: Nonlinear models of politics, ed. Richards, Diana, 296–328. Ann Arbor, MI: University of Michigan Press.Google Scholar

Schrodt, Philip A. 2006. Twenty years of the Kansas event data system project. Political Methodologist 14(1): 2–6.Google Scholar

Slapin, Jonathan, and Proksch, Sven-Oliver. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3): 705–22.CrossRef Google Scholar

Spirling, Arthur. 2012. US treaty-making with American Indians. American Journal of Political Science 56(1): 84–97.CrossRef Google Scholar

Spirling, Arthur, and McLean, Iain. 2007. UK OC OK? Interpreting optimal classification scores for the UK House of Commons. Political Analysis 15(1): 85–96.CrossRef Google Scholar

Stewart, Brandon M., and Zhukov, Yuri M. 2009. Use of force and civil-military relations in Russia: An automated content analysis. Small Wars & Insurgencies 20: 319–43.Google Scholar

Stone, Phillip, Dunphy, Dexter, Smith, Marshall, and Ogilvie, Daniel. 1966. The general inquirer: A computer approach to content analysis. Cambridge, MA: The MIT Press.Google Scholar

Taddy, Matthew A. 2010. Inverse regression for analysis of sentiment in text. Arxiv preprint arXiv:1012.2098.Google Scholar

Turney, P., and Littman, M. L. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4): 315–46.CrossRef Google Scholar

van der Laan, Mark, Polley, Eric, and Hubbard, Alan. 2007. Super learner. Statistical Applications in Genetics and Molecular Biology 6(1): 1544–6115.Google Scholar

van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. 2006. Oracle inequalities for multifold cross validation. Statistics and Decisions 24(3): 351–71.Google Scholar

Venables, W. N., and Ripley, B. D. 2002. Modern applied statistics with S. 4th ed. New York: Springer.CrossRef Google Scholar

Wallach, Hanna, Dicker, Lee, Jensen, Shane, and Heller, Katherine. 2010. An alternative prior for nonparametric Bayesian Clustering. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS) 9: 892–99.Google Scholar

Weber, Robert P. 1990. Basic content analysis. Newbury Park, CA: Sage University Paper Series on Quantitative Applications in the Social Sciences.CrossRef Google Scholar

Weingast, Barry, Shepsle, Kenneth, and Johnsen, Christopher. 1981. The political economy of benefits and costs: A neoclassical approach to distributive politics. The Journal of Political Economy 89(4): 642.CrossRef Google Scholar

Yiannakis, Diana Evans. 1982. House members' communication styles: Newsletter and press releases. The Journal of Politics 44(4): 1049–71.Google Scholar

Young, Lori, and Soroka, Stuart. 2011. Affective news: The automated coding of sentiment in political texts. Political Communication 29(2): 205–31.Google Scholar

Political Analysis (2013) 21:350–367 CrossRef Google Scholar

Article contents

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Abstract

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests