Hostname: page-component-586b7cd67f-t8hqh Total loading time: 0 Render date: 2024-11-30T19:09:31.910Z Has data issue: false hasContentIssue false

Automated dictionary generation for political eventcoding

Published online by Cambridge University Press:  27 March 2019

Benjamin J. Radford*
Affiliation:
LevelUp Research, LLC, Arlington, VirginiaUS
*
*Corresponding author. Email: [email protected]

Abstract

Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method with the production of a novel event dataset on cybersecurity incidents.

Type
Original Article
Copyright
Copyright © The European Political Science Association 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akamai (2015) State of the internet—security. 2, no. 4 (Q4).Google Scholar
Althaus, SL, Bajjalieh, J, Carter, JF, Peyton, B and Shalmon, DA (2017) Cline center historical event data. June 26. https://www.clinecenter.illinois.edu/data/event/phoenix.Google Scholar
Arora, S, Li, Y, Liang, Y, Ma, T and Risteski, A (2016) Rand-walk: a latent variable model approach to word embeddings. arXiv:1502.03520v7 (July 22).Google Scholar
Azar, EE (1980) The conflict and peace data bank (copdab) project. The Journal of Conflict Resolution 24, (April).CrossRefGoogle Scholar
Bauer, J (2014) Shift-reduce constituency parser [in English]. The Stanford Natural Language Processing Group. Online. http://nlp.stanford.edu/software/srparser.shtml.Google Scholar
BBC (2015) Bangladeshi secular publisher hacked to death. Online. http://www.bbc.co.uk/news/world-asia-34688245 October 31.Google Scholar
Bojanowski, P, Grave, E, Joulin, A and Mikolov, T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar
Boschee, E, Lautenschlager, J, O'Brien, S, Shellman, S, Starz, J and Ward, M (2015) Icews coded event data. V15. http://dx.doi.org/10.7910/DVN/28075.CrossRefGoogle Scholar
Brecher, M and Wilkenfeld, J (2000) A Study of Crisis. Ann Arbor, Michigan. University of Michigan Press.Google Scholar
Brecher, M, Wilkenfeld, J, Beardsley, K, James, P and Quinn, D (2016) International crisis behavior data codebook, version 11. http://sites.duke.edu/icbdata/data-collections.Google Scholar
Caerus Associates (2015) Phoenix event data set codebook 0.0.1b. https://s3.amazonaws.com/oeda/docs/phoenix_codebook.pdf.Google Scholar
Caliskan, A, Bryson, JJ and Narayanan, A (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356, 183186. ISSN: 0036-8075. doi: 10.1126/science.aal4230. eprint: http://science.sciencemag.org/content/356/6334/183.full.pdf http://science.sciencemag.org/content/356/6334/183.CrossRefGoogle ScholarPubMed
Cimpanu, C (2015) Military contractors that used Russian programmers for dod software get fined by US govt. http://news.softpedia.com/news/military-contractors-that-used-russian-programmers-for-dod-software-get-fined-by-us-govt-495827.shtml. Softpedia Security News. (November 6).Google Scholar
Clapper, JR (2015) Worldwide threat assessment of the us intelligence community. http://cdn.arstechnica.net/wp-content/uploads/2015/02/Clapper_02-26-15.pdf. Senate Armed Services Committee, February 26, 2015.Google Scholar
Dhillon, PS, Foster, DP and Ungar, LH (2015) Eigenwords: Spectral word embeddings. Journal of Machine Learning Research 16, 30353078. http://www.pdhillon.com/dhillon15a.pdf.Google Scholar
Dunning, T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 6174.Google Scholar
Finkel, JR, Grenager, T and Manning, C (2005) Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics, pp. 363370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf.CrossRefGoogle Scholar
Goldberg, Y and Levy, O (2014) word2vec explained: deriving mikolov et al.'s negative-sampling word-embedding method. arXiv (February).Google Scholar
Gouws, S, Bengio, Y and Corrado, G (2015) Bilbowa: fast bilingual distributed representations without word alignments. Proceedings of the 32nd International Conference on Machine Learning 37, 748756.Google Scholar
Greenwald, AG, McGhee, DE and Schwartz, JL (1998) Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology 74, 1464–80.CrossRefGoogle ScholarPubMed
Harris, ZS (1954) Distributional structure. WORD 10, 146162.CrossRefGoogle Scholar
Herzog, A, Shahmehri, N and Duma, C (2007) An ontology of information security. International Journal of Information Security and Privacy 1, 123.CrossRefGoogle Scholar
IARPA (2018) Draft Broad Agency Announcement: Better Extraction from Text Towards Enhanced Retrieval (BETTER). IARPA-BAA-18-05.Google Scholar
King, G and Lowe, W (2003) An automated information extraction tool for international conflict data with performance as good as human coders: a rare events evaluation design. International Organization 57, 617642. http://gking.harvard.edu/files/gking/files/infoex.pdf?m=1360039060.CrossRefGoogle Scholar
King, G, Lam, P and Roberts, ME (2017) Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science 61, 971988.CrossRefGoogle Scholar
Kovacs, E (2012a) Iranian official: we did not launch cyberattacks on American banks. Softpedia Security News (September 24). http://news.softpedia.com/news/Iranian-Officials-We-Did-Not-Launch-Cyberattacks-on-American-Banks-294412.shtml.Google Scholar
Kovacs, E (2012b) Tick tock: It's lights out for DNS changer-infected computers on July 9. Softpedia Security News (June 6). http://news.softpedia.com/news/Tick-Tock-It-s-Lights-Out-for-DNSChanger-Infected-Computers-on-July-9-Video-279700.shtml.Google Scholar
Kovacs, E (2013) Hundreds of sites hacked in conflict between Malaysia and Philippines hacktivists. Softpedia Security News (March 4). http://news.softpedia.com/news/Hundreds-of-Sites-Hacked-in-Conflict-Between-Malaysia-and-Philippines-Hacktivists-334047.shtml.Google Scholar
Le, Q and Mikolov, T (2014) Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning.Google Scholar
Leetaru, K and Schrodt, PA (2013) GDELT: Global Data on Events, Location and Tone, 1970–2012. Annual Meeting of the International Studies Association. http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf.Google Scholar
Mikolov, T, Chen, K, Corrado, G and Dean, J (2013) Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. http://arxiv.org/pdf/1301.3781.pdf.Google Scholar
Mikolov, T, Sutskever, I, Chen, K, Corrado, G and Dean, J (2013) Distributed representations of words and phrases and their compositionality. arXiv (October). http://arxiv.org/abs/1310.4546.Google Scholar
Nguyen, KA, im Walde, SS and Vu, NT (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 454459.CrossRefGoogle Scholar
Norris, C, Schrodt, P and Beieler, J (2017) PETRARCH2: Another event coding program. The Journal of Open Source Software 2, (January), 11. doi: 10.21105/joss.00133, http://dx.doi.org/10.21105/joss.00133.CrossRefGoogle Scholar
Open Event Data Alliance (2015a) PETRARCH Python Engine for Text Resolution and Related Coding Hierarchy. Online. http://www.github.com/openeventdata/petrarch.Google Scholar
Open Event Data Alliance (2015b) Phoenix Data Project. Online. phoenixdata.org.Google Scholar
Open Event Data Alliance (2015c) Phoenix Pipeline. Online. http://phoenix-pipeline.readthedocs.org/en/latest.Google Scholar
Open Event Data Alliance (2018) Universal Dependency PETRARCH. https://github.com/openeventdata/UniversalPetrarch.Google Scholar
Palmer, G, D'Orazio, V, Kenwich, M and Lane, M (2015) The MID4 dataset, 2002–2010: Procedures, coding rules and description. Conflict Management and Peace Science 32, 222242.CrossRefGoogle Scholar
Pan, SJ and Yang, Q (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 13451359.CrossRefGoogle Scholar
Pennington, J, Socher, R and Manning, CD (2014) GloVe: Global Vectors for Word Representation, In Empirical Methods in Natural Language Processing (EMNLP), pp. 15321543.Google Scholar
Raleigh, C, Linke, A, Hegre, H and Karlsen, J (2010) Introducing ACLED – Armed Conflict Location and Event Data. Journal of Peace Research 47, 651660.CrossRefGoogle Scholar
Rehurek, R and Sojka, P (2010) Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta: ELRA, May, pp. 4550 http://is.muni.cz/publication/884893/en.Google Scholar
Rubenstein, H and Goodenough, JB (1965) Contextual Correlates of Synonymy. In Communications of the ACM vol. 8(10), 627633. ACM, New York, NY.CrossRefGoogle Scholar
Santorini, B (1990) Part-of-speech tagging guidelines for the penn treebank project. https://www.cis.upenn.edu/~treebank/.Google Scholar
Schrodt, PA (1998) KEDS Kansas Event Data System version 1.0. http://eventdata.parusanalytics.com/.Google Scholar
Schrodt, PA (2011) TABARI: Textual Analysis by Augmented Replacement Instructions, Version 0.7.6. http://eventdata.parusanalytics.com/tabari.dir/tabari.manual.0.7.6.pdf.Google Scholar
Schrodt, PA and Brackle, DV (2013) Automated Coding of Political Event Data. In Subrahmaniam, VS (ed.) Handbook of Computational Approaches to Counterterrorism. Springer Science + Business Media. New York, NY.CrossRefGoogle Scholar
Schrodt, PA, Gerner, DJ and Yilmaz, O (2009) Conflict and Mediation Event Observations (CAMEO): an event data framework for a post Cold War world. International Conflict Mediation: New Approaches and Findings.Google Scholar
The Economist (2012) Hype and fear: America is leading the way in developing doctrines for cyber-warfare. other countries may follow, but the value of offensive cyber capabilities is overrated. The Economist (December 8).Google Scholar
The GDELT Project (2016) The Datasets of GDELT as of February 2016. https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016. March 13.Google Scholar
Valeriano, B and Maness, RC (2014) The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research 51(3), 347360.CrossRefGoogle Scholar
Volz, D and Finkle, J (2016) US indicts iranians for hacking dozens of banks, New York dam. Reuters. http://www.reuters.com/article/us-usa-iran-cyber-idUSKCN0WQ1JF.Google Scholar
Wang, W, Kennedy, R, Lazer, D and Ramakrishnan, N (2016) Growing pains for global monitoring of societal events. Science Magazine Digital, 15021503. (September 30).CrossRefGoogle Scholar
Ward, MD, Beger, A, Cutler, J, Dickenson, M, Dorff, C and Radford, B (2013) Comparing GDELT and ICEWS event data. http://mdwardlab.com/sites/default/files/GDELTICEWS_0.pdf.Google Scholar
Supplementary material: Link

Radford Dataset

Link
Supplementary material: PDF

Radford supplementary material

Online Appendix

Download Radford supplementary material(PDF)
PDF 225.1 KB