Where does armed conflict occur? Despite the plethora of subnational studies on civil war, we still lack clear answers to this question, which we may think of as a mere nuisance. In a number of regression studies, for instance, scholars use specific areal units, such as administrative boundaries or grid cells, and assume that the presence of a combatant event means that the entire unit is a conflict zone. These areal assignments are so common that we may not recognize that they are in fact assumptions. For example, a number of studies using the PRIOGRID (Tollefsen et al., Reference Tollefsen, Strand and Buhaug2012) assume that if one or more events occur in a grid cell, the entire 55-km-by-55-km cell would be affected by the conflict (Buhaug et al., Reference Buhaug, Gleditsch, Holtermann, Østby and Tollefsen2011; Pierskalla and Hollenbach, Reference Pierskalla and Hollenbach2013; Fjelde and Hultman, Reference Fjelde and Hultman2014). Other scholars use large administrative units, such as provinces (Cunningham and Weidmann, Reference Cunningham and Weidmann2010; Fjelde and von Uexkull, Reference Fjelde and von Uexkull2012; Ritter and Conrad, Reference Ritter and Conrad2016), and rely on a similar set of assumptions. Although these studies carefully defend their choices of areal units and measurements, none check the robustness of their findings with alternative units.Footnote 1
The areal-assignment assumptions are, however, consequential for our understanding of civil war. As an example, the following figure (Figure 1), maps the zones of the Somali Civil War (1989–2017) made by the different areal-unit assignment rules but with the same dataset of conflict events (UCDP GED; Sundberg et al., Reference Sundberg, Lindgren and Padskocimaite2010). If one assigns the conflict events to the grid cells (PRIOGRID; Tollefsen et al., Reference Tollefsen, Strand and Buhaug2012; red dotted polygons in Figure 1), the conflict zones tightly fit the conflict event locations (dot points in Figure 1). In contrast, if one uses the second-order administrative units (districts; blue dot-dashed polygons in Figure 1), the conflict zones grow to include lands within Somalia. The UCDP Polygons Dataset (Croicu and Sundberg, Reference Croicu and Sundberg2012; yellow dashed polygons in Figure 1)—a commonly used conflict zones dataset—indicates an even larger area that includes Ethiopia's Ogaden region, which has no record of conflict events. The profound differences in how conflict zones can be defined from a given set of underlying data suggest that empirical findings may be sensitive to the choice of areal unit. How can we define conflict zones in a way that is less dependent on areal-unit assumptions?
I argue that the extant zoning methods rest on strong assumptions about the areal assignment of war zones, which can potentially result in misleading pictures. I demonstrate this by formalizing a zone as a summary function that maps locations and (if necessary) other substantive information onto the presence/absence of conflict events. From this perspective, these approaches not only impose strong constraints on the zoning function, but also assume that the mapping has no stochastic error. However, since a conflict zone is a function, we can readily apply statistical methods to estimate the zones.
Statistically estimating conflict zones presents a special challenge, however; while we can observe the presence of conflict events and their locations, we do not have direct observations about their absence. Although one might consider that the lack of recorded conflict events within particular geographical boundaries—such as grid cells or administrative units—would constitute absence data, the construction of the absence data is not as straightforward as one might think. Importantly, it requires pre-defined areal units, and the results may differ depending on which areal units one uses. Furthermore, since locations near conflict events are less likely to be “real” absence observations than locations farther from the events, one might also need to build a sampling scheme that accounts for the spatial heterogeneity. However, all of these procedures require additional assumptions that make the zoning exercise sensitive to researchers' arbitrary choices. Ideally, we would like to estimate conflict zones without relying on the pseudo-absence data or pre-defined areal units.
In this paper, I address these problems by using the one-class support vector machine (OCSVM), which is an unsupervised machine learning method commonly used for outlier detection (Schölkopf et al., Reference Schölkopf, Williamson, Smola, Shawe-Taylor, Platt, Solla, Leen and Müller2000). Unlike other methods, the OCSVM requires only presence data, allowing us to estimate the conflict zones even without any pre-defined areal units. Even though the OCSVM does not use absence data and is less powerful than other statistical methods, it allows us to construct conflict zones with fewer assumptions and is therefore suitable for creating data infrastructure for broader application. In order to provide such infrastructure, I apply the OCSVM to the UCDP GED and create a new dataset of conflict zones. With this new dataset, I replicate Daskin and Pringle's (Reference Daskin and Pringle2018) study on civil wars' effect on wildlife. The results suggest that the actual ecological costs of civil war are much smaller than the original estimate.
1. A conflict zone as a representation
I consider a conflict zone to be a concise representation of the geographic distribution of conflict. A “population” conflict zone is an area within which conflict takes place and thus generates conflict events. An “estimated” conflict zone, by contrast, is an area in which conflict is likely to take place given our observations of conflict events. In reality, the population conflict zone may not exist; we cannot draw a line such that conflict takes place one millimeter inside of it, while conflict does not exist one millimeter outside of it. Thus, as is common in structural parameter estimation (such as the utility maximization theory of a logistic regression), the data generation process should be considered a theoretical construct. The key question is not whether the population conflict zones are “true”, but whether they are useful for specific purposes.
Conflict zones are useful for certain purposes. For instance, by having conflict and non-conflict zones, we can directly compare human, economic, and environmental costs of civil war inside and outside of the conflict zones (Ghobarah et al., Reference Ghobarah, Huth and Russett2003; Daskin and Pringle, Reference Daskin and Pringle2018). Moreover, the conflict zones can be used for the purpose of issuing travel advisories. In fact, as can be seen in the travel advisory maps, it is more helpful to display zones of high risk instead of the precise locations of violent events. Finally, the method that this paper proposes can be potentially used for other mapping exercises, such as poverty maps, crime zones, state controls over territories, hazard maps, and zones of racial segregation, all of which have substantive applications.
This paper is agnostic with respect to the definitions of “conflict” and “conflict events.” I assume that conflict events are presented as point locations,Footnote 2 and that the term “conflict” in “conflict events” and “conflict zones” have the same meaning, but this study does not depend on a particular definition of conflict. Since there are a number of studies about the concepts of armed conflict (Sundberg et al., Reference Sundberg, Lindgren and Padskocimaite2010), violence (Kalyvas, Reference Kalyvas2006), civil war (Sambanis, Reference Sambanis2004), peace (Campbell et al., Reference Campbell, Findley and Kikuta2017), and territorial controls (Tao et al., Reference Tao, Strandow, Findley, Thill and Walsh2016; Anders et al., Reference Anders, Xu, Cheng and Satish Kumar2017), I focus on the concept of a zone and ask readers to refer to those studies.
Finally, this paper is primarily interested in a binary measure of conflict zones.Footnote 3 Although continuous indicators of conflict risks might be more nuanced and useful for some purposes (Anders et al., Reference Anders, Xu, Cheng and Satish Kumar2017; Campbell et al., Reference Campbell, Findley and Kikuta2017), a dichotomous zone has at least one clear advantage: providing a new geographical unit of analysis that allows us to compare conflict and non-conflict areas. What motivates this study is not to estimate a “true” distribution of conflict events or to create as precise as possible description of events. Rather, the goal is to provide a concise representation of conflict as a part of data infrastructure.
1.1 Formalizing a conflict zone
This paper makes a conceptual shift in the geography of civil war; I conceptualize locations as predictors of conflict instead of units of analysis. This conceptualization allows me to create a new areal unit – a conflict zone – without assuming any prior areal units. Consider a set of conflict events, X = {x1, …, xn:y i = 1 for i = 1, …, n}, where xi is a vector of longitude and latitude (and if necessary other predictors) of an event i,Footnote 4 which I call a location, and y i is an indicator of the presence and absence of conflict. A zoning function f Y is a function that maps every location on the earth to the sample space of Y,
where G is the entire surface of the globe. Intuitively, as seen in Figure 2, a zoning function tells us whether each location belongs to a zone of a certain conflict. A conflict zone is an uncountable set of locations, A c = {x ∈ G:f Y(x) = 1}, and a non-conflict zone is its complement, A ¬c = {x ∈ G:f Y(x) = 0}. Our goal is therefore to estimate a zoning function that approximates the population zoning function and hence best summarizes the conflict events.
One advantage of this formalization is that we can now define the fitness of zoning. Let $\tilde{f}_Y$ be the population zoning function and $\hat{f}_{Y\vert {\boldsymbol X}}$ be a zoning function estimated from data. The population zoning function represents the underlying data generation process of conflict events, while the estimated zoning function is our estimate of the data generation process. The difference between the population and estimated zoning functions is then defined by a loss function $L( {{\tilde{f}}_Y, \;\;{\hat{f}}_{Y\vert X}} )$. Our objective is therefore to find $\hat{f}_{Y\vert {\boldsymbol X}}$ that minimizes the expected value of the loss function, $E_{\boldsymbol X}[ {L( {{\tilde{f}}_Y, \;{\hat{f}}_{Y\vert {\boldsymbol X}}} ) } ]$. Under certain conditions (Friedman, Reference Friedman1997; Valentini and Dietterich, Reference Valentini and Dietterich2004), the expected loss function is decomposed into bias and variance terms;
where g is a generic function that is increasing with L 1 and L 2. The L 1 term represents a systematic difference between the population and estimated zoning functions (bias), while the L 2 term indicates how random noise can alter our estimate (variance). When a zoning function is too inflexible and thus underfitted to data, the zoning function is heavily influenced by our assumptions, resulting in a large bias. By contrast, when a zoning function is overfitted to data, the function is extremely sensitive to random noise, indicating a large variance. Thus, estimating the population zoning function requires striking a delicate balance between bias and variance.
1.2 Fitting problems in deterministic methods
From the bias-variance perspective, deterministic methods of zoning like those commonly used in conflict studies are suboptimal. In fact, they tend to risk both underfitting and overfitting. Because those methods impose relatively strong constraints on the zoning function, the estimated zoning functions are dependent on those assumptions and potentially biased (unless those constraints were in fact correct). For instance, although we might use simple polygon assignment rules, such as assigning an administrative unit polygon as part of a conflict zone if it contains one or more conflict event, this method presumes the following functional form;
where P conflict is a set of polygons that have at least one conflict event; if, for example, there is one or more conflict events at the eastern border area in Ogaden, the entire Ogaden region is assumed to be affected by the conflict. Although the UCDP Polygons (Croicu and Sundberg, Reference Croicu and Sundberg2012) take a more sophisticated approach (called a convex hull method), it also assumes that the shapes of conflict zones are convex, which may not always be realistic. In the case of the Somali Civil War, for instance, the convex hull method cannot account for the concave shape of Somalia (yellow dashed zone in Figure 1), resulting in a conflict zone that mistakenly includes the Ethiopian Ogaden region (despite the fact that no conflict event is reported in Ethiopian Ogaden).
Even worse, because the deterministic rules do not account for stochastic errors in our observations,Footnote 5 they also tend to overfit the data. For instance, if there is a single combat event in a far distant location (say, bombing in Paris by the combatants of the Sri Lankan Civil War), the polygon assignment method treats the surrounding areas as a part of the conflict zone. Thus, even if the deterministic approaches were to minimize the differences between the zoning function and observed data, the zoning function may not be optimal.
Fortunately, we can avoid these shortcomings by using statistical methods. With statistical learning methods, we can assume a fairly flexible zoning function and systematically account for random errors. An easy way to understand the statistical approach is a logistic regression (even though it is not flexible); one could estimate a logistic regression of y on X and then use the estimated model as a zoning function. However, as I discuss in the next section, extending statistical methods to the zoning problem is not as straightforward as one might expect.
2. Statistical approaches to zoning: problems of presence-only data
A methodological challenge is that even though we have data on the presence of conflict events, we do not have direct observations about the absence of conflict events. As a result, y i always takes a value of 1 in our sample, and thus conventional methods, such as logistic regression, cannot be used without further innovations. Although the presence-only data do not draw much attention and are rarely recognized as a problem in political science, this problem arises in other fields, including the conservation sciences (Mack and Waske, Reference Mack and Waske2017), genetics (Mei and Zhu, Reference Mei and Zhu2015), and text analyses (Lee and Liu, Reference Lee, Liu, Fawcett and Mishra2003).
2.1 Positive-absence (PA) data approach
The most straightforward approach is the positive-absence (PA) data methods. The idea is that we “make up” absence data and then apply conventional classification methods. To create the pseudo-absence data, one might assign areal units, such as grid cells or administrative boundaries, to the conflict events and then treat the remaining areal units as absence data. Alternatively, one could build more sophisticated sampling schemes that account for spatial relationships (Mei and Zhu, Reference Mei and Zhu2015). Once s/he creates absence data, a variety of classification methods are readily available.
A drawback to the PA approach is its sensitivity to the absence-data generation. Researchers must specify the areal units or sampling schemes, and it is well known that those choices can greatly influence the estimates (Phillips et al., Reference Phillips, Dudík, Elith, Graham, Lehmann, Leathwick and Ferrier2009). Even worse, because both estimation and cross-validation rest on the pseudo-absence data, there is no established way to evaluate different absence-data sampling schemes. Thus, without strong substantive reasons to justify particular methods of absence-data generation, it is difficult to use the PA methods.
2.2 Positive-unlabeled (PU) data approach
Unlike the PA methods, the positive-unlabeled (PU) data methods do not treat the pseudo-absence data as genuine Y = 0 observations. Instead, the PU methods treat the outcome of the pseudo-absence data as indeterminate. For instance, the maximum entropy method (Phillips and Dudík, Reference Phillips and Dudík2008), which is one of the most widely used methods in the species distribution modeling, estimates the probability distribution of Y over a specific extent using observed events. The estimated probability distribution is then used for predicting zones as well as assigning specific probabilities to the unlabeled data.
Although the PU method is the current standard in the literature on species distribution modeling, recent studies have shown that the PU methods are actually dependent on how one defines the scopes of unlabeled data (VanDerWal et al., Reference VanDerWal, Shoo, Graham and Williams2009). In conflict studies, Schutte (Reference Schutte2017) applies a point process model (PPM) to predict zones of ten insurgent wars in Africa. Although the author correctly refers to the problems of areal-unit assumptions, the PPM actually depends on particular areal assumptions, including the geographical scope of the analysis and the specification of the grid cells.Footnote 6 Thus, although the PPM and more generally the PU methods are great departures from the deterministic methods, they are still confined by the areal-unit assumptions.Footnote 7 At the crux, the deterministic, PA, and PU methods suffer the same problem; they are sensitive to the choices of pre-defined areal units.
2.3 Positive-only (PO) approach
The positive-only (PO) methods can provide a possible solution to the areal-unit problems (Mack and Waske, Reference Mack and Waske2017). Unlike the PA or PU methods, the PO methods solely rely on presence data without requiring absence data or pre-defined areal units. The PO approach therefore can be considered as a minimalist approach to conflict zoning; even though the PO methods can be less informative as they do not utilize unlabeled data, they do not require strong assumptions and hence allow broader applications. In general, while the PA and PU approaches are useful when one's objective is to make the best possible zones for a few conflicts with field-level knowledge, the PO approach is more suitable when one would like to create database infrastructure for the purpose of broader application. This paper aims at the latter objective and hence develops a PO method.
3. Statistical method of zoning: one-class support vector machine (OCSVM)
The OCSVM is an unsupervised machine learning method and one of the most popular among the PO approaches. There are several applications in the fields of text analysis (Lee and Liu, Reference Lee, Liu, Fawcett and Mishra2003), species distribution models (Mack and Waske, Reference Mack and Waske2017), and gene science (Mei and Zhu, Reference Mei and Zhu2015). The advantages of OCSVM over other PO methods are that it is particularly useful for handling continuous predictors and that the hyper-parameter tuning is relatively well understood.Footnote 8
Conceptually (but not algorithmically), the OCSVM can be considered as a two-step procedure; transforming data with a fairly flexible function φ and then fitting the tightest enclosing circle to the transformed data.Footnote 9 As seen in Figure 3, the function φ maps the observed m predictors to m-dimensional Cartesian space so that the data are centered at b (in Figure 3, m = 2). Although such an m-to-m function is hard to even express, it is mathematically sufficient to define its kernel, K(xj, xk) = φ(xj)Tφ(xk), which maps two m-length vectors to a scalar and hence is mathematically tractable.Footnote 10 The Euclidian distance, for instance, would be such a kernel, but we can use more flexible kernels as well. A standard choice is the radial basis function;
where γ is a kernel parameter, which represents the influence of a single observation on the overall estimate. Larger γ indicates a tighter fit to every observation. The support vector machine with the radial basis function is so flexible that it can approximate to any finite function (so-called universal approximator; Hammer and Gersmann, Reference Hammer and Gersmann2003).
Given a specified kernel, the OCSVM searches for the tightest circle that encloses the transformed data points (the red dashed circle in the right pane of Figure 3). However, because it is not desirable to fit the circle too tightly to the data and risk overfitting, we also allow several observations to be outside of the circle (four data points in the right pane of Figure 3). This provides a guard against overfitting. Formally, the loss function and corresponding optimization problem is expressed as;
with constraints of;
where R is a radius of the circle. We would like to have a circle that encloses the points as tightly as possible (minimizing R 2), but we also want the circle to be sufficiently inclusive and thus not so far from the outliers (minimizing $\sum _{i = 1}^n \delta _i$). The parameter ν controls the weights of those two opposing forces; large ν allows many outliers, while small ν means an inclusive circle. By solving the optimization problem for $\hat{R}, \;\;\hat{{\boldsymbol b}}, \;\;{\rm and}\;\hat{{\boldsymbol \delta }}$, we get the OCSVM approximation to the population zoning function;Footnote 11
Since the two hyper-parameters γ and ν (both of which control the balance between underfitting and overfitting)Footnote 12 are not directly estimated, I follow Ghafoori et al. (Reference Ghafoori, Erfani, Rajasegarar, Bezdek, Karunasekera and Leckie2018) to choose the optimal values.Footnote 13 The predictive intervals are obtained via bootstrapping.
4. Performance comparison I: simulation analysis
One advantage of using statistical methods is their ability to separate a systematic pattern of conflict events from non-systematic errors. In this section, I compare the performance of both deterministic and statistical methods by conducting a couple of simulation analyses. I first define a population conflict zone as the entire territory of Nigeria or Somalia. I choose these countries because they have perhaps the most convex and concave shapes among African countries. I then randomly draw 1000 locations within the territory and add random noise;
where U poly is a uniform distribution over the territory of Nigeria or Somalia, $\tilde{{\boldsymbol y}}_i$ is a location within the territory, and vi is noise drawn from a normal distribution of mean zero and variance σ 2. I vary the size of the noise σ from 0 to 1 degree (~0 to 111 km).Footnote 14 We are supposed to have no information about $\tilde{{\boldsymbol y}}_i {\boldsymbol v}_i$, or their distributions with only having the data yi. Our task is to infer the population conflict zone from the observed data yi.
In the following analysis, I compare the performances of the PRIOGRID and district assignments, the convex hull (deterministic approaches), support vector machines (SVM; PA data approach), maximum entropy method (MAXENT; PU data approach), and one-class support vector machine (OCSVM; PO approach).Footnote 15 The convex hull method is supplemented with a deterministic rule for outlier removal, which is used in the UCDP Polygons dataset (so-called 20–5 percent rule).Footnote 16 Since the SVM and MAXENT require pseudo-absence or unlabeled data, I randomly sample locations and use them as pseudo-absence or unlabeled data.Footnote 17 Finally, I evaluate the performance by calculating the accuracy of the predictions (the proportion of correctly predicted conflict and non-conflict area across the entire area). I repeat the simulation for 1000 times for each value of σ and calculate the average accuracies.Footnote 18
4.1 Results
The following figure (Figure 4) shows the results of the simulation analyses. On average, the OCSVM has a higher performance than the other methods in both simulations. Although the PRIOGRID assignment performs relatively well, the performance is sensitive to the addition of small amounts of noise especially in the case of Somalia, which is not surprising given its deterministic nature. In both simulations, the district assignment exhibits relatively low performance; in the case of Somalia, its accuracy quickly deteriorates and then becomes comparatively stable. The convex hull method works well only when a population conflict zone is convex. When the assumption of a convex conflict zone is violated, the accuracy becomes much lower.
Among the statistical methods, only OCSVM has high performance in both simulations. While the performance of the MAXENT is as high as OCSVM's in the case of Nigeria when there is a large amount of noise, the MAXENT has the second lowest accuracy in the Somalia simulation. Similarly, the performance of the SVM is somehow equivalent to that of the PRIOGRID assignment for Somalia, but it does poorly for the Nigeria simulation. These results reinforce the fact that SVM and MAXENT are sensitive to pseudo-absence data generation. Overall, the OCSVM exhibits the highest and most stable performance.
5. Performance comparison II: validation with the Rohingya crisis
Although it is usually difficult to validate conflict zones with real-world data as we rarely have absence observations (and without absence observations, we cannot calculate accuracy), the case of the Rohingya Crisis provides a unique analytical opportunity. Specifically, the United Nations Institute for Training and Research (UNITAR) analyzes high-resolution satellite images to measure the levels of housing destruction at 900 Rohingya villages in Myanmar for the period of 31 August 2017 to 31 March 2018 (UNITAR, 2018). Importantly, the dataset contains information about both presence and absence of housing destruction in each village. Although housing destruction might not be a valid indicator of conflict, I can at least analyze to what extent the conflict zones (or zones of housing destruction) validly reflect the reality. If the UNITAR data indicate “few” or more destruction, it is considered as evidence for the presence of conflict, and hence the outcome variable takes a value of 1.
I conduct a two-fold cross-validation test with the housing destruction data. I first randomly split each of the destroyed and unaffected villages to two groups. The assignments of the PRIOGRID cells and township polygons,Footnote 19 convex hull, SVM, MAXENT, and OCSVM are then applied to one half of the affected villages, and the corresponding conflict zones are estimated. I calculate the accuracies of the conflict zones by comparing them to the other half of the affected villages and one half of the villages that were unaffected.Footnote 20 The same exercise is done by replacing the groups. The two-fold cross-validation is repeated 500 times (thus, 2 × 500 = 1, 000 simulation outputs). Finally, the average accuracy is calculated.Footnote 21 The other specifications are the same as those in the simulation analyses.
As seen in Figure 5, the OCSVM exhibits the highest performance, indicating that the OCSVM better reflects the reality of the Rohingya Crisis. Nonetheless, it should be noted that the performance is not very high in the absolute term; only about seven out of ten times, the OCSVM correctly distinguish affected and unaffected villages. This reflects the generic difficulties of one-class classification. Thus, as mentioned above, the OCSVM should not be considered as substitutes for detailed field-level knowledge. Having said that, however, the OCSVM marks improvement compared to the extant methods; the OCSVM increases the probability of correct predictions by 0.25, 0.2, 0.05, and 0.03 compared to the PRIOGRID and polygon assignments, the MAXENT, the convex hull, and the SVM respectively.
Although the SVM exhibits a performance similar to the OCSVM, the SVM's accuracy varies substantially across simulations. Indeed, the standard deviation of SVM's accuracy is 0.065, which is far larger than any of the other methods (the standard deviations of the other methods are below 0.03). This is not surprising because the SVM relies on the random sampling of absence data and hence is subject to additional noise. Next, even though the convex hull also exhibits a relatively high performance, it includes the central mountain areas in which there is no housing destruction or conflict (the upper middle pane of Figure 6). Because there is no observation in the mountain areas, these mis-predictions are not reflected in the accuracy metric, which creates the impression that the convex hull would be as accurate as the OCSVM. The OCSVM, on the other hand, does not include those central mountain areas.
Compared to those statistical methods, the MAXENT exhibits very low accuracy. As seen in the upper right pane of Figure 6, the MAXENT is unstable outside the extent of the presence observations. Moreover, even within this extent, the predictions are too inclusive and therefore inaccurate. Finally, the PRIOGRID and township polygon assignments have the lowest accuracies, which is not surprising given the large sizes of the grid cells and townships. Because a majority of the UCDP GED events are also reported at the levels of villages, towns, or cities, those findings cast a doubt on the validity of those polygon assignments. Overall—even though none of these methods can substitute field-level knowledge—the OCSVM exhibits the highest performance, indicating its potential use for macro-level analysis.
6. New conflict zones: application to the UCDP GED
I apply the OCSVM to the UCDP GED (version 19.1), a conflict event dataset commonly used not only in political science but also in other fields (Daskin and Pringle, Reference Daskin and Pringle2018).Footnote 22 An armed conflict is defined as “[a]n incident where armed force was by an organized actor against another organized actor, or against civilians, resulting in at least 1 direct death at a specific location and a specific date” (Sundberg, Lindgren, and Padskocimaite, Reference Sundberg, Lindgren and Padskocimaite2010: 2). Although recent studies point out reporting biases in the dataset (Weidmann, Reference Weidmann2015, Reference Weidmann2016), the reporting biases require solutions at the level of event data collection. Thus, they are beyond the scope of this paper. The following analysis is readily replicable with more accurate event data. The new conflict zone data can also potentially be used in conjunction with the calibration method proposed by Donnay et al. (Reference Donnay, Dunford, McGrath, Backer and Cunningham2018).
I estimate conflict zones with and without using the conflict event dates as an additional predictor so that I can create both time-variant and time-invariant conflict zones. Each conflict event is weighted by the casualties so that events of higher casualties have larger weights in the estimation. With the event data, I separately estimate the conflict zones for each dyad of actors. The UCDP GED specifies a conflict name (which I call “conflict episode”) and names of two involved actors (which I call “conflict dyad”) for every conflict event.Footnote 23 Therefore, in the example of the Iraqi Insurgency, I create conflict zones for battles between the government and Islamic State, battles between the government and Ansar al-Islam, and so forth.Footnote 24 Because each dyad is always assigned to a single episode—which in turn belongs to either state-based, one-sided or non-state conflict type—the dyadic conflict zones can be easily aggregated to zones at the levels of conflict episodes or types.
I do not include any geographic or climatic predictors so that the conflict zones are solely based on the UCDP GED and hence those predictors can be used in later analyses.Footnote 25 These features are intended to match those of the UCDP Polygons dataset.Footnote 26 The goal here is to provide a reliable alternative to the UCDP Polygons dataset, which has not been updated for the past 8 years and only includes Africa.Footnote 27 Moreover, the roles of additional predictors are rather limited in the OCSVM. If predictors do not affect the conflict locations, there is no reason to include them. By contrast, if predictors can affect the locations of conflict events, such effects are already reflected in the conflict events themselves. Although the predictors may still provide efficiency gains, they do not reduce biases in the estimate. Finally, the predictors also limit the possible usage of the conflict zones. If one would include geographic predictors, for instance, it prevents us from analyzing the relationship between those predictors and conflict zones in the causal analysis. This is less than attractive not only because we cannot answer those substantive questions, but also because we can no longer use those exogenous variables for the purpose of causal identification.
As a final note, recall that the conflict zones are not real geographical objects but concise summaries of conflict events, and hence the conflict zones are primarily used for macro-level analysis. For instance, it makes less sense to compare areas one or few kilometers inside and outside of the conflict zones, as the approximation errors are usually larger than such a small scale. Thus, the dataset also comes with estimates of the approximation errors. Specifically, I use parametric bootstrapping to provide the standard errors and corresponding 95 percent lower and upper bounds of the conflict zones.Footnote 28 The interval estimates can be used for the purpose of sensitivity analysis.
The new conflict zone dataset—Wzone—is publicly available in time-varying (daily; 1989–2018) and static versions at the levels of conflict dyads and episodes. Any geo-spatial covariate can be incorporated to Wzone by calculating the mean or other metrics within each zone. Conversely, the Wzone dataset can be integrated to PRIOGRID (Tollefsen et al., Reference Tollefsen, Strand and Buhaug2012) and other spatial datasets by calculating the proportion of conflict zones within a spatial unit. The integration with PRIOGRID will allow researchers to access a wide array of covariates for further analysis.Footnote 29
6.1 Results
The following figure (Figure 7) is the time-invariant estimates of conflict zones. The left and right panes are the UCDP Polygons dataset and the OCSVM estimates respectively.Footnote 30 For graphical purposes, the figure shows only the zones of state-based conflicts in Africa. Consistent with my argument, the UCDP Polygons tend to be less flexible but more sensitive to outliers. For example, while the UCDP Polygons contains substantial amounts of ocean areas for the case of Mozambique (blue; bottom right of Figure 7), the OCSVM estimates are mostly along the coastal lines. As I argued, because a majority of conflict events occurred inside the coastal lines, even without any covariates about terrain, the OCSVM properly accounts for the spatial distribution.
A more noticeable and perhaps important difference is the sensitivity to the outliers. For quite a few conflicts, the UCDP Polygons indicate larger conflict zones than those in the OCSVM estimates, including those in Algeria (brown; top left) and Angola (orange; bottom left). In the case of the Algerian Civil War, for instance, the conflict was mostly fought within the northern region of Algeria. A few terrorist attacks, however, squeeze the UCDP conflict zone to the outside of the country, including Mauritania, Mali, Niger, and a large area of the Sahara Desert. By contrast, the OCSVM estimate is contained within the northern coastal regions of Algeria, more accurately representing the nature of the civil war.Footnote 31
7. The ecological costs of armed conflict: replication of Daskin and Pringle (2018)
Finally, I replicate Daskin and Pringle's (Reference Daskin and Pringle2018) study on the ecological consequences of armed conflict to demonstrate how the zones could alter the inferences they made. I choose the Nature letter to examine the potentially broad implications of the zoning problem and to highlight an issue that is understudied in political science.Footnote 32 The article, which was published on 18 January 2018, has already been cited by 52 newspapers, including the New York Times and the Economist (8 August 2018).Footnote 33 Their sample is cross-sectional and comprised of 172 park-species combinations in Africa.Footnote 34
The outcome variable is the annualized finite rate of population change,
where d t=0 and d t=1 are the densities of a wild large herbivores in the beginning and end years of mammal population records (y t=0 and y t=1 respectively). The lambda measures the ratio of the population size at the end of a year and the population size at the beginning of the year. The value λ = 0.9, for instance, indicates that if there are 100 animals at a beginning of a year, their population decreases to 90 at the end of that year. The densities of wild large herbivores are compiled by “systematically reviewing academic and grey literature” (Daskin and Pringle, Reference Daskin and Pringle2018: 329).Footnote 35 Their key predictor is the proportion of conflict zones averaged over the years of mammal population records. While the authors use the UCDP Polygons dataset, I use the new conflict zone dataset and calculate a proportion of zones within each protected area. In the following section, I compare the results with the updated version of the UCDP Polygons and the results with the OCSVM estimates, while keeping the other specifications intact so that the only difference lies in the zoning methods.Footnote 36
7.1 Results
The following table (Table 1) compares the results based on the updated UCDP Polygons and the OCSVM estimates. While the original finding (left columns in Table 1) indicates a statistically and substantively significant association between conflict zones and the decline of the mammal population, these results are not consistent with my conflict zones dataset. In fact, with the new zones, we cannot draw meaningful inferences from the data.
Note: The table shows the regressions of mammal population trajectories on the average proportions of conflict areas in protected areas in Africa. The left and right columns show the results based on the updated UCDP Polygons and the OCSVM estimates respectively. In each column, the regression coefficient and corresponding 95 percent confidence intervals are reported. The control variables are human population density, proportion of urban areas, and drought frequency, which are included in the “best” model of Daskin and Pringle (Reference Daskin and Pringle2018). n = 172.
The differences become even clearer once we consider the effect sizes. The following figure (Figure 8) compares the trajectories of the hypothetical mammal population, which has an initial size of 100,000. For each of the estimated effects in Table 1, I calculate the population trajectory in a protected area that does not at all belong to conflict zones (blue dotted line) and that in an area totally belonging to conflict zones (red solid line). As seen in pane (a) of Figure 8, according to Daskin and Pringle (Reference Daskin and Pringle2018), the mammal population is stable or only slightly decreases without armed conflict, but it drastically decreases in conflict zones; in each year of the armed conflict, the population is estimated to decline to about 85 percent of the initial size. This means that within 5 years of armed conflict, the population would decrease to less than 1 percent of the initial size.
The estimates with the new conflict zones, however, indicate more modest and perhaps realistic trajectories (pane (b) of Figure 8); in each year of armed conflict, the population is predicted to decline to 90 percent of the initial size. Although this estimate is still large given the prolonged nature of armed conflict (the population decreases to about 59 percent of the initial size within five years of armed conflict), it at least does not mean that fighting would nearly eradicate the animals within a few years.Footnote 37 The results are also indeterminate. There is no definite evidence that mammal population decreases in conflict zones or that the rate of population loss is higher than that in non-conflict zones. Given the relatively large difference in the mean estimates, the null result is probably due to the small sample size. Future studies need to collect more observations to increase the power of the analysis.
I also conduct two additional replications, which are detailed in Supporting Information 6 and 7. Although I refrain from drawing a definite conclusion given the small number of replications, it appears that the measurement errors tend to have large impacts when we use the conflict zones for creating variables and/or when the sample size is small. The analysis with a small sample can be heavily influenced by systematic or non-systematic measurement errors in a few observations. In fact, the new conflict zones also substantially alter the results of Beardsley et al. (Reference Beardsley, Gleditsch and Lo2015), who use the UCDP Polygons for measuring rebels' movement in an analysis with a relatively small sample (n = 257). By contrast, the new measure does not alter the main findings of Fjelde and Hultman (Reference Fjelde and Hultman2014), who use the conflict zones for selecting a sample in an analysis with large panel data.Footnote 38
These results, however, do not mean that a larger number of observations can always mitigate the biases from measurement errors. In fact, the measurement errors in Beardsley et al. (Reference Beardsley, Gleditsch and Lo2015) have systematic patterns that will not disappear even with a large sample. This demonstrates that the biases in empirical estimates can persist. Thus, it is advised for future studies to carefully assess the underlying assumptions of conflict zones and the patterns of the measurement errors. If the measurement errors are not systematic, a large sample can help (even though it can cause attenuation biases). If the measurement errors are systematic, however, the empirical findings must be taken with great caution.
8. Conclusion
In conflict studies, the selection of areal units is so common that people may not recognize that the areal assignment is indeed an assumption. Without properly understanding where armed conflict takes place, however, we cannot know why armed conflict occurs or what its consequences are. In this paper, I have addressed the areal-unit problems by developing a theory, method, and dataset of conflict zones. I define a zone as a summary function that maps locations and other relevant information onto the presence and absence of armed conflict. This formalization clarifies that the zoning exercise is essentially a statistical problem—it is a matter of how we can infer a zoning function from observed data of conflict events. I answer this question by applying the OCSVM, which unlike other deterministic or statistical methods does not depend on a predefined areal unit. I apply the OCSVM to the UCDP GED conflict event dataset and create a new dataset of conflict zones. The replication of Daskin and Pringle (Reference Daskin and Pringle2018) indeed indicates that zones can potentially alter our inferences about the ecological costs of armed conflict in statistically and substantively significant ways.
Although this paper is primarily interested in armed conflict and applies the method to the UCDP GED, the theory and method can be applied to the other conflict data, such as ACLED (Raleigh et al., Reference Raleigh, Linke, Hegre and Karlsen2010), SCAD (Hendrix and Salehyan, Reference Hendrix and Salehyan2013), and ICEWS (Boschee et al., Reference Boschee, Lautenschlager, O'Brien, Shellman, Starz and Ward2015), and potentially to other topics in the social sciences, including poverty mapping, crime zones, state controls over territories, hazard maps, and zones of racial segregation. Although application to those topics will certainly require extensions and modifications, the framework of this paper provides a way to think about the problems and thus to develop suitable methods. I hope this paper facilitates our understandings on the geography of armed conflict and, more broadly, the areal-unit assumptions in political science.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2020.16.
Acknowledgements
I am grateful to Michael G. Findley, Stephen Jessee, Yuta Kamahara, the two anonymous reviewers, and the editors of PSRM for thoughtful comments. I also express appreciation to Ross Buchanan for writing support. This paper was presented at the IR Workshop at the University of Texas at Austin, the Workshop on Armed Conflict and Political Economy of Development at Kyoto, and American Political Science Association Annual Meeting in 2018.