1. Introduction
The Ganges spans approximately 26 per cent of India's territory and sustains nearly half of its population (Chakraborti et al., Reference Chakraborti, Singh, Rahman, Dutta, Mukherjee, Pati and Kar2018). Despite its importance, it is becoming one of the world's most polluted rivers due to growing population, industrialisation, and urbanisation (Chaudhary and Walker, Reference Chaudhary and Walker2019). Urban areas near the Ganges saw a 30 per cent population increase from 2001 to 2011, which likely worsened the pollution (Government of India, 2011). Consequently, the pollution in the Ganges not only harms the environment but also poses significant health and economic consequences for the people living nearby (Das and Birol, Reference Das and Birol2010; Khan et al., Reference Khan, Gani and Chakrapani2016; and others).
Many studies show that polluted water threatens public health and economic well-being. The Ganges, a key water source, is among the world's most polluted rivers (Chaudhary and Walker, Reference Chaudhary and Walker2019). Pollution can affect children's physical growth and cognitive development, as water filters may not remove all pollutants. This paper explores how pollution in the Ganges Basin affects the education of children aged 8–11 across 39 districts. Long-term exposure to pollution could impair cognitive abilities, potentially leading to lower educational achievements (Dewey et al., Reference Dewey, England-Mason, Ntanda, Deane, Jain, Barnieh, Giesbrecht and Letourneau2023). We use data from the Central Pollution Control Board (2012a) and the 2011–12 wave of the Indian Human Development Survey (Desai and Vanneman, Reference Desai and Vanneman2012) to analyse how organic and inorganic pollutants impact children's test scores. We focus on the effects of faecal coliform and Nitrate Nitrogen + Nitrite Nitrogen on children's reading, maths and writing abilities. For brevity, we will refer to Nitrate Nitrogen + Nitrite Nitrogen as Nitrate-N + Nitrite-N henceforth.
Originating from the Gangotri glacier in Uttarakhand, India, the Ganges flows 2,525 km across five states to the Bay of Bengal. It is essential for drinking, cooking and irrigation. However, pollution from sewage, industrial waste and agricultural runoff – exacerbated by population and industrial growth – poses a significant challenge. A recent report indicates that 764 industries release 500 million litres of wastewater into the Ganges daily.Footnote 1 Heavy metals in the water can cause kidney damage and cancer (Lellis et al., Reference Lellis, Fávaro-Polonio, Pamphile and Polonio2019). Furthermore, long-term consumption of water with heavy metal content has been shown to impair cognitive function, according to several studies (Siegal and Share, Reference Siegal and Share1990; Tolins et al., Reference Tolins, Ruchirawat and Landrigan2014; Tyler and Allan, Reference Tyler and Allan2014). Nitrates and antibiotic-resistant bacteria in the water also pose health risks (Quist et al., Reference Quist, Inoue-Choi, Weyer, Anderson, Cantor, Krasner, Freeman, Ward and Jones2018; Adimalla, Reference Adimalla2020). This study examines the impact of faecal coliform and Nitrate-N + Nitrite-N on children's cognitive abilities and educational outcomes, establishing an association between polluted water in the Ganges and lower test scores.
Religious activities such as ritual baths, idol immersion, and cremation add to the Ganges' pollution, increasing heavy-metal levels and the river's biochemical oxygen demand (BOD), often exceeding the Central Pollution Control Board (CPCB) standards. During the Maha Kumbh festival, studies of the Ganges water show that such mass gatherings significantly raise BOD, total suspended solids, and ammonia nitrogen beyond safe limits for outdoor bathing. The water also shows high levels of faecal and total coliforms, leading to more water-borne diseases (Tyagi et al., Reference Tyagi, Bhatia, Gaur, Khan, Ali, Khursheed, Kazmi and Lo2013).Footnote 2
Several studies have shown that the water quality of the Ganges is unsuitable for drinking and bathing at many monitoring points (Mariya et al., Reference Mariya, Kumar, Masood and Kumar2019). This can pose a higher risk to human health (Chaudhri and Jha, Reference Chaudhri and Jha2012), and can potentially lead to lower cognitive abilities through the channel of health deterioration. When it comes to educational outcomes of children in the context of developing countries, researchers are more interested in socioeconomic and household conditions as determinants of children's education (Nambissan, Reference Nambissan2009; Chaudhri and Jha, Reference Chaudhri and Jha2012). A growing literature provides evidence that exposure to pollutants, especially air pollutants, leads to lower educational outcomes in the US (Sanders, Reference Sanders2012; Rosofsky et al., Reference Rosofsky, Lucier, London, Scharber, Borges-Mendez and Shandra2014; Ebenstein et al., Reference Ebenstein, Lavy and Roth2016; Roth, Reference Roth2017). However, to the best of our knowledge, this is the first research that specifically examines the negative impact of poor water quality on educational outcomes in the context of a developing country like India.
This paper investigates the understudied area of pollution's impact on education in developing countries such as India. Water pollution leads to both immediate and long-term health issues, including negative effects on cognitive development from prolonged pollution exposure. Increased population density in polluted areas further exacerbates these effects, reducing children's cognitive abilities. Despite its importance, such research is limited, often overshadowed by urgent issues like child mortality. Moreover, while the discourse on environment and development prioritizes health and the reduction of child mortality, interest in educational outcomes often takes a backseat. Some studies that explored only the environmental and health outcomes were conducted after pollution control laws like the Ganga Action Plan were implemented (Dwivedi et al., Reference Dwivedi, Mishra and Tripathi2018). The lack of data for long-run health and cognitive outcomes is another hurdle in researching the connection between water pollution and children's educational outcomes.Footnote 3
2. Data
To examine the relationship between the water quality of the river Ganges and children's educational outcomes, we merge two types of data: (1) household survey data, which provides information on children's educational outcomes, and (2) water quality data, encompassing various measures of water quality.Footnote 4 Below, we detail both data sources and describe the variables employed to estimate our empirical model.
2.1 Indian human development survey
The source of the household survey data for this paper is the Indian Human Development Survey (IHDS), a nationally representative dataset.Footnote 5 For this paper, we use the second round of the survey, conducted between November 2011 and October 2012. In this round, 42,152 households across 1,503 villages and 971 urban neighbourhoods throughout India were interviewed. While the first wave took place in the 2004–05 period, data from both the base year and the second round cannot be combined for this study because educational outcomes were only measured in the second round. Most children surveyed were at most two years old during the 2004–05 period and not suitable for educational aptitude testing. Data on various socioeconomic characteristics, such as individual health, household employment, and income, along with school facilities and staff, were collected. The interviews utilised two sets of questionnaires: one on income and social capital, typically answered by the male head of the household, and another on education and health, answered by an ever-married woman. The collected data are organised into fourteen modules, of which the Individual, Household, and School Facilities modules are used for this study.Footnote 6 After merging the data and excluding missing values, we retain 1,147 observations for children aged 811 living in 39 districts across five states in the Ganges Basin, where water quality was monitored.Footnote 7
2.2 Water quality data
We gathered water quality data for the districts in the Ganges basin for the years 2012 and 2013, drawing from the CPCB (2012a) database. This database operates under the Ministry of Environment, Forest, and Climate Change of the Indian Government.Footnote 8 The CPCB selects monitoring points along rivers or near water bodies (lakes and groundwater sources) that likely exhibit varying levels of key pollutants and potential turbidity. Monitoring points within districts along a river are sometimes categorised as either upstream or downstream from well-known locations. With each monitoring point's specific location provided, we identify the nearest district to each point. For instance, if a monitoring point is in a river, we assign it to the district situated directly on the riverbank. Most districts in our sample are located by a river, on the banks of the Ganges and/or Yamuna, or along their tributaries.Footnote 9
Pollution data was collected quarterly and monthly at these monitoring points, with CPCB publishing yearly averages for minimum, mean and maximum levels of each water quality indicator. For example, at a specific monitoring point j at time $t = 1$, the CPCB calculates the minimum, mean and the maximum levels of faecal coliform, ${F_{\textrm{max},1,j}},{F_{\textrm{mean},1,j}},$ and ${F_{min,1,j}}$, respectively. By averaging these measurements over total T periods, they create $(\sum\nolimits_{t = 1}^T {{F_{\textrm{max},t,j}})/T,\,\,(\sum\nolimits_{t = 1}^T {{F_{\textrm{mean},t,j}})/T} }$, and $(\sum\nolimits_{t = 1}^T {{F_{\textrm{min},t,j}})/T}$. If a district has J monitors – the monitor index being $j = 1, 2, 3, \ldots , \,J$ – and if data was collected by CPCB at T times in 2012, then we calculate the district mean of faecal coliform as $(\sum\nolimits_j^J {\sum\nolimits_{t = 1}^T {{F_{\textrm{mean},t,j}})/(T \times J)} }$. We use this averaging scheme for each district. Compared to the average maximum and minimum levels of pollution exposure, represented by $(\sum\nolimits_j^J {\sum\nolimits_{t = 1}^T {{F_{\textrm{max},t,j}})/(T \times J)} }$ and $(\sum\nolimits_j^J {\sum\nolimits_{t = 1}^T {{F_{\textrm{min},t,j}})/(T \times J)} }$ respectively, the overall mean pollution level $(\sum\nolimits_j^J {\sum\nolimits_{t = 1}^T {{F_{\textrm{mean},t,j}})/(T \times J)} }$ more accurately indicates the level of pollution to which the sample respondents were most frequently exposed. The minimum and maximum readings from the monitoring points may reflect infrequent dips and spikes in pollution, not necessarily representing the regular exposure levels for children. Since CPCB provides only minimum and maximum readings at each monitoring point but not their frequencies, we decide to use only the mean pollution levels from the monitoring points to calculate $(\sum\nolimits_j^J {\sum\nolimits_{t = 1}^T {{F_{\textrm{mean},t,j}})/(T \times J)} }$, the district-level pollution measure. District-level means of the other water quality variables have been calculated in the same way.
We primarily use water quality data from 2012, supplementing it with 2013 data to fill any gaps. Missing readings for certain monitoring points in 2012 could potentially bias the computation of average water quality variables. To address this, we impute missing values using their 2013 counterparts. We found that readings from monitoring points available in both years were consistent, with no cases of monitors shifting from benign pollution levels in 2012 to hazardous levels in 2013. Therefore, we are confident that our approach to handling missing data ensures the reliability and representativeness of the actual pollution levels.
2.3 Descriptive statistics
Table 1 displays mean values for key variables, with each column representing a sample based on the type of water source monitored for pollution. For instance, the averages in the first column are derived from data on children in districts where river water was monitored. Column 7 in table 1 shows variable means for the full sample of 1,147 children. In some districts, more than one type of water source was monitored. According to columns 1 and 2 in table 1, mean faecal coliform and mean Nitrate-N + Nitrite-N levels are higher in the ‘river’ and ‘Ganges’ samples compared to ‘Yamuna’, ‘groundwater’ (GW) and ‘Tributaries’ (Trib.). The main binary variables of interest are district-average $1[\textrm{Mean}\textrm{ faecal}\textrm{ Coliform} > 2,500\ \textrm{MPN}/100\ \textrm{ml}]$ and $1[\textrm{Mean}\textrm{ Nitrate} - N + \textrm{Nitrite} - N > 1\ \textrm{mg/}1\ \textrm{L}].$ For simplicity and to save space, we express these variables as $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$ using Iverson notation, respectively.Footnote 10
Notes: Columns (1) to (7) show variable means for district groups by water source type monitored in 2012. Columns (1) to (6) (detail specific sources: Ganges and Yamuna (1), only Ganges (2), only Yamuna (3), lakes (4), groundwater (5), and tributaries (6), with column (7) combining all districts.
a Mean faecal coliform (MPN/100 ml), reported in millions.
b HH expenditure: Household per capita expenditure.
c Household purifies water by boiling, filtering, aquaguard, or chemicals.
d Members of the households always wash hands after defaecation.
Table 1 displays significant variations in the average values of water pollution measures. For example, the highest mean faecal coliform level is observed in the ‘Lake’ sample, while the ‘Ganges’ sample records the highest mean levels of Nitrate-N + Nitrite-N. Conversely, the ‘Yamuna’ sample, shown in column (3), has the lowest levels of both mean faecal coliform and mean Nitrate-N + Nitrite-N, coinciding with the lowest mean test scores. These patterns indicate a possible link between higher district test scores and elevated levels of pollutants, possibly because urban districts, despite higher pollution, often have access to better educational resources and means to counteract water pollution effects. Hence, the descriptive data in table 1 alone cannot comprehensively evaluate pollution's negative impact on test scores. A detailed analytical model is essential to pinpoint the impact of pollution exposure on test scores.
Our study focuses primarily on district-mean levels of faecal coliform and Nitrate-N + Nitrite-N as the main water pollutants, rather than on other pollutants for which data are available. Other water quality metrics, such as biochemical oxygen demand (BOD), dissolved oxygen level (DO) and pH, are not classified as pollutants, though they do assess water quality.Footnote 11 We incorporate these metrics as control variables in our model. It is important to note that BOD and DO levels do not consistently correlate with the levels of our primary pollutants of interest. Typically, higher BOD levels and lower DO levels are observed in more turbid water, which may coincide with higher levels of faecal coliform and Nitrate-N + Nitrite-N (Ahipathy and Puttaiah, Reference Ahipathy and Puttaiah2006). However, the absence of undesirable BOD and DO levels does not necessarily mean the absence of unsafe levels of faecal coliform and Nitrate-N + Nitrite-N. For instance, table 1 indicates that the groundwater sample exhibits relatively fewer occurrences of undesirable BOD and DO levels, yet the mean faecal coliform level in these districts is very similar to that of the full sample. In addition, in districts adjacent to the Yamuna River where BOD levels exceed preferred thresholds, Nitrate-N + Nitrite-N levels do not reach hazardous levels. Thus, BOD and DO levels do not always serve as accurate indicators of pollution. Lastly, the pH level exhibits minimal variation across the samples mentioned in columns 1 to 7 of table 1.Footnote 12 All these samples, along with almost all districts in ‘tributaries’ and the full sample, maintain high but safe pH levels. Consequently, overall pH levels do not present a significant risk to the cognitive abilities of children.
In table 1, individual characteristics such as age, gender, height, weight, and family consumption expenditure show only marginal variation across the monitored water source categories. Interestingly, the proportion of households with indoor piped water supply and those purifying water vary between 0.05–0.77 and 0.09–0.77, respectively. Handwashing after defecation is a critical preventive measure against many diseases (Curtis and Cairncross, Reference Curtis and Cairncross2003), and the proportion of households consistently practicing this varies narrowly from 0.69 to 0.77. Table A2 represents variable means for samples that are exposed to unsafe levels of faecal coliform and Nitrate-N + Nitrite-N. Table A3 includes means of additional variables we use as controls. Note that all tables whose numbers are preceded by ‘A’ appear in the online appendix, in which we provide explanations of the table contents below the tables as needed.
For regression analysis, we employ binary measures of the water pollutants, $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$. Using binary variables offers three distinct advantages. First, they enable a clear distinction between the districts experiencing unsafe pollution levels and those that do not, based on the established safety limits for pollutant concentrations. Second, understanding the estimated effect of the binary variables that signal unsafe pollution levels in districts does not rely on pollution changing by a certain amount; there was not much difference in pollution levels from 2012 to 2013. Also, minute fluctuations, like a one MPN increase in faecal coliform in 100 ml of water, are unlikely to make noticeable differences in test scores, making the estimated effect of the one-unit hard to interpret. Lastly, identification of the effects of pollutants in a regression model can be challenging at extremely high values of the pollution-measuring continuous variables. This complexity arises because districts with the most significant river pollution are often both densely populated and economically advanced. It is easier for such districts to insure themselves against high levels of pollution by establishing superior water filtration systems.
We examine the educational outcomes of children living in Ganges Basin districts, focusing on areas where water sources were monitored for pollution. The survey assessed children's reading, writing and arithmetic skills through tests administered to all eligible children aged 8–11 in each household. As indicated in table 1, the test scores are considered continuous variables, with a comprehensive description provided in table A4.Footnote 13 These tests, developed in collaboration with researchers from PRATHAM,Footnote 14 were pretested to ensure they were comparable across various languages. This method allows us to analyse the educational performance of school children in different states, accommodating the diverse languages used as mediums of instruction. Despite each Indian state having its unique school curriculum, PRATHAM's tests remain consistent across the board. The standardisation of test scores enables us to assess the impact of pollution exposure on children's average position within the test score distribution.
3. Empirical model
The empirical model examines the effect of water quality on test scores (equation (1)). The analytical sample contains unique children i = 1, 2, 3…n living in k = 1, 2, 3,…, K districts,
where $\boldsymbol{W}$ is the vector of water quality variables and their values vary between districts, $\boldsymbol{X}$ is a vector of ${X_{ik}}$ control variables, and ${\chi _k}$ are district dummy variables. We use the same right-hand-side variables for each test outcome, ${Z_{ik}}.$ The main treatment variables, $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$, vary only between districts and not within each district. Our baseline model uses random intercept regression. ${\epsilon _{ik}}$ is the individual-level error term and ${Z_{ik}}$ indicates our set of dependent variables are nested within cluster k, with each district representing a separate cluster. Since $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$ vary between districts, we can interpret the coefficient estimates of these two variables as the average decline in the children's position within the test score distribution due to exposure to district-level pollutants.Footnote 15 We include district-mean pH, and binary indicators of BOD and DO in the vector $\boldsymbol{W}$ from equation (1).Footnote 16
The economic intuition behind applying the random-effects model is that the district-level errors are not necessarily affecting ${Z_{ik}}$ through the variables of interest, $\boldsymbol{W}$. Communities within a district can invest in water treatment plants and water supply networks to insure against pollution. More affluent districts, often more urbanised, tend to pool resources to develop better public water supply networks to mitigate water pollution risks (Sarker et al., Reference Sarker, Keya, Mahir, Nahiun, Shahida and Khan2021). Since water supply networks are monopolies requiring an initial fixed investment, and marginal cost of water supply to additional households is low, all the households in a district would have the same quality of water supply network available for them irrespective of individual household-level wealth and income. In other words, both rich and poor participate in the same water distribution network and are subject to similar levels of water quality. Thus, the unobserved heterogeneity due to a district's water supply characteristics of a district can be considered as random intercepts, $E({\boldsymbol{X}\textrm{|}{\chi_k}} )= 0$, for the households and are not likely to drive or be driven by the household-level observed variables in $\boldsymbol{X}$. If $E({\boldsymbol{X}\textrm{|}{\chi_k}} )\ne 0$, then we would need fixed-effects estimation of equation (1). Therefore, we model district-level exposure to water quality as random district-level effects.Footnote 17
We prefer a random-effects model over one with district fixed effects because the fixed- effects model can introduce multicollinearity between the district-level dummy variables and the binary pollution variables. We run different tests to check if the random-effects model should be used instead of some alternative models. Diagnostic tests developed by Hausman (Reference Hausman1978) and Schaffer and Stillman (Reference Schaffer and Stillman2006) show that the random-effects model is preferred over the fixed-effects model.Footnote 18 Additionally, a test by Breusch and Pagan (Reference Breusch and Pagan1980) shows that the random-effects model is favoured over a simple ordinary least squares (OLS) model. Furthermore, we conduct a likelihood-ratio (LR) test that indicates that a random-effects model is preferred to a pooled model with district dummy controls. Overall, the results support applying a random intercept (district-level) specification.
The binary variables indicating unsafe levels of faecal coliform and Nitrate-N + Nitrite-N correlate with DO, BOD, and pH to some degree, as they all reflect aspects of water quality. The exact functional relationships between them are unknown. Generally, water quality deteriorates when faecal coliform and Nitrate-N + Nitrite-N exceed safety limits. Consequently, the estimated effect of main water pollution measures may be overstated, capturing both the overall water quality impact and specific pollution contents. However, water turbidity is also associated with poor quality, making it essential to control for the effects of mean BOD, mean pH and mean DO in equation (1). By doing so, we might have overly adjusted for water quality effects, rendering the estimates of the impact of unsafe levels of faecal coliform and Nitrate-N + Nitrite-N as ‘lower-bound’ estimates.
3.1 Identification
Equation (1) is based on the structure of a simple education production function. This function, widely discussed in the education economics literature, relates educational inputs to outcomes like test scores and class rankings (see Krueger (Reference Krueger1999) and Hanushek (Reference Hanushek2010), among others). We assume that water quality levels are ‘predetermined’ factors in the education production process. Thus, the error term ${\varepsilon _{ik}}$ is uncorrelated with water quality, or $E(\boldsymbol{W}|{\varepsilon _{ik}}) = 0$. While this is a strong assumption, we later introduce a propensity score matching model to estimate the causal effects of $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$ on test scores, relaxing this initial assumption.
River pollution is the outcome tied to economic activities, population density and geographic characteristics of an area. However, schooling is governed by state policies and government mandates in India, i.e., all children must attend schools (Chhokar, Reference Chhokar2010). The government provides funding to the schools and dictates school curricula and related policies (Kingdon, Reference Kingdon2007). The average quality of education and outreach at a district is not subject to the aggregate factors which may drive river pollution – overpopulation, urbanisation and industrialisation. Average education outcomes of the children may be driven by river pollution and other aggregate factors. Pollution impacts education production through the channel of both short-term and long-term health, as health is directly linked to water quality and, consequently, to productive outcomes such as educational attainment.
The CPCB employs stringent criteria to select monitoring points, indicating a non-random selection process. Consequently, the non-random selection of monitoring stations leads to a non-random selection of districts in our analysis. To address this, we calculate district-level mean pollution after aggregating readings from all monitoring points in a district. If the sample distribution of pollutants is skewed right because CPCB monitors more polluted areas, then the sample mean might exceed the true average pollution level. However, our focus is on binary indicators that show whether average monitored pollution levels exceed safety limits. Given that the sample includes districts with pollution levels below the unsafe threshold, it seems unlikely that CPCB exclusively monitored the most polluted river sections. Furthermore, some monitors detected no faecal coliform and Nitrate-N + Nitrite-N levels, suggesting that the selection of monitoring sites is unlikely to compromise the validity of our findings on the pollutants' treatment effect.
For robustness checks, the vector $\boldsymbol{X}$ in equation (1) is expanded to include the effects of teaching quality, educational expenditure, schooling quality, short-term morbidity, use of technology, and household members' personal hygiene. Since we lack variables for long-term morbidity throughout the children's lives, which could be linked to river pollution, we use district-level short-term morbidity as a proxy. The decline in skills such as maths, reading and writing cannot result from random sickness episodes alone. Short-term morbidity does not reveal the children's susceptibility to illness. Continuous consumption of poor-quality water, even if it does not cause immediate sickness, may lead to cognitive declines in children. The reading, writing and maths tests administered by Pratham (2021) measure the students' average cognitive abilities. Therefore, mean district-level morbidity is intended to capture spikes in short-term morbidity due to unforeseen reasons and the overall health of children in the district, excluding the cognitive loss channel in children exposed to unsafe pollution levels in drinking water.
We investigate the possible channels of cognitive ability loss due to pollutant contents in drinking water. Thus, we further demonstrate that interaction terms between $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and binary variables describing household water supply and storage choices are statistically significant. This analysis aims to identify how water pollutants not removed by the water supply system – which may or may not have a filtration system – affect children's cognitive abilities.Footnote 19
Household characteristics such as the educational level of the head, available resources, and income significantly impact children's educational outcomes. Families with well-educated heads, ample resources, and higher incomes often see better educational results for their children. However, when considering the substantial impact of high water-pollution levels on education and income, children from households with lower educational outcomes may become trapped in a cycle of poverty. These children may face challenges in earning low incomes and lack the means to relocate from areas with poor water quality. In such a scenario, the current household head's lower investment in children's education might be linked to lower investment $({{P_k}} )$ in his/her education when he/she was a child and therefore, $E({P_k}|{\epsilon _{ik}}) \ne 0.$ In addition, the observational data used here does not include individual or household-level instruments that could be used to infer causation between poor water quality and educational outcomes.
We define a binary treatment variable ${T_f}$ in the following way:
Therefore, we estimate average treatment effect on the treated (ATT), which measures the difference between expected test scores of children in high-pollution districts Tf = 1 versus a counterfactual outcome expressed as:
In equation (2), ${Z_0}$ and ${Z_1}$ are outcomes of the non-treated $({T_f} = 0)$ and the treated $({T_f} = 1)$. The subscript $f$expresses that the treatment is unsafe levels of faecal coliform. $E[{Z_0}|{T_f} = 1]$ is the counterfactual state that we do not observe and estimate. By extension, the ATT is also applicable for unsafe levels of Nitrate-N + Nitrite-N. If ${T_n}$ holds 1 for district-level mean Nitrate-N + Nitrite-N to be over the safe level, and 0 otherwise, then $\textrm{AT}{\textrm{T}_n} = E[{{Z_1} - {Z_0}\textrm{|}{T_n} = 1} ]= E[{{Z_1}\textrm{|}{T_n} = 1} ]- E[{Z_0}|{T_n} = 1]$. The subscript $n$ expresses that the treatment is unsafe levels of Nitrate-N + Nitrite-N. Identification is dependent on the assumption of conditional independence – if we control for the household and individual factors that drive educational outcomes, then the treatment effect can be considered random. For this non-experimental exercise, we use the widely known propensity score matching (PSM) developed by Rosenbaum and Rubin (Reference Rosenbaum and Rubin1983).Footnote 20
The baseline regression results in tables 2–4 can be combined to provide a picture of the negative impact of river pollution on children's test outcomes. Column 1 results are estimated using the full sample in each of the three tables. The pollutants do not appear to generate a statistically significant effect on the test scores which are based on the full sample. Only for the ‘river’ and the ‘Ganges’ samples do we see unsafe levels of faecal coliform generating a statistically significant negative impact.Footnote 21 The largest impact of faecal coliform is on the writing test and the smallest on the reading test when the samples, ‘river’ and the ‘Ganges’ are considered (columns 1 and 5 in tables 2–4). Overall, faecal coliform has a negative impact on test outcomes. Unsafe levels of Nitrate-N + Nitrite-N only has a significant impact on reading tests when ‘groundwater’ districts are considered. Among other variables, age, height, and weight have some estimated positive impact on the test scores as expected. Binary indicators of household consumption is coded 1 if per capita consumption expenditure of a household is at the 25th, 50th and 75th percentile of the distribution or below. As the reference group is children from households above the 75th percentile of the per capita consumption expenditure distribution, the estimated effects of these variables, when statistically significant, understandably are negative.
HH con., Household consumption per capita; ptile, percentile; GW, groundwater; Trib., Tributaries.
Notes: Robust standard errors clustered at district level in parentheses.
Explanatory variables not reported: Numerical variables such as ‘hours spent at school per week’, ‘hours spent doing homework per week’, ‘hours spent being tutored per week’, ‘distance from school to home’, ‘number of days the child spent disabled because of short-term morbidity in the last 30 days’. Binary variables such as ‘1 = Rupees spent on books and uniform > Rs. 500’, ‘1 = water storage vessel available at home’, ‘1 = water is purified at home though some mode of filtration or boiling’, ‘1 = household members always wash hands after defaecation’.
HH con., Household consumption per capita; ptile, percentile; GW, groundwater; Trib., Tributaries.
Notes: Robust standard errors clustered at district level in parentheses.
Explanatory variables not reported: Numerical variables such as ‘hours spent at school per week’, ‘hours spend doing homework per week’, ‘hours spent being tutored per week’, ‘distance from school to home’, ‘number of days the child spent disabled because of short-term morbidity in the last 30 days’. Binary variables such as ‘1 = Rupees spent on books and uniform > Rs. 500’, ‘1 = water storage vessel at home’, ‘1 = water is purified at home though some mode of filtration or boiling’, ‘1 = household members always wash hands after defaecation’.
HH con., Household consumption per capita; ptile, percentile; GW, groundwater; Trib., Tributaries.
Notes: Robust standard errors clustered at district level in parentheses.
Explanatory variables not reported: Numerical variables such as ‘hours spent at school per week’, ‘hours spend doing homework per week’, ‘hours spent being tutored per week’, ‘distance from school to home’, ‘number of days the child spent disabled because of short-term morbidity in the last 30 days’. Binary variables such as ‘1 = Rupees spent on books and uniform > Rs. 500’, ‘1 = water storage vessel available at home’, ‘1 = water is purified at home though some mode of filtration or boiling’, ‘1 = household members always wash hands after defaecation’.
Having an indoor piped water supply is also estimated to have a positive impact on children's reading test scores (columns 1 and 3–6 in table 2), and also on maths and reading test scores (column 6 in tables 3 and 4). In districts adjacent to groundwater and tributaries that were monitored for pollution, the effect of unsafe levels of faecal coliform and Nitrate-N + Nitrite-N are statistically indistinguishable from zero.Footnote 22 We investigate whether the interaction between unsafe levels of faecal coliform and access to indoor piped water supply significantly affects test scores. While indoor piped water alone has minimal impact on scores, column 6 in table A5 reveals that in the ‘river’ sample, the positive effect of indoor piped water (+0.818) on writing scores is nearly cancelled out by its interaction with the faecal coliform variable (−0.803). This suggests that faecal coliform may impair children's cognitive abilities, as reflected in test scores, despite the presence of indoor piped water supply. The results in columns 3 and 5 in table 3 are based on ‘Ganges’ and ‘groundwater’ samples. Tables 2–4 support the impact of unsafe levels of faecal coliform being primarily driven by the pollution in the river Ganges. Our other binary variable of interest about Nitrate-N + Nitrite-N only has a significant impact on reading test scores when the districts where groundwater is monitored are chosen.
We look for heterogeneity in the estimated effect of $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$ between genders. Looking for differential pollution effect on boys versus girls, we find that $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ has approximately 0.01 standard deviation greater effect on boys than girls in writing tests (columns 9 and 12 in table A6).Footnote 23
Caste-based and religion-based discrimination in accessing safe water suggests that water pollution's impact might vary across different castes and religious groups (Hoff, Reference Hoff2016). However, dividing the sample by religion and caste results in too few observations per group, leading mostly to inconclusive results and hindering our ability to detect potential heterogeneity in the effects of $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$. Given the distinct social statuses and relationships among the six religious and caste groups, merging these groups to enlarge sample sizes could lead to misleading conclusions.
In table 5, we present ATT by estimating a PSM model as outlined in equation (2). The estimated ATT shows causal impact of the main pollution treatments. The results show that when the full samples are considered, ${T_f}$ has a statistically significant causal impact on reading, maths and writing scores. ${T_n}$ also has a negative impact on reading and maths scores.
Notes: Abadie and Imbens (Reference Abadie and Imbens2016) robust standard errors in parentheses. Tf = 1 means that the household is in district that received the treatment of exposure to unsafe levels of faecal coliform and Tf = 0 means untreated. Tn = 1 means that the household is in district that received the treatment of exposure to unsafe levels of Nitrate-N + Nitrite-N and Tn = 0 means untreated. Average treatment effect on the treated has been estimated by propensity-score matching. We consider a logit treatment model. Conditioning variables in the treatment model: demographic identities, age, height, weight, consumption expenditure by households, and individual-level variables: household per capita income, school distance, school hours/week, homework hours/week, private tuition hours/week, expenditure on books and uniform, short-term morbidity (days of disability in the previous thirty days before the survey interview), Binary: whether the household boils water for purification (1 = yes), whether household members wash hands after defaecation (1 = yes).
3.2 Robustness checks
We check the robustness of the effects of the pollutants in several ways. We check if the effects $1[\overline {\textrm{FCOLI}} > \textrm{limit}]$ and $1[\overline {\textrm{NIT}} > \textrm{limit}]$ differ across states. We find that the more economically developed West Bengal sees greater negative impact of $1[\overline {\textrm{FCOLI}} > \textrm{limit]}$ on writing tests compared to the Uttar Pradesh and Bihar-Jharkhand sample (columns 6 and 9 in table A10).Footnote 24 Next, we include more variables in $\boldsymbol{X}^{\mathrm{\prime}}\Gamma$(equation (1)) that cover more factors related to individual characteristics, household characteristics, water source information, short-term morbidity and schooling. The results in tables A11 show if the effects of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ and $1[\overline {\textrm{NIT }} > \textrm{limit}]$ on reading and writing are robust even after the inclusion of a long list of control variables. The results in table A12 are estimated by adding indicators related to teaching quality to the regression specification in addition to the set of explanatory variables corresponding to the results in table A11.Footnote 25 The estimated effect of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ on reading and writing scores is still robust in table A12.
As a sensitivity analysis, we estimate the baseline results using mixed-model specifications where the random-effects are interpreted as district-specific random intercepts (table A13). The estimated effect of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ in table A13 are similar to those in tables 2–4, proving that these alternative specifications do not change the baseline results. In addition, tables A14 and A15 exhibit the statistically robust effects of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ and $1[\overline {\textrm{NIT }} > \textrm{limit}]$, respectively employing two-level and three-level random-intercept models that account for variations within villages, neighbourhoods and households. In table A16, we find that after including a measure of short-term morbidity, the effects of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ on reading scores in the ‘river’ sample and $1[\overline {\textrm{NIT }} > \textrm{limit}]$ on reading scores in the full sample remain robust statistically. Next, after adding state-specific controls to our regression specifications, we find that the effect of $1[\overline {\textrm{NIT }} > \textrm{limit}]$ loses its statistical significance but the effect of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ remains statistically robust on the three test scores for the full sample (table A17). We attempt to separate the seasonality effect from the pollution effect in table A18. As our dataset is of a cross-sectional nature, we plug $\textrm{State }\textrm{ID} \times \textrm{Disctrict}\textrm{ mean}\textrm{ morbidity} \times \textrm{Survey}\textrm{ month}$ – interaction terms – into the model, which are supposed to account for variations in district-mean morbidity over the survey months, and find that the effect of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$remains robust on reading and maths scores in the full-sample regression (table A18).
Besides water pollution, other types of pollution like land and air pollution may also affect test scores. An increase in water and air pollution when both are driven by rapid urbanisation can coincide, and the estimated effect of water pollutants can partially contain the effect of air pollution. We have included PM2.5,Footnote 26 a measure of air pollution, as a control variable in our model. PM2.5 refers to particulate matter in the air that are less than 2.5 micrometres in diameter. We find that the impact of $1[\overline {\textrm{FCOLI }} > \textrm{limit}]$ on reading and maths scores remains statistically significant in the full sample in table A19. Moreover, its influence on writing scores also proved to be statistically significant in districts near Ganges. Our final robustness checking strategy instruments the district-mean level of faecal coliform with the district's upstream adjacent district's mean level of faecal coliform (MeanFCOLI). This instrumentation is based on the idea that pollution from an upstream district generates exogenous variation in its downstream neighbouring district; the upstream district is not likely to be influenced by downstream conditions. The effect of instrumented MeanFCOLI on reading scores across three different samples – full sample, ‘river’, and ‘tributaries’ sample – are reported in table A20.
The section ‘Explanation for Table A20’ in the online appendix includes the instrumentation strategy. We also observe weaker effect of the instrumented MeanFCOLI on the maths score in the full sample and the ‘tributaries’ sample but not on the writing score, potentially due to a smaller number of observations available. Notably, in the ‘tributaries’ sample, the coefficients for district MeanFCOLI remain unchanged between the random-effects and generalised 2SLS random-effects model (columns 13 to 18 in table A20). This instrumental variable analysis, leveraging upstream faecal coliform levels, acts as an additional robustness check, supporting our primary findings.
4. Conclusion
This study focuses on the impact of water pollution on the educational outcomes of school-going children aged 8–11 across 39 districts in the Ganges Basin of India. Water, as a crucial natural resource for production and consumption, can have long-term effects on human health, life expectancy, and cognitive functions through various channels. Using data from the CPCB of India and the IHDS 2011–12, we estimate water pollution's effect on performance in three tests taken by children aged 8–11 as part of the IHDS. We find that unsafe faecal coliform levels have a consistently robust negative effect on reading and writing test scores. In several extended specifications and sensitivity analyses, the impact of faecal coliform on maths scores was not statistically robust. The negative effect of Nitrate-N + Nitrite-N was statistically indistinguishable from zero in some robustness checks. The negative effects of faecal coliform in water sources on children's reading and writing performance prove to be consistently significant, even when controlling for additional factors such as average district-level short-term morbidity in children (over thirty days), quality of teaching, and adjustments made using a PSM model. This suggests that faecal coliform contamination may impair the cognitive development of children exposed to poor water quality through the channel of health deterioration for prolonged periods (exceeding 30 days). Future studies employing larger datasets and more precisely pinpointed water pollution data have the potential to refine our understanding of how water contaminants like faecal coliform and Nitrate-N + Nitrite-N impact cognitive functions.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S1355770X24000123.
Acknowledgements
The views expressed in this research paper are those of the authors and do not reflect the positions or opinions of any affiliated organizations.
Data
Data from the Indian Human Development Survey are openly available at Inter-university Consortium for Political and Social Research, Ann Arbor, Michigan, The United States (URL: https://doi.org/10.3886/ICPSR36151.v6).
The water quality data is derived from the publicly available source at URL: https://cpcb.nic.in/nwmp-data-2012. This data is collected and maintained by Central Pollution Control Board (CPCB), Ministry of Environment, Forests and Climate Change, Government of India.
Financial support
We did not receive any financial support for the research, authorship, and/or publication of this article.
Competing interests
The authors declare none.