Introduction
There is an extensive body of research examining associations between educational attainment (EA) and mental and physical health outcomes. Existing studies have pointed to EA (measured as years of education, age at leaving education, or diploma obtained) as a correlate of well-being (Bücker, Nuraydin, Simonsmeier, Schneider, & Luhmann, Reference Bücker, Nuraydin, Simonsmeier, Schneider and Luhmann2018), depression (Lorant et al., Reference Lorant, Deliège, Eaton, Robert, Philippot and Ansseau2003), quality-adjusted life years (Furnée, Groot, & Van Den Brink, Reference Furnée, Groot and Van Den Brink2008), different cardiovascular outcomes (Khaing, Vallibhakara, Attia, McEvoy, & Thakkinstian, Reference Khaing, Vallibhakara, Attia, McEvoy and Thakkinstian2017), and a wide range of other diseases and disorders (Choi et al., Reference Choi, Weekley, Chen, Li, Kurella Tamura, Norris and Shlipak2011; Putrik et al., Reference Putrik, Ramiro, Keszei, Hmamouchi, Dougados, Uhlig and Boonen2016; Telfair & Shelton, Reference Telfair and Shelton2012). Often, EA is interpreted as a modifiable risk factor that might improve outcomes in these different domains, but confounding and reverse causation are difficult to rule out.
Correlational evidence provides us with a first indication of associations between education and (mental) health outcomes. For example, a meta-analysis by Bücker et al. suggests a small-to-medium positive correlation between academic achievement and subjective well-being (SWB) that was stable across different measures of academic achievement and SWB (Bücker et al., Reference Bücker, Nuraydin, Simonsmeier, Schneider and Luhmann2018). Similarly, a small but significant correlation has been found between academic achievement and subsequent depression through meta-analysis (Huang, Reference Huang2015). In addition, lower education has been associated with a higher risk of different cardiovascular outcomes (Khaing et al., Reference Khaing, Vallibhakara, Attia, McEvoy and Thakkinstian2017), and lower self-reported health (Furnée et al., Reference Furnée, Groot and Van Den Brink2008).
Such meta-analytic studies offer the opportunity to evaluate and summarize the existing literature, which allows us to identify correlations worth exploring in more detail. However, it is difficult to establish whether these associations reflect causal associations or whether they might be caused by residual confounding (e.g. genetics, socioeconomic status) (Fewell, Davey Smith, & Sterne, Reference Fewell, Davey Smith and Sterne2007; Sobel, Reference Sobel2000). While confounders can be considered in meta-analysis, it is rarely the case that a large number of studies include the same confounders. Moreover, even if confounding factors could be ruled out, correlational studies would not offer clarity on the direction of causation. For example, while higher levels of education might lead to better access to healthcare, less health problems, and higher health (van der Heide et al., Reference van der Heide, Wang, Droomers, Spreeuwenberg, Rademakers and Uiters2013), the reverse could also be true: for example, people in good health might have better possibilities to focus on education and reach higher levels of education than those in poor health (Kawachi, Adler, & Dow, Reference Kawachi, Adler and Dow2010).
A quasi-experimental design that has been applied widely in educational research is to consider compulsory schooling laws where the legal minimum school-leaving age is increased (Brunello, Fort, & Weber, Reference Brunello, Fort and Weber2009; Clark & Royer, Reference Clark and Royer2013; Glymour & Manly, Reference Glymour and Manly2018; Lleras-Muney, Reference Lleras-Muney2002) as an exposure over which individuals can be reasonably assumed to have no control. The implementation of these laws serves as a natural experiment where people are quasi-randomly separated in two groups (before and after, or subject to or not subject to the policy change). Assuming that this policy change only directly impacts the number of years someone stays in education, and assuming that is unrelated to confounding factors, this policy change can be used to estimate the direct effect of educational duration on diverse outcomes. Using this design, researchers have found positive effects of educational duration on mental health (Chevalier & Feinstein, Reference Chevalier and Feinstein2006; Graeber, Reference Graeber2017), cognitive abilities (Banks & Mazzonna, Reference Banks and Mazzonna2012), mortality (Davies, Dickson, Smith, Van Den Berg, & Windmeijer, Reference Davies, Dickson, Smith, Van Den Berg and Windmeijer2018), income (Davies et al., Reference Davies, Dickson, Smith, Van Den Berg and Windmeijer2018; Grenet, Reference Grenet2013), and cardiovascular health (Hamad, Nguyen, Bhattacharya, Glymour, & Rehkopf, Reference Hamad, Nguyen, Bhattacharya, Glymour and Rehkopf2019). Nevertheless, there is still considerable disagreement across different studies employing this design due to heterogeneity in study features such as the included instrument, the examined number of years around the reform, or the populations included (see Hamad, Elser, Tran, Rehkopf, & Goodman, Reference Hamad, Elser, Tran, Rehkopf and Goodman2018). Additionally, the policy shift only affects those that would otherwise have left school earlier, meaning that we study a Local Average Treatment Effect (LATE) in this context. This is important to keep in mind when interpreting results, since this limits the generalizability of findings to those not affected by the reform (Ichino & Winter-Ebmer, Reference Ichino and Winter-Ebmer1999). For the subgroup of individuals affected by the reform, we also assume monotonicity, i.e. there are no individuals for whom the reform decreases their educational duration.
Another quasi-experimental design controlling for several forms of confounding using observational data is the sibling-control design. Comparing outcomes of biological siblings brought up in the same family allows to control for shared environmental confounding (e.g. socioeconomic conditions during childhood), and for shared genetic predispositions. However, factors unique to one of the siblings but not the other and measurement error can still bias the results of sibling-control studies (Frisell, Reference Frisell2021). Additionally, even if we could control for all unshared confounders, the method would not help us determine the direction of causation. If we find that siblings who score higher on well-being also stay in school longer, this could be because well-being causally increases school-leaving age, but the reverse is as likely: school-leaving age might causally increase well-being.
In Mendelian randomization (MR), one or more genetic variant(s) robustly associated with a predictor variable are used as instrumental variables to examine a potentially causal association between a predictor and outcome. The approach relies on Mendel's laws of segregation and independent assortment, which assume that genetic variants are inherited randomly from one's parents and independent from other genetic variants. Assuming that (1) the genetic variants are robustly associated with the exposure, (2) there are no unmeasured confounders of the instrument–outcome association, and (3) the genetic variants are not associated with the outcome of interest other than via the exposure (no pleiotropy), the genetic variants for an exposure can be used as instruments to examine potential causality between the exposure and an outcome. For example, a genetic variant associated with educational duration that is also indirectly associated with higher well-being (through its association with educational duration) provides supportive evidence of a causal association from education on well-being. Multiple studies have used MR to examine causal links between EA and health-related traits, with suggestive evidence for causal influences on traits like alcohol consumption, physical activity, and cardiovascular outcomes (Davies, Dickson, Davey Smith, Windmeijer, & van den Berg, Reference Davies, Hill, Anderson, Sanderson, Deary and Davey Smith2019b; Gill, Efstathiadou, Cawood, Tzoulaki, & Dehghan, Reference Gill, Efstathiadou, Cawood, Tzoulaki and Dehghan2019). Importantly, these associations are only valid if the three key assumptions mentioned above are met. Unfortunately, it is often difficult to evaluate if the assumption of no pleiotropy is met, as many, or even most, genetic variants exert pleiotropic effects. In addition, unmodeled assortative mating, dynastic effects, and population stratification can spuriously induce associations between the genetic variant(s) and outcomes (Brumpton et al., Reference Brumpton, Sanderson, Heilbron, Hartwig, Harrison, Vie and Davies2020).
A further development of MR is the application of this method in the context of within-family analysis (Brumpton et al., Reference Brumpton, Sanderson, Heilbron, Hartwig, Harrison, Vie and Davies2020). By performing genetic instrumental variable within sibling pairs, we directly control for the influences of assortative mating, population stratification (siblings share the same population background), and dynastic effects. First, since genetic variants inherited by siblings are random within a family, genotype differences between siblings will be independent of assortative mating. Second, since the effects of parental wealth and status on their offspring is likely similar across siblings, genetic differences between siblings will be independent of dynastic effects. Lastly, genetic differences between siblings are independent of population stratification. Using within-sibling MR, Brumpton et al. demonstrate that conventional non-family MR estimates for the association between taller height/lower body mass index (BMI) and increased EA were almost entirely attenuated in the context of within-family MR (Brumpton et al., Reference Brumpton, Sanderson, Heilbron, Hartwig, Harrison, Vie and Davies2020). Similarly, Davies et al. used a sibling sample to check if identified associations between EA and different health measures were due to dynastic effects or assortative mating (Davies et al., Reference Davies, Dickson, Davey Smith, Windmeijer and van den Berg2019a, Reference Davies, Hill, Anderson, Sanderson, Deary and Davey Smith2019b). They found little evidence that the within-family results were different from bivariate two-sample MR, but also note a probable lack of power.
While within-family MR has important advantages over conventional MR, it is nevertheless still fallible to unmet assumptions (e.g. the presence of pleiotropy) and is also less powerful as it is applied only in siblings within a larger sample. For both the conventional and the within-family MR, we assume monotonicity (i.e. the genetic variants do not have opposite effects in subgroups of people) and interpret identified effects as LATE.
There are various methods for examining causality in observational data, but all rely on strict assumptions that often are difficult to meet or evaluate. A way in which we can reduce our reliance on these individual assumptions is by applying multiple methods and evaluate the consistency of results and potential discrepancies therein, in light of the biases that accompany each of these methods. In a study where the effect of BMI on different outcomes was assessed, the authors used both MR (subject to family-level confounding) and non-genetic and genetic within-family analyses (subject to reverse causation) (Howe et al., Reference Howe, Kanayalal, Harrison, Beaumont, Davies, Frayling and Tyrrell2020). By verifying that these methods converge upon the same conclusion, the authors increase the certainty that the results were not a by-product of their respective biases. In a similar fashion, Davies et al. examined potential causal effects of education on health, mortality, and income using both a design where they leverage the raising of school-leaving age (ROSLA) and MR, with both methods suggesting similar effects for almost all outcomes (Davies, Dickson, Davey Smith, Windmeijer, & van den Berg, Reference Davies, Dickson, Davey Smith, Windmeijer and van den Berg2021).
For the current project, we are interested in causal influences on specific aspects of well-being, anxiety and mood disorders, and cardiovascular health. As educational effects on well-being are of primary interest to us, we depart from treating ‘well-being’ as a single unified outcome and separately consider effects on satisfaction with family relations, work, friendships, health, and finances (Schimmack, Reference Schimmack2008). We rely on four widely accepted techniques for causal inference: we make use of a random natural policy shift in England and Wales in September 1972 that raised school-leaving age from 15 to 16 but is unlikely to be related to confounding factors. We perform analyses within sibships to control for shared environmental confounders, and partly control for shared genetics. We make use of an index of genetic variation related to EA as an instrumental variable in MR. Finally, we combine the genetic instrumental variable with within-family analysis in sibling pairs. We apply those techniques in a single homogenously measured sample (the UK Biobank [UKB]), minimizing variation in results due to differences in measurement. By assessing if these different methods converge on the same conclusion in terms of whether or not there is a causal effect of educational duration on the different outcomes, we can be more confident in our conclusions on the potential causal relation between education and the different outcomes.
Methods
This project was pre-registered at the Open Science Framework (https://osf.io/s6gha). Deviations from the pre-registration are indicated throughout the manuscript.
Sample
We used data from the UKB, a large UK cohort study which collected genetic and phenotypic data on ±500 000 participants between 40 and 69 years old at recruitment (Bycroft et al., Reference Bycroft, Freeman, Petkova, Band, Elliott, Sharp and Marchini2018). For the current project, we selected individuals of European ancestry (a decision taken to minimize ancestral confounding in genetic analyses) that were born in England and Wales (to ensure participants were likely affected by the school-leaving age reform). Specific further sample selection procedures for the four different analyses are described below per analysis, and a flowchart of sample selection per analysis is found in online Supplementary Fig. S1.
Education variable
We used UKB data-field 845 ‘age completed full-time education’ as our education exposure variable. Participants were asked to answer the question ‘at what age did you complete your continuous full-time education?’. If someone provided an answer below 5, or an answer higher than their age, the answer was rejected. If someone answered with an age higher than 40, the participant was asked to confirm their answer. Since the question was not collected in participants who indicated having a college or university degree, we, in line with the literature (Davies et al., Reference Davies, Dickson, Smith, Van Den Berg and Windmeijer2018; Plotnikov et al., Reference Plotnikov, Williams, Atan, Davies, Mojarrad and Guggenheim2020), imputed their age at completed full-time education as 21. In case someone provided an answer on more than one instance, we used the last available answer as the age at which one completed their full-time education. If the answer at the later time-point indicated a lower age than a previous answer (N = 72), we coded the answer as missing.
Outcome variables
General information on item construction and cleaning procedures for these variables can be found in the Supplementary Methods. The following self-report items were included as well-being outcome variables: general happiness based on happiness (UKB ID 4526) and general happiness (UKB ID 20459), family relationship satisfaction (UKB ID 4559), financial situation satisfaction (UKB ID 4581), friendship satisfaction (UKB ID 4570), work/job satisfaction (UKB ID 4537), health satisfaction based on health satisfaction (UKB ID 4548) and general happiness with own health (UKB ID 20459), and belief that own life is meaningful (UKB ID 20460). All items were coded so that a higher score indicated a higher level of well-being. For neuroticism, we included a summary score (UKB ID 20127) that was based on 12 neurotic domain self-report items. We used a combination of medical record data (UKB ID 41270) and self-report data (UKB ID 20002) to create binary variables reflecting if someone was ever diagnosed with depression, anxiety, or manic or bipolar disorder. Lastly, a binary variable indicating cardiovascular problems was constructed based on vascular/heart problems diagnosed by a doctor (UKB ID 6150) or self-reported (UKB ID 20002).
Control outcomes
We selected four negative control outcomes: height (UKB ID 50), birthweight (UKB ID 20022), comparative body size at age 10 (UKB ID 1687), and comparative height size at age 10 (UKB ID 1697). It is unlikely these variables are causally influenced by additional years of schooling, but the presence of confounding parental variables (e.g. parental SES) might lead to observable but false-positive associations. As a positive control outcome, we included average total household income before tax (UKB ID 738), which was split into the four yes/no dichotomous variables: income over 18k, income over 31k, income over 52k, and income over 100k. General information on item construction and cleaning procedures for these variables can also be found in the Supplementary Methods.
Covariates
As phenotypic covariates, we included sex (UKB ID 31), assessment center (UKB ID 54), family size (based on number of [adopted] siblings, UKB IDs 1873, 3972, 1883, and 3982), season of birth (based on month of birth, UKB ID 52), and year of birth (UKB ID 34). Genetic covariates included the first 10 genomic principal components (PCs) and batch (UKB ID 22000).
Genotype data
Single-nucleotide polymorphisms (SNPs) from HapMap3 (CEU: Utah residents with Northern and Western European Ancestry) (1 345 801 SNPs) were filtered out of the imputed dataset. A pre-principal component analysis (PCA) quality control (QC) was done on unrelated individuals, filtering out SNPs with minor allele frequency (MAF) <0.01 and missingness >0.05, leaving 1 252 123 SNPs. After filtering out individuals with non-European ancestry, the SNP QC was repeated on unrelated Europeans (N = 312 927). SNPs with MAF <0.01, missingness >0.05, and Hardy-Weinberg equilibrium (HWE) p < 10−10 were filtered, leaving 1 246 531 SNPs. The HWE p-value threshold of 10−10 was based on: http://www.nealelab.is/blog/2019/9/17/genotyped-snps-in-uk-biobank-failing-hardy-weinberg-equilibrium-test. A final dataset of 1 246 531 QC-ed SNPs was created for 456 028 UKB subjects of European ancestry.
UKB correction
While the UKB is a valuable dataset where a large number of participants have been genotyped and extensively phenotyped, it is not necessarily representative of the UK population due to confounding from volunteer bias (Batty, Gale, Kivimäki, Deary, & Bell, Reference Batty, Gale, Kivimäki, Deary and Bell2020). To partially correct for volunteer bias, we calculate and include inverse probability weights using procedures by van Alten, Domingue, Galama, and Marees (Reference van Alten, Domingue, Galama and Marees2022). The respondents are weighted using weights based on sex, year of birth (5-year cohort), education level, ethnicity, region of residence (Census Greater London Area), tenure of dwelling, employment status, number of cars in the household, a dummy indicating whether the person lives in a single-person household, and self-reported health. For a more detailed description, see van Alten et al. (Reference van Alten, Domingue, Galama and Marees2022).
Analyses
We use four different methods to examine potential causal effects between educational duration and our outcomes. Table 1 provides an overview of these four methods, including their respective advantages and limitations. Sample descriptives per method can be found in Table 2. Below, we describe each of the four methods in more detail. All analysis code is available at https://github.com/margotvandeweijer/EA_causality. All continuous outcomes were standardized so that the resulting effect sizes reflect the s.d. increase in the outcomes for each additional year of education (see Table 2 for an overview of the s.d.s of the included variables).
* All methods are susceptible for bias from selection/collider bias.
*For some participants, sex information is missing.
Instrumental variable analysis leveraging the ROSLA
We used the ROSLA policy reform where the minimum school-leaving age was increased from 15 to 16 in England and Wales to examine the effects of longer schooling on our different outcomes. We selected a sample of UKB participants born in a 5-year window (1 February 1955 to 1 February 1960) around the reform (1 September 1972), and excluded related individuals (KING kinship coefficient >0.0884) using the ukbtools package in R (Hanscombe, Coleman, Traylor, & Lewis, Reference Hanscombe, Coleman, Traylor and Lewis2019). A binary ROSLA indicator was created for this subset of participants that indicates if a participant was born before (affected = 0) or after (affected = 1) 1 September 1957 and was thus affected by the reform or not. Additionally, we transformed the age at which one left full-time education variable into a binary variable that indicates if an individual stayed in school after age 15 or not (Davies et al., Reference Davies, Dickson, Davey Smith, Windmeijer and van den Berg2019a). Next, we used two-stage least squares (2SLS) instrumental variable analyses using the fixest R package (Bergé, Reference Bergé2018), where in the first stage the binary education variable was included as the dependent variable and the binary ROSLA indicator was included as the instrument. In the second stage, we regressed all our standardized outcome variables on the fitted education values from the first-stage regression. Both stages included the phenotypic covariates. For comparative purposes, we also run regular (non-pre-registered) ordinary least squares (OLS) regression in the same sample the binary education predictor was used to predict the different outcomes (including the same covariates as the ROSLA analyses). To examine the robustness of the ROSLA results, we repeated the analyses using samples born in a 2 and 10 years window around the reform.
Sibling control design
We perform analyses within sibships to control for shared familial background characteristics, and partly control for genetic effects. Biological sibships in the UKB dataset are defined as participants with a kinship coefficient between ${1 \over {2^{5/2}}}$ and ${1 \over {2^{3/2}}}$ and a probability of zero identical-by-state sharing >0.0012 (Bycroft et al., Reference Bycroft, Freeman, Petkova, Band, Elliott, Sharp and Marchini2018; Manichaikul et al., Reference Manichaikul, Mychaleckyj, Rich, Daly, Sale and Chen2010). Individuals indicating they were adopted were removed from this sample. For each sibship j with i siblings, we start by calculating the average age at which sibships left full-time education $\overline {edu_{oj}} = \sum\nolimits_1^m {edu_{ij}/m}$. Next, we calculate each sibling's deviation from the sibship average: $edu_{\Delta ij} = edu_{ij}-\overline {edu_{0j}}$. We use these estimates in a linear model where each outcome Y ij for sibling i in sibship j is predicted as follows:
where β B is the between-sibship effect estimating if the average school-leaving age within sibships is associated with our outcomes, and β W is the within-sibship effect estimating if a sibling deviating from the sibship school-leaving age average is associated with our outcome measures. Since we examine the effect of these within- and between-sibship estimates on the outcomes of individual siblings, we excluded sibships where only one sibling reported on educational duration, but we did not exclude sibships where not all siblings reported on one or more outcome measures. We report robust standard errors taking into account familial clustering, calculated using the coeftest function from the lmtest r-package (Hothorn et al., Reference Hothorn, Zeileis, Farebrother (pan.f), Cummins (pan.f), Millo and Mitchell2022). All phenotypic covariates were included in the analyses.
Mendelian randomization
We used polygenic scores (PGS) for EA in 2SLS instrumental variable analysis as genetic instruments for testing a directed causal association between educational duration and the outcomes. PGS are aggregate measures of genetic susceptibility for a trait of interest weighted by effect size estimates from genome-wide association studies (Choi, Mak, & O'Reilly, Reference Choi, Mak and O'Reilly2020). To calculate the PGS for EA, we used the summary statistics from the Genome Wide Association Study (GWAS) of years of education by Lee et al. (Reference Lee, Wedow, Okbay, Kong, Maghzian, Zacher and Cesarini2018), excluding 23andme and British cohorts (N = ~245k). PGS were constructed from the set of genome-wide significant HapMap3 SNPs (p < 5 × 10−8), pruned to be independent (using the package TwoSampleMR [Hemani et al., Reference Hemani, Zheng, Elsworth, Wade, Haberland, Baird and Haycock2018]) using a clumping window of 1000 kb and a linkage disequilibrium (LD) cut-off of R 2 = 0.1. The PGS prediction accuracy for EA was assessed based on the incremental R 2 when including the PGS in a regression with all covariates.
Next, the PGS was used as a genetic instrument in 2SLS instrumental variable analysis in a sample of unrelated UKB participants (KING kinship coefficient >0.0884). In the first stage, we predicted age at which one left full-time education (standardized) from the PGSs. In the second stage, the outcome and control outcomes were predicted from the fitted education values. All phenotypic and genetic covariates were included as covariates in both stages. The MR analyses were conducted using the fixest package in R (Bergé, Reference Bergé2018). For comparison, we also perform regular (non-pre-registered) OLS regression in the same sample, where standardized age at which one left full-time education is used to predict the outcomes, whilst correcting for the phenotypic covariates.
Mendelian randomization in sibships
Since one of the limitations of MR is its susceptibility to residual confounding stemming from dynastic effects, population stratification, and assortative mating, we additionally perform MR within sibships. We identify siblings in UKB and calculate each sibling's deviation from the sibship average using the same methodology as used for the sibling control design (see ‘Sibling control design’). Additionally, we use the PGSs calculated for the MR analyses (see ‘Mendelian randomization’) to calculate a PGS average within sibships: $\overline {PGS_{oj}} = \sum\nolimits_1^m {PGS_{ij}/m}$, and each sibling's deviation from the sibship average: $PGI_{\Delta ij} = PGI_{ij}-\overline {PGI_{0j}}$. We use these deviation estimates in instrumental variable regression (using the fixest package), where in the first stage we predict the sibling education deviation from the sibling PGS deviation. Next, the outcome and control outcomes were predicted from the first-stage fitted education values. Similar to the within-sibling analyses, we excluded sibships where only one sibling reported on EA, but we did not exclude sibships where not all siblings reported on one or more outcome measures. We report robust standard errors taking into account familial clustering, calculated using the coeftest function from the lmtest r-package (Hothorn et al., Reference Hothorn, Zeileis, Farebrother (pan.f), Cummins (pan.f), Millo and Mitchell2022). Both the phenotypic and genetic covariates were included.
Pre-registered interpretation of results
We define an unambiguous causal association as one where the policy shift, the sibling control design, and the Mendelian randomization analyses all imply a significant result in the same direction. The absence of significance across these methods would imply the absence of such a result. Due to the lower power associated with our within-sibship MR analyses, we are satisfied if the magnitude and direction of the Mendelian randomization within siblings is consistent with the other methods. With respect to statistical significance and multiple testing, we use two significance thresholds: (1) a suggestive threshold where we correct for the number of outcomes (15), so that α = 0.05/15 = 0.003, and (2) a conservative threshold where we correct for the number of outcomes (15) and analysis types (4), so that α = 0.05/60 = 0.0008. Inconsistencies across results will be interpreted along the potential biases and assumptions that accompany the different methods.
Results
Instrumental variable analysis leveraging the ROSLA
Table 3 depicts the results of the ROSLA instrumental variable analyses. Based on the 2SLS models, none of the outcomes are significantly predicted by age at which one left full-time education. This contrasts our comparative OLS analyses, which do not control for unmeasured confounders, where most associations were significant. The F-statistic of the 2SLS analyses ranged from 216.9 to 1205.9 depending on the outcome of interest, indicating that our instrument is unlikely to suffer from weak instrument bias. Since the standard errors are relatively large and the Wu–Hausman statistics, which test for the absence of endogeneity, were almost always non-significant at α = 0.05, it is suggested that the 2SLS and OLS models do not statistically differ. However, the methods do lead to different estimates, suggesting the OLS results are nonetheless subject to considerable bias. Examining these associations in a 2- or 10-year window around the reform did not change our conclusions (see online Supplementary Table S1).
Note. All continuous outcomes were standardized. Assessment center, sex, season of birth, and year of birth were included as covariates.
p-values indicated in bold are lower than the conservative p-value threshold of 0.0008.
a H0 is the absence of endogeneity of the instrumented variables.
These findings contrast earlier findings by Davies et al. (Reference Davies, Dickson, Smith, Van Den Berg and Windmeijer2018). Using instrumental variable regression in UKB, they did observe an effect of remaining in school after age 15 on different cardiovascular outcomes and income. The main difference between the current study and the Davies et al. study is the method of correcting for year of birth, where they used a difference-in-difference approach instead of including this variable as a covariate. Therefore, we performed supplementary (non-preregistered) analyses where we, in a step-wise fashion, added season of birth and year of birth. The results are shown in online Supplementary Table S2 and Fig. 1. While adding year of birth as covariates might increase the chance that we are overcorrecting, it is evident from these results that the use of a policy experiment as an instrumental variable is very sensitive to the model specification: inclusion year of birth renders previously significant associations with happiness, familial, financial, and work satisfaction, cardiovascular problems, income, birthweight, and height non-significant.
Sibling control design
In total, there were 15 237 families with sibships of at least two siblings. The number of included individuals per outcome varied (see Table 4 for the sample size per outcome). The intra-class correlation for education, reflecting the amount of total variation in education explained by the family-level, was 0.40. Table 4 presents the within- and between-sibship estimates from the sibling control analyses. For the main outcomes, the between-sibling estimates for school-leaving age were significantly associated (based on the conservative α = 0.0008 threshold) with happiness, health satisfaction, family satisfaction, financial satisfaction, work satisfaction, neuroticism, anxiety, and cardiovascular problems. However, the within-sibling estimates (indicating a potential causal effect) were only significant for financial satisfaction (β = 0.025, s.e. = 0.006, p = 9.70 × 10−6), and neuroticism (β = −0.016, s.e. = 0.004, p = 3.67 × 10−5), indicating a positive association between longer education and financial satisfaction and a negative association with neuroticism. With respect to the positive control outcomes, all between- and within-sibship estimates were significant. Between-sibling estimates for negative control outcomes also showed significant positive associations with age at leaving school, except for comparative body size at age 10. As expected, the within-sibling estimates were however not significant for birthweight and comparative height/body size at age 10, but surprisingly still significant for height, suggesting that the within-family estimates are not adequately correcting for all sources of bias.
p-values indicated in bold are lower than the conservative p-value threshold of 0.0008.
Mendelian randomization
The EA PGS predicted 0.28% of the variance in school-leaving age, which is similar to the predictive power of the EA PGS from Lee et al. that was based on the genome-wide significant SNPs based on a similarly sized (N = 293 723) discovery GWAS (Okbay et al., Reference Okbay, Beauchamp, Fontana, Lee, Pers, Rietveld and Benjamin2016). Despite the relatively low predictive power, the F-values from our MR analyses ranged from 240.1 to 957.1 for the different outcomes, indicating that the PGS did not suffer from weak instrument bias.
The full results from the MR analyses are shown in Table 5. Six outcomes were significantly associated with school-leaving age based on our conservative significance threshold of α = 0.0008: we found positive associations with health satisfaction (β = 0.37, s.e. = 0.05, p = 1.31 × 10−14) and financial satisfaction (β = 0.43, s.e. = 0.06, p = 5.96 × 10−13), and negative associations with friendship satisfaction (β = −0.23, s.e. = 0.06, p = 4.02 × 10−5), neuroticism (β = −0.23, s.e. = 0.04, p = 1.67 × 10−9), depression (β = −0.04, s.e. = 0.01, p = 0.0003), and cardiovascular problems (β = −0.07, s.e. = 0.01, p = 9.32 × 10−6). All positive control outcomes, and the negative control outcomes height and height at age 10 were significantly associated with school-leaving age. For comparison, we also associated the outcomes with age at which one left full-time education in regular OLS regression. We did so to examine if analyses that do not take into account causality through a genetic instrument would indicate an association. When doing so, all outcomes except happiness, meaning in life, and bipolar disorder were significantly associated.
Note. Sex, family size, season of birth, year of birth, assessment center, batch, and the first 10 genomic PCs were included as covariates for the MR analyses. The OLS regression is the prediction of the outcomes with education including the same covariates, with the exception of batch and the genomic PCs. All continuous outcomes and age at which one left full-time education were standardized.
p-values indicated in bold are lower than the conservative p-value threshold of 0.0008.
Mendelian randomization in sibships
The results from the MR analyses within sibships can be found in Table 6. While our instrument was much less powerful than in regular MR, all F-statistics except the F-statistic for meaning in life were higher than 10 (which is commonly used as a rule of thumb to avoid bias [Lawlor, Harbord, Sterne, Timpson, & Smith, Reference Lawlor, Harbord, Sterne, Timpson and Smith2008]). None of the associations (for both control and outcome variables) were significant after correcting for multiple testing.
p-values indicated in bold are lower than the conservative p-value threshold of 0.0008.
Interpretation of results
An overview of all results from the four different methods can be found in Fig. 2. As mentioned in the methods section, we define an unambiguous result as one which is consistent across all methods. Additionally, due to the lower power associated with our within-sibship MR analyses, we are satisfied if the magnitude and direction of the Mendelian randomization within siblings is consistent with the other methods.
The ROSLA estimates displayed in Fig. 2 reflect the associations where year of birth was included as a covariate (as pre-registered). With respect to our main outcomes, we found non-significant associations across all four methods for: happiness, family satisfaction, work satisfaction, meaning in life, anxiety, and bipolar disorder. Educational duration was positively associated with financial satisfaction, and negatively associated with neuroticism in the sibling-control and MR analyses, but these associations were non-significant in both the ROSLA and within-sibling MR. The within-sibling MR estimate for financial satisfaction was in the same direction, and of comparable magnitude as the conventional MR results, whereas the result for neuroticism was in the opposite direction. Lastly, educational duration was significantly positively associated with health satisfaction and significantly negatively associated with friendship satisfaction, depression, and cardiovascular outcomes in the conventional MR analyses only. Thus, overall, the different analyses do not seem to converge on a consistent conclusion in terms of whether there is a causal effect.
We included different income classes as positive control outcomes, as we expected educational duration to causally influence income. Only in the sibling control and MR analyses were the different income variables significantly associated with educational duration. The non-significant within-family MR estimates for income were in the same direction but of slightly smaller magnitude as the conventional MR. When not including year of birth as a covariate in the ROSLA analyses, the first three income classes were significantly associated with educational duration, suggesting a potential overcorrection in our ROSLA analyses (online Supplementary Table S2). With respect to our negative controls, height was significantly associated with education in both the sibling-control and MR analyses. Additionally, comparative body height at age 10 was significantly associated with education in the MR analyses. Associations with these negative control phenotypes suggest the possible presence of residual bias.
Discussion
Our study was designed to disentangle causal effects from confounding in the association between educational duration and different well-being, and mental and physical health indicators. To this end, we applied four established techniques for causal inference to a homogeneous sample, the UKB. We find consistent non-significant associations for happiness, family satisfaction, work satisfaction, meaning in life, depression, anxiety, and bipolar disorder. However, we do not find robust significant associations across all four methods for health satisfaction, friendship satisfaction, financial satisfaction, neuroticism, and cardiovascular outcomes. The absence of significant consistent results suggests that associations between educational duration and well-being, mental and physical health are largely confounded or biased by reverse causation. Alternatively, a small causal effect may exist but power in one or some of our techniques may have been insufficient to detect it.
Overall, in our first set of analyses (based on the ROSLA), we do not find an effect of educational duration on any of the outcomes, including our positive controls. This contradicts an earlier study similarly examining the causal effect of education in the UK in light of the ROSLA reform, where a causal effect was found for different cardiovascular outcomes, income, and height. The main difference between the current study and the Davies et al. study is the method of correcting for year of birth. Whereas Davies et al. employ a difference-in-difference approach where the data are stratified by year of birth, we directly include year of birth as a covariate. To examine the effect of year of birth on the associations, we ran supplementary analyses where we compare the associations with and without year of birth as a covariate. When including year of birth, educational duration no longer significantly influenced our positive control outcomes (the four income classes), suggesting that by including year of birth in this set of analyses we might be overcorrecting. This degree of sensitivity of the model to inclusion of covariates, and to what way covariates are taken into account, does complicate the interpretation of our findings.
Although the ROSLA analyses did not result in significant associations, we found an effect of educational duration on health satisfaction, friendship satisfaction, depression, and cardiovascular outcomes, but only in the conventional (non-family) MR analyses. The finding that these associations were only significant in the MR analyses might indicate that one of the assumptions was violated, such as the no pleiotropy assumption. Moreover, we found significant associations with financial satisfaction and neuroticism in both the conventional MR and within-sibship analyses. For the latter, we found that people who stayed in school longer had higher financial satisfaction and lower neuroticism. While these associations were not significant in the within-sibling MR analyses, the direction of effect was consistent. We can therefore interpret these associations as at least suggestive evidence in three out of four analyses. Besides our potential overcorrection issue, it could be argued that discrepancies with the ROSLA analyses are caused by the caveat that the ROSLA results only apply to those who would have left school at age 15 in the absence of the reform. In this sense, the ROSLA results are less generalizable to the population than the other methods as the reform did not affect those who would have stayed in school until age 16 or later irrespective of the reform. Additionally, one of the possibilities that comes to mind when interpreting the results for financial satisfaction is the potential mediation of income. We therefore re-ran the sibling control analyses including the mean income and the sibling deviation in income from the sibship income average as covariates (Saunders, McGue, & Malone, Reference Saunders, McGue and Malone2019) (see online Supplementary Table S3). We found that the standardized estimates for both financial satisfaction and neuroticism decreased substantially (with similar standard errors), resulting in non-significant associations controlling for income, suggesting that the association between educational duration and these two outcomes may be mediated through income. Since financial satisfaction is partly the result of one's income, this finding is not surprising. With respect to neuroticism, a previous MR study found evidence for bidirectional causality between education and neuroticism, but did not consider potential mediation of income (Nagel et al., Reference Nagel, Jansen, Stringer, Watanabe, de Leeuw, Bryois and Posthuma2018).
The most consistent finding to emerge from the data is the lack of evidence for a causal effect of educational duration on happiness, family satisfaction, work satisfaction, meaning in life, anxiety, and bipolar disorder. The association between educational duration and these outcomes was non-significant, irrespective of which method was applied, while OLS suggested there was an association. In these OLS analyses, we observe that almost all outcomes were significantly predicted by school-leaving age. It is therefore likely that the OLS associations are subject to confounding and/or reverse causation, and are unlikely to reflect direct, causal effects. One note in the context of our findings is that we examine variation in educational duration from a minimal school-leaving age onward, and do not examine the effect of attending education in general. It is therefore important to note that schooling in general has important pecuniary and non-pecuniary consequences (Grossman, Reference Grossman, Hanushek and Welch2006), but that our study suggests a lack of evidence for causal effects of variation in educational duration on variation in (mental) health outcomes beyond a minimum school-leaving age. Additionally, important to note is that the current project focused on potential causal effects of educational duration on (mental) health, and not reverse effect or bi-directional causality. It might be the case that, for some phenotypes, there is a causal effect from health on educational duration. In addition, we focused on linear associations, disregarding potential non-linear effects. While these two directions were beyond the scope of the current project, they would be interesting directions for future research.
When applying the methods used here in isolation, it is often difficult or impossible to evaluate all the respective limitations and assumptions. A strength of this study is that we try to minimize our reliance on any one set of assumptions by applying various existing approaches for causal inference that rely on different assumptions to account for possible confounding and bias, and triangulate results. In doing so, we found that the different causal inference approaches led to heterogeneous results. Since we investigate the same measures in (largely) the same population, differences in results across methods are most likely attributable to the methods themselves. Importantly, if we decided to focus on only one of these methods for the current paper, we would have drawn very different conclusions than we do now. With respect to health satisfaction, friendship satisfaction, and cardiovascular outcomes, these were only significantly predicted by educational duration in the conventional MR analyses, but not in any of the other analyses. Evaluating this discrepancy considering the characteristics of the different methods, it is possible that these associations are caused by a familial or population effect that is uncontrolled for in conventional MR but is controlled for in the other analyses. Additionally, the MR sample was the largest sample we examined, and it might be the case that an increase in sample size for the ROSLA and within-sibship analyses would allow us to detect smaller effects that remain undetected using the current sample size. This is also reflected in the confidence intervals for the results from the different methods (Fig. 2), where we see relatively precise estimates for the sibling control and MR analyses, and relatively imprecise estimates for the within-sibling MR and especially the ROSLA analyses. Therefore, it might be the case that the lack of significant results in the ROSLA and within-sibling MR is due to a lack of power. Regardless of power, the negative control traits suggest a reliance on MR alone risks false-positive results for obvious reasons: the significant causal effects of educational duration on birthweight and height at age 10 cannot be true effects.
While we tried to account for the limitations of the separate methods by means of triangulation, our results are still sensitive to our sample and measurement characteristics. We used a relatively homogeneous sample that allowed for a straightforward comparison between methods, but this also limits the generalizability of our findings. More specifically, the UKB sample is known to suffer from a ‘healthy volunteer’ bias, where participants are more healthy than the general population (Fry et al., Reference Fry, Littlejohns, Sudlow, Doherty, Adamska, Sprosen and Ellen2017; Munafò, Tilling, Taylor, Evans, & Davey Smith, Reference Munafò, Tilling, Taylor, Evans and Davey Smith2018). Additionally, participants are more likely to be older, female, and live in more socioeconomically advantaged areas than non-participants (Fry et al., Reference Fry, Littlejohns, Sudlow, Doherty, Adamska, Sprosen and Ellen2017). To (partly) correct for the confounding stemming from volunteer bias, we used weights as calculated in van Alten et al. (Reference van Alten, Domingue, Galama and Marees2022). While these weights help us correct for volunteer bias, it is important to stress that the examined sample is still a WEIRD (Western, Educated, Industrialized, Rich, Democratic) sample. It is therefore unknown if these results generalize to non-WEIRD societies. We also tested if the conclusions of the sibling control analyses would change if we did not restrict to European Ancestry individuals born in England and Wales. We did not find different results in this larger set of siblings (see online Supplementary Table 4). Second, we used relatively broad, imprecise phenotype and disease definitions. For example, we included all depression diagnoses present in UKB under the umbrella ‘depression’, and all cardiovascular-related diagnoses under the umbrella ‘cardiovascular outcomes’. For our continuous phenotypes (except neuroticism), we used single items to measure the phenotypes. It is possible that more precise phenotype and disease definitions could reduce measurement error and influence power. Additionally, it has been argued that quantitative education measures such as years of education is not an optimal measure of education, especially in the context of non-pecuniary returns of education (Oreopoulos & Salvanes, Reference Oreopoulos and Salvanes2009). More qualitative measures of education, such as teaching methods or curricula differences, might be better suited in this context, but these data are difficult to acquire and analyze on a large scale. Our quantitative measure of education was moreover collected through self-report questionnaires, and those responses might have suffered from recall bias (i.e. participants might not remember school-leaving age accurately). We did use single items for our well-being phenotypes, but we did not treat well-being as a unidimensional construct. Alternative to looking at a general well-being item or sum-score, we assessed if educational duration influenced specific well-being aspects, such as work satisfaction and meaning in life. However, these well-being measures also relied on self-report, which might have introduced measurement error. When examining the test–retest reliability for education and happiness, which were measured on multiple occasions, we find moderate test–retest reliabilities between 0.57 and 0.68, which is in line with previous findings on the stability of well-being (Anusic & Schimmack, Reference Anusic and Schimmack2016). Lastly, the included analyses do not necessarily estimate the same type of causal effect. While the ROSLA and conventional MR identify LATE effects, the sibling analyses identify Conditional Average Treatment Effects (CATE). For ROSLA and MR, this means that we estimate an effect in those whose treatment (educational duration) would differ if the value of the instrumental variable differed. However, note that under different assumptions, an MR effect can be identified as an Average Treatment Effect (ATE). An MR effect can be identified as ATE when we assume the exposure (i.e. education) has the same causal effect on the outcomes for the whole population, or when the instrument (i.e. genetic variants associated with education) has a consistent effect on the exposure, and that this effect is the same for the whole population. In addition, ATE can be identified under the no simultaneous heterogeneity assumption (NoSH), which assumes that the heterogeneity in the instrument–exposure effect and the exposure–outcome effect should be uncorrelated (Hartwig, Wang, Davey Smith, & Davies, Reference Hartwig, Wang, Davey Smith and Davies2023). For CATE, the causal interpretation we obtain only applies to our sub-population of sibling pairs, excluding singletons. In addition, if the exposure of one sibling affects the exposure of the other, the estimator describes what happens if the treatment affects both siblings (Petersen & Lange, Reference Petersen and Lange2020).
We used a natural experiment, sibling-control analysis, Mendelian randomization, and within-sibship Mendelian randomization to a large UK sample to disentangle potential causal effects of education duration on several mental and physical health outcomes. A comparison of results across these four methods illustrates that (1) associations between education and these several outcomes are largely confounded, and (2) triangulation of evidence across different methods is necessary to examine the results in light of their respective limitations. Notwithstanding the relatively limited generalizability of our findings across different cultures, time frames, and educational systems, this work provides valuable insight into the complexities of establishing the causal effects of EA on important life outcomes.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S003329172300329X.
Acknowledgements
This research has been conducted using the UK Biobank Resource under Application Number 40310.
Author contributions
All authors contributed to the study conception and design. Data analyses were performed by Margot van de Weijer, Perline Demange, and Michel Nivard. The first draft of the manuscript was written by Margot van de Weijer and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding statement
Margot van de Weijer, Dirk Pelt, and Meike Bartels are supported by the European Research Council Consolidator Grant (ERC-2017-COG 771057 WELL-BEING PI Bartels). Margot van de Weijer is funded by the European Union (ERC, UNRAVEL-CAUSALITY, project nr. 101076686). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Perline Demange is supported by the grant 531003014 from The Netherlands Organisation for Health Research and Development (ZonMW). Michel Nivard is supported by R01MH120219, ZonMW grants 849200011919 and 531003014 from The Netherlands Organisation for Health Research and Development, a VENI grant awarded by NWO (VI.Veni.191G.030), and is a Jacobs Foundation Research Fellow.
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Ethical standards
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.