The last two decades have witnessed many research initiatives and claims to demonstrate the importance of genotype × environment interaction (G × E) for human behavior and its disorders. The intimate connection between statistical demonstrations of G × E and the instruments employed to measure behavior has received less attention. This ambiguity has long been recognized by plant and animal geneticists (see, e.g., Mather & Jinks, Reference Mather and Jinks1982), but its implications for the substantive significance of claims to detect G × E for human behavior have been virtually ignored.
With the aid of simulated data, this article shows that the problem of inferring G × E is acute for the types of data (symptom counts and checklists) that form the mainstay of most psychiatric genetic epidemiology, to the point that claims to find G × E for psychiatric outcomes must seriously be questioned in the absence of any deeper analysis of the underlying biology or the relationship between behavioral dimensions and the instruments used to assess them.
Outline of Approach
(1) Simulated twin data are generated for a continuous latent behavioral trait and a continuous correlated index of environmental risk on the assumption that a measured environmental covariate does not modulate any contributions of genes, shared family environment, or individual specific environment to differences in the simulated behavioral trait. (2) The latent trait and environmental index are ‘scored’ using a simple test procedure simulating those frequently used in behavioral measurement to generate symptom counts and environmental checklist scores with properties similar to those often found in psychiatric genetic epidemiology. (3) A series of statistical tests and analyses are performed on the simulated liabilities, derived test scores and transformed test scores to detect and estimate the main effects of genes, measured and unmeasured environment and the apparent modulation (‘interaction’) of genetic and environmental contributions by the environmental covariate. (4) An integrated model is developed and tested that recovers both the parameters of the genetic model, allowing for any ‘true’ G × E on the underlying liability, and the parameters of a typical measurement model relating the responses of subjects to survey items to differences in the underlying variables they are designed to assess.
Data Simulation
Simulated pairwise data vectors were generated for environmental covariates and continuous outcomes for each of 10,000 monozygotic (MZ) and 10,000 dizygotic (DZ) twin pairs from multivariate normal distributions with zero mean vectors and covariance matrices shown in Table 1. The underlying ‘true’ model assumes that there is a main effect of the covariate on outcome but no modulation of the contributions of genes and environment by the covariate. The expected covariance matrices correspond to a path model (Figure 1) in which the environmental covariate is equally correlated between MZ and DZ twins (i.e., there is no within-person, genotype–environment correlation for the covariate) and explains all of the shared environmental component in the outcome. The model allows genetic and residual environmental effects on the outcome but no modulation of the genetic and environmental path coefficients by the environmental covariate (i.e., no G × E interaction or other modulation).
In practice, researchers do not have direct access to the ‘true’ dimensions of liability but attempt to measure them with plausible ‘instruments’. In psychiatric genetic epidemiology, the instruments typically comprise checklists of symptoms and/or risk factors such as life events or features of the social environment. For the purposes of this study, we assume that the ‘environment’ is assessed by a checklist of 10 dichotomous items and that behavior is assessed by a list of 25 dichotomous symptoms. These numbers were chosen to correspond to some in one of our own studies. Responses to multi-item checklists of covariates and symptoms were simulated for each twin on the assumption that the probability of a subject endorsing the kth item is a monotonic function of increasing subject liability, θ. In our case θ corresponds to the subject's simulated ‘true’ covariate or outcome value. The probability that the ith subject endorses the kth item is thus assumed to be:
where Φ(θ) is the standardized cumulative normal p.d.f. (normal ogive), bk is the value of θ at which Φ(θ) changes most rapidly (also referred to as the ‘item difficulty’ or ‘threshold’) and sk is the rate of change of Φ at bk (also referred to as the ‘sensitivity’ or ‘discriminating power’ of the item).
The simulation assumed that the item parameters were the same for all items in a checklist. Thus, for the environmental checklist, it was assumed that bk = 1.3 and sk = 1 for all k. Similarly, all the items in the simulated behavioral inventory were assumed to be equivalent in difficulty and sensitivity with bk = 1.5 and sk = 1 for all k. The item parameters were selected to generate sum score values with distributions resembling those in some of our own data. Ideally, the items should be selected to represent different thresholds across the entire practical range of the latent trait (see, e.g., Lord & Novick, Reference Lord and Novick1968). In practice, however, this is seldom possible for relatively infrequent symptoms of abnormal behavior and environmental indicators (see, e.g., Eaves et al., Reference Eaves, Erkanli, Silberg, Angold, Maes and Foley2005).
The simulated 0/1 item responses for the environmental covariate and behavioral checklist were added to generate a sum score for the environment and symptom count for each subject. Figure 2a–d shows the distributions of the latent trait scores and checklist sum scores for the behavioral and environmental checklists. The reversed J-shaped distribution is typical of many clinical measures derived from symptom counts.
The matrices of twin correlations for the latent traits and checklist scores are given in Table 2. The correlations for the square roots of the sum scores for the symptoms and environmental items are also included because scale transformation is a typical precursor to statistical analysis in which the effects of ignoring heteroscedasticity are expected to generate apparent interactions. The choice of square-root transformation (see Bartlett, Reference Bartlett1947) for this application is suggested by the scoring of items showing relatively infrequent endorsement (e.g., Eaves & Eysenck, Reference Eaves and Eysenck1977). Other scales suggest other transformations (see, e.g., Eaves et al., Reference Eaves, Eysenck and Martin1989) but, ultimately, the best approach is to integrate the genetic model for the latent trait and covariate(s) with the psychometric model for how these are projected onto the items chosen to measure them (see below). The twin correlations for the derived scores are lower than those for the latent traits because of the inherent stochastic error attached to each item response.
Twin correlations for outcome phenotype and environmental covariate shown in bold.
Analysis of Simulated Data to Detect G × E Interaction
Regression of Intra-Pair Outcome Variance on Pair Mean Outcome for MZ and DZ Twins
A seminal paper by Jinks and Fulker (Reference Jinks and Fulker1970) pointed out that the expected means of separated MZ twin pairs were estimates of genetic effects and the absolute intra-pair differences were estimates of the effects of random environmental effects. If sensitivity to the environment, measured by intra-pair variability was a function of average genetic deviation of the pair, Jinks and Fulker predicted a significant regression of absolute intra-pair difference (or intra-pair variance) on pair means for separated twins. In principle, the regression of variance on mean need not be linear or monotonic. The same basic approach, used with other types of relative pairs (e.g., MZ and DZ twins reared together), would reveal other kinds of interaction involving the modulation of intra-pair differences by sources of between pair variation (e.g., Eaves & Eysenck, Reference Eaves and Eysenck1976). Figure 3a–c shows the (linear) regression of intra-pair variances on pair means for the behavioral outcome in MZ and DZ twins: latent trait (Figure 3a), sum scores (Figure 3b), and square-root transformed sum scores (Figure 3c).
As expected, Figure 3a shows little evidence that intra-pair variance changes with pair mean for the simulated latent trait scores, although the regression line for DZ twins is significantly more elevated than that for MZs due the segregation of genetic differences within pairs of DZ twins. Indeed, twice the difference between the elevations of the line for DZs and MZs is an estimate of the genetic variance in the measured trait. By contrast, the sum scores show a markedly different picture with a very steep regression of intra-pair variance on mean for MZ pairs, indicating that the apparent effects of the individual twins’ environments are modulated by the sources of differences between pairs, including G × E interaction if the differences between pairs are partly genetic. The slope for DZs is much steeper than that for MZs, reflecting the apparent increase in the importance of genetic factors as a function of differences between families. Without a deeper examination, it might be tempting to infer that G × E interaction is responsible for the increase in genetic variance as a function of increasing ‘dose’ of the environmental covariate. The difference between the apparent absence of interaction in Figure 2a and the very marked interaction in Figure 2b says nothing about biological process but depends entirely on the characteristics of the specific behavioral items chosen to measure it. Other items or other ways of scaling could produce entirely different patterns of interaction. Thus, patterns of interaction may lead no deeper than the theory of measurement, if indeed there is one. The point is emphasized by the fact that a square-root transformation (e.g., Bartlett, Reference Bartlett1947) of the sum scores derived from basic statistical considerations removes much of the apparent support for non-additivity (Figure 3c).
Regression of Intra-Pair Outcome Variance on Pair Mean Environmental Covariate for MZ and DZ Twins
The above analysis does not specify any specific environmental covariate. The simulated data include a ‘measured’ environmental covariate that is highly correlated between twin pairs and has a substantial linear relationship with the underlying behavioral trait. Figure 4a–c shows the regression of intra-pair outcome variances on pair mean environment for MZ and DZ pairs when behavior and outcome are both assessed by latent trait (Figure 4a), sum score (Figure 4b), and square root transformed sum score (Figure 4c.). The graphs now illustrate the modulation of genetic and environmental within-pair variance of MZ and DZ twins by the shared environmental covariate.
The regressions on pair mean environmental covariate resemble those in Figure 3 for the regression on mean phenotype score. There is little or no evidence of any slope in Figure 4a; the lines are flat, with that for DZs being more elevated than that for MZs as a function of the genetic contribution to the outcome. Analysis of sum scores from the behavioral and environmental checklists yields striking support for the increase in genetic contribution as a function of the environmental covariate (G × E, Figure 4b). The same pattern is to be expected with any covariate that has a linear relationship to the latent trait when checklist scores are employed to summarize the behavior of interest. As before, the square-root transformation removes much of the evidence for environmental modulation of genetic and environmental components of variance with twin pairs (Figure 4c).
Fitting Structural Models Allowing for the Environmental Modulation of Genetic and Environmental Path Coefficients
The above analyses are instructive and require nothing more than twin data and the most basic resources for data summary and graphical display. Such basic considerations are often overlooked, given the availability of efficient software for structural modeling of family resemblance such as Mx (Neale et al., Reference Neale, Boker, Xie and Maes2006) and openMx (Boker et al., Reference Boker, Neale, Maes, Wilde, Spiegel, Brick and Fox2011). However, the availability of ready-built code for a variety of models allows the investigator to demonstrate relative sophistication without ever looking critically at how the assumptions of a specific model map onto those used to generate the data to which it is applied.
The dependence between scale of measurement and the results of genetic modeling is revealed by fitting a variety of models to the simulated outcome measures of behavior, scored as a latent trait, checklist sum score, and square-root transformed sum score allowing for the fixed main effect, β, of the measured environmental covariate on outcome, the random effects of genes (a), residual shared environment (c), and individual unique environment (e), and the modulation of a, c, and e by the fixed environmental covariate (parameters γ, η, and δ, respectively). This approach is a similar to that proposed by Purcell (Reference Purcell2002), treating the covariate as a fixed effect in a manner comparable to those used in the analysis of G × E in other species (see, e.g., Bucio-Alanis & Hill, Reference Bucio-Alanis and Hill1966; Perkins & Jinks, Reference Perkins and Jinks1973; and references).
Four models were fitted to each of the three assessments of outcome by full information maximum likelihood (FIML) in openMx (Boker et al., Reference Boker, Neale, Maes, Wilde, Spiegel, Brick and Fox2011). (1) Estimating the mean, μ, and paths from genes (a), shared environment (c), and unique environment (e) to outcome without any regression of outcome on measured environment (β) or modulation of a, c, and e by the measured environment (i.e., no interaction). (2) Estimating all the parameters of model 1 plus the main effect of the environmental covariate on outcome (β). (3) Estimating the parameters of model 2 plus the modulation (γ) of the genetic path a by the environment (G × E interaction). (4) A ‘full’ model including the parameters of model 3 plus the addition modulation of c and e by the measured environment (i.e., adding interaction between the measured environment and the random effects of the shared and unique environment).
Results of fitting the four models to the three pairs of simulated outcomes and environments are summarized in Table 3.
See Table 4 for definition of symbols. Path coefficients are not standardized to facilitate comparison of estimated effects as a function of adding interactions to model.
All three measurement scales show strong support for the main effects of genes, environment and covariate on outcome. The estimates of β are all highly significant and the improvement in fit over the model that excludes any effect of covariate is substantial. These are very large samples for a twin study, but even a study one-tenth of the size (1,000 of both MZ and DZ pairs) is expected to yield highly significant main effects of the shared environment. As with the simple variance-mean regressions (see Figure 4), support for environmental modulation of random genetic and environmental effects is contingent on the units chosen to measure behavior. When the latent outcome and environment are both measured directly there is absolutely no gain in support from adding G × E (χ2(1) = 0.3) or modulation of the shared and unique environment (χ2(2) = 2.4). The picture is sharply different when analysis is based on checklist sum scores. There is now highly significant G × E (χ2(1)= 977.6) and strong additional support for modulation of c and e (χ2(2)= 362.6). The evidence for interaction is likely to remain even in samples one-tenth as large or smaller with these effect sizes. Square-root transformation removes all significant support for modulation of genetic and environmental effects by the measured environment. Thus, all evidence for interaction between genes, environment, and an environmental covariate is generated by summarizing behavior by adding responses to checklists of items.
Integrating Models for the Causes of Variation With Theories of Measurement
The above analyses confirm the intimate connection between the model for the effects of genes and environment and the instrument used to assess behavior and its covariates. The conclusions of a genetic analysis are acutely sensitive to the units of measurement when it comes to interpreting G × E interaction and other kinds of non-additivity at anything other than the most superficial level in the absence of any theory about how individual differences in liability translate into the measures used for analysis.
Such results, long familiar to statistical geneticists, require that claims to detect G × E for human behavior need to be viewed far more critically than is typical in the current climate. At the very least it is incumbent on those working in the area take the time to evaluate the scientific significance of their findings more carefully by taking into account the theory of measurement that is typically ignored in psychiatric genetics. As a final step, the first 2,500 MZ and DZ pairs (5,000 pairs, in total) of the 200,000 simulated pairs above were employed to demonstrate the kind of approach that investigators could employ to deepen theoretical and substantive understanding of the processes they profess to study. The reduction in sample size significantly reduces computation time without undermining the basic conclusion.
The model comprises two elements: ‘genetic’ and ‘psychometric’. The genetic model comprises the parameters that characterize the additive and non-additive effects of genes and environment on variation in the latent trait (θ) such as that employed above in Table 3. The psychometric model comprises the function that maps onto θ the scores or responses used for assessment. In our example, where the simulated checklist items are all assumed to have the same endorsement probability, p(θ), the log-likelihood that a subject with trait value θ endorses r out of N items is given by the binomial theorem:
and the unconditional likelihood of a specific outcome score, r, is the integral
The analytical task devolves upon estimating the parameters relating p(θ) to θ, that is, the parameters of the psychometric model and those of the ‘genetic’ model accounting for the variation in the latent trait θ: the additive genetic and environmental paths, a, c, and e, including the regression of liability on the covariate and the coefficients modulating the main effects of genes and environment, γ, η, and δ.The assumption of equivalent items requires only the estimation of one free parameter of the psychometric model: the item difficulty (threshold), b. The sensitivity, s, is fixed at 1, allowing the paths a, c and e to set the scale of variance in the latent trait.
The model for the trait in the jth twin of the ith pair is thus:
where Xi,j is the corresponding covariate value. In this application, the latent trait covariate score, not the environmental checklist score, was used to simplify the model. Ai,j, Ci,j, and Ei,j are the random additive genetic, shared environmental and unique environmental effects standardized to zero mean and unit variance. The correlation between the genetic effects is assumed to be 1 for MZ pairs and ½ for DZs. The correlation between the shared environmental effects of twins is 1 and that for the unique environments is 0.
Estimates of the model parameters may be obtained in a variety of ways. We employed a Markov Chain Monte Carlo approach (MCMC) using the Gibbs sampler (see, e.g., Gilks et al., Reference Gilks, Richardson and Spiegelhalter1996) implemented in BUGS (Bayesian analysis using Gibbs sampling; Lunn et al., Reference Lunn, Jackson, Best, Thomas and Spiegelhalter2012). Elsewhere we have used this approach to estimate parameters of models combining elements of genetic and item-response theory (Eaves et al., Reference Eaves, Erkanli, Silberg, Angold, Maes and Foley2005) in twin data and for models involving G × E interaction and correlation (Coventry et al., Reference Coventry, James, Eaves, Gordon, Gillespie, Ryan and Wray2010; Eaves et al., Reference Eaves, Silberg and Erkanli2003; Wray et al., Reference Wray, Coventry, James, Montgomery, Eaves and Martin2008). The MCMC approach is especially convenient for fitting models that require large amounts of numerical integration to compute likelihoods and has the added flexibility of yielding estimates of the standard errors and confidence intervals of parameters at little additional computational cost.
Results of fitting the full combined psychometric and non-additive genetic model are summarized in Table 4. Estimates are obtained from every fifth MCMC sample from 30,000 iterations (N = 6,000) after a 20,000 iteration ‘burn in’. Computations required approximately six hours on a MacBook Pro laptop computer.
The results are striking. Once the theory of measurement is integrated with the genetic model there is absolutely no evidence of G × E interaction or of any modulation of the shared and unique environmental parameters by the environmental covariate. The 95% confidence intervals of γ,η, and δ all lie either side of zero, and the median estimates of the modulation parameters are all close to zero. Furthermore, the estimates of the main genetic and environmental parameters and the regression of latent trait on measured environment are all close to those used in simulating the original data. In particular, after the effect of the measured environment is extracted, there is no residual contribution of the shared environment to outcome liability (-0.221 < c < 0.236) and the estimated residual correlation of MZ twins (0.660) is close to twice that estimated for DZs (0.339), as expected when only additive genetic effects contribute to residual twin resemblance.
Discussion
The psychometric model assumed in the simulation and MCMC analysis is simpler than might apply in practice. Items may not be equivalent and the detailed implications of different methods of environmental assessment have still to be explored. Nevertheless, our simulation captures the essence of the problem and it is unlikely that elaboration of the measurement model with different item difficulties and sensitivities will substantially change our conclusions here.
Geneticists have long recognized that phenotypic distributions depend on parameters of the underlying genetic model for individual differences. Thus, the moments of distributions in kinship data are well known to depend on factors as diverse as the number of loci affecting a complex trait (e.g., Carmelli et al., Reference Carmelli, Karlin and Williams1979; Karlin et al., Reference Karlin, Williams and Carmelli1981), their dominant and epistatic interactions (e.g., Fisher et al., Reference Fisher, Immer and Tedin1932), and G × E interactions (e.g. Jinks and Fulker, Reference Jinks and Fulker1970). On the whole, geneticists have been far more cautious about the using the properties of observed phenotypic distributions in kinships to infer subtleties of the genetic architecture of complex traits because a variety of more or less arbitrary factors, having little or nothing to do with genetics, can affect the more subtle features of trait distributions. Paramount among such factors are those arising from the fact that the scales used to measure variation have an ill-defined relationship to underlying biological differences such that changes in the units or method of measurement can lead to drastically different conclusions about the genetic architecture of the underlying biological system. Mather and Jinks (Reference Mather and Jinks1982) offer a classical statement of the interdependence of measurement and genetic inference:
The scale on which the measurements are expressed for the purposes of genetical analysis must therefore be reached by empirical means. Obviously it should be one which facilitates both the analysis of the data and the interpretation and use of the resulting statistics. The scale should preferably be one on which . . . the interactions among the genes and between genotype and environment are absent, or at any rate as small as they can reasonably be made. (p. 64, italics added)
Lack of attention to this goal leaves in question the heuristic value of claims to find G × E in psychiatric data.
Mather and Jinks (Reference Mather and Jinks1982) recommend that ‘so far as possible the non-allelic genes and non-heritable agents should all be additive in action’ but also caution that such scales may be hard to find since ‘each gene and each non-heritable agent may be acting on its own scale’ and the elegance of a parsimonious additive model may be elusive. The problem is that psychiatric geneticists seldom bother to look. We are not blessed with decisions as simple as whether to measure body weight in kilograms or log-kilograms, though even here the choice of scale will not be neutral with respect to conclusions about the contributions of additive and non-additive effects.
With respect to behavior, the issue often comes down to decisions about which constellations of items, combined in which way, best characterize the salient latent behavioral outcomes and psychosocial risk factors of interest. The relationship between the numbers generated by a test and the way genes and environment work is tenuous and theory-dependent. There is an intimate connection between the choice of measure, and conclusions drawn about the relative importance of genes, environment and the various possible interactions between them. Elegant pictures of the role of G × E interaction, such as those in Figures 3 and 4, may be no more robust than the items selected to measure the hypothesized latent variable, the rule used to combine them, or how the scores are scaled after they have been combined. It has long been recognized that as simple a step as taking the square root or arcsine of a personality test score can remove otherwise striking evidence of G × E and sex differences in genetic effects in twin data (e.g., Eaves & Eysenck, Reference Eaves and Eysenck1977; Eaves et al., Reference Eaves, Eysenck and Martin1989) and yield a far more parsimonious and heuristically powerful, additive genetic, and environmental model. Indeed, it has been shown that analysis of psychological test scores that ignore the relationship between test and the underlying trait can even simulate the segregation of a single gene of large effect in kinship data (Eaves, Reference Eaves1983).
The recent emphasis on the importance of G × E in psychiatric genetics necessitates a cautionary re-examination of the relationship between measurement and G × E for the benefit of researchers, editors, and others who make critical decisions about the seriousness with which findings about G × E are to be received (see also Duncan & Keller, Reference Duncan and Keller2011).
The current academic and funding environment creates pressure to publish novel and more subtle findings. Yet, there may be nowhere where the risks of perpetuating and funding replicable artifact are greater than in attempts to demonstrate G × E for psychiatric phenotypes and risk factors assessed by checklists of items. The above analyses counsel the exercise of greater thoroughness before publishing claims to detect G × E interaction for psychiatric outcomes. Otherwise, statistically significant findings may reveal less about the biology of behavior than they do about cultural incentives to premature publication.
Acknowledgments
This research was supported by grant DA024413 from the U.S. Department of Health and Human Services (PI E Jane Costello, Duke University). I am grateful to Mike Neale, Tim York, Brad Verhulst, Judy Silberg, Nick Martin, and John Hewitt for the helpful advice and discussion.