Hostname: page-component-586b7cd67f-t8hqh Total loading time: 0 Render date: 2024-12-01T01:40:58.410Z Has data issue: false hasContentIssue false

Improved Lasso for genomic selection

Published online by Cambridge University Press:  14 December 2010

ANDRÉS LEGARRA*
Affiliation:
INRA, UR 631 SAGA, F-31326 Castanet-Tolosan, France
CHRISTÈLE ROBERT-GRANIÉ
Affiliation:
INRA, UR 631 SAGA, F-31326 Castanet-Tolosan, France
PASCAL CROISEAU
Affiliation:
INRA, UMR1313 GABI, F-78352 Jouy en Josas, France
FRANÇOIS GUILLAUME
Affiliation:
Institut de l'Elevage, F-75595 Paris, France
SÉBASTIEN FRITZ
Affiliation:
UNCEIA, F-75595 Paris, France
*
*Corresponding author. INRA, UR 631 SAGA, BP52627, F-31326 Castanet Tolosan, France. Tel: +33561285182. Fax: +33561285353. e-mail: [email protected]
Rights & Permissions [Opens in a new window]

Summary

Empirical experience with genomic selection in dairy cattle suggests that the distribution of the effects of single nucleotide polymorphisms (SNPs) might be far from normality for some traits. An alternative, avoiding the use of arbitrary prior information, is the Bayesian Lasso (BL). Regular BL uses a common variance parameter for residual and SNP effects (BL1Var). We propose here a BL with different residual and SNP effect variances (BL2Var), equivalent to the original Lasso formulation. The λ parameter in Lasso is related to genetic variation in the population. We also suggest precomputing individual variances of SNP effects by BL2Var, to be later used in a linear mixed model (HetVar-GBLUP). Models were tested in a cross-validation design including 1756 Holstein and 678 Montbéliarde French bulls, with 1216 and 451 bulls used as training data; 51 325 and 49 625 polymorphic SNP were used. Milk production traits were tested. Other methods tested included linear mixed models using variances inferred from pedigree estimates or integrated out from the data. Estimates of genetic variation in the population were close to pedigree estimates in BL2Var but not in BL1Var. BL1Var shrank breeding values too little because of the common variance. BL2Var was the most accurate method for prediction and accommodated well major genes, in particular for fat percentage. BL1Var was the least accurate. HetVar-GBLUP was almost as accurate as BL2Var and allows for simple computations and extensions.

Type
Research Papers
Copyright
Copyright © Cambridge University Press 2010

1. Introduction

Genome-wide strategies for genetic evaluation can be roughly divided into BLUP-like methods (postulating normal distribution of single nucleotide polymorphism (SNP) effects) and variable selection methods using more sophisticated distributions. The seminal paper of Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) already made this distinction, by creating BLUP and Bayes (A, B) methods. In the first group, marker effects are posited normal distributions with zero mean and identical variance for all markers. This results in nice properties, like simplicity of computations and, in particular, an equivalent model using a ‘genomic’ relationship matrix (Van Raden, Reference Van Raden2008). The latter can be meshed with additive relationship matrices and extended to the whole pedigree (Legarra et al., Reference Legarra, Aguilar and Misztal2009). Further, under mild assumptions, equivalences exist between genetic variances in an additive relationship model and marker variances (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009).

However, at least for some traits, it has been shown that departures of SNP effects from normality exist. This results in (and can be seen by) higher accuracy of methods with more sophisticated a priori distributions of the marker effects, like BayesA or non-linear regression (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009; Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b). These methods are sometimes called ‘Bayesian methods’ (Lund et al., Reference Lund, Sahana, de Koning, Su and Carlborg2009). This is inappropriate, because BLUP is also a Bayesian method, and also because they have frequentist counterparts (e.g. Usai et al., Reference Usai, Goddard and Hayes2009). Thus, we will call them ‘variable selection methods’ because most of them assume values of most SNP effects to be zero or close to zero. Another property of variable selection methods, shown in simulations, is that these methods have better properties in the long run, that is, estimates of SNP effects are stable after several generations (Habier et al., Reference Habier, Fernando and Dekkers2007). In addition, small (and possibly much cheaper) subsets of markers chosen by variable selection methods have been shown to be of acceptable accuracy (Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009). Thus, variable selection methods are being heavily used in simulations (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001; Calus et al., Reference Calus, Meuwissen, de Roos and Veerkamp2008; Kizilkaya et al., Reference Kizilkaya, Fernando and Garrick2010) and in real data analysis (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009; Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b).

Most variable-selection methods nevertheless require a priori distributions or tuning parameters. These include the number of SNPs a priori in the model and its variance (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001; Verbyla et al., Reference Verbyla, Hayes, Bowman and Goddard2009); or the ratio of variances of SNPs ‘in’ or ‘out’ (Calus et al., Reference Calus, Meuwissen, de Roos and Veerkamp2008; Verbyla et al., Reference Verbyla, Hayes, Bowman and Goddard2009); or the a priori variance of SNP effects (Kizilkaya et al., Reference Kizilkaya, Fernando and Garrick2010). No clear clue, based on biological knowledge, exists about these a priori distributions. This complicates their practical application.

The Lasso (least absolute shrinkage and selection operator; Tibshirani, Reference Tibshirani1996) combines variable selection and shrinkage. Its Bayesian counterpart, the Bayesian Lasso (Park & Casella, Reference Park and Casella2008) provides a more natural interpretation in terms of a priori distributions. It is well known that, generally, conditional expectations are optimal for selection (Gianola & Fernando, Reference Gianola and Fernando1986). These can be obtained through the Bayesian Lasso but not the regular Lasso. Also, in particular, Bayesian Lasso provides a fully parametric model with a simple Gibbs sampler implementation, as well as an EM algorithm for the estimation of the ‘sharpness’ parameter λ, needing little (or no) prior information. Thus, Bayesian Lasso is an attractive candidate for genomic selection because of its simplicity, computational ease and little (or no) need to postulate prior information. Further, the exponential distribution of the Lasso is thought to reflect reasonably well the nature of quantitative trait locus (QTL) effects (Goddard, Reference Goddard2008).

The (Bayesian or not) Lasso has been used in an animal breeding context (de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Usai et al., Reference Usai, Goddard and Hayes2009; Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009), albeit a broad comparison with related methods using several traits and a real, large data set has not yet been published. In addition, we find that the particular case of Park and Casella's Bayesian Lasso includes a common variance term for modelling both residual terms and effects in the model, instead of two different variances. We find that this parameterization is not optimal. The purpose of this paper is thus manifold. First, to propose and compare a different, more general, model for the Bayesian Lasso, which in fact is equivalent to Tibshirani's (Reference Tibshirani1996) original Lasso. This model implies different variances for residual terms and for SNP effects. Second, an alternative linear model for genomic prediction will be presented and tested empirically; in this model individual SNP variances are inferred via the Bayesian Lasso first and then used in a BLUP-like estimator. Third, we compare the performance of these models with a more standard ‘genomic BLUP’ (GBLUP) either fixing the variance for the marker effects from pedigree estimates applying a rough equivalence (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009), or inferring and integrating it out from the data via the Gibbs Sampler.

2. Parameterization of the Bayesian Lasso

The base of the Lasso is a typical linear model of the form:

where b are fixed effects (e.g. an average mean), a are, in this work, SNP effects and MVN stands for multivariate normal. Originality of Lasso is in modelling effects a. In the Bayesian Lasso, the distribution of (a single) SNP effect a is modelled as

In the classical Lasso (Tibshirani, Reference Tibshirani1996) this distribution is actually p(a2, λ)=(λ/2)exp(−λ|a|); however, Tibshirani (Reference Tibshirani1996) assumes that incidence matrix Z has been standardized, which is not assumed here or in the Bayesian Lasso.

Finally, in Bayesian Lasso the variance of a is Var(a)=2σ22.

Intriguingly, as shown in the expressions above, in Bayesian Lasso applications in genomic selection (e.g. de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009) the variance σ2 has been used at the same time to model the residual term as well as the distribution of the SNP effects. However, we do expect the distribution of SNP effects not to be related to unobservable, unaccounted (residual) effects that can, for example, vary from site to site for the same individuals. Assume, for instance, a crop trial design in which some varieties are tested. Each variety can be tested 1 or 100 times. If the phenotype to be analysed is the average yield of the variety, everything else being equal, it is expected that the residual variation is divided by 10 in the second option, but not the variation across SNP effects. Another example is as follows. Assume that a set of dairy bulls is tested in two different locations, the second with less frequent milk recording. The second location will show higher residual variation for milk yield, whereas genetic variation in the bulls will be the same.

The implementation of the Bayesian Lasso in Park & Casella (Reference Park and Casella2008) does not take this into account. A more general implementation would split the sources of variation in purely residual (σe2) and variation due to SNPs (σa2), by rewriting the model as

However, this is clearly equivalent to

which is the original form of Tibshirani's (Reference Tibshirani1996) original Lasso, because only the ratio λ/σa is used and thus they cannot be estimated separately. Equivalently, the model could be written in terms of σa2 by dropping λ.

In the original Lasso, cross-validation is used for the estimation of λ (Usai et al., Reference Usai, Goddard and Hayes2009). Park & Casella (Reference Park and Casella2008) proposed a fully parametric implementation by computing a posterior distribution (using the Gibbs sampler) or an empirical Bayes estimation by marginal maximum likelihood by a Monte Carlo Expectation–Maximization (MCEM) algorithm. The latter avoids the problem of choosing a hyperprior for λ, pointed out by both Park & Casella (Reference Park and Casella2008) and de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009).

The hierarchical formulation of Lasso shown above includes explicitly two sources of variation and is thus akin to classical models in quantitative genetics and genetic evaluation (Henderson, Reference Henderson1984; Falconer & Mackay, Reference Falconer and Mackay1996) where variation is split into environmental and genetic variances. The shape of the distribution of SNP effects is determined by λ, which effectively determines the variance of SNP effects by using Var (a)=2/λ2. Thus, λ plays the same role as the inverse of a standard deviation in normal models. This does not seem to have been recognized by previous scholars (de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Usai et al., Reference Usai, Goddard and Hayes2009).

Applying the same logic as in Gianola et al. (Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009), and in ideal conditions, it is possible to establish a rough equivalence between genetic variance in a population (σu2; usually estimated by an additive, relationship-based model) and the variance of SNP effects:

where pi is the allelic frequency at the ith marker.

3. Estimation and cross-validation study

(i) Data

Two sets of bulls from French dairy cattle populations have been analysed from, respectively, Holstein (1756 bulls) and Montbéliarde (678 bulls) breeds. Bulls were genotyped with the Illumina Bovine SNP50 BeadChip. Markers were discarded based on low call rate, lack of positioning in the genome, or very high Mendelian inconsistency rate. No minor allele frequency threshold was imposed. Finally, 51 325 and 49 625 polymorphic SNP were, respectively, used in each breed. A cross-validation approach was used where 1216 and 451 bulls were taken as the training data and the rest as validation data. Bulls in the validation data set were, roughly, bulls being tested in 2004 and 2005, and younger than the training bulls. All parameter estimation in this work was carried out on the training population.

Data for training (y in the model) were daughter yield deviations (DYDs; Van Raden & Wiggans, Reference Van Raden and Wiggans1991) as computed with data available in 2004; data for validation were DYDs from data available in 2009. Thus, the validation mimics well a real scenario. To account for different accuracies in the estimation of DYDs, these were weighted by their prediction error variances (in terms of number of equivalent daughters) as estimated from regular genetic evaluation. This will be explained in more detail later.

Traits analysed were milk, fat and protein yields (MY, FY and PY) and fat and protein percentages (FP and PP). Several models were used. The estimation was mostly made by Bayesian methods using Markov Chain Monte Carlo (MCMC) as well as, for certain cases, a marginal maximum likelihood by an MCEM algorithm, as suggested by Park & Casella (Reference Park and Casella2008) to avoid the use of a hyperprior for λ. An example of marginal maximum likelihood in the genetics literature is the REML estimator of variance components (Patterson & Thompson, Reference Patterson and Thompson1971). The models used to analyse the data sets are described next.

(ii) Bayesian Lasso with genetic and residual variances (Bayesian lasso with two variances; BL2Var)

The model is as follows:

where y contains twice the DYDs for each bull, μ is a general mean, Z is an incidence matrix of SNP effects a, e is a vector of residuals and F is a diagonal matrix that contains, in the diagonal, the inverse of the number of equivalent daughters for each DYD. The parameterization of SNP effects is as in Van Raden (Reference Van Raden2008) : −2pi , 1−2pi , and 2−2pi for the genotypes 00, 01 and 11, where pi is the allelic frequency of ‘1’. In this way, assuming Hardy–Weinberg equilibrium, SNP genotypic effects are substitution effects with average effect of 0 in the population (Falconer & Mackay, Reference Falconer and Mackay1996), which is one of the conditions for the expression of the genetic variation in the population as (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009). This parameterization also results in slightly better predictive abilities compared to other ones such as −1, 0, 1 for the 00, 01 and 11 genotypes (data not shown).

The prior distribution for σe2 was an inverted chi-square distributions with 4 degrees of freedom and expectations equal to the value used in regular genetic evaluation for σe2. Prior for λ was deliberately vague, being uniform between 0 and 1 000 000.

In practice, the model above was transformed in an equivalent model, yielding the same solutions, as follows:

which amounts to multiply each row of 1 and Z by the square root of the number of equivalent daughters, so that

which simplifies the computations.

A Gibbs sampler was implemented as in Park & Casella (Reference Park and Casella2008) or de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009), via the introduction of additional (augmented) variables τi2, which can be seen as variance components for each SNP effect. The Gibbs sampler with residual update (Legarra & Misztal, Reference Legarra and Misztal2008) was used to speed up sampling of location parameters μ and a. The full conditional posterior distributions are as follows (the symbol indicates the current state of variable b):

or

or

where , zi* is the row of Z* corresponding to the ith effect and ai indicates all a variables except for ai . Further,

where IG stands for inverted Gaussian,

bounded between 0 and 1 000 000, and where G is a gamma distribution with shape ‘s’ and scale ‘sc’ and ‘nsnp’ is the number of a effects. Finally,

where S e2 is the scale of the a priori distribution of the residual variance and ndata is the number of records in y. For the inverted Gaussian distribution, we used the algorithm of Michael et al. (Reference Michael, Schucany and Haas1976) with a minor modification: extracting the largest root of the quadratic to avoid numerical cancellation.

For the MCEM estimation of λ (BL2Var-EM), the iterations proceed as above but sampling of λ is substituted by an updated estimate

where are obtained by MonteCarlo using the previous estimate of λ. In our case and after experimentation with one trait, the number of iterations to get was reduced to just one. This seems to be possible because the very large number (51 325) of variables included provides a reasonable estimate. At convergence, the last 100 samples of λ were averaged to obtain a MonteCarlo error-free estimate (as suggested by Park & Casella, Reference Park and Casella2008).

(iii) Bayesian Lasso with one variance (BL1Var)

The model by Park & Casella (Reference Park and Casella2008) and de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009) postulates a one-variance component linked to a priori variation in both residual and SNP effects, and thus:

The conditional distributions are as above, with the following modifications:

and

where D is a diagonal matrix with τi2σe2 in the (i,i) position. This conditional distribution shows well that SNP effects are in practice considered as pseudo-residuals in the one-variance Bayesian Lasso.

(iv) Bayesian mixed model with unknown genetic and residual variances (MCMC-GBLUP)

This model is similar to the ‘BLUP’ model in Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001), although the variance components are not fixed a priori. Instead, they are estimated in the model as in Legarra et al. (Reference Legarra, Robert-Granié, Manfredi and Elsen2008) :

The prior distribution for σe2 is as in the Bayesian lasso with two variances; the prior distribution for σa2 was a chi-squared distribution with 4 degrees of freedom and expectation equal to ; σu2 being the genetic variance component used in genetic evaluation. The Gibbs sampler for this model has been extensively described (e.g. Sorensen & Gianola, Reference Sorensen and Gianola2002).

(v) Bayesian mixed model with known genetic and residual variances (GBLUP)

This model is as the previous one, except that variance components were assumed to be known with certainty and inferred from values used in current genetic evaluation, as for the priors in MCMC-GBLUP. To estimate solutions for μ and a, Henderson's (Reference Henderson1984) mixed model equations were used, which were solved by preconditioned conjugated gradients as described in Legarra & Misztal (Reference Legarra and Misztal2008).

(vi) Bayesian mixed model with heterogeneous genetic variances (Het-GBLUP)

This model assumes that components of overall genetic variation (λ and σe2) are known with certainty but allows for heterogeneous variances of SNP effects, which are τi2 for the ith SNP. In order to accommodate heterogeneous variances in a linear estimator, these have to be previously known. Thus, we followed a three-step procedure. First, λ and σe2 were estimated as in the Bayesian Lasso with two variances. Second, estimates of τi2 were computed by a Gibbs sampler (as the one in the Bayesian Lasso with two variances) with λ and σe2 fixed to their estimated values. Finally, a diagonal matrix D was formed to describe the heterogeneous variance, with in the (i,i) position. Thus, the model becomes

which is solvable by Henderson's mixed model equations as above.

All models above were fit to the five traits for the Holstein breed; for the Montbéliarde, only the Bayesian Lasso with two variances and GBLUP were fit. The MCEM was run for 50 000 iterations, with final convergence; the MCMC were run for 50 000 iterations with 25 000 of burn-in after which solutions for all unknowns were estimated by their posterior means. Self-made programs were written in Fortran95.

Different parameters were estimated. In addition to σe2 and λ, a rough equivalent of the classical, pedigree-based genetic variance (σu2) was estimated for each model (except for GBLUP where it is supposed fixed). For the Bayesian Lasso with two variances, this is . For the Bayesian Lasso with one variance, this is . For the MCMC-GBLUP, this is . A pedigree-based estimate of σu2 was obtained by REML for the Holstein breed using REMLF90 (Misztal et al., Reference Misztal, Tsuruta, Strabel, Auvray, Druet and Lee2002).

(vii) Cross-validation

Predictions (genomic estimated breeding values (GEBVs)) for the validation data test were computed as for the different models, and compared with 2009 progeny test-based DYDs. Predictive ability was measured as the correlation between both. The correlation was weighted by the number of equivalent daughters in 2009 DYD data. The formula for the weighted Pearson product moment correlation coefficient is as follows (e.g. Peers, Reference Peers1996):

where and and w are the weights for each data point. Cross-validation results for BL2Var-EM approach were essentially the same as BL2Var (correlation among EBVs was higher than 0·99) and are therefore not shown.

4. Results

(i) Estimates of parameters

Tables 1 (for Holstein) and 2 (for Montbéliarde) show estimates of λ parameters. Estimates are generally accurate as shown by their standard errors. Estimates from BL1Var are very similar across traits, which does not occur for BL2Var. On the other hand, estimates by full Bayesian inference (BL2Var) or marginal maximum likelihood (BL2Var – EM) are virtually identical in both Holstein (Table 1) and Montbéliarde (Table 2). Also, estimates in Holstein and in Montbéliarde are quite similar for the same traits.

Table 1. Estimates of ‘sharpness’ parameter λ (±se) in Holstein

MY, milk yield; FY, fat yield; PY, protein yield; FP, fat percentage; PP, protein percentage.

Table 2. Results in Montbéliarde: estimates (±se) of ‘sharpness’ parameter λ, of population genetic variance σu2 and accuracies r (correlations between GEBVs and 2DYDs in the validation data set)

a Divided by 1000.

MY, milk yield; FY, fat yield; PY, protein yield; FP, fat percentage; PP, protein percentage.

Estimates of genetic variation in the population (σu2) are shown in Tables 2 and 3. Estimates from BL2Var and MCMC-GBLUP (which are remarkably close) are similar to pedigree-based estimates by REML or to values currently used, so that they make clear biological sense and are understandable in terms of genetic variation. Differences of the REML estimate from current values can be explained by the incompleteness of the DYD data; differences among pedigree-based and SNP-based models can be due to this reason, but also to the different assumptions of both models, which make them not fully comparable, as discussed below. Estimates from BL1Var are clearly different from any other estimate and are not reliable as estimates of the genetic variation in the population.

Table 3. Estimates of population genetic variance σu2se) in Holstein

a As used in regular genetic evaluation.

b Divided by 1000.

MY, milk yield; FY, fat yield; PY, protein yield; FP, fat percentage; PP, protein percentage.

(ii) Empirical accuracies

Tables 4 (for Holstein) and 2 (for Montbéliarde) show empirical accuracies of predicted GEBVs. Possibly due to the relatively small data set, accuracies are lower than some previously reported estimates (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009; Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b) although comparable to accuracies in a small study of similar size (Luan et al., Reference Luan, Woolliams, Lien, Kent, Svendsen and Meuwissen2009).

Table 4. Accuracies: correlations between GEBVs and 2DYDs in the validation data set, in Holstein

MY, milk yield; FY, fat yield; PY, protein yield; FP, fat percentage; PP, protein percentage.

Accuracies are almost systematically highest for BL2Var, in particular for traits controlled by major genes as DGAT1 (Grisart et al., Reference Grisart, Coppieters, Farnir, Karim, Ford, Berzi, Cambisano, Mni, Reid, Simon, Spelman, Georges and Snell2002) (FP). GBLUP and MCMC-GBLUP perform similarly (with 10% less accuracy than BL2Var for FP). The two-step approach HetVar-GBLUP reaches similar accuracies to BL2Var, with only 2% less accuracy for FP. As for BL1Var, its performance is the poorest almost systematically, although the difference is minimal for FY and PY. Thus, results from previous users of Bayesian Lasso (e.g. Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009) might underestimate the true potential of Lasso. This is possibly trait dependent.

Results in Montbéliarde show no major difference between BL2Var and GBLUP. This is partly due to the fact that DGAT1 has an extremely low minor allele frequency (4%) in Montbéliarde and causes almost no genetic variation in the population (Gautier et al., Reference Gautier, Capitan, Fritz, Eggen, Boichard and Druet2007).

In general, models with the same accuracy are almost identical (correlations among EBVs higher than 0·99), as shown in Table 6 for MY and FP. This implies that errors in estimation of EBVs are very similar across methods. Models with different accuracies show, obviously, lower correlations. For example, correlations of EBVs estimated with BL2Var with those estimated by BL1Var and MCMC-GBLUP for FP are, respectively, 0·73 and 0·92.

Table 5. Regression coefficients b of 2DYDs on GEBVs in the validation data set, in Holstein

MY, milk yield; FY, fat yield; PY, protein yield; FP, fat percentage; PP, protein percentage.

Table 6. Correlation among GEBVs in the validation data set predicted by various methods for milk yield (above diagonal) and fat percentage (below diagonal), in Holstein

Table 5 shows regression coefficients of 2DYDs on GEBVs. These regression coefficients should ideally be 1, implying that the predicted has the same variance as the true value. This is relevant for the comparison of estimated breeding values across generations. Methods give often inflated variances of GEBVs (b<1) for yields. For contents, they oscillate around 1; the reason is that most genetic variation is well captured due to large QTL effects. However, BL1Var results in inflation for all traits because it does not shrink estimators enough. Even in the absence of genomic information, predictions of young bulls by parent average are known to be biased (Van Raden et al., Reference Van Raden, Tooker and Cole2009a). One explanation for the generally low value for b is pre-selection of validation bulls (Mäntysaari et al., Reference Mäntysaari, Liu and Van Raden2010); according to our information, this is actually not the case in the French industry. Another more likely explanation is lack of enough information, because in this work dams ‘information is not added to the genomic predictions. A combined index was suggested by Van Raden et al. (Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b).

Figure 1 shows the estimates by HetVar-GBLUP of SNP effects for FP in chromosomes 13 (representative of the rest of the genome) and 14 in a log-10 scale. At the beginning of chromosome 14, the effect of DGAT1 can be appreciated, presenting a sharp peak even in the logarithmic scale of the plot. Peaks of this size cannot be observed elsewhere in the genome. Also, it can be observed that whereas most of the effect of the markers range between 10−2 and 10−4, a few have very low values of about 10−8; this corresponds to markers whose effects are ‘almost nullified’ by the estimate. As pointed out by Usai et al. (Reference Usai, Goddard and Hayes2009), in the original Lasso a joint mode is estimated and most markers are expected to have values of exactly zero; whereas in Bayesian Lasso, posterior means are estimated, possibly with small values but not zero. Posterior means are optimal for selection (Gianola & Fernando, Reference Gianola and Fernando1986; Goddard, Reference Goddard2008).

Fig. 1. Estimated effects of SNP loci for FP in Holstein by the HetVar-GBLUP method, for chromosomes 13 (crosses) and 14 (rounds).

5. Discussion

(i) Sense of hyperparameters in Bayesian Lasso and genetic variation in the population

Little has been discussed on estimates of λ in Bayesian Lasso for genomic selection. In BL1Var, values are similar across traits. This is possibly a by-product of modelling residual and SNP effects with a common σ2 parameter. However, in BL2Var, λ differs from trait to trait. Further, we provide an interpretation of the meaning of λ in a quantitative genetics context: it is a measure of the genetic variation, as in, for example,

This estimate of genetic variation in the population is appropriate for an ideal population in Hardy–Weinberg and linkage equilibrium (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009). Yet, in spite of this assumption, estimates of σu2 are close to pedigree-based estimates and currently used estimates in the overall population for both BL2Var and MCMC-GBLUP, whereas estimates of σu2 in BL1Var do not agree well and make no biological sense. Thus, BL2Var has the advantage of providing biologically reasonable parameters. We recall that pedigree estimates of genetic variation are also ideal, assuming, for instance, unrelated base individuals; thus, both cannot be exactly compared.

(ii) Predictive ability and use of prior information

Predictive ability is optimal for BL2Var almost systematically. This agrees with Van Raden et al. (Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b) who found better predictions for non-linear than for linear equations (GBLUP) for these traits. However, Luan et al. (Reference Luan, Woolliams, Lien, Kent, Svendsen and Meuwissen2009) found similar or better results for GBLUP than for non-linear (Mixture and BayesB) methods. For other traits (e.g. fertility) both Van Raden et al. (Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b) and Hayes et al. (Reference Hayes, Bowman, Chamberlain and Goddard2009) found that GBLUP performed better than non-linear methods. In general, for large data sets, the difference seems to be negligible for most traits in dairy cattle except for FP and PP (Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b).

A possible explanation for the superiority of BLVar2 over GBLUP is that we did use very little prior information or tuning parameters, extracting most information from data. This is a very important issue in genomic selection nowadays. As extensively shown by simulations (e.g., Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001), the use of the correct – biological – a priori information results in better predictive abilities. However, current state of knowledge on QTL action and location does not allow the construction of this prior information, which is replaced by somehow controversial (e.g. Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009; Hill, Reference Hill2010) figures or deductions from population genetics theory. This prior information includes the number of QTLs for non-linear regression or BayesB (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001; Van Raden, Reference Van Raden2008); or the a priori variance of ‘true’ respect to ‘false’ SNP effects (Calus et al., Reference Calus, Meuwissen, de Roos and Veerkamp2008); or, yet, the a priori variance of SNP effects in BayesA and BayesB (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001). Some of this information does not vanish asymptotically as data cumulate (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009).

Another option is the ‘trial and error’ of several ‘priors’ (i.e. Usai et al., Reference Usai, Goddard and Hayes2009) to find the best predictive ability. This is not real Bayesian (or parametric) inference. In fact, this is an estimation of parameters by trial and error, like the cross-validation methods used in non-parametric inference. This is indeed a legitimate strategy, albeit in practice it is hard to ascertain if all the parametric space of the prior has been covered or how to conceive the different priors. Another problem is that inference depends on the constitution of the validation data set, a difficult problem in an animal breeding context where data are correlated and selected.

In hierarchical Bayesian models, prior is established in high-order hyperparameters (e.g. variance components or λ), so that the influence of the prior vanishes asymptotically. This is used for example in MCMC-GBLUP (in this work) or in Bayesian Lasso (Park & Casella, Reference Park and Casella2008; de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009). This is routinely done for the estimation of genetic parameters via the Gibbs Sampler; the influence of the prior information is negligible for reasonably large data sets (Van Tassell et al., Reference Van Tassell and Van Vleck1996). There is little experience, though, on the practical influence of the prior information on genomic selection. de los Campos (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009) found this influence to exist in estimates of λ, but not really on estimates of SNP effects. Indeed, priors for λ were difficult to conceive, because no natural interpretation on this parameter was recognized. From this work, a reasonable (but not exact) guess of λ is . At any rate, if priors are not sought, and as suggested by Park & Casella (Reference Park and Casella2008), the marginal maximum likelihood estimate can be used. We have shown the feasibility of this estimate by MC-EM, finding values similar to the use of low informative priors for λ. Thus, Bayesian Lasso is rather unique among parametric variable selection methods in genomic selection because it is readily estimable using fully parametric methods, either fully Bayesian or using marginal maximum likelihoods.

(iii) Shape of SNP effects

Figure 2 shows clearly that BL2Var is able to accommodate SNPs of large effect (i.e. around DGAT1) and also of small, almost nil, effects. Because of the nature of shrinkage caused by models positing a priori normal distributions, both features are generally difficult to attain, unless large amounts of information exist. It is unknown if these kinds of distributions (e.g. similar to double exponential) for SNP effects are frequent in nature or not, but the case of DGAT1 shows its importance in practice, at least for the dairy cattle industry. Because of this ability, BL2Var results in optimal predictive abilities.

Fig. 2. Theoretical distribution of SNP effects for fat content according to estimates of σe2, σa2 and λ in BL2Var (continuous line), BL1Var (grey dashed line) and MCMC-GBLUP normal model (dotted black line). The figure has been scaled so that the normal distribution has a variance of 1.

Figure 2 shows the theoretical distribution of SNP effects for FP in Holstein according to the distributions for a described in Methods and estimates for λ (for BL2Var), λ and σe2 (for BL1Var) and σa2 (for MCMC-GBLUP). FP is the trait with more differences observed for predictive ability in the cross-validation approach, and partially controlled by DGAT1. For the BL1Var approach, the peak is not very sharp and is indeed lower than for BL2Var. This produces a distribution for SNP effects with little shrinkage for BL1Var, which is also reflected in Table 6. Thus, BL1Var does not seem to shrink enough in the appropriate regions. This is reflected in its poor predictive ability. On the other hand, BL2Var shrinks more than the normal distribution for most of the domain except for effects of more than about 3 standard units. This illustrates well that in order to account for the distribution of SNP effects both the sharpness and the variance have to be considered.

(iv) Precomputation of variance of SNP effects

In HetVar-GBLUP, variances of SNPs are pre-computed via BL2Var. This results in good accuracies. This strategy presents several practical advantages. Computation of SNP solutions, once variances are known, is very fast following, for example, Legarra & Misztal (Reference Legarra and Misztal2008). Also, a genomic relationship matrix with the same results, can be constructed as G=ZDZ′σa2 (Goddard, Reference Goddard2008; Van Raden, Reference Van Raden2008), giving more importance to SNPs with large than small effect. Mixed model equations using G can be set up with several nice properties. Solving is quite simple (Van Raden, Reference Van Raden2008) even for singular G (Henderson, Reference Henderson1984). Pseudo-reliabilities can be constructed from their inverse; extensions exist to include full pedigree in the relationships and all data available (Legarra et al., Reference Legarra, Aguilar and Misztal2009; Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2010; Christensen & Lund, Reference Christensen and Lund2010). Models including an additional polygenic term (i.e. Guillaume et al., Reference Guillaume, Fritz, Boichard and Druet2008) can easily be set up.

We suggest, based on these practical considerations, a two-step procedure to include large amounts of data (i.e. genotyped cows, genotyped individuals with no record, or all genotyped and ungenotyped individuals). First, variances in D can be estimated in a small yet informative sample of data (e.g. bulls). Second, D can be used for genetic evaluation either for all genotyped animals or all animals in the population. Pre-computation is the strategy used for variance components in regular genetic evaluation. It is an open question whether this strategy is stable with time or across different strata in a population.

6. Conclusion

The Bayesian Lasso with different variances for residual or SNP effects (BL2Var), which is equivalent to the original Lasso (Tibshirani, Reference Tibshirani1996) is appropriate for genomic selection, with generally highest accuracies and less inflation of GEBVs than other methods included in this study. Park & Casella's (Reference Park and Casella2008) original BL1Var cannot be recommended because of inappropriate constraints in the model. We have shown how to estimate parameter λ with little (or no) prior information, and its biological interpretation in relation to genetic variation in the population. The inclusion of specific SNP variances in linear models is feasible by pre-computing the variances with the BL2Var. These methods should be further explored in other data sets including different traits and species.

This work was financed by ANR project AMASGEN (2009–2011). Project partly supported by Toulouse Midi-Pyrénées bioinformatic platform. We thank G. de los Campos for carefully explaining to us the Bayesian Lasso and for his initial code in R. Suggestions of the reviewers are gratefully acknowledged.

References

Aguilar, I., Misztal, I., Johnson, D. L., Legarra, A., Tsuruta, S. & Lawlor, T. J. (2010). A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. Journal of Dairy Science 93, 743752.CrossRefGoogle ScholarPubMed
Calus, M. P. L., Meuwissen, T. H. E., de Roos, A. P. W. & Veerkamp, R. F. (2008). Accuracy of genomic selection using different methods to define haplotypes. Genetics 178, 553561.CrossRefGoogle ScholarPubMed
Christensen, O. F. & Lund, M. S. (2010). Genomic prediction when some animals are not genotyped. Genetics Selection Evolution 42, 2.CrossRefGoogle Scholar
de los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K. & Cotes, J. M. (2009). Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182, 375385.CrossRefGoogle ScholarPubMed
Falconer, D. & Mackay, T. (1996). Introduction to Quantitative Genetics. New York: Longman.Google Scholar
Gautier, M., Capitan, A., Fritz, S., Eggen, A., Boichard, D. & Druet, T. (2007). Characterization of the DGAT1K232A and variable number of tandem repeat polymorphisms in French dairy cattle. Journal of Dairy Science 90, 29802988.CrossRefGoogle Scholar
Gianola, D., de los Campos, G., Hill, W. G., Manfredi, E. & Fernando, R. L. (2009). Additive genetic variability and the Bayesian alphabet. Genetics 183, 347363.CrossRefGoogle ScholarPubMed
Gianola, D. & Fernando, R. L. (1986). Bayesian methods in animal breeding theory. Journal of Animal Science 63, 217244.CrossRefGoogle Scholar
Goddard, M. (2008). Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136, 245257.CrossRefGoogle ScholarPubMed
Grisart, B., Coppieters, W., Farnir, F., Karim, L., Ford, C., Berzi, P., Cambisano, N., Mni, M., Reid, S., Simon, P., Spelman, R., Georges, M. & Snell, R. (2002). Positional candidate cloning of a QTL in dairy cattle: Identification of a missense mutation in the bovine DGAT1 gene with major effect on milk yield and composition. Genome Research 12, 222231.CrossRefGoogle ScholarPubMed
Guillaume, F., Fritz, S., Boichard, D.& Druet, T. (2008). Correlations of marker-assisted breeding values with progeny-test breeding values for eight hundred ninety-nine French Holstein bulls. Journal of Dairy Science 91, 25202522.CrossRefGoogle ScholarPubMed
Habier, D., Fernando, R. L. & Dekkers, J. C. M. (2007). The impact of genetic relationship information on genome-assisted breeding values. Genetics 177, 23892397.CrossRefGoogle ScholarPubMed
Hayes, B. J., Bowman, P. J., Chamberlain, A. J. & Goddard, M. E. (2009). Invited review: Genomic selection in dairy cattle: Progress and challenges. Journal of Dairy Science 92, 433443.CrossRefGoogle ScholarPubMed
Henderson, C. R. (1984). Applications of Linear Models in Animal Breeding. Guelph: University of Guelph.Google Scholar
Hill, W. G. (2010). Understanding and using quantitative genetic variation. Philosophical Transactions of the Royal Society B 365, 7385.CrossRefGoogle ScholarPubMed
Kizilkaya, K., Fernando, R. L. & Garrick, D. J. (2010). Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. Journal of Animal Science 88, 544551.CrossRefGoogle ScholarPubMed
Legarra, A., Aguilar, I. & Misztal, I. (2009). A relationship matrix including full pedigree and genomic information. Journal of Dairy Science 92, 46564663.CrossRefGoogle ScholarPubMed
Legarra, A. & Misztal, I. (2008). Technical note: Computing strategies in Genome-wide selection. Journal of Dairy Science 91, 360366.CrossRefGoogle ScholarPubMed
Legarra, A., Robert-Granié, C., Manfredi, E. & Elsen, J. M. (2008). Performance of genomic selection in mice. Genetics 180, 611618.CrossRefGoogle ScholarPubMed
Luan, T., Woolliams, J. A., Lien, S., Kent, M., Svendsen, M. & Meuwissen, T. H. E. (2009). The accuracy of Genomic Selection in Norwegian red cattle assessed by cross-validation. Genetics 183, 11191126.CrossRefGoogle ScholarPubMed
Lund, M. S., Sahana, G., de Koning, D. J., Su, G. & Carlborg, O. (2009). Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection. BMC Proceedings 3, S1.CrossRefGoogle ScholarPubMed
Mäntysaari, E., Liu, Z. & Van Raden, P. (2010). Interbull validation test for genomic evaluations. Interbull Bulletin 41.Google Scholar
Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 18191829.CrossRefGoogle ScholarPubMed
Michael, J., Schucany, W. & Haas, R. (1976). Generating random variates using transformations with multiple roots. American Statistician 30, 8890.CrossRefGoogle Scholar
Misztal, I., Tsuruta, S., Strabel, T., Auvray, B., Druet, T. & Lee, D. (2002). BLUPF90 and related programs (BGF90). In Seventh World Congress on Genetics Applied to Livestock Production, 2002, CD-ROM Communication N° 28–07.Google Scholar
Park, T. & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association 103, 681686.CrossRefGoogle Scholar
Patterson, H. & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545554.CrossRefGoogle Scholar
Peers, I. (1996). Statistical Analysis for Education and Psychology Researchers. Washington, DC: The Falmer Press.Google Scholar
Sorensen, D. & Gianola, D. (2002). Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. New York: Springer.CrossRefGoogle Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58, 267288.Google Scholar
Usai, M. G., Goddard, M. E. & Hayes, B. J. (2009). LASSO with cross-validation for genomic selection. Genetics Research 91, 427436.CrossRefGoogle ScholarPubMed
Van Raden, P. M. (2008). Efficient Methods to Compute Genomic Predictions. Journal of Dairy Science 91, 44144423.CrossRefGoogle ScholarPubMed
Van Raden, P. M., Tooker, M. E. & Cole, J. B. (2009 a). Can you believe those genomic evaluations for young bulls? Journal of Animal Science 87(E-Suppl. 2), 314(abstr. 279).Google Scholar
Van Raden, P. M., Van Tassell, C. P., Wiggans, G. R., Sonstegard, T. S., Schnabel, R. D., Taylor, J. F. & Schenkel, F. S. (2009 b). Invited review: Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy Science 92, 1624.CrossRefGoogle ScholarPubMed
Van Raden, P. M. & Wiggans, G. R. (1991). Derivation, calculation, and use of national animal model information. Journal of Dairy Science 74, 27372746.CrossRefGoogle ScholarPubMed
Van Tassell, C. P. & Van Vleck, L. D. (1996). Multiple-trait Gibbs sampler for animal models: flexible programs for Bayesian and likelihood-based (co)variance component inference. Journal of Animal Science 74, 25862597.CrossRefGoogle ScholarPubMed
Verbyla, K. L., Hayes, B. J., Bowman, P. J. & Goddard, M. E. (2009). Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genetics Research 91, 307311.CrossRefGoogle ScholarPubMed
Weigel, K. A., de los Campos, G., González-Recio, O., Naya, H., Wu, X. L., Long, N., Rosa, G. J. M. & Gianola, D. (2009). Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. Journal of Dairy Science 92, 52485257.CrossRefGoogle ScholarPubMed
Figure 0

Table 1. Estimates of ‘sharpness’ parameter λ (±se) in Holstein

Figure 1

Table 2. Results in Montbéliarde: estimates (±se) of ‘sharpness’ parameter λ, of population genetic variance σu2 and accuracies r (correlations between GEBVs and 2DYDs in the validation data set)

Figure 2

Table 3. Estimates of population genetic variance σu2se) in Holstein

Figure 3

Table 4. Accuracies: correlations between GEBVs and 2DYDs in the validation data set, in Holstein

Figure 4

Table 5. Regression coefficients b of 2DYDs on GEBVs in the validation data set, in Holstein

Figure 5

Table 6. Correlation among GEBVs in the validation data set predicted by various methods for milk yield (above diagonal) and fat percentage (below diagonal), in Holstein

Figure 6

Fig. 1. Estimated effects of SNP loci for FP in Holstein by the HetVar-GBLUP method, for chromosomes 13 (crosses) and 14 (rounds).

Figure 7

Fig. 2. Theoretical distribution of SNP effects for fat content according to estimates of σe2, σa2 and λ in BL2Var (continuous line), BL1Var (grey dashed line) and MCMC-GBLUP normal model (dotted black line). The figure has been scaled so that the normal distribution has a variance of 1.