A two-step method for detecting selection signatures using genetic markers

DANIEL GIANOLA; HENNER SIMIANER; SABER QANBARI

doi:10.1017/S0016672310000121

A two-step method for detecting selection signatures using genetic markers

Published online by Cambridge University Press: 01 June 2010

DANIEL GIANOLA ,

HENNER SIMIANER and

SABER QANBARI

Show author details

DANIEL GIANOLA*: Affiliation:
Department of Animal Sciences and Department of Dairy Science, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, N-1432 Ås, Norway Department of Animal Sciences, Georg-August-Universität, Göttingen, Germany
HENNER SIMIANER: Affiliation:
Department of Animal Sciences, Georg-August-Universität, Göttingen, Germany
SABER QANBARI: Affiliation:
Department of Animal Sciences, Georg-August-Universität, Göttingen, Germany
*: *Corresponding author. e-mail: [email protected]

Article contents

Summary
Introduction
Background
Estimation of parameters
Clustering of θ-parameters
Discussion
References

Rights & Permissions

Summary

A two-step procedure is presented for analysis of θ (FST) statistics obtained for a battery of loci, which eventually leads to a clustered structure of values. The first step uses a simple Bayesian model for drawing samples from posterior distributions of θ-parameters, but without constructing Markov chains. This step assigns a weakly informative prior to allelic frequencies and does not make any assumptions about evolutionary models. The second step regards samples from these posterior distributions as ‘data’ and fits a sequence of finite mixture models, with the aim of identifying clusters of θ-statistics. Hopefully, these would reflect different types of processes and would assist in interpreting results. Procedures are illustrated with hypothetical data, and with published allelic frequency data for type II diabetes in three human populations, and for 12 isozyme loci in 12 populations of the argan tree in Morocco.

Type: Research Papers
Information: Genetics Research , Volume 92 , Issue 2 , April 2010 , pp. 141 - 155

DOI: https://doi.org/10.1017/S0016672310000121 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

1. Introduction

The discovery of a massive number of single nucleotide polymorphisms (SNPs) in the genome of several species has enabled exploration of genome-wide signatures of selection via an assessment of variation in marker allele frequencies among populations (e.g. Holsinger & Weir, Reference Holsinger and Weir2009). Several methods have been proposed for doing this, such as site frequency spectrum, linkage disequilibrium and population differentiation (Sabeti et al., Reference Sabeti, Schaffner, Fry, Lohmueller, Varilly, Shamovsky, Palma, Mikkelsen, Altshuler and Lander2006; Akey, Reference Akey2009). Concerning population differentiation, a parameter θ=F _ST, measuring relatedness between pairs of alleles within a sub-population relative to that in an entire population, has been used for this purpose (Wright, Reference Wright1951; Cockerham, Reference Cockerham1969; Weir & Hill, Reference Weir and Hill2002); Lewontin & Krakauer (Reference Lewontin and Krakauer1973) and Robertson (Reference Robertson1975) discuss related approaches. Equivalently, θ can be interpreted as a measure of dispersion of gene frequencies among groups relative to the variation expected in the population from which such groups derived. For example, Akey et al. (Reference Akey, Zhang, Zhang, Jin and Shriver2002) analysed over 26 500 SNPs for which allele frequencies were available in three populations of humans. The θ parameter was estimated for every marker locus and the distribution of estimates over the entire genome, and by chromosome, was examined. By referring these estimates to their empirical genome-wide distribution, 174 candidate genes were identified as possible targets of selection.

Holsinger & Weir (Reference Holsinger and Weir2009) provide an account of the logic of the procedure. Briefly, given a set of loci in a given species, a reasonable assumption is that all share the same demographic history and patterns of migration. If these loci are neutral and have similar mutation rates, members of this set can be conceivably regarded as exchangeable realizations of the same evolutionary process. Loci showing departures from the resulting distribution may serve as flags of genomic regions that have been under the influence of selection. Under the hypothesis of selective neutrality, the distribution (over loci) of estimates of θ is expected to be driven by genetic drift, assumed to affect all loci in a similar fashion. On the other hand, when selection operates on one or several loci (as in a multifactorial model for complex traits), markers that are within genes or in nearby locations will display large or small values of θ, the latter occurring when some sort of balancing selection takes place (Cavalli-Sforza, Reference Cavalli-Sforza1966). This opens an avenue for identification of regions associated with population differentiation, e.g., high versus low producing breeds of dairy cattle. Knowledge of such regions may be useful for enhancing the effectiveness of breeding programs via marker-assisted selection, or for tagging variants associated with disease or quantitative traits. While unusual values of θ may point to genomic locations where selection may have operated, there is arbitrariness with respect to characterizing the type of selection that might have occurred. Typically, loci are classified as either neutral, or subject to balancing selection (low values of θ), or favoured by selection within some specific population or environment (large population differentiation, thus leading to large values of θ). If the values of θ arise from different evolutionary or artificial (such as in plant and animal breeding) processes, one would expect to observe a mixture of distributions leading to clusters representing the different kinds of mechanisms operating. There is no apparent reason why there should be only two or three such clusters; there may be several clusters harbouring loci undergoing different types of selection processes. On the other hand, if θ values vary completely at random due to genetic drift, a single cluster is to be expected.

Statistical issues associated with inferring θ-statistics have been discussed, e.g., by Weir & Cockerham (Reference Weir and Cockerham1984) and Weir & Hill (Reference Weir and Hill2002), with emphasis on methods of moments estimation; by Balding (Reference Balding2003) using maximum likelihood for beta-binomial and Dirichlet-multinomial distributions; and by Holsinger (Reference Holsinger1999), Beaumont & Balding (Reference Beaumont and Balding2004) and Guo et al. (Reference Guo, Dey and Holsinger2009) employing Bayesian procedures. None of these treatments have addressed the possible existence of a clustered structure.

The objective of this paper is to present a two-step procedure eventually leading to clusters of θ values. The first step, along the lines of Holsinger (Reference Holsinger1999), Balding (Reference Balding2003) and Beaumont & Balding (Reference Beaumont and Balding2004), uses a simple Bayesian structure for drawing samples from the posterior distributions of θ-parameters, but without constructing Markov chains. This step assigns a weakly informative prior to allelic frequencies and does not make any assumptions about evolutionary models. The second step regards samples from these posterior distributions as ‘data’ and fits a sequence of finite mixture models, with the aim of identifying clusters of θ-statistics. Hopefully, these would reflect different types of processes and would assist in interpreting results.

The paper is organized as follows. Section 2 reviews basic concepts. In section 3, the first step of the procedure is presented, contrasted with maximum likelihood, and illustrated with a hypothetical dataset and with data on type II diabetes in three populations. Section 4 describes the second step of the procedure, and illustrates it with a dataset containing allelic frequencies for 12 polymorphic isozyme loci in 12 populations of the argan tree (Argania spinosa L. Skeels) of Morocco presented in Petit et al. (Reference Petit, El Mousadik and Pons1998) and analysed by Holsinger (Reference Holsinger1999). The paper concludes with a discussion of the proposed methodology.

2. Background

(i) Basic concepts

The stage is set by reviewing essentials of a random effects treatment proposed by Cockerham (Reference Cockerham1969, Reference Cockerham1973). Suppose that genetic markers (e.g. SNPs) are screened in a set of individuals in each of R groups or populations, the latter viewed as drawn at random from some conceptual hyper-population from which such groups derive. Consider a bi-allelic locus (developments carry to multiple alleles as well) and let A _l and a _l be the two alleles at locus l (l=1, 2, …, L); define p _l=Pr (A _l) to be the true, unobserved, frequency of allele A _l in the hyper-population, with 1−p _l=Pr (a _l) being the frequency of a _l. Cockerham (Reference Cockerham1969) defines a _l as any allele other than A _l and uses an indicator variable x to denote allelic state (‘content’), such that

$x_{rij\comma l} \equals \left\{\! {\matrix{ {1\comma } \quad {{\rm if\ an\ allele\ is\ }A_{l} \comma } \hfill\cr {0\comma } \quad{{\rm otherwise}.}\hfill \cr} } \right.$

Here, r=1, 2, …, R denotes group or replicate, i indicates an individual, j is an index for a within-individual deviation and l=1, 2, …, L is an indicator for locus. Even though x _rij,l is a binary variable (so a linear model is questionable), Cockerham (Reference Cockerham1969) uses the linear decomposition

(1)

$x_{rij\comma l} \equals p_{l} \plus a_{r\comma l} \plus b_{ri\comma l} \plus w_{rij\comma l} \comma$

where p _l is as before and a _r,l~(0, σ_a,l²), b _ri,l~(0, σ_b,l²) and w _rij,l~(0, σ_w,l²) are mutually uncorrelated zero-mean random deviates, specific to locus l; the σ²′s are variance components. Under the assumption that all alleles at locus l have the same marginal distribution,

$E\lpar x_{rij\comma l} \rpar \equals p_{l}$

and

${\rm Var}\lpar x_{rij\comma l} {\rm \rpar } \equals p_{l} \lpar 1 \minus p_{l} \rpar \equals \sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} \plus \sigma _{w\comma l}^{\setnum{2}} \equals \sigma _{l}^{\setnum{2}}$

for l=1, 2, …, L. Decomposition (1) induces the following covariance structure between allelic content variables:

$\eqalign\tab {{\rm Cov\lpar }x_{rij\comma l} \comma x_{r{\prime } i{\prime } j{\prime } \comma l} {\rm \rpar } \equals} \cr\quad \openup3pt\hskip-6pt\left\{\! {\matrix{ {\sigma _{l}^{\setnum{2}} {\rm \comma \hfill}\hfill\quad \hskip32pt{\rm if\ }r \equals r{\prime } \comma i \equals i{\prime } \comma j \equals j{\prime } \comma }\hfill \cr {\sigma _{a}^{\setnum{2}} {\rm \comma }\quad\hskip32pt {\rm if\ }r \equals r{\prime } \comma i \ne i{\prime } \comma j \ne j{\prime } \comma }\hfill \cr {\sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} {\rm \comma }\quad\enspace \hskip-3pt{\rm if\ }r \equals r{\prime } \comma i \equals i{\prime } \comma j \ne j{\prime } \comma }\hfill \cr {{\rm Cov}\lpar a_{r} \comma a_{r{\prime } } \rpar {\rm \comma \enspace if\ replicates\ are\ correlated\ somehow}.} \cr} } \right.}$

A standard assumption is Cov(a _r, a _r′)=0. The following correlations (all positive) follow.

• Pairs of alleles drawn at random from different individuals in the same group are correlated as

(2)
$\rho _{a\comma l} \equals {{\sigma _{a\comma l}^{\setnum{2}} } \over {\sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} \plus \sigma _{w\comma l}^{\setnum{2}} }} \equals \theta _{l} \equals F_{{\rm ST}\comma l} \comma$
so 0⩽θ_l⩽1 for all l.
• Pairs of alleles drawn within individuals over all replicates bear a correlation equal to

$\rho _{ab\comma l} \equals {{\sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} } \over {\sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} \plus \sigma _{w\comma l}^{\setnum{2}} }} \equals F_{l} \equals F_{{\rm IT}\comma l} \comma$
where F is the total inbreeding coefficient, also known as F _IT (e.g. Weir & Hill, Reference Weir and Hill2002).
• The correlation between alleles within individuals within the same replicate is

$\rho _{b\comma l} \equals {{\sigma _{b\comma l}^{\setnum{2}} } \over {\sigma _{b\comma l}^{\setnum{2}} \plus \sigma _{w\comma l}^{\setnum{2}} }} \equals f_{l} \equals F_{{\rm IS}\comma l} \comma$
which is the within sub-population inbreeding coefficient f.

It is easy to show that

$\theta _{l} \equals {{F_{{\rm IT}\comma l} \minus F_{{\rm IS}\comma l} } \over {1 \minus F_{{\rm IS}\comma l} }} \equals F_{{\rm ST}\comma l}.$

This expression satisfies

$1 \minus F_{{\rm IT}\comma l} \equals \lpar 1 \minus F_{{\rm IS}\comma l} \rpar \lpar 1 \minus F_{{\rm ST}\comma l} \rpar \comma$

indicating that a reduction in heterozygosity has two sources: one that is due to population sub-division or Wahlund's effect, (1−F _ST,l), and a reduction within subpopulation or group caused by ‘local’ inbreeding, (1−F _IS,l).

Note that parameter F _ST given in (2) can also be written as

$\theta _{l} \equals {{\sigma _{a\comma l}^{\setnum{2}} } \over {\sigma _{a\comma l}^{\setnum{2}} \plus \sigma _{b\comma l}^{\setnum{2}} \plus \sigma _{w\comma l}^{\setnum{2}} }} \equals {{\sigma _{a\comma l}^{\setnum{2}} } \over {p_{l} \lpar 1 \minus p_{l} \rpar }}.$

In Cockerham (Reference Cockerham1969), the variance σ_a,l² arises by drawing alleles from a random sample of populations. Under conceptual repeated sampling, this generates a distribution having such variance. However, in many applications, the R groups under study are targeted (as opposed to randomly sampled) populations, e.g., Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007) ; this defines a ‘fixed effects’ model. Now, since σ_a,l² is the between-group variance in allelic content as per model (1), an alternative parametric definition of θ_l in terms of the unknown gene frequencies of the R groups is

(3)

$\theta _{l} \equals {{{\textstyle{{\sum\nolimits_{r \equals \setnum{1}}^{R} {{\rm \lpar }p_{r\comma l} \minus \bar{p}_{l} {\rm \rpar }^{\rm \setnum{2}} } {\rm \ }} \over R}}} \over {\bar{p}_{l} \lpar 1 \minus \bar{p}_{l} \rpar }}\comma$

where $\bar{p}_{l} \equals \sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l} \sol\! R}$ is the average (over groups) of the frequencies of allele A _l at locus l. Note that _l is taken as an unweighted average; it does not seem sensible to express a parameter in terms of sample size (unless weights assigned to samples reflect true population sizes). Expressing θ_l explicitly in terms of the locus-specific gene frequencies yields

(4)

$\theta _{l} \equals {{\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l}^{\setnum{2}} } \minus {{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r{\rm \comma }l} } } \right)}\nolimits^{\setnum{2}} } \over R}} \over {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r{\rm \comma }l} } \minus {{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r{\rm \comma }l} } } \right)}\nolimits^{\setnum{2}} } \over R}} \right)}}\comma$

providing a mapping from the joint space of R allelic frequencies to the single-dimensional space of θ_l, which resides in (0, 1). If allelic frequencies for the different loci are driven from the same stochastic evolutionary process (e.g. as generated by random drift), this defines a distribution of values of θ expected under neutrality assumptions. From a Bayesian perspective, every unknown is a random variable and, since allelic frequencies are unknown, θ as given in (3) will posses a distribution, both a priori and a posteriori. In the first step of the method proposed in this paper, the posterior distribution of θ_l will result from assigning a vague prior to all allelic frequencies, corresponding in some sense to what could be termed as a fixed effects treatment from a frequentist perspective. The second step addresses the question of whether or not all θ_l stem from the same distribution or from different distributions resulting from heterogeneity of the underlying stochastic processes. This makes the treatment proposed here different from those in, e.g., Holsinger (Reference Holsinger1999) or Balding (Reference Balding2003).

3. Estimation of parameters

(i) Inferring gene frequencies

Gene frequencies can be inferred using a simple Bayesian approach. Suppose that n _r individuals are genotyped in population r, so that the number of alleles screened at locus l is 2n _r=n _r,Al+n _r,al, where n _r,Al and n _r,al are the observed numbers of copies of A _l and a _l, respectively.

A convenient assumption is that of mutual independence between the distributions of alleles at different loci (stronger than that of pairwise linkage equilibrium). Linkage disequilibrium is pervasive but the assumption made above facilitates matters and is widely used, e.g., by Corander et al. (Reference Corander, Waldmann and Sillanpää2003). Let p=(p₁, p₂, …, p_R)′ be an RL×1 vector of allelic frequencies for all R groups, where p_r=(p _r,1, p _r,2, …, p _r,L)′ has order L×1. Under the mutual independence assumption, the likelihood conferred by the observed number of copies of alleles to the gene frequencies is

(5)

$l\lpar {{\bf p}\vert {\rm DATA}} \rpar \equals \prod\limits_{r \equals \setnum{1}}^{R} \prod\limits_{l \equals \setnum{1}}^{L} {\rm \ }p_{r\comma l}^{n_{{r\comma A_{l} }} } \lpar 1 \minus p_{r\comma l} \rpar ^{n_{{r\comma a_{l} }} }.$

The maximum likelihood estimator of p _r,l is _r,l=n _r,A _l/2n _r and its empirical variance is $\widehat{{\rm Var}}\lpar \hat{p}_{r\comma l} \rpar \equals {{\hat{p}_{r\comma l} \lpar 1 \minus \hat{p}_{r\comma l} \rpar } \sol {2n_{r} }}$ . The maximum likelihood estimator is unbiased but unstable, and may take values at the boundaries of the parameter space in small samples.

In a Bayesian treatment, allelic frequencies are assigned a prior distribution that might be homogeneous or heterogeneous over populations, chromosomes or genomic regions (e.g. coding versus non-coding regions). For example, Holsinger (Reference Holsinger1999, Reference Holsinger, Clark and Gelfand2006) adopts a prior beta distribution, Beta (p _l|a _l, b _l) (and interprets it as describing variation over populations) with parameters

$a_{l} \equals {{1 \minus \theta } \over \theta }x_{l}$

and

$b_{l} \equals {{1 \minus \theta } \over \theta }\lpar 1 \minus x_{l} \rpar.$

Here θ is common to all loci (i.e. the hypothesis of neutrality) and x _l is the mean allelic frequency at locus l (averaged over populations). Using properties of the beta distribution in the parametric definition of θ leads to

${{{\rm Var\lpar }p_{l} {\rm \rpar }} \over {E{\rm \lpar }p_{l} {\rm \rpar \lsqb }1 \minus E\lpar p_{l} {\rm \rpar \rsqb }}} \equals {{{\textstyle{{a_{l} b_{l} } \over {\lpar a_{l} \plus b_{l} \rpar ^{\setnum{2}} \lpar a_{l} \plus b_{l} \plus 1\rpar }}}} \over {{\textstyle{{a_{l} } \over {a_{l} \plus b_{l} }}} \cdot {\textstyle{{b_{l} } \over {a_{l} \plus b_{l} }}}}} \equals \theta.$

Then, the joint posterior distribution of all unknowns (allelic frequencies, θ and vector x={x _l}) is

$\eqalign{ \tab g\lpar {{\bf p}\comma \theta \comma {\bf x}\vert {\rm DATA}} \rpar \cr \propto \tab \prod\limits_{r \equals \setnum{1}}^{R} \prod\limits_{l \equals \setnum{1}}^{L} {\rm \ }p_{r\comma l}^{n_{{r\comma A_{l} }} \plus \lpar \lpar \setnum{1} \minus \theta \rpar \sol \theta \rpar x_{l} \minus \setnum{1}}\cr\tab \lpar 1 \minus p_{r\comma l} \rpar ^{n_{{r\comma a_{l} }} \plus \lpar \lpar \setnum{1} \minus \theta \rpar \sol \theta \rpar \lpar \setnum{1} \minus x_{l} \rpar \minus \setnum{1}} g\lpar \theta \rpar g\lpar {\bf x} \rpar. \cr}$

Holsinger (Reference Holsinger1999) took g(θ)=Beta(1, 2) distribution as prior for θ, and assumed that all x _l were identically distributed according to the uniform process g(x _l)=U(0, 1). Given θ and x, the allelic frequencies are conditionally independent with conditional posterior distributions:

$\eqalign{ g\lpar {p_{r\comma l} \vert {\rm ELSE}} \rpar\equals\tab {\rm Beta}\left( {n_{r\comma A_{l} } \plus {{1 \minus \theta } \over \theta }x_{l} \comma n_{r\comma a_{l} } \plus {{1 \minus \theta } \over \theta }\lpar {1 \minus x_{l} } \rpar} \right)\comma \cr r \equals\tab 1\comma \ 2 \comma \, \ldots \comma \, R\quad l \equals 1\comma \ 2\comma \, \ldots \comma \, L\comma \cr}$

where ELSE means all parameters other than p _r,l and the data observed. However, the conditional posterior distributions of θ and x are not recognizable, so an elaborate sampling scheme, e.g., one based on Markov chain Monte Carlo (MCMC) methods, must be tailored. Holsinger (Reference Holsinger1999) found that inferences were insensitive with respect to the choices of beta and uniform prior distributions for θ and elements of x, respectively. However, it was assumed (as in a neutral model) that all loci share the same θ parameter. This produces a mutual borrowing of information among loci (shrinking p _r,l towards a common value), but the procedures are not explicit with respect to the existence of heterogeneity over loci due to forces such as differential mutation or selective sweeps. As proposed by Beaumont & Balding (Reference Beaumont and Balding2004), one could estimate locus specific θ values and refer these estimates to the posterior distribution of θ under the homogeneity value. In this manner, outliers could be found with respect to the ‘neutral’ distribution, but this would not inform about the structure of any latent heterogeneity.

Here, an alternative approach is used. Jeffreys rule (Bernardo & Smith, Reference Bernardo and Smith1994; Sorensen & Gianola, Reference Sorensen and Gianola2002) is used to produce a reference prior, which is a ${\rm Beta}\left( {{\textstyle{1 \over 2}}\comma {\textstyle{1 \over 2}}} \right)$ distribution assigned to all loci in all populations. This reference prior distribution is minimally informative in a well defined sense (Bernardo and Smith, Reference Bernardo and Smith1994). Using Bayes theorem, the joint posterior density of all allelic frequencies is now

(6)

$\eqalign{ g\lpar {\bf p}\vert {\rm DATA}\rpar \propto \tab \prod\limits_{r \equals \setnum{1}}^{R} \prod\limits_{l \equals \setnum{1}}^{L} {\rm \ }p_{r\comma l}^{n_{{r\comma A_{l} }} \plus \lpar \setnum{1}\sol \setnum{2}\rpar \minus \setnum{1}} \lpar 1 \minus p_{r\comma l} \rpar ^{n_{{r\comma a_{l} }} \plus \lpar \setnum{1}\sol \setnum{2}\rpar \minus \setnum{1}}\hskip-30pt \cr \equals \tab \prod\limits_{r \equals \setnum{1}}^{R} \prod\limits_{l \equals \setnum{1}}^{L} {\rm \ Beta}\left( {n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}\comma n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}} \right).\hskip-30pt \cr}$

Thus, allelic frequencies at different loci are mutually independent, a posteriori, with p _r,l following a beta distribution with parameters $\alpha _{rl} \equals n_{r\comma A_{L}\! }\plus {\textstyle{1 \over 2}}$ and $\beta _{rl} \equals n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}$ . Possible point estimates of allelic frequencies are the posterior mean

(7)

${\mathop{p}\limits^{\leftrightarrow}}\nolimits _{r\comma l}\equals {{n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \over {2n_{r} \plus 1}}\comma$

and the posterior mode

(8)

$\tilde{p}_{r\comma l} \equals {{n_{r\comma A_{l} } \minus {\textstyle{1 \over 2}}} \over {2n_{r} \minus 1}}\comma \quad {\rm for\ }n_{r\comma A_{l} } \ges 1.$

The variance of the posterior distribution of p _r,l is

(9)

${\rm Var\lpar }p_{r\comma l} \vert {\rm DATA\rpar } \equals {{\left( {n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \right)\left( {n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}} \right)} \over {\mathop {\lpar 2n_{r} \plus 1\rpar }\nolimits^{\setnum{2}} \lpar 2n_{r} \plus 2\rpar }}.$

Even though a weakly informative prior is used, differences exist with respect to maximum likelihood. To illustrate this point, consider a hypothetical example with two groups, M and N. Suppose that 100 individuals are genotyped in group M and that the observed number of A _l alleles is 199, i.e. the locus is nearly fixed. The maximum likelihood estimate of p _M,l is 0·995 and its estimated standard error is 4·99×10⁻³; a calculation based on asymptotic normality (without truncation) yields that the probability of obtaining estimates larger than 1 is close to 0·16! Further, the probability of obtaining estimates between 0·9 and 0·995 is close to ${\textstyle{1 \over 2}}$ . On the other hand, the posterior distribution of p _M,l is ${\rm Beta}\left( {199 \plus {\textstyle{1 \over 2}}\comma 1 \plus {\textstyle{1 \over 2}}} \right)$ . The posterior mean and posterior standard deviation are 0·993 (note some shrinkage away from the edge of the parameter space) and 6·06×10⁻³, respectively; the posterior probability of the frequency being larger than 1 is exactly zero, and the probability that p _M,l takes values between 0·9 and 0·995 is about 0·57. Figure 1 displays the posterior distribution of the allelic frequency obtained with Jeffreys prior, overlaid against the normal approximation to the distribution of the maximum likelihood estimates. Clearly, the approach used makes a difference, even in a situation where allelic frequencies are estimated with reasonable precision, as indicated by the small standard error of the maximum likelihood estimate and the small posterior standard deviation in the Bayesian analysis (the coefficient of variation of the posterior distribution is less than 1%).

Fig. 1. Posterior density (thick line) of the allelic frequency p at a locus for which 199 copies have been observed out of 200 alleles counted in hypothetical population M; the posterior distribution is Beta $\left( {199 \plus {\textstyle{1 \over 2}}\comma 1 \plus {\textstyle{1 \over 2}}} \right)$ . The thin line is the density of a normal approximation to the sampling distribution of the maximum likelihood estimator.

In the second population, N, 30 individuals are genotyped and 10 alleles are of the type A _l; the maximum likelihood estimate of p _N,l is then ${\textstyle{1 \over 6}}$ , much lower than in M, and its sampling variance is 2·31×10⁻³. The posterior distribution of p _N,l is ${\rm Beta}\left( {10 \plus {\textstyle{1 \over 2}}\comma {\rm \ }50 \plus {\textstyle{1 \over 2}}} \right)$ . In N, the posterior density of p _N,l and the normal approximation to the density of the distribution of the maximum likelihood estimator are very similar (not shown here).

Differences in allelic frequencies between populations M and N at the locus in question may be due to random drift or may suggest a signature of selection.

(ii) Inferring θ by maximum likelihood

A likelihood-based estimate of θ can be obtained by replacing in (3) or (4) the unknown allelic frequencies by their maximum likelihood estimates. For the example of populations M and N above, the estimate is

$\hats{\theta }_{l} \equals {{\sum\nolimits_{r \equals \setnum{1}}^{\setnum{2}} {\lpar{\hat{p}}\nolimits_{r\comma l} \minus \bars{\hat{p}}_{l} \rpar ^{\setnum{2}} } } \over {2\bars{\hat{p}}_{l} \lpar 1 \minus \bars{\hat{p}}_{l} \rpar }} \approx 0{\cdot}7046.$

The sampling variance of the maximum likelihood estimator of θ_l can be approximated using a Taylor series expansion. As shown in Appendix A, the first derivative of θ_l with respect to the allelic frequency at locus l in group r is

${\partial \over {\partial p_{r\comma l} }}\theta _{l} \equals \left[ {{{2\lpar p_{r\comma l} \minus \bar{p}_{l} \rpar } \over {\overline{{p}_{l}^{\setnum{2}}} \minus \bar{p}_{l}^{\setnum{2}} }} \minus {{\lpar 1 \minus 2\bar{p}_{l} \rpar } \over {\bar{p}_{l} \lpar 1 \minus \bar{p}_{l} \rpar }}} \right]{{\theta _{l} } \over R}\comma$

for r=1,2, …, R; l=1,2, …, L, where $\bar{p}_{l} \equals \sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l} \sol\! R}$ is as before and $\bar{p}_{l}^{\setnum{2}} \equals \sum\nolimits_{r \equals \setnum{1}}^{R} {p_{{r\comma l} }^{\setnum{2}} \sol\! R}$ . Further, let

$\hats{\nabla } \equals{\left\{ {{\partial \over {\partial p_{r\comma l} }}\theta _{l} } \right\}}_{p_{{r\comma l}} \equals \hates{p}_{{r\comma l}} } \comma$

be an RL×1 vector of first derivatives evaluated at the maximum likelihood estimates of the allelic frequencies. Then, approximately

$\eqalign{ \widehat{{\rm Var}}\lpar \hats{\theta }_{l} \rpar \approx\tab\hskip2pt \hats{\nabla }\, \prime\, \widehat{{\rm Var}}\lpar\! {\bf\hat{p}}\rpar \hats{\nabla } \cr \equals \tab \mathop\sum\limits_{r \equals \setnum{1}}^{R}{\left\{ {\left[ {{{2\lpar p_{r\comma l} \minus \bar{p}_{l} \rpar } \over \overline{{p}_{l}^{\setnum{2}}} \minus \bar{p}_{l}^{\setnum{2}} }} \minus {{\lpar 1 \minus 2\bar{p}_{l} \rpar } \over {\bar{p}_{l} \lpar 1 \minus \bar{p}_{l} \rpar }}} \right]{{\theta _{l} } \over R}} \right\}}\nolimits_{_{{p_{{r\comma l}} \equals \hates{p}_{{r\comma l}} }} }^{\setnum{2}} \cr \tab \times {{\hat{p}_{r\comma l} \lpar 1 \minus \hat{p}_{r\comma l} \rpar } \over {2n_{r} }}\comma \cr}$

where $\widehat{{\rm Var}}\lpar {\bf \hat{p}}\rpar \equals {\rm Diag\lpar }\hat{p}_{r\comma l} \lpar 1 \minus \hat{p}_{r\comma l} \rpar \sol 2n_{r} \rpar$ is a diagonal matrix containing the estimates of the sampling variances of the maximum likelihood estimates of allelic frequencies p _r,l. For the hypothetical example, $\widehat{{\rm Var}}\lpar \hats{\theta }_{l} \rpar \approx 9{\cdot}8265 \times 10^{ \minus \setnum{5}}$ . The asymptotic normal approximation to the distribution of the estimates assigns nil probability to ‘estimates’ outside of (0,1); the probability of obtaining estimates of θ between 0·67 and 0·74 for this two-population situation is 0·9996.

(iii) Bayesian inference of θ

Consider now finding the posterior distribution of θ_l as defined in (4) and without making the assumption that the θs are realizations from the same stochastic process, i.e. without borrowing information across loci over and above the shrinkage of allelic frequencies produced by Jeffreys prior. The posterior distribution is analytically difficult to arrive at because θ_l is a non-linear function of gene frequencies in all R groups. However, since it is easy to obtain independent samples from each of the ${\rm Beta\lpar }n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}\,\comma n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}{\rm \rpar }$ processes, Monte Carlo estimates of features of the posterior distribution of θ_l can be obtained without using MCMC methods at all. Let p _r,l^(s), s=1,2, …, S, be samples from the posterior (beta) distribution of p _r,l, the frequency of allele A _l at locus l. Then, a draw from the posterior distribution of θ_l is given by

(10)

$\theta _{l}^{\lpar s \rpar} \equals {\displaystyle{\sum\nolimits_{\,r \equals \setnum{1}}^{R} {\lpar p_{r\comma l}^{\lpar s\rpar } \rpar ^{\setnum{2}} } \minus {{\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l}^{\lpar s\rpar } } } \right)^{\setnum{2}} } \over R}} \over {\left( \displaystyle{{{R\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l}^{\lpar s\rpar } } \minus{\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l}^{\lpar s\rpar } } } \right)}\nolimits^{\setnum{2}} } \over R}} \right)}}\comma$

which is a random variable with support in (0,1) (Holsinger, Reference Holsinger, Clark and Gelfand2006). Then, from S samples, the mean, median, variance, etc., of the posterior distribution of θ_l can be estimated. Each θ_l (l=1, 2, …, L) will have a point estimate and an assessment of uncertainty, e.g., a credibility interval of size 95% given by the 2·5% and 97·5% percentiles of the corresponding posterior distribution estimated either from samples or from the normal theory approximation given in Appendix B.

In the hypothetical populations M and N, the posterior distributions of the frequency of A _l are Beta(199·5,1·5) and Beta(10·5,50·5), respectively. With draws denoted as B ^(s) (.,.), S samples from the posterior distribution of θ_l can be obtained as

$\eqalign{\theta _{l}^{\left( s \right)} \equals \tab {{\lsqb B^{\lpar s\rpar } \lpar 199{\cdot}5\comma 1{\cdot}5\rpar \rsqb ^{\setnum{2}} \plus \lsqb B^{\lpar s\rpar } \lpar 10{\cdot}5\comma 50{\cdot}5\rpar \rsqb ^{\setnum{2}} \minus \displaystyle{{\lcub \lsqb B^{\lpar s\rpar } \lpar 199{\cdot}5\comma 1{\cdot}5\rpar \rsqb{ \plus \lsqb B^{\lpar s\rpar } \lpar 10{\cdot}5\comma 50{\cdot}5\rpar \rsqb \rcub }\nolimits^{\setnum{2}} } \over 2}} \over \lsqb{B^{\lpar s\rpar } \lpar 199{\cdot}5\comma 1{\cdot}5\rpar \rsqb \plus \lsqb B^{\lpar s\rpar } \lpar 10{\cdot}5\comma 50{\cdot}5\rpar \rsqb \minus \displaystyle{{\lcub \lsqb B^{\lpar s\rpar } \lpar 199{\cdot}5\comma 1{\cdot}5\rpar \rsqb{ \plus \lsqb B^{\lpar s\rpar } \lpar 10{\cdot}5\comma 50{\cdot}5\rpar \rsqb \rcub }\nolimits^{\setnum{2}} } \over 2}}}}\comma\quad s\equals 1,2,\ldots, S.$

To illustrate, 5000 samples were drawn from each of the two beta distributions, to form S=5000 corresponding draws from the posterior distribution of θ_l. The mean and median were 0·6966 and 0·6972, respectively; the standard deviation was 0·070 and the range of values samples spanned from 0·4268 to 0·8883. The posterior density of θ_l and the empirical cumulative distribution function are in Figs 2 and 3, respectively. Values of θ_l appearing with appreciable density range from about 0·5 to 0·9 (Fig. 2), with small posterior probability assigned to values smaller than 0·6 (Fig. 3).

Fig. 2. Posterior density of θ_l for the hypothetical example of populations M and N.

Fig. 3. Empirical cumulative distribution function of θ_l for the hypothetical example of populations M and N.

(iv) A Bayesian ‘null’ distribution for assessing sampling variation uncertainty

It is important to check whether or not posterior estimates of θ_l depart from what would be expected by chance alone. A posterior distribution consistent with expectations under a ‘null’ model is formulated next. The θ_l statistics calculated from the ‘full’ model above can then be referred to this null distribution. Note that the ‘null’ distribution given below describes the uncertainty to be expected from drawing random samples from the same population, but not the variability to be expected due to genetic drift. If estimates of θ_l fall in this null distribution, this would indicate that the study lacks power to answer evolutionary questions in any meaningful manner.

A ‘null’ distribution is arrived at by stating that p _r _, _l=p _l is the same random variable for all R populations. Under this assumption, the posterior distribution of the vector of gene frequencies (now of dimension L×1) under the ‘null’ model is

(11)

$\eqalign\tab{ g\lpar {\bf p}\vert {\rm DATA}\comma {\rm Null}\rpar}} \cr\quad \propto \left[ {\prod\limits_{r \equals \setnum{1}}^{R} \prod\limits_{l \equals \setnum{1}}^{L} \,p_{l}^{n_{{r\comma A_{l} }} } \lpar 1 \minus p_{l} \rpar ^{n_{{r\comma a_{l} }} }} \right]{\prod\limits_{l \equals \setnum{1}}^{R} \,p_{l}^{{\textstyle{\setnum{1} \over \setnum{2}}} \minus \setnum{1}} \lpar 1 \minus p_{l} \rpar ^{{\textstyle{\setnum{1} \over \setnum{2}}} \minus \setnum{1}}} \cr \tab\quad {{{\equals \prod\limits_{l \equals \setnum{1}}^{L} \,{\rm Beta}\left( {\mathop\sum\limits_{r \equals \setnum{1}}^{R} {n_{r\comma A_{l} } } \plus {1 \over 2}\comma \mathop\sum\limits_{r \equals \setnum{1}}^{R} {n_{r\comma a_{l} } } \plus {1 \over 2}} \right). \cr}$

Hence, allelic frequencies p _l are mutually independent, a posteriori, with p _l|DATA, Null being a beta distribution with parameters $\alpha _{l} \equals \sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma A_{l} } } \plus {\textstyle{1 \over 2}}$ and $\beta _{r} \equals \sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}}.$ A draw from the posterior distribution of the F _ST statistic under this model takes the form

(12)

$\theta _{l\comma {\rm Null}}^{\lpar s\rpar } \equals {{{\textstyle{{\sum\nolimits_{r \equals \setnum{1}}^{R} {\lpar p_{l}^{\lpar r\comma s\rpar } \minus \bar{p}_{l} ^{\lpar s\rpar } \rpar } ^{\setnum{2}} } \over R}}} \over {\bar{p}_{l} ^{\lpar s\rpar } \lpar 1 \minus \bar{p}_{l} ^{\lpar s\rpar } \rpar }}\comma$

where p _l^(r,s) is a draw from ${\rm Beta}\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma A_{l} } } \plus {\textstyle{1 \over 2}}\comma \sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma a_{l} } } \plus {\textstyle{1 \over 2}}} \right)$ , with R such draws involved in a realization of θ_l^(s), and $\bar{p}_{l} ^{\lpar s\rpar }$ is the average of the R draws. A set of samples from the posterior distribution of θ_l under the null model is obtained by repeating the sampling process S times. This distribution serves as a reference against which the θ_l statistics calculated from the ‘full’ model can be compared. If the posterior mean of θ_l obtained from the ‘full’ model falls outside of a high density area of the posterior distribution of θ in the null model, then the divergence between populations would be probably due to drift or selection (assuming mutation rates are constant over populations), but not due to chance alone.

For the example of populations M and N, $\sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma A_{l} } } \equals 209$ and $\sum\nolimits_{r \equals \setnum{1}}^{R} {n_{r\comma a_{l} } } \equals 51$ . Figure 4 depicts the Beta(209·5,51·5) distribution of the allelic frequency under the ‘null’ model. Note that the maximum likelihood estimates of the allelic frequencies in the M and N populations, of 0·995 and ${\textstyle{1 \over 6}}\comma$ respectively, are not assigned any appreciable density under this model. Upon drawing 5000 independent samples from the beta distribution of the allelic frequency under the null model, 5000 draws for θ_l,Null^(s) were obtained by evaluating (12) for each of the samples. Draws ranged from 8·24×10⁻¹³ to 0·0503; the mean (standard deviation) was 0·0038(0·0053) and the posterior median was 0·0017. The posterior density of θ_l,Null was very sharp as shown in Fig. 5. In the full model, the estimated posterior mean (standard deviation) of θ_l was 0·6966, which is unlikely to have been generated under the null distribution. This would make the locus a reasonable candidate for further examination.

Fig. 4. Posterior density of the allelic frequency p under the ‘null’ model for hypothetical populations M and N: 209 copies of A _l are observed out of 260 alleles screened.

Fig. 5. Posterior density of θ_l under the null model for the hypothetical example of populations M and N.

(v) Illustration of sampling variation with candidate genes for type-II diabetes

The Bayesian method was applied to data pertaining to identification of candidate gene variants for type II diabetes in Polynesians (Myles et al., Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007). Prevalence of this disease is high in several Pacific populations, e.g. 40% of adults living in the island of Nauru. DNA samples were obtained from 23 Polynesians, 23 New Guineans and 19 Han Chinese from Beijing. Type II diabetes-associated alleles were from 10 SNP loci having evidence of association. Estimated frequencies and θ_l statistics are shown in page 587 of Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007). To illustrate the Bayesian procedure, data for the KCNJ11 locus was used, and susceptibility allele frequencies (A _l in our notation) were 0·30, 0·25 and 0·34 in the three populations above, respectively. Their figures do not lead to an integer number of alleles, due to rounding error, so the number of observed A _l alleles used here was set to 14 (Polynesians), 12 (New Guineans) and 13 (Han Chinese). Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007) employed an ‘unbiased estimator’ of θ_l for calculating population pairwise differences, and their estimates were 0·003 (New Guinea–China), −0·024 (China–Polynesia) and −0·017 (New Guinea–Polynesia). Note the two negative estimates of a parameter that resides in (0,1); standard errors or significance levels were not provided. Their analysis suggests that this locus is not associated with prevalence of the disease.

The posterior distributions of A _l were Beta_Polynesians(14·5,32·5), Beta_NewGuineans(12·5,34·5) and Beta_HanChinese(13·5,25·5). The number of samples drawn from each of these 3 posterior distributions was S=1000, and 1000 draws from the posterior distribution of θ_KCNJ11 were obtained by evaluation of (10). Values of θ_KCNJ11 ranged from 2·423×10⁻⁵ to 0·1500, with an estimated posterior mean (standard deviation) of 0·019 (0·0193); this estimate is higher than that of Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007). The non-parametric estimate of the posterior density of θ_KCNJ11 is shown in Fig. 6, illustrating that the true value of the F _ST parameter is very likely below 0·10. The posterior inter-correlation structure between allelic frequencies and θ_KCNJ11 in the full model was examined and, as expected, draws from the posterior distributions of allelic frequencies in the three populations were uncorrelated. Samples of θ_KCNJ11 were positively correlated (0·55) with those for allelic frequency in Chinese Han, and the 95% confidence interval for the correlation was 0·51–0·60. However, draws for θ_KCNJ11 were negatively correlated with allele frequencies in Polynesians (−0·07) and New Guineans (−0·39); the confidence intervals for these two correlations were (−0·13,−0·01) and (−0·44,−0·34), respectively.

Fig. 6. Density of the posterior distribution of θ_KCNJ11 obtained from allelic frequencies in Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007).

For the ‘null’ model, the 1000 samples from the posterior distribution of θ_KCNJ11,Null ranged from 3·62×10⁻⁶ to 0·1460, with the posterior mean (standard deviation) estimated at 0·002 (0·002); the posterior median was 0·002 as well. The posterior mean (standard deviation) estimate of θ_KCNJ11 under the ‘full’ model was 0·019, and it did not enter with high density in the ‘null’ model (not shown). Although variation in allelic frequency at locus KCNJ11 among the three populations departs from what would be expected from chance alone (statistical sampling), the observed θ value is very small. This may support the hypothesis that this locus may not be associated with differences in prevalence of type II diabetes, in agreement with Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007). Allelic frequencies were uncorrelated, as it should be, given that the three replicates were drawn from the same Beta(209·5,51·5) distribution. The θ_KCNJ11,Null statistic was uncorrelated with allelic frequencies, and the correlations were −0·08, −0·11 and 0·03 in the three replicates, with all confidence intervals including 0.

4. Clustering of θ-parameters

The second step of the procedure consists of clustering a set of estimates of θ values (in this case, posterior means) from a multi-locus analysis into data driven groups. The expectation is that these clusters might be representative of different processes taking place in the populations such as balancing or directional selection, neutrality or anything else.

The method is illustrated with data from a study of Petit et al. (Reference Petit, El Mousadik and Pons1998) in which alleles were sampled for 12 isozyme loci of the Argania genus tree in each of 12 areas (populations) of Morocco. The data, given in page 847 of Petit et al. (Reference Petit, El Mousadik and Pons1998), were modified as shown in Table 1. The modification consisted of treating all loci as bi-allelic by lumping alleles for loci with more than two variants into two classes. The number of individuals sampled per population ranged between 20 and 50, and the number of alleles per locus varied originally between 2 and 5. Note that, at some loci, one of the alleles was fixed in almost all populations. For example, for locus 3, the only population in which segregation was observed was TA.

Table 1. Allelic frequencies at 12 isozyme loci in each of 12 Argan tree populations, adapted from Petit et al. (Reference Petit, El Mousadik and Pons1998) by making all loci bi-allelic. A1–A12 represent frequencies of the ‘A’ allele at loci 1–12; No. A1–No. A12 are the observed number of copies of the alleles. The number of ‘a’ alleles can be calculated from the number of individuals samples and the number of ‘A’ alleles observed

For each locus, 2000 samples were drawn from the beta posterior distributions of allelic frequencies. For example, the posterior distribution of p _AB _, ₁ was Beta(21·5,19·5). From these samples, 2000 draws from the posterior distribution of θ for each locus were formed as in (10). The posterior means were as follows:

$\left[ {\matrix{ {\theta _{\setnum{1}} } \tab {\theta _{\setnum{2}} } \tab {\theta _{\setnum{3}} } \tab {\theta _{\setnum{4}} } \tab {\theta _{\setnum{5}} } \tab {\theta _{\setnum{6}} } \tab {\theta _{\setnum{7}} } \tab {\theta _{\setnum{8}} } \tab {\theta _{\setnum{9}} } \tab {\theta _{\setnum{10}} } \tab {\theta _{\setnum{11}} } \tab {\theta _{\setnum{12}} } \cr {0{\cdot}098\,} \tab {0{\cdot}168\,} \tab {0{\cdot}791\,} \tab {0{\cdot}166\,} \tab {0{\cdot}393\,} \tab {0{\cdot}137} \tab {0{\cdot}299} \tab {0{\cdot}382} \tab {0{\cdot}593} \tab {0{\cdot}095} \tab {0{\cdot}122} \tab {0{\cdot}190} \cr} } \right]$

so estimates of θ varied over loci from about 0·095 (locus 10) to 0·791 (locus 3); all these estimates did not enter into the corresponding ‘null’ distributions. Box plots of the posterior distributions of the θ parameters are in Fig. 7. Visually, it is tempting to suggest four clusters: the first one would include locus 3, with the posterior mean of θ close to 0·79; the second cluster would include locus 9, with an estimate of θ of 0·59. The third cluster would include loci 5, 7 and 8 with estimates ranging between 0·30 and 0·39, and the fourth cluster would be represented by loci 1, 2, 4, 6, 10, 11 and 12 having the lowest estimates of θ.

Fig. 7. Box plot of the posterior distributions of θ-parameters in 12 isozyme loci of the argan tree in Morocco (data originally from Petit et al. Reference Petit, El Mousadik and Pons1998).

The existence of an underlying structure is suggested by the distribution of all 24 000 samples, presented in Fig. 8. In the left panel, a non-parametric density estimate was obtained from these samples treated as if all draws (2000 for each of the 12 loci) had been made from the same process; the densities in the middle and right panels correspond to the logit, i.e., log(θ/(1−θ)), and Gompit, −log(−log(θ)), transforms of the sampled θ values, respectively. The three densities suggest that θ values cluster around 3, perhaps 4, modes.

Fig. 8. Non-parametric density estimates of θ values (based on 2000 sample for each of 12 loci), logit(θ) and Gompit(θ). All samples treated as homogeneous, i.e. as generated from the same stochastic process.

The structure was explored more formally by fitting a sequence of finite mixture models to the means of the posterior distribution of the θ values for each of the 12 loci. These posterior means are independent (under the assumptions made for the allelic frequency models), but not identically distributed, since they are estimated with different precision, due to unequal numbers of individuals sampled and varying allelic frequencies. The distributions of θ values among loci are not normal (the logit and Gompit transforms would be expected to be more nearly so). This should not be an issue because the mixture model was not used for testing hypotheses; its objective, rather, was to explore a clustered structure. Since there are only 12 posterior means, the mixture models must have less than 12 parameters; otherwise, a perfect fit would be obtained. The mixture model fitted to the posterior mean estimates $\bars{\theta\, }_{l}$ postulated that

$\eqalign{\tab\bars{\theta\, }_{l} \ {\rm or} \ {\log}\left( {{{\bars{\theta\, }_{l} } \over {1 \minus \bars{\theta\, }_{l} }}} \right) \ {\rm or} \ \minus {\log\lpar } \minus \log\lpar \bars{\theta\, }_{l} \rpar {\rm \rpar } \cr \tab \sim \mathop\sum\limits_{k \equals \setnum{1}}^{K} {\pi_{k} N\lpar \bars{\theta\, }_{l} \vert \mu _{k} \comma \sigma _{k}^{\setnum{2}} \rpar } \comma}$

where K is the number of components of the mixture (clusters of posterior means of θ values or transforms thereof), π_k is the probability that $\bars{\theta\, }_{l}$ belongs to cluster k (subject to $\sum\nolimits_{k \equals \setnum{1}}^{K} {\pi _{k} \equals 1}$ ), and μ_k and σ_k² are the mean and variance, respectively, of component k. For example, if k=2, there are 5 ‘free’ parameters in the mixture; if k=4, there are 11 such parameters, so it is not sensible to fit a model with more than 4 components. Mixture model parameters were estimated by maximum likelihood via the expectation–maximization algorithm as implemented in the FlexMix package (Leisch, Reference Leisch2004) in the R project (R development core team, 2008). Upon convergence (assuming the stationary point was a global maximum), the conditional probability that $\bars{\theta\, }_{l}$ (or its transformation) belongs to cluster k is calculated as

$\eqalign{\tab{\rm Pr}\left( {{\rm locus} \ l \in {\rm cluster} \ k\vert {\rm parameter} \ {\rm estimates}} \right) \equals \cr \tab \quad {{\hat{\pi }_{k} N\lpar \bars{\theta\, }_{l} \vert \hat{\mu}_k\comma \hat{\sigma }_{k} ^{\setnum{2}} \rpar } \over {\sum\nolimits_{k \equals \setnum{1}}^{K} {\hat{\pi }_{k} N\lpar \bars{\theta\, }_{l} \vert \hat{\mu }_{k}\comma \hat{\sigma }_{k} ^{\setnum{2}} \rpar } }}.}$

The locus was assigned to the cluster with the largest conditional probability. Models with different values of k=1, 2, 3 and 4 were compared using Akaike's information criterion (AIC), that is

${\rm AIC}\left( K \right) \equals 2\left[ {p_{K} \minus \mathop{\sum}\limits_{l \equals \setnum{1}}^{\setnum{12}} {\log \left( {\mathop{\sum}\limits_{k \equals \setnum{1}}^{K} {\hat{\pi }_{k} N\lpar \bars{\theta\, }_{l} \vert \hat{\mu }_{k} \comma \hat{\sigma }_{k} ^{\setnum{2}} \rpar } } \right)} } \right]\comma$

where p _K is the number of parameters for a model with K components (McLachlan & Peel, Reference McLachlan and Peel2000). Models with the smallest AIC values are preferred. It is known that this criterion tends to overstate the number of components due to violation of regularity conditions in mixture models (Celeux & Soromenho, Reference Celeux and Soromenho1996).

Results of the mixture model analysis, by number of components fitted, are shown in Table 2. The AIC favoured a mixture with two clusters when the response was either θ or its Gompit transform, and a single component when the logit transformation was used. Clearly, with data from only 12 loci, the analyses did not have enough power to resolve heterogeneity in a finer manner. This would certainly not be the case with SNP data, where the number of marker loci typically oscillates between a few thousands in some animal species to close to a million in humans. Classification probabilities using K=2 and estimates of cluster mean and standard deviation are shown in Table 3. Irrespective of whether θ values were transformed or not, loci were clustered into two groups, one consisting of loci 3, 5, 7, 8 and 9, possibly reflecting a selection signature, and the other one including the remaining loci, presumably representing neutral loci. The maximum likelihood estimates of the mean and variance of θ values in the cluster with loci 3, 5, 7, 8 and 9 were 0·41±0·21, whereas the corresponding estimates in the other cluster were 0·12±0·03. This assignment into clusters is consistent with the picture emerging from visual consideration of the box plots in Fig. 7.

Table 2. Comparison of mixture models with 2, 3 or 4 components fitted to the 12 posterior means of θ-parameters and their logit or Gompit transforms in the argan tree data of Petit et al. (Reference Petit, El Mousadik and Pons1998). AIC (models with smallest values are favoured and indicated in boldface)

Table 3. Conditional probabilities of membership to one of two clusters for mixture models fitted to the posterior means of θ for the 12 loci in the argan tree, and their logit, log(θ/1−θ), and Gompit, −log(−log(θ)), transformations (boldfaced probability indicates the cluster with largest probability of membership)

The two-step procedure is simple and does not require the tailoring of problem-specific software. However, it has the drawback of not taking into account the uncertainty associated with the posterior distributions of the θ-parameters, inferred in the first step. In principle, a better approach is to feed the entire set of posterior samples to the clustering procedure, such that not only the location of the posterior distributions of the θs is considered but also their uncertainty as well. Although this is very appealing conceptually, it may create difficulties with the EM algorithm, leading to convergence failure. For instance Qanbari et al. (personal communication) employed the procedure with posterior means (each calculated with 1 million samples from the corresponding posterior distribution) with about 35 000 SNPs in Hereford and Simmental cattle. When posterior means were used as data, the mixture model approach revealed the existence of 4–5 clusters. However, when the 35 million samples were used as data points, the Expectation–Maximization (EM) algorithm, as implemented in FlexMix, failed to converge. An alternative to using the entire collection of samples is to feed a selected set of percentiles of the posterior distribution of each θ_l, so that a proxy for the dispersion of the individual posterior distributions enters into the analysis.

5. Discussion

The use of F-statistics for the study of genetic divergence between population dates back to Wright (Reference Wright1931). Holsinger & Weir (Reference Holsinger and Weir2009) have provided a justification for their usefulness, e.g., in association mapping and in detecting genomic regions affected by evolutionary processes, such as selection. These authors also reviewed different types of statistical methods for inferring F _ST, including Bayesian procedures. Method of moments estimation was prompted by the linear model formalism of Cockerham (Reference Cockerham1969, Reference Cockerham1973), and a review is in Weir & Hill (Reference Weir and Hill2002). There has been an increased interest in Bayesian methods, and important contributions in this front have been made by Holsinger (Reference Holsinger1999, Reference Holsinger, Clark and Gelfand2006), Beaumont & Balding (Reference Beaumont and Balding2004) and Guo et al. (Reference Guo, Dey and Holsinger2009).

In the Bayesian approaches that have been suggested, e.g., Holsinger (Reference Holsinger1999), the model poses a product binomial (or product multinomial in the case of multiple alleles) likelihood function for allelic frequencies, with conjugate prior distributions, such as beta or Dirichlet processes. Marginalizing over the allelic frequencies yields the beta binomial or Dirichlet-multinomial distributions used by Balding (Reference Balding2003) for likelihood-based inference. Holsinger (Reference Holsinger1999) matched the mean and variance of, e.g., the beta distribution, to the definition of θ, and obtained a joint posterior distribution which is a function of the unknown allelic frequencies, of θ (assumed exchangeable over all loci) and of the mean allelic frequencies in an undivided population. The implementation, as well as those of Beaumont & Balding (Reference Beaumont and Balding2004) and of Guo et al. (Reference Guo, Dey and Holsinger2009) requires MCMC. While the power and flexibility of hierarchical models coupled with MCMC are well known (Sorensen & Gianola, Reference Sorensen and Gianola2002), implementations are not trivial and monitoring of convergence to the equilibrium distribution is a delicate matter (Cowles & Carlin, Reference Cowles and Carlin1996). The idea in these methods is that, under a neutral model, all θ (over loci) should be realizations of the same stochastic process. Outlying θ values may be suggestive of genomic regions affected by selection. Typically, it is argued that loci are either neutral, subject to balancing selection or to directional selection favouring alleles in specific environments, e.g. Akey et al. (Reference Akey, Zhang, Zhang, Jin and Shriver2002). However, the assignment of loci to specific types of processes is often arbitrary.

The present paper follows ideas of Holsinger (Reference Holsinger1999), but it differs in two important respects. The proposed method has two steps. First, allelic frequencies are assigned a non-informative prior, so that the mutual borrowing of information between loci is limited, leading to less shrinkage of frequencies towards a common value; in maximum likelihood there is no shrinkage at all, an issue criticized by Haldane (Reference Haldane1948). Samples of allelic frequencies can be obtained directly (actually, their posterior distributions are tractable, analytically), and these draws are used to form draws from the posterior distribution of locus-specific θ-parameters, using the parametric definition of F _ST as a function of allelic frequencies. The first step was illustrated with hypothetical data and with type II diabetes data in Myles et al. (Reference Myles, Hradetzky, Engelken, Lao, Nürnberg, Trent, Wang, Kayser and Stoneking2007). The step leads to estimates of the posterior distribution of the θs, which can be used to explore underlying structure, presumably caused by different evolutionary forces. In the second step, the structure is explored by using features of the posterior distribution of the θs (posterior means or transformations thereof) as response variables in a mixture model. Data from Petit et al. (Reference Petit, El Mousadik and Pons1998) on 12 isozyme loci in 12 populations of the argan tree in Morocco were used to illustrate the second step. Here, the posterior means of θ are treated as belonging to a mixture of normal distributions, which is then resolved into data-supported components. Since the final objective is that of clustering loci according to their similarity in θ values, departures from normality are arguably of little consequence. Here, logit and Gompit transformations were examined, and the clustering procedure produced exactly the same results. Using AIC as a gauge for model comparison, it was suggested that the 12 estimates of θ clustered into two groups, one representing putatively neutral loci (provided that this group reflects variation due to drift), and another one possibly corresponding to genomic regions affected by selection. With 12 loci only, it is unreasonable to expect a finer clustering structure. An ongoing study is applying the two-step procedure to large-scale SNP data in an animal population and this will be reported in a future communication.

As mentioned earlier in the paper, the two-step procedure has the disadvantage of not incorporating the uncertainty about the posterior distributions inferred in the first step. Although this can be remedied by using all posterior samples as input into the mixture model analysis, it can create numerical difficulties with the EM algorithm. This is an issue that needs additional research.

The method proposed here extends naturally to multiple alleles. In this case the likelihood is product multinomial, and the beta prior distribution is replaced by a Dirichlet distribution with minimum information content. The posterior distribution of the allelic frequencies is product Dirichlet, which is simple to sample from. Then, samples from the posterior distribution of θ_l would be drawn by evaluation of formulae similar to those in Nei (Reference Nei1973) where θ-parameters are averaged over alleles. For example, one could define

$\theta _{l} \equals {\mathop\sum_{m \equals \setnum{1}}^{M}} {\bar{p}_{l\comma m} } {{{\textstyle{{\sum\nolimits_{r \equals \setnum{1}}^{R} {\lpar p_{r\comma l\comma m} \minus \bar{p}_{l\comma m} \rpar ^{\setnum{2}} } } \over R}}} \over {\bar{p}_{l\comma m} \lpar 1 \minus \bar{p}_{l\comma m} \rpar }} \equals \mathop\sum\limits_{m \equals \setnum{1}}^{M} {{{\sum\nolimits_{r \equals \setnum{1}}^{R} {\lpar p_{r\comma l\comma m} \minus \bar{p}_{l\comma m} \rpar ^{\setnum{2}} } } \over {R\lpar 1 \minus \bar{p}_{l\comma m} \rpar }}} \comma$

where p _r,l,m is the frequency of allele m at locus l in population r and $\bar{p}_{l\comma m}$ is the unweighted average over the R populations.

In common with the studies of Holsinger (Reference Holsinger1999), Beaumont & Balding (Reference Beaumont and Balding2004), Weir et al. (Reference Weir, Cardon, Anderson, Nielsen and Hill2005) and Guo et al. (Reference Guo, Dey and Holsinger2009), the procedure presented here assumes that allelic frequencies are in linkage equilibrium, so that the likelihood of all allelic frequencies is either product binomial or product multinomial. Accommodating linkage disequilibrium, especially with dense batteries of marker loci, represents a formidable task and it is a challenge for future research. For example, Akey et al. (Reference Akey, Zhang, Zhang, Jin and Shriver2002) and Weir et al. (Reference Weir, Cardon, Anderson, Nielsen and Hill2005) reported that θ values of loci in regions of high linkage disequilibrium were similar. Guo et al. (Reference Guo, Dey and Holsinger2009) address correlations due to linkage, but not due to linkage disequilibrium, and do so by introducing a spatial structure for loci located in the same chromosome. Specifically, they proposed an autoregressive model in which logit transforms of θ values are correlated according to physical distance. The model is quite involved and requires MCMC computations. However, loci may be in linkage disequilibrium even though not being physically linked (Crow & Kimura, Reference Crow and Kimura1970), and such disequilibrium is very common in animal populations (Sandor et al., Reference Sandor, Farnir, Hansoul, Coppieters, Meuwissen and Georges2006; de Roos et al., Reference de Roos, Hayes, Spelman and Goddard2008; Lipkin et al., Reference Lipkin, Straus, Stein, Bagnato, Schiavini, Fontanesi, Russo, Medugorac, Foerster, Sölkner, Dolezal, Medrano, Friedmann and Soller2009; Qanbari et al., Reference Qanbari, Pimentel, Tetens, Thaller, Lichtner, Sharifi and Simianer2009), where finite size and selection under epistasis are factors in building up linkage disequilibrium. The two-step approach considered here could be enhanced by exploring algorithms alternative to EM as well as by consideration of different types of mixtures, e.g., of beta distributions, which are more appropriate for random variables taking values in (0,1).

Part of this work was carried out while the senior author was a Visiting Professor at Georg-August-Universität, Göttingen (Alexander von Humboldt Foundation Senior Researcher Award). Support by the Wisconsin Agriculture Experiment Station, and by grant NSF DMS-NSF DMS-044371 is acknowledged. This study is part of the project FUGATO-plus GenoTrack and was financially supported by the German Ministry of Education and Research, BMBF, the Föderverein Biotechnologieforschung e.V. (FBF), Bonn, and Lohmann Tierzucht GmbH, Cuxhaven. S. Qanbari thanks the H. Wilhelm Schaumann Stiftung Hamburg for financial support. Professor W. G. Hill and three anonymous reviewers are thanked for useful comments.

Appendix A: First derivatives of θ with respect to allelic frequencies

Let $\mathop {\overline{p} }\nolimits_{.\comma l} \equals {{\sum\nolimits_{r \equals \setnum{1}}^{R} {p_{r\comma l} } } \over R}.$ From (4), the derivative is

(A.1)

$\openup-4\eqalign{ {\partial \over {\partial p_{r\comma l} }}\theta _{l} \equals \tab {1 \over {\left( {{\textstyle{{R\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} \minus \mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}} \right)}}\left( {2p_{r\comma l} \minus {{2\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}} \right)\cr\tab \minus {{\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l}^{\setnum{2}} \minus {\textstyle{{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}} \over {\mathop {\left( {{\textstyle{{R\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} \minus \mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}} \right)}\nolimits^{\setnum{2}} }}\left( {{{R \minus 2\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}} \right)\cr\equals\tab {{2\theta _{l} } \over {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l}^{\setnum{2}} \minus {\textstyle{{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}}}\left( {p_{r\comma l} \minus {{\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}} \right)\cr\tab \minus {{\theta _{l} } \over {\left( {{\textstyle{{R\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} \minus \mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}} \right)}}\left( {{{R \minus 2\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}} \right)\hskip22pt\eqalign{\hskip-10pt\equals \tab \left[ {{{2\left( {p_{r\comma l}\hskip-1 \minus {\textstyle{{\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}}} \right)} \over {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l}^{\setnum{2}}\hskip-1 \minus {\textstyle{{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}}}\hskip-1 \minus\hskip-1 {{\left( {1 \hskip-1\minus {\textstyle{{2\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \over R}}} \right)} \over {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} \hskip-1\minus {\textstyle{{\mathop {\left( {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}}}} \right]\theta _{l}\hskip-22pt \cr \equals \tab \left[ {{{2\left( {p_{r\comma l} \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l}^{\setnum{2}} \minus {\textstyle{{\mathop {\left( {R\mathop {\overline{p} }\nolimits_{.\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}}} \minus {{\left( {1 \minus 2\mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\sum\nolimits_{r \equals \setnum{1}}^{R} \,p_{r\comma l} \minus {\textstyle{{\mathop {\left( {R\mathop {\overline{p} }\nolimits_{.\comma l} } \right)}\nolimits^{\setnum{2}} } \over R}}}}} \right]\theta _{l}\hskip-22pt \cr \equals \tab \left[ {{{2\left( {p_{r\comma l} \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {R\left( {\overline{{p_{.\comma l}^{\setnum{2}} }} \minus \mathop {\overline{p} }\nolimits_{.\comma l}^{\setnum{2}} } \right)}} \minus {{\left( {1 \minus 2\mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {R\left( {\mathop {\overline{p} }\nolimits_{.\comma l} \minus \mathop {\overline{p} }\nolimits_{.\comma l}^{\setnum{2}} } \right)}}} \right]\theta _{l}\hskip-22pt \cr \equals \tab \left[ {{{2\left( {p_{r\comma l} \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\overline{{p_{.\comma l}^{\setnum{2}} }} \minus \mathop {\overline{p} }\nolimits_{.\comma l}^{\setnum{2}} }} \minus {{\left( {1 \minus 2\mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\mathop {\overline{p} }\nolimits_{.\comma l} \left( {1 \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)}}} \right]{{\theta _{l} } \over R}\hskip-22pt \cr}\hskip-22pt$

Appendix B: Approximate Bayesian analysis

An approximate Bayesian analysis without sampling from the posterior distribution is also possible. An approximation to the mean and variance of the posterior distribution of θ_l can be obtained using a Taylor series expansion about the modes ${\tilde {p}}_{r\comma l}$ of the allelic frequencies. Let now $\skew-6\widetilde\nabla\equals \mathop {\left\{ {{\textstyle{\partial \over {\partial p_{r\comma l} }}}\theta _{l} } \right\}}\nolimits_{p_{{r\comma l \equals }} \equals \mathop {\tilde p}\nolimits_{{r\comma l}} }$ be an R×1 vector of first derivatives evaluated at the posterior mode estimates (8) of the allelic frequencies. Then, approximately

$\theta _{l} \approx \mathop {\skew6\widetilde\theta }\nolimits_{l} \plus \mathop {\bf {\skew-6\widetilde\nabla} }{\prime } \left( {{\bf p}_{l} \minus \widetilde{{\bf p}_{l} }} \right)\comma$

where $\tilde{\bf p}_{l}$ is the vector of posterior mode estimates of allele frequencies in the R groups. Then, approximately

(B.1)

$\eqalign{\tab \hskip-12ptE\left( {\theta _{l} \vert {\rm DATA}} \right)\cr\approx\tab \mathop {\widetilde\theta }\nolimits_{l} \plus \mathop{\sum}\limits_{r \equals \setnum{1}}^{R} \,\mathop {\left\{ {{\partial \over {\partial p_{r\comma l} }}\theta _{l} } \right\}}\nolimits_{p_{{r\comma l}} \equals \mathop {\tilde p}\nolimits_{{r\comma l}} } \left( {{{n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \over {2n_{r} \plus 1}} \minus {{n_{r\comma A_{l} } \minus {\textstyle{1 \over 2}}} \over {2n_{r} \minus 1}}} \right) \cr \equals \tab\mathop {\widetilde\theta }\nolimits_{l} \plus \mathop{\sum}\limits_{r \equals \setnum{1}}^{R} \,\mathop {\left\{ {\left[ {{{2\left( {p_{r\comma l} \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\overline{{p_{.\comma l}^{\setnum{2}} }} \minus \mathop {\overline{p} }\nolimits_{.\comma l}^{\setnum{2}} }} \!\minus\! {{\left( {1 \minus 2\mathop {\overline{p} }\nolimits_{.\comma l} } \right)} \over {\mathop {\overline{p} }\nolimits_{.\comma l} \left( {1 \minus \mathop {\overline{p} }\nolimits_{.\comma l} } \right)}}} \right]{{\theta _{l} } \over R}} \right\}}\nolimits_{_{{p_{{r\comma l}} \equals \mathop {\tilde p}\nolimits_{{r\comma l}} }\hskip-32pt} }\cr\tab\times \left( {{{n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \over {2n_{r} \plus 1}} \minus {{n_{r\comma A_{l} } \minus {\textstyle{1 \over 2}}} \over {2n_{r} \minus 1}}} \right).\cr}\hskip-32pt$

Likewise

${\rm Var}\left( {\theta _{l} \vert {\rm DATA}} \right) \approx \mathop {\bf {\skew-6\widetilde\nabla}\prime } {\rm Var}\lpar {{\bf p}_{l} \vert {\rm DATA}} \rpar{\bf {\skew-6\widetilde\nabla}}.$

Since allelic frequencies have mutually independent distributions, the R×R variance–covariance matrix Var (p_l|DATA) is diagonal with elements given by (9). Thus

(B.2)

${\rm Var}\lpar {\theta _{l} \vert {\rm DATA}} \rpar \approx \mathop {\skew-6\widetilde\nabla \prime } {\rm Diag}\left[ {{{\left( {n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \right)\left( {n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}} \right)} \over {\mathop {\left( {2n_{r} \plus 1} \right)}\nolimits^{\setnum{2}} \left( {2n_{r} \plus 2} \right)}}} \right]\skew-6\widetilde\nabla\prime.$

In short, each θ_l (l=1, 2, …, L) statistic will have a point estimate and an assessment of uncertainty, e.g., a credibility interval of size 95% given by the 2·5% and 97·5% percentiles of the corresponding posterior distribution estimated from samples, or from using a normal theory approximation, e.g.,

$\eqalign{ \tab \mathop {\widetilde\theta }\nolimits_{l} \plus 2\mathop\sum\limits_{r \equals \setnum{1}}^{R} \,\mathop {\left\{ {\left[ {{{2\left( {p_{r\comma l} \minus \mathop {\bar{p} }\nolimits_{l} } \right)} \over {\overline{{p_{l}^{\setnum{2}} }} \minus \mathop {\bar{p} }\nolimits_{l}^{\setnum{2}} }} \minus {{\left( {1 \minus 2\mathop {\bar{p} }\nolimits_{l} } \right)} \over {\mathop {\bar{p} }\nolimits_{l} \left( {1 \minus \mathop {\bar{p} }\nolimits_{l} } \right)}}} \right]{{\theta _{l} } \over R}} \right\}}\nolimits_{_{{p_{{r\comma l}} \equals {\hatess{p}}\nolimits_{{r\comma l}} }} } \left( {{{n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \over {2n_{r} \plus 1}} \minus {{n_{r\comma A_{l} } \minus {\textstyle{1 \over 2}}} \over {2n_{r} \minus 1}}} \right) \cr \tab \pm 1{\cdot}96\sqrt {\mathop\sum\limits_{r \equals \setnum{1}}^{R} \,\mathop {\left\{ {\left[ {{{2\left( {p_{r\comma l} \minus \mathop {\bar{p} }\nolimits_{l} } \right)} \over {\overline{{p_{l}^{\setnum{2}} }} \minus \mathop {\bar{p} }\nolimits_{l}^{\setnum{2}} }} \minus {{\left( {1 \minus 2\mathop {\bar{p} }\nolimits_{l} } \right)} \over {\mathop {\bar{p} }\nolimits_{l} \left( {1 \minus \mathop {\bar{p} }\nolimits_{l} } \right)}}} \right]{{\theta _{l} } \over R}} \right\}}\nolimits_{_{{p_{{r\comma l}} \equals {\hatess p}\nolimits_{{r\comma l}} }} }^{\setnum{2}} {{\left( {n_{r\comma A_{l} } \plus {\textstyle{1 \over 2}}} \right)\left( {n_{r\comma a_{l} } \plus {\textstyle{1 \over 2}}} \right)} \over {\mathop {\left( {2n_{r} \plus 1} \right)}\nolimits^{\setnum{2}} \left( {2n_{r} \plus 2} \right)}}.} \cr}$

References

Akey, J. M. (2009). Constructing genomic maps of positive selection in humans: where do we go from here? Genome Research 19, 711–722.CrossRef Google Scholar

Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D. (2002). Interrogating a high-density SNP map for signatures of natural selection. Genome Research 12, 1805–1814.CrossRef Google Scholar PubMed

Balding, D. J. (2003). Likelihood-based inference for genetic correlations. Theoretical Population Biology 63, 221–230.CrossRef Google Scholar

Beaumont, M. A. & Balding, D. J. (2004). Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology 13, 969–980.CrossRef Google Scholar PubMed

Bernardo, J. M. & Smith, A. F. M. (1994). Bayesian Theory. Chichester: Wiley.CrossRef Google Scholar

Cavalli-Sforza, L. L. (1966). Population structure and human evolution. Proccedings of the Royal Society London B. Biological Sciences 164, 362–379.Google Scholar PubMed

Celeux, G. & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Classification Journal 13, 195–212.CrossRef Google Scholar

Cockerham, C. C. (1969). Variance of gene frequencies. Evolution 23, 72–84.CrossRef Google Scholar PubMed

Cockerham, C. C. (1973). Analyses of gene frequencies. Genetics 74, 679–700.CrossRef Google Scholar PubMed

Corander, J., Waldmann, P. & Sillanpää, M. (2003). Bayesian analysis of genetic differentiation between populations. Genetics 163, 367–374.CrossRef Google Scholar PubMed

Cowles, M. K. & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association 91, 883–904.CrossRef Google Scholar

Crow, J. F. & Kimura, M. (1970). An Introduction to Population Genetics Theory. Caldwell: Blackburn Press.Google Scholar

de Roos, A. P. W., Hayes, B. J., Spelman, R. J. & Goddard, M. E. (2008). Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179, 1503–1512.CrossRef Google Scholar PubMed

Guo, F., Dey, D. K. & Holsinger, K. E. (2009). A Bayesian hierarchical model for analysis of single nucleotide polymorphisms, diversity in multilocus, multipopulation samples. Journal of the American Statistical Association 104, 142–154.CrossRef Google Scholar PubMed

Haldane, J. B. S. (1948). The precision of observed values of small frequencies. Biometrika 35, 297–303.CrossRef Google Scholar

Holsinger, K. E. (1999). Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas 130, 245–255.CrossRef Google Scholar

Holsinger, K. E. (2006). Bayesian hierarchical models in geographical genetics. In: Applications of Computational Statistics in the Environmental Sciences (ed. Clark, J. S. & Gelfand, A. E.), pp. 25–37. New York: Oxford University Press.Google Scholar

Holsinger, K. E. & Weir, B. S. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F _ST. Nature Review Genetics 10, 639–650.CrossRef Google Scholar PubMed

Leisch, F. (2004). FlexMix: a general framework for finite mixture models and latent class regression in R. Journal of Statistical Software 11, 1–18.CrossRef Google Scholar

Lewontin, R. C. & Krakauer, J. (1973). Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74, 175–195.CrossRef Google Scholar PubMed

Lipkin, E., Straus, K., Stein, T., Bagnato, A., Schiavini, F., Fontanesi, L., Russo, V., Medugorac, M., Foerster, M., Sölkner, J., Dolezal, M., Medrano, M. F., Friedmann, A. & Soller, M. (2009). Extensive long-range and non-syntenic linkage disequilibrium in livestock populations. Genetics 181, 691–699.CrossRef Google Scholar

Myles, S., Hradetzky, E., Engelken, J., Lao, O., Nürnberg, P., Trent, R. J., Wang, X., Kayser, M. & Stoneking, M. (2007). Identification of a candidate genetic variant for the high prevalence of type II diabetes in Polynesians. European Journal of Human Genetics 15, 584–589.CrossRef Google Scholar PubMed

McLachlan, G. & Peel, D. (2000). Finite Mixture Models. New York: Wiley.CrossRef Google Scholar

Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences of the USA 70, 3321–3323.CrossRef Google Scholar PubMed

Petit, R. J., El Mousadik, A. & Pons, O. (1998). Identifying populations for conservation on the basis of genetic markers. Conservation Biology 12, 844–855.CrossRef Google Scholar

Qanbari, S., Pimentel, E. C. G., Tetens, J., Thaller, G., Lichtner, P., Sharifi, A. R. & Simianer, H. (2009). The pattern of linkage disequilibrium in German Holstein cattle. Animal Genetics doi: 1111/j.1365-2052.2.2009.02011.X.Google Scholar

R Development Core Team (2008). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0. Available at http://www.R-project.org Google Scholar

Robertson, A. (1975). Gene frequency distributions as a test of selective neutrality. Genetics 81, 775–785.CrossRef Google Scholar PubMed

Sabeti, P. C., Schaffner, S. F., Fry, B., Lohmueller, J., Varilly, P., Shamovsky, S., Palma, A., Mikkelsen, T. S., Altshuler, D. & Lander, E. S. (2006). Positive natural selection in the human lineage. Science 312, 1614–1620.CrossRef Google Scholar PubMed

Sandor, C., Farnir, F., Hansoul, S., Coppieters, W., Meuwissen, T. & Georges, M. (2006). Linkage disequilibrium on the bovine X chromosome: characterization and use in quantitative trait locus mapping. Genetics 173, 1777–1786.CrossRef Google Scholar PubMed

Sorensen, D. & Gianola, D. (2002). Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. New York: Springer.CrossRef Google Scholar

Weir, B. S., Cardon, L. R., Anderson, A. D., Nielsen, D. M. & Hill, W. G. (2005). Measures of human population structure show heterogeneity among genomic regions. Genome Research 15, 1468–1476.CrossRef Google Scholar PubMed

Weir, B. S. & Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370.Google Scholar PubMed

Weir, B. S. & Hill, W. G. (2002). Estimating F-statistics. Annual Review of Genetics 36, 721–750.CrossRef Google Scholar PubMed

Wright, S. (1931). Evolution in Mendelian populations. Genetics 16, 97–159.CrossRef Google Scholar PubMed

Wright, S. (1951). The genetical structure of populations. Annals of Eugenics 15, 323–354.CrossRef Google Scholar PubMed

Fig. 1. Posterior density (thick line) of the allelic frequency p at a locus for which 199 copies have been observed out of 200 alleles counted in hypothetical population M; the posterior distribution is Beta\left( {199 \plus {\textstyle{1 \over 2}}\comma 1 \plus {\textstyle{1 \over 2}}} \right). The thin line is the density of a normal approximation to the sampling distribution of the maximum likelihood estimator.

Fig. 2. Posterior density of θl for the hypothetical example of populations M and N.

Fig. 3. Empirical cumulative distribution function of θl for the hypothetical example of populations M and N.

Fig. 4. Posterior density of the allelic frequency p under the ‘null’ model for hypothetical populations M and N: 209 copies of Al are observed out of 260 alleles screened.

Fig. 5. Posterior density of θl under the null model for the hypothetical example of populations M and N.

Fig. 6. Density of the posterior distribution of θKCNJ11 obtained from allelic frequencies in Myles et al. (2007).

Table 1. Allelic frequencies at 12 isozyme loci in each of 12 Argan tree populations, adapted from Petit et al. (1998) by making all loci bi-allelic. A1–A12 represent frequencies of the ‘A’ allele at loci 1–12; No. A1–No. A12 are the observed number of copies of the alleles. The number of ‘a’ alleles can be calculated from the number of individuals samples and the number of ‘A’ alleles observed

Fig. 7. Box plot of the posterior distributions of θ-parameters in 12 isozyme loci of the argan tree in Morocco (data originally from Petit et al.1998).

Table 2. Comparison of mixture models with 2, 3 or 4 components fitted to the 12 posterior means of θ-parameters and their logit or Gompit transforms in the argan tree data of Petit et al. (1998). AIC (models with smallest values are favoured and indicated in boldface)

Article contents

A two-step method for detecting selection signatures using genetic markers

Summary

1. Introduction

2. Background

(i) Basic concepts

3. Estimation of parameters

(i) Inferring gene frequencies

(ii) Inferring θ by maximum likelihood

(iii) Bayesian inference of θ

(iv) A Bayesian ‘null’ distribution for assessing sampling variation uncertainty

(v) Illustration of sampling variation with candidate genes for type-II diabetes

4. Clustering of θ-parameters

5. Discussion

Appendix A: First derivatives of θ with respect to allelic frequencies

Appendix B: Approximate Bayesian analysis

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests