1. Introduction
Mutation rate is one of the most important parameters in genetics (Crow, Reference Crow2000). It can provide crucial information about the evolution of genome sequence composition. Unfortunately, direct estimation of mutation rates in higher eukaryotes is not as easy as in microbes (Drake et al., Reference Drake, Charlesworth, Charlesworth and Crow1998). An attempt has been made to measure bidirectional somatic mutation rates using hyper-mutating chicken B cells (Jolly et al., Reference Jolly, Cook, Raftery and Jones2007), but the method is not widely applicable. Human mutations are much more complicated than expected (Duret, Reference Duret2009). It is well known that human mutations are influenced by factors such as age and gender (Drake et al., Reference Drake, Charlesworth, Charlesworth and Crow1998; Crow, Reference Crow2000, Reference Crow2006). Moreover, transcription-induced mutation biases are observed in gene loci (Green et al., Reference Green, Ewing, Miller, Thomas and Green2003; Polak & Arndt, Reference Polak and Arndt2008). Direct estimation of mutation rates found differential mutation rates across loci (Drake et al., Reference Drake, Charlesworth, Charlesworth and Crow1998; Reich & Lander, Reference Reich and Lander2001; Kondrashov, Reference Kondrashov2002). However, this direct estimation, using the incidence of dominant Mendelian diseases in humans, would only be appropriate for providing approximate information.
Estimates of human mutation rates are usually based on theoretical population genetics. One earlier estimate using a phylogenetic approach resulted in an average mutation rate of ~2·5×10−8 per nucleotide, assuming the effective population size to be 104 or 105 (Nachman & Crowell, Reference Nachman and Crowell2000). This mutation rate is very similar to that found by direct estimation (Drake et al., Reference Drake, Charlesworth, Charlesworth and Crow1998; Kondrashov, Reference Kondrashov2002). There are many ways to estimate mutation rates based on theoretical population genetics, including coalescent-based approaches (Kimura & Ohta, Reference Kimura and Ohta1971; Clark et al., Reference Clark, Tavare and Doebley2005; Hartl & Clark, Reference Hartl and Clark2007). However, most previous estimates are based on phylogeny, probably due to the availability of appropriate data. All of the approaches based on population genetics should be applied with caution, since past population histories and selective pressure can influence the results (Hartl & Clark, Reference Hartl and Clark2007).
It is well known that transition mutations are much more frequent than transversions and that the biochemical properties of CpG methylation help facilitate C-to-T transition mutations (Krawczak et al., Reference Krawczak, Ball and Cooper1998; Strachan & Read, Reference Strachan and Read2004). Therefore, it is highly plausible that mutation rates will vary depending on the adjacent nucleotide. Moreover, recent bovine genome data strongly support neighbouring nucleotide effects on mutations (Jiang et al., Reference Jiang, Wu, Zhang, Michal and Wright2008). Phylogeny-based methods that account for nearest-neighbour interactions (Hwang & Green, Reference Hwang and Green2004; Lunter & Hein, Reference Lunter and Hein2004; Arndt & Hwa, Reference Arndt and Hwa2005) also confirm the high C-to-T transition mutation rate, especially in CpG sites. More recently, a similar method based on phylogeny was developed based on the general time-reversible model, but the authors did not apply the method to real data (Catanzaro et al., Reference Catanzaro, Pesenti and Milinkovitch2006). However, the sequences of untranscribed regions were used for most of the estimates based on phylogeny. Moreover, rate estimates generated using the Human Genome Mutation Database (HGMD) are limited because the data rely solely upon reported mutations, which are usually associated with diseases (Krawczak et al., Reference Krawczak, Ball and Cooper1998). The analysis of re-sequencing data of human gene loci can provide more detailed general information regarding these issues in human gene loci.
Previous studies have found that the evolution of genome sequence composition is quite complex (Duret et al., Reference Duret, Semon, Piganeau, Mouchiroud and Galtier2002; Lercher et al., Reference Lercher, Smith, Eyre-Walker and Hurst2002; Piganeau et al., Reference Piganeau, Mouchiroud, Duret and Gautier2002; Webster et al., Reference Webster, Smith and Ellegren2003; Belle et al., Reference Belle, Duret, Galtier and Eyre-Walker2004; Duret, Reference Duret2009) and is even influenced by recombination (Duret & Arndt, Reference Duret and Arndt2008). In previous comparative genomics studies, more fixed nucleotide substitutions of G/C to A/T were found, but polymorphism data suggested more nucleotide substitutions of A/T to G/C. To address such discrepancies, more extensive estimates using public re-sequencing data based on the basic theories of population genetics would be helpful. The number of polymorphic sites from re-sequencing data provides the overall mutation rate, and the allele frequency spectra are used to estimate the mutation rate of one base for another (Wright, Reference Wright1931; Kimura & Ohta, Reference Kimura and Ohta1971). Mutation rates in this study will be estimated using allele frequency spectra from large-scale, high-quality, public re-sequencing data generated by the SeattleSNPs Program for Genome Application (PGA). Like other estimates based on theoretical population genetics, these methodologies require knowledge of the effective population size in order to estimate mutation rates. Since the primary purpose of this study is to enhance our knowledge of mutations and the evolution of the human genome, the current study focuses on determining relative rates of each base being mutated for another, without discussing the actual effective population sizes.
2. Methods
(i) Data
Data from the SeattleSNPs PGA were used for the analyses. A detailed description of the data and the patterns of variations has been reported previously (Crawford et al., Reference Crawford, Akey and Nickerson2005). There were 24 samples of African descent (AD) and 23 of European descent (ED), consisted of the first and second study populations in the SeattleSNPs PGA. Excluding X-linked genes and re-sequenced genes using the third study population, 292 genes were selected to estimate the population mutation parameters for each base substitution (Supplementary Table 1). Because the evolutionary processes governing mutations in repeat regions such as Alu, Mer and others, may be different from those of the rest of the genome, such regions were also excluded. Finally, tri-allelic sites were excluded, bringing the total number of base pairs (bp) examined to 4 437 493.
(ii) Mutation rates for each base to another
Let μ represent the rate per generation of the mutation of ‘A’ to ‘a’ and ν represent the rate of the opposite mutation. Let the frequency of ‘A’ be denoted as q. If mutations are subject to uniform rates of recurrence of mutation and reverse mutation over the whole genome, the distribution of allele frequencies for reversible recurrent mutation at equilibrium can be expressed as the following beta distribution, in which N is the effective population size (Wright, Reference Wright1931; Kimura & Ohta, Reference Kimura and Ohta1971):
In contrast to a scenario in which there is no mutation, where the derived allele frequency distribution would ignore the fixed alleles, the frequency, q, in this study involves fixed alleles. From eqn (1), the population mutation parameter of A-to-a is 4Nμ, and that of a-to-A is 4Nν. These parameters can be estimated from the mean and variance of the beta distribution. The estimates of 4Nμ and 4Nν are expressed as follows, where m is the mean and ‘var’ is the variance:
Data extraction and handling were performed using Perl programming, and the actual estimations were conducted using the R statistical package. First, the total sequence was analysed and sub-sequences were then analysed based on whether or not they were transcribed and/or translated. If there were transcript isoforms, the longer isoform was selected for the region classification. In total, analyses were conducted on the following eight regions: complete sequence (indicated as ‘Total’ in tables), mRNA, coding sequence (CDS), 5′-untranslated region (5′-UTR), 3′-UTR, all UTRs, intron and untranscribed region. Here, the mRNA region included the fully processed, mature mRNA without any introns. For estimations of population mutation parameters while taking into account the adjacent nucleotide effect, the data were divided into groups based on the identity of the next and previous nucleotides. Correlations between pairs of mutation rates expected to be equal due to strand reversal symmetry, assuming mutation-drift equilibrium were analysed by non-parametric rank-based measures for association, Kendall's tau and Spearman's rho statistics in the R package. Additionally, correlations between rates of AD and ED samples were tested to examine whether there was a population difference.
Another approach used to estimate the population mutation parameters was based on the number of polymorphic sites. If θ is the population mutation parameter 4Nμ and P is the number of polymorphic sites, then θ can be expressed as eqn (3) (Kimura & Ohta, Reference Kimura and Ohta1971; Hartl & Clark, Reference Hartl and Clark2007). Here, μ is the overall mutation rate. In this equation, q is an arbitrary frequency that defines whether or not a variant is polymorphic. If the frequency of a variant is higher than q, the site is considered to be polymorphic. In this study, q is equal to one divided by the number of sample chromosomes (1/48 for the AD samples and 1/46 for the ED samples):
(iii) Simulations
To test whether the differences in mutation rates in the AD and ED samples come from the past population histories, samples were generated based on expected past population history using a program, ms, which generates samples under the Wright–Fisher neutral model (Hudson, Reference Hudson1983, Reference Hudson2002). The same parameters of past population histories as a previous study were used in this study. The parameters were derived by extensive assessment of the possible past population histories of African–Americans and European–Americans (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). However, the number of samples was changed to 48 for the AD samples and 46 for the ED samples to reflect the data set used in this study. The African expansion model was applied for the AD samples, and the European simple bottleneck and European complex bottleneck models were applied for the ED samples. Simulations were repeated ten times.
3. Results
Mutation rates of one base for another were estimated with re-sequencing data, using the reversible recurrent mutation model at equilibrium. As described in the Methods section, repeat regions, insertions and tri-allelic sites were excluded, and a total of 4 437 493 bp in 292 genes were examined, including 21 472 polymorphic sites. Sixteen polymorphic sites found within inserted regions were also excluded from the data. Among the polymorphic sites, 8444 sites were polymorphic in both populations, 10 015 sites were polymorphic only in the AD population and 3013 sites were polymorphic only in the ED population. Bases analysed, excluding the polymorphic sites, included: A, 1 182 263; C, 986 450; G, 1 007 768; and T, 1 239 540. The data set contained 1·21 times more A+T than C+G (45·2% GC content).
(i) Overall estimates
The population mutation parameters (4Nμ) were estimated to examine relative mutation rates, which are the product of effective population size (N) and mutation rate per base per generation (μ). Estimates of the population mutation parameter based on the number of polymorphic sites (Kimura & Ohta, Reference Kimura and Ohta1971; Hartl & Clark, Reference Hartl and Clark2007) were 0·00108 for the AD samples and 0·000675 for the ED samples. These estimates are similar to the previous estimates using coalescent- or phylogeny-based methods (Nachman & Crowell, Reference Nachman and Crowell2000; Bhangale et al., Reference Bhangale, Rieder, Livingston and Nickerson2005; Carlson et al., Reference Carlson, Reider, Nickerson, Eberle and Kruglyak2005). The estimated mutation rates based on the reversible recurrent mutation model were also similar to values from these earlier studies. Taking sequence constitution into consideration, the average population mutation parameters were 0·000810 for the AD samples and 0·000613 for the ED samples. If the effective population number is assumed to be approximately 10 000, the same as in the previous study (Nachman & Crowell, Reference Nachman and Crowell2000), the actual mutation rates would be 2·02×10−8 in the AD samples and 1·53×10−8 in the ED samples, which is similar to the previous estimate of approximately 2·5×10−8. In this study, the estimates of population mutation parameters were considered as relative mutation rates, instead of assuming an effective population size. Therefore, rates described in this study refer to relative mutation rates.
(ii) Mutation-drift equilibrium
There are six possible combinations of two of the four nucleotides; therefore, there are 12 mutation rates of one base for another. Assuming that the effective population size is constant and that substitutions are in equilibrium, the following equalities should be observed:
• A→C=T→G Transversion
• A→G=T→C Transition
• C→A=G→T Transversion
• C→T=G→A Transition
• A→T=T→A Transversion
• C→G=G→C Transversion
As shown in Tables 1 and 2, estimates for the AD and ED samples correspond relatively well to the expected equalities of mutation rates between certain bases corresponding to the strand reversal symmetry. Table 3 shows the excellent overall correlation between the observed rates and expected equalities. Poor correlation between the observed rates and expected equalities was noted for 5′-UTR regions, which contained the least amount of data of all the regions analysed.
a Number of polymorphic sites.
a Number of polymorphic sites.
a Combined estimates of total, mRNA, CDS, 5′-UTR, 3′-UTR, UTRs, intron and untranscribed regions.
b Test for estimates of both AD and ED samples.
(iii) Population differences
Increased population mutation parameters were observed in the AD samples compared with those in the ED samples. Tables 1 and 2 reflect the presence of more polymorphic sites in the AD samples than in the ED samples, which is suspected as a main reason for the increased population mutation parameters in AD samples. The parameters in the AD and ED samples showed very tight correlations (Table 3), indicating that there is a very similar pattern of base substitution in each sample set. These results suggest that the same mechanisms affecting genomic composition may underlie both populations.
Differences in the magnitude of rates between the AD and ED samples may exist as a result of differential effective population sizes between AD and ED samples. Using the effective population sizes estimated from linkage disequilibrium (Tenesa et al., Reference Tenesa, Navarro, Hayes, Duffy, Clarke, Goddard and Visscher2007), the average mutation rates were obtained as 2·70×10−8 in the AD samples and 4·94×10−8 in the ED samples. For the calculation, the estimate of 7500 for the population of Yoruba in Ibadan, Nigeria, was used for the AD samples as the effective population size; the estimate of 3100 for the population from Utah with European ancestry was used for the ED samples. Compared with the estimates with the same effective population size of 10 000 (2·02×10−8 in the AD samples and 1·53×10−8 in the ED samples), the gap between the mutation rates of AD and ED samples became larger after using the estimated effective population sizes (Tenesa et al., Reference Tenesa, Navarro, Hayes, Duffy, Clarke, Goddard and Visscher2007). The larger gap between samples was observed again for the estimates using the number of polymorphic sites. This phenomenon may result because the method for estimating effective population size (Tenesa et al., Reference Tenesa, Navarro, Hayes, Duffy, Clarke, Goddard and Visscher2007) did not reflect the past population histories.
To test the effect of past population history directly, simulations were conducted using the most probable parameters of past population history for each sample set as described in the methods section (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). The simulations showed that the number of polymorphic sites for the AD population is 1·75-fold higher than the number of polymorphic sites for the ED population. The same ratio was calculated when the complex bottleneck model was applied instead of the simple bottleneck model for the ED population (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). The simulated ratio was very similar to the real ratio (1·61) found in the SeattleSNPs PGA data. Results from the simulations support that the higher population mutation parameters in the AD samples might come from differences in the past population histories of the AD and ED samples, which influence the effective population sizes of populations.
(iv) Regional differences and transition versus transversion
As shown in Tables 1 and 2, the population mutation parameters were estimated in several specific gene regions: mRNA, CDS, 5′-UTR, 3′-UTR, all UTRs, intron and untranscribed region. The relative mutation rates were different depending on the region. The mRNA regions showed lower mutation rates than intronic and untranscribed regions, primarily attributable to the lower rates in the CDS. In particular, introns showed higher mutation rates than overall sequences and even untranscribed regions. These lower rates in the CDS may result from the negative selection pressures in genes as indicated in a recent study (Schmidt et al., Reference Schmidt, Gerasimova, Kondrashov, Adzhubei, Kondrashov and Sunyaev2008). One interesting observation is that T-to-C transitions were more frequent than C-to-T transitions in the 5′-UTR and CDS regions. In 5′-UTRs, the increased rate of T-to-C transitions may indicate the importance of CpG maintenance in this region. However, it is not clear whether the difference in transition rates is due to selective pressure or variation in overall mutation rates in the region.
Although overall estimates show mutation-drift equilibrium with good strand reversal symmetry, the transcription-induced mutation biases were observed mainly in intronic regions and weakly in 3′-UTR regions, showing increased A-to-G over T-to-C mutation rates compared with G-to-A over C-to-T. Additionally, as observed in a previous study (Green et al., Reference Green, Ewing, Miller, Thomas and Green2003; Polak & Arndt, Reference Polak and Arndt2008), the increased C-to-T over G-to-A is observed in the 5′-UTR regions. However, contrary to the previous results (Polak & Arndt, Reference Polak and Arndt2008), the global asymmetry in transcripts of exceeding A-to-G rates over T-to-C rates was not observed in the 5′-UTR region, which showed even opposite results. The asymmetry in the 5′-UTR was higher in ED samples than AD samples. Overall, these effects of mutation biases due to transcription were not strong in this study, and were primarily observed in intronic regions, possibly because the target genes were selected to examine the inflammatory pathway so that they may be less regularly transcribed.
The mutation biases can induce an excess of G+T over A+C on the coding strand (Green et al., Reference Green, Ewing, Miller, Thomas and Green2003). As expected, intronic regions showed 51·4% GT content, which is slightly increased compared with 50·9% GT content in the total sequence. This 51·4% GT content in introns is the highest among all investigated regions. However, in this study, taking the sequence compositions into account, the average G+T-to-A+C mutation rate is equal to the reverse rate, as the population mutation parameters were 0·000355 for AD samples and 0·00027 for ED samples, if all possible relevant exchanges were considered (A-to-G, G-to-A, C-to-T, T-to-C, A-to-T, T-to-A, C-to-G and G-to-C). Not only A-to-G, T-to-C and their reverse mutations but also A-to-T, C-to-G and their reverse mutations can eventually contribute to the excess of G+T over A+C in sequence composition. Consideration of sequence compositions and inclusion of these rates counterbalance the increased A-to-G over T-to-C mutation rates, with an expectation of the equilibrium of the sequence composition in intronic regions.
Not surprisingly, there were more transition polymorphisms than transversion polymorphisms. The number of transitions in the AD samples was 2·16-fold higher than the number of transversions. This ratio was 2·19 in the ED samples. The average population mutation parameters for transversions, adjusting for the base composition of the sequence, were 0·000255 for the AD samples and 0·000195 for the ED samples. The parameters for transitions were 0·000555 for the AD samples and 0·000418 for the ED samples (a 2·18- and 2·14-fold increase in rate over transversions, respectively). For transitions, C-to-T and G-to-A mutations were slightly more frequent than the reverse (Tables 1 and 2), presumably due to CpG methylation. The transition to transversion ratios also differed slightly depending on the region (Table 4). Interestingly, much higher ratios were observed in the CDS and the 5′-UTR. However, the increase in transitions versus transversions was not due to a higher level of CpG-to-TpG transitions (or CpG-to-CpA), but rather to an increase in other transition substitutions, especially T-to-C (Supplementary Tables 2 and 3).
(v) Genomic composition equilibrium
To examine the nature of changes in genomic composition, the average rates of G/C-to-A/T and A/T-to-G/C mutations were estimated with consideration of current base composition. Without adjusting the sequence composition, the raw estimates showed that there were higher mutation rates of G/C-to-A/T than A/T-to-G/C. After considering the current sequence composition of each base, the average rates of G/C-to-A/T mutations were almost the same as A/T-to-G/C, which were 0·000343 for the AD samples and 0·000258 for the ED samples. These estimates indicate that there is the same number of changes between A/T and G/C bases. This result supports the notion that current sequence compositions due to the nucleotide substitutions between A/T and G/C are in equilibrium at these gene loci. As shown in Table 4, the average rate ratios of A/T-to-G/C versus G/C-to-A/T substitutions indicate that all the regions are in equilibrium.
(vi) Adjacent nucleotide effect
To examine the neighbouring nucleotide effects of mutation, rates dependent on the previous and subsequent nucleotide in the sequence were estimated. As shown in Tables 5 and 6, different mutation rates were observed depending on the adjacent sequence. Again, the correlation between estimates for the AD and ED samples was highly significant, indicating a similar mechanism of mutation in both populations. In principle, due to equilibrium in the mutation rates between paired base substitutions, estimates of the effect of the previous nucleotide should match the estimates of the effect of the next nucleotide, i.e. CpG-to-TpG=CpG-to-CpA. As shown in Table 7, overall association tests between the estimates considering the previous and the reversely corresponding next nucleotides were more significant than those between the estimates not taking the adjacent sequence into account (Table 3). This confirms that the adjacent sequence influences the mutation rates.
a Adjacent sequence.
b Number of polymorphic sites.
a Adjacent sequence.
b Number of polymorphic sites.
a Combined transition rates of total, mRNA, 5′-UTR, 3′-UTR, UTRs, intron and untranscribed regions.
b Transition rates in total sequence only.
c Test for estimates of both AD and ED samples.
As expected, the highest rates were for CpG-to-TpG and CpG-to-CpA, indicating that CpG methylation is an important factor in the generation of specific mutations. To examine regional differences in these rates, the transition rates were estimated again after separating the data into sub-sequences (Supplementary Tables 2 and 3). Because transitions were more common than transversions and provided sufficient data for generating estimates, they were chosen for analysis. CpG methylation effects were reduced in mRNA regions and enhanced in intron regions. Like the estimates in Tables 1 and 2, the effects were lowest in the 5′-UTR and next lowest in the CDS. However, in this analysis, the rates of CpG-to-TpG rates were still slightly higher than those for TpG-to-CpG, although increased rates of TpG-to-CpG could certainly be seen in these regions compared with other regions. The increase in T-to-C transition rates as compared with the reverse substitution for 5′-UTR and CDS (Tables 1 and 2) comes from the overall effects of the adjacent sequence (Supplementary Tables 2 and 3). Combined with the results in Table 4, it may be suspected that a CpG maintenance system may exist in these two regions.
4. Discussion
This study is the first report of the relative mutation rates of one base for another by applying the recurrent reversible mutation model in equilibrium to re-sequencing data. The overall estimated population mutation parameters found were similar to previous studies using coalescent- and phylogeny-based approaches (Nachman & Crowell, Reference Nachman and Crowell2000; Bhangale et al., Reference Bhangale, Rieder, Livingston and Nickerson2005; Carlson et al., Reference Carlson, Reider, Nickerson, Eberle and Kruglyak2005). This study provides several important insights into mutation in the human genome. First, mutation-drift equilibrium was observed in human gene loci from the relative mutation rates. Second, the sequence composition of the human genome is in equilibrium using this method, at least in these gene loci. Third, mutation rates differ depending on the region of a gene locus in question. In particular, the lower rate in mRNA is observed primarily due to CDS regions, which may result from negative selection pressure, as described in a recent report (Schmidt et al., Reference Schmidt, Gerasimova, Kondrashov, Adzhubei, Kondrashov and Sunyaev2008). This study also confirms several expectations with more detailed information: higher rates of transition substitutions, mutations due to CpG methylation, adjacent sequence effects and transcription-induced mutation biases.
The frequent C-to-T substitutions occurring due to CpG methylation are well known and have been observed in mutation databases (Krawczak et al., Reference Krawczak, Ball and Cooper1998; Olivier et al., Reference Olivier, Eeles, Hollstein, Khan, Harris and Hainaut2002; Petitjean et al., Reference Petitjean, Mathe, Kato, Ishioka, Tavtigian, Hainaut and Olivier2007). This phenomenon is also common in other organisms (Morton et al., Reference Morton, Bi, McMullen and Gaut2006). Phylogeny-based estimates with the nearest-neighbour interactions showed that the CG-to-TG substitution rate is more than 10-fold higher than other substitution rates (Hwang & Green, Reference Hwang and Green2004; Lunter & Hein, Reference Lunter and Hein2004). However, in this study, the pattern was quite different from previously published results. The differences between CG-to-TG (and CG-to-CA) and other rates were not as large as in the previous estimates. Moreover, the rates of all other transitions except CpG-to-TpG (or CpG-to-CpA) lay in a relatively small range, in contrast to the previous study. Differences from the previous results may be a consequence of analysing human gene loci, in which negative selections are underway with observations of lower substitution rates at non-synonymous CpG sites (Schmidt et al., Reference Schmidt, Gerasimova, Kondrashov, Adzhubei, Kondrashov and Sunyaev2008). However, the possibility that differences in methodology may have led to these discrepancies cannot be ruled out.
The investigated sequence had a GC content of 45·2%, which is higher than the genome-wide average of 41% (Strachan & Read, Reference Strachan and Read2004). The number of CpGs analysed was 60 035, which was only 27% of the expected amounts, taking the sequence compositions into account. However, this is a bit higher than the regular expectation of the human genome, which is one-fifth of the expected value given the overall contents of C and G (Strachan & Read, Reference Strachan and Read2004). This discrepancy may result from the examination of gene loci in this study. Previous studies have indicated that the transition bias is due mainly to CpGs (Rosenberg et al., Reference Rosenberg, Subramanian and Kumar2003; Keller et al., Reference Keller, Bensasson and Nichols2007). However, considering the small number of CpGs in the human genome, it is unlikely that CpG methylation completely explains the transition bias. The estimates in this study show that other transition rates are also much higher than transversion rates (Tables 5 and 6).
Previous measurements relying primarily on phylogeny-based methods suggested higher G/C to A/T substitution rates (Hwang & Green, Reference Hwang and Green2004; Lunter & Hein, Reference Lunter and Hein2004; Lipatov et al., Reference Lipatov, Arndt, Hwa and Petrov2006; Duret & Arndt, Reference Duret and Arndt2008). However, several studies using polymorphism data found evidence for a fixation bias in favour of mutations that increase GC content (Eyre-Walker, Reference Eyre-Walker1999; Webster et al., Reference Webster, Smith and Ellegren2003). A more recent study suggested that these observations could be caused by misidentification of the ancestral state of partial data (Hernandez et al., Reference Hernandez, Williamson, Zhu and Bustamante2007). However, their results did not show the reverse observation, that is, a higher rate of G/C to A/T mutations. The estimates in the current study indicate that there is no mutational preference for G/C or A/T and that the overall sequence is in base compositional equilibrium (Table 4). Since the estimates in this study also indicate higher rates of G/C to A/T mutations than the reverse before adjusting for the current sequence composition, strict adjustment for the sequence composition may somewhat alter the previous estimates based on phylogeny. However, unfortunately, it is still true that the current results cannot explain the observation of higher rates of G/C to A/T substitutions than the reverse in previous studies using comparative methods.
Increased transition to transversion ratios in the 5′-UTR and CDS were observed in this study (Table 4). As shown in Supplementary Tables 2 and 3, these phenomena appear to result from increased T-to-C transition rates rather than increased CpG-to-TpG transition rates. Decreased CpG-to-TpG transition rates were observed in 5′-UTR and CDS regions, indicating that these rates did not contribute to the increased transition to transversion ratios in the 5′-UTR and CDS. Since the reversible recurrent mutation model assumes mutation-drift equilibrium and the estimates are dependent on the sequence composition, it cannot be ruled out that these observations might be a consequence of methodology. However, it should also be noted that these estimates may provide a reasonable explanation for the evolution of sequence composition and the maintenance of CpG islands in 5′-UTR and CDS regions.
The estimated population mutation parameters for the ED samples are lower than the estimates for the AD samples, which may arise from the different effective population sizes between samples. As shown in the results section, application of the effective population sizes estimated from linkage disequilibrium did increase the discrepancy between the estimates of AD and ED samples (Tenesa et al., Reference Tenesa, Navarro, Hayes, Duffy, Clarke, Goddard and Visscher2007). The increased discrepancy may arise because the estimation method of effective population sizes lacks the consideration of past population history. Previous studies found a demographic history of expansion in AD populations and a bottleneck in ED populations (Akey et al., Reference Akey, Eberle, Rieder, Carlson, Shriver, Nickerson and Kruglyak2004; Marth et al., Reference Marth, Czabarka, Murvai and Sherry2004; Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008). Using the same parameters suggested in a previous study (Boyko et al., Reference Boyko, Williamson, Indap, Degenhardt, Hernandez, Lohmueller, Adams, Schmidt, Sninsky, Sunyaev, White, Nielsen, Clark and Bustamante2008), simulation results provided evidence that the increase in polymorphic sites in the AD samples were due to the differences between the past demographic histories of AD and ED populations, which may influence effective population sizes. However, these simulations can only superficially explain the increased estimates calculated for the AD samples. Moreover, considering the overall effect of past population histories, higher asymmetry of transition rates of the 5′-UTR in ED samples than AD samples may come from other effects, such as selection pressures, rather than the effect of past population history. For a complete understanding of population differences, more studies would be necessary.
There are limitations to the interpretation of the findings of this study. Direct estimation from the incidence of dominant Mendelian diseases in humans indicates differential mutation rates across loci (Drake et al., Reference Drake, Charlesworth, Charlesworth and Crow1998; Reich & Lander, Reference Reich and Lander2001; Kondrashov, Reference Kondrashov2002). The gene lengths in this study are varied; some of the genes have a very small number of polymorphic sites due to their short lengths. To address these concerns, further study of gene clusters containing a sufficient number of polymorphic sites may be useful. It should also be noted that mutation rates are influenced by gender and age (Crow, Reference Crow2000, Reference Crow2006). In addition, attention must be paid to population demography and selective pressure. Although they are not expected to influence comparison of the relative rates significantly, these other factors may nevertheless alter the mutation rate estimates. Moreover, all the regions examined in this study are gene loci. Mutation rates may be different for regions that do not harbour genes.
Using high-quality, large-scale, public re-sequencing data, this study explored the relative mutation rates of each base for each of the other bases. This study is the first report of population mutation parameters as relative mutation rates by applying the recurrent reversible mutation model to re-sequencing data. By providing more detailed information on the mutations in human gene loci, this study offers reasonable explanations for the evolution of the human genome by mutational events. Nevertheless, further studies using larger data sets and theoretical studies with consideration of population genetic factors would be helpful in providing better understanding for human mutation processes. In addition, as suggested (Duret, Reference Duret2009), direct estimation using large-scale re-sequenced data of families would be extremely useful.
The author appreciates the reviewers' comments, which substantially improve this manuscript. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2007-532-C00017). Key calculations were performed using the supercomputing resource at the Korea Institute of Science and Technology Information (KISTI).