Book contents
- Frontmatter
- Contents
- List of Contributors
- Preface
- 1 An Introduction to Next-Generation Biological Platforms
- 2 An Introduction to The Cancer Genome Atlas
- 3 DNA Variant Calling in Targeted Sequencing Data
- 4 Statistical Analysis of Mapped Reads from mRNA-Seq Data
- 5 Model-Based Methods for Transcript Expression-Level Quantification in RNA-Seq
- 6 Bayesian Model-Based Approaches for Solexa Sequencing Data
- 7 Statistical Aspects of ChIP-Seq Analysis
- 8 Bayesian Modeling of ChIP-Seq Data from Transcription Factor to Nucleosome Positioning
- 9 Multivariate Linear Models for GWAS
- 10 Bayesian Model Averaging for Genetic Association Studies
- 11 Whole-Genome Multi-SNP-Phenotype Association Analysis
- 12 Methods for the Analysis of Copy Number Data in Cancer Research
- 13 Bayesian Models for Integrative Genomics
- 14 Bayesian Graphical Models for Integrating Multiplatform Genomics Data
- 15 Genetical Genomics Data: Some Statistical Problems and Solutions
- 16 A Bayesian Framework for Integrating Copy Number and Gene Expression Data
- 17 Application of Bayesian Sparse Factor Analysis Models in Bioinformatics
- 18 Predicting Cancer Subtypes Using Survival-Supervised Latent Dirichlet Allocation Models
- 19 Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure
- 20 Optimized Cross-Study Analysis of Microarray-Based Predictors
- 21 Functional Enrichment Testing: A Survey of Statistical Methods
- 22 Discover Trend and Progression Underlying High-Dimensional Data
- 23 Bayesian Phylogenetics Adapts to Comprehensive Infectious Disease Sequence Data
- Index
- Plate section
3 - DNA Variant Calling in Targeted Sequencing Data
Published online by Cambridge University Press: 05 June 2013
- Frontmatter
- Contents
- List of Contributors
- Preface
- 1 An Introduction to Next-Generation Biological Platforms
- 2 An Introduction to The Cancer Genome Atlas
- 3 DNA Variant Calling in Targeted Sequencing Data
- 4 Statistical Analysis of Mapped Reads from mRNA-Seq Data
- 5 Model-Based Methods for Transcript Expression-Level Quantification in RNA-Seq
- 6 Bayesian Model-Based Approaches for Solexa Sequencing Data
- 7 Statistical Aspects of ChIP-Seq Analysis
- 8 Bayesian Modeling of ChIP-Seq Data from Transcription Factor to Nucleosome Positioning
- 9 Multivariate Linear Models for GWAS
- 10 Bayesian Model Averaging for Genetic Association Studies
- 11 Whole-Genome Multi-SNP-Phenotype Association Analysis
- 12 Methods for the Analysis of Copy Number Data in Cancer Research
- 13 Bayesian Models for Integrative Genomics
- 14 Bayesian Graphical Models for Integrating Multiplatform Genomics Data
- 15 Genetical Genomics Data: Some Statistical Problems and Solutions
- 16 A Bayesian Framework for Integrating Copy Number and Gene Expression Data
- 17 Application of Bayesian Sparse Factor Analysis Models in Bioinformatics
- 18 Predicting Cancer Subtypes Using Survival-Supervised Latent Dirichlet Allocation Models
- 19 Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure
- 20 Optimized Cross-Study Analysis of Microarray-Based Predictors
- 21 Functional Enrichment Testing: A Survey of Statistical Methods
- 22 Discover Trend and Progression Underlying High-Dimensional Data
- 23 Bayesian Phylogenetics Adapts to Comprehensive Infectious Disease Sequence Data
- Index
- Plate section
Summary
Introduction
Rare DNA variants (minor allele frequency [MAF] of 1% or less in a population), occurring less than one in every 1 KB (Wang et al., 1998), can be distributed in different genes, interact with each other, and affect more than one disease phenotype. To study the association of rare variants with diseases, it is necessary to obtain many DNA genomes from individuals with specific disorders. Even though next-generation sequencing has achieved a low cost per base and a high throughput on the terabase (TB) scale, it is still challenging to sequence hundreds of samples at regular laboratories and at the same time to comply with the high standards of accuracy and completeness in medical research. Recent developments in targeted sequencing provide a timely solution by generating sequencing data from the genomic regions of interest (e.g., 1 MB for 500 candidate genes vs. 3 TB for whole-genome, per sample), therefore reducing the time, the cost, and the amount of data in the downstream analysis. The selection of these regions or candidate genes can be done through linkage mapping, phenotype-based gene association, or network analysis (Scharfe et al., 2009).
Efficient and specific enrichment of tens of thousands of selected genomic regions across hundreds of samples is essential for the success of a targeted sequencing study. This field is currently still under development. The available methods include hybridization-based capture and in-solution capture. Compared with hybridization-based methods, in-solution enrichment strategies usually deliver higher target specificity (>98%) with lower costs and smaller DNA sample requirements, which is useful for multisample studies. In particular, we have developed a novel probe-based in-solution capture technology called long padlock probes (LPP) method (Shen et al., 2011) .
- Type
- Chapter
- Information
- Advances in Statistical BioinformaticsModels and Integrative Inference for High-Throughput Data, pp. 54 - 76Publisher: Cambridge University PressPrint publication year: 2013