We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
With diploid organisms, one is interested not only in discovering variants but also in discovering to which of the two haplotypes each variant belongs. One would thus like to identify the variants that are co-located on the same haplotype, a process called haplotype phasing. Assume we have managed to do haplotype phasing for several individuals. It is then of interest to do haplotype matching, that is, to locate long contiguous blocks shared by multiple individuals. The chapter covers algorithms and complexity analysis of these key haplotype analysis tasks. A close connection between classical indexes and a tailored data structure called the positional BWT index is established.
Analysing the content of a biological sequence can often be modeled as a segmentation problem. For example, one may wish to segment a genome in coding and non-coding regions, where only the former are translated to proteins. Statistical features of what genes usually look like can be used to derive an optimization framework. This process can be formalized through hidden Markov models, and the underlying segmentation problem can be solved using dynamic programming. This chapter introduces the key methods related to such optimization.
Classical index structures like suffix trees are powerful, but they occupy much more space than the data they are built on. Many space-efficient alternatives exist that occupy space close to the input data. This chapter covers such data structures based on the Burrows–Wheeler transform (BWT). A special emphasis is given to the bidirectional BWT index, which can be used for solving basic genome analysis tasks by simulating suffix tree exploration without any sacrifice in run time. Space-efficient representations of de Bruijn graphs are also covered.
Graphs are a fundamental model for representing various relations among data. The aim of this chapter is to present some basic problems and techniques relating to graphs, mainly for finding particular paths in directed and undirected graphs. In the following chapters, we deal with various problems in biological sequence analysis that can be reduced to one of these basic ones.
In this chapter we assume that we have a collection of reads from all the different (copies of the) transcripts of a gene. We start by showing how to extend read alignment techniques to short RNA reads, and later we show how to exploit the output of genome analysis techniques to obtain an aligner for long reads of RNA transcripts. Our final goal is to assemble the reads into the different RNA transcripts and to estimate the expression level of each transcript. The main difficulty of this problem, which we call multi-assembly, arises from the fact that the transcripts share identical substrings. We illustrate different scenarios, and corresponding multi-assembly formulations, which we then solve using network flow techniques.
A full-text index for a string T is a data structure that is built once and that is kept in memory for answering an arbitrarily large number of queries on the position and frequency of substrings of T. Such queries can be used for speeding-up dynamic programming algorithms tailored for mapping reads to a reference genome – a fundamental task in the analysis of high-throughput sequencing data. This chapter covers the classical full-text indexes and the like, including k-mer indexes, suffix arrays, and suffix trees. Linear-time algorithms for suffix sorting and for basic genome analysis tasks, such as finding maximal exact matches, are also presented.
This chapter gives a minimalistic, combinatorial introduction to molecular biology, omitting the description of most biochemical processes and focusing on inputs and outputs, abstracted as mathematical objects.
This chapter connects the alignment techniques and space-efficient data structures covered in earlier chapters. It shows how to use BWT indexes for alignining sequencing reads to a reference genome. This powerful read mapping procedure enables variant calling and genotyping of new individuals from a species whose reference genome has already been assembled.
An alignment of two sequences aims to highlight how much in common the two sequences have. In computational biology, an alignment is a prediction of the evolutionary steps between the two sequences. Different costs for such steps can be assigned, and then one can seek for an optimal alignment. This chapter gives a comprehensive introduction to the dynamic programming algorithms developed for various alignment formulations.
This chapter shows how to perform analysis and comparison of genomes without assuming a reference genome to be available. The bidirectional BWT index turns out to be essential here, and the chapter covers a comprehensive set of techniques to manipulate this data structure. The algorithms covered include computing maximal exact/unique matches, substring kernels, matching statistics, and Jaccard similarity.
Several large-scale studies aim to build comprehensive catalogs of all the variants in a population, for example all the frequent variants in a species or all the variants in a group of individuals with a specific trait or disease. Such catalogs are the substrate for subsequent genome-wide association studies that aim to correlate variants to traits, and ultimately to personalized treatments. Such catalogs can also be leveraged for making basic analysis tasks, such as read alignment, using not just one reference genome but a pangenome data structure representing all genomes in the catalogue. The chapter gives an overview of different pangenome data structures and their applications. Selected data structures are covered in more depth, including the r-index.
Assume that a drop of seawater contains cells from many distinct species. Sequencing such a mixed sample and figuring out the relative abundancy of every species is a key problem in metagenomics. This chapter explores techniques for metagenomics analysis in different settings, for example with and without assuming that reference sequences are available. To solve these problems, we use techniques including tailored k-mer-based analyses, bidirectional BWT indexing, and network flows.
This chapter presents the minimal setup of data structures required to follow the rest of the book in a self-contained manner. Balanced binary trees are enhanced to solve dynamic range minimum queries. Bitvector rank and select data structures and their extensions to larger alphabets with wavelet tree are covered. Then a special structure for solving static range minimum queries is derived. The chapter ends with a concise description of hashing primitives, such as perfect hashing, Bloom filters, minimizers, and the Rabin–Karp rolling hash.
Throughout the book we mostly assume the genome sequence under study to be known. In this chapter we look at strategies for how to assemble fragments of DNA into longer contiguous blocks, and eventually into chromosomes. This chapter is partitioned into sections roughly following the workflow of a de novo assembly project, namely, error correction, contig assembly, scaffolding, and gap filling. Algorithms working with de Bruijn graphs and overlap graphs are studied.