Search results for Computational Biology and Bioinformatics

3 - Constructing Trees from True Subtrees
from PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 51-60
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Frontmatter
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 1-2
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

1 - Brief Introduction to Phylogenetic Estimation
from PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 3-28
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

9 - Multiple Sequence Alignment
from PART II - MOLECULAR PHYLOGENETICS
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 178-233
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
Phylogeny estimation generally begins by estimating a multiple sequence alignment on the set of sequences. Once the multiple sequence alignment is computed, a tree can then be computed on the alignment (Figure 9.1). Not surprisingly, errors in multiple sequence alignment estimation tend to produce errors in estimated trees (Ogden and Rosenberg, 2006; Nelesen et al., 2008; Liu et al., 2009a; Wang et al., 2012) and other downstream analyses. Hence, multiple sequence alignment is an important part of phylogeny estimation.
As we have seen, there are many methods for estimating trees from gap-free data. However, because multiple sequence alignments almost always contain gaps, represented as dashes, phylogeny estimation methods must be modified to be able to analyze alignments with dashes. Typically this is performed by treating the dashes as missing data (i.e., missing data means there is an actual nucleotide or amino acid, but it is not known). Alternatively, the dashes are sometimes treated as an additional state in the sequence evolution model, thus producing five states for nucleotide alignments or 21 states for amino acid alignments. Finally, sometimes sites (i.e., columns in the multiple sequence alignment) containing dashes are eliminated from the alignment before a tree is computed. The different treatments of sequence alignments can result in quite different theoretical and empirical performance.
Multiple sequence alignments are computed for different purposes, including phylogeny estimation and protein structure prediction, and the definition of what constitutes a correct alignment depends, at least in part, on the purpose for the alignment. For some biological datasets, curated alignments, typically based on experimentally confirmed structural features of the molecules (e.g., secondary structures or tertiary structures of RNAs and proteins), are used as benchmarks for evaluating alignment methods. Examples of such benchmarks for evaluating large amino acid alignments include HomFam (Sievers et al., 2011), BAliBASE (Thompson et al., 1999), and the 10AA collection (Nguyen et al., 2015b), while the Comparative Ribosomal Website (CRW) provides benchmarks for RNA alignment (Cannone et al., 2002). Evolutionary alignments, on the other hand, are defined by the evolutionary history relating the sequences.

6 - Consensus and Agreement Trees
from PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 109-120
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In this chapter we discuss techniques for analyzing sets of trees, which are called profiles. Depending on how the set of trees was computed and the objective of the analysis, different techniques are needed.
For example, when a maximum parsimony analysis is performed, many equally good trees may be found, all having the same “best” score (meaning that it is the best score found during the analysis). Similarly, when a Bayesian MCMC (see Section 8.7) analysis is performed, then a random sample of the trees is examined. Sometimes, many different types of methods are used on the same data, and for each analysis a set of trees is obtained. Then, from the full set of trees, each of which has been estimated on the same data, again some kind of point estimate is sought. In each of these cases, a “consensus method” is used to provide a point estimate of the tree from the full set of trees.
An alternative objective might be to find those subsets of the taxon set on which all or most of the trees in the profile agree; this kind of approach does not produce a point estimate of the true tree, but can be used to identify the portion of the history on which all trees agree, and also (potentially) problematic taxa. We refer to this type of approach as an “agreement method.”
In this chapter, we discuss consensus and agreement methods, which are methods for analyzing datasets of trees under the assumption that all the trees are estimated species trees with the same leafset S. We discuss supertree methods in Chapter 7, which extend the consensus methods in this chapter to allow for the taxon sets to be different between trees. Finally, methods for analyzing sets of gene trees that can differ from each other due to incomplete lineage sorting, horizontal gene transfer, or gene duplication and loss are discussed in Chapter 10. Many of the methods described in Chapter 10 can also be used as supertree methods.
Consensus Trees
In general, consensus methods are applied to unrooted trees (and we will define them in that context), but they can be modified so as to be applicable to rooted trees as well.

Appendix C - Guidelines forWriting Papers About Computational Methods
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 327-330
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Becoming a good writer is tremendously important, for many reasons. The most obvious one is that your papers will be easier to understand and this will help you get published and also contribute to the research field. However, in addition, being a good writer will also help you find mistakes in your work, which is perhaps even more important!
There are many ways to write well, and you should find your own style. However, all good scientific writing should have the following properties:
• clarity of exposition, so that both you and the reader understand what you've done and can draw correct inferences from the data;
• reproducibility, so that the experiment can be performed by someone else, using the exact same methods and data;
• rigor, so that what you infer makes sense; and
• scientific relevance, so that what you generate is relevant to some real data analysis.

To become a good writer, you should read the scientific literature as much as you can, and note what you like and don't like about each paper, and why. Extensive reading helps in terms of developing a good writing style, and also developing skills in designing and doing experiments that are convincing and appropriate. It's also very good practice to read the supplementary materials of all the papers you like, because often it's only in the supplement that you will find out the details that are the most important. For that matter, some journals make it quite difficult to provide sufficient detail, due to space limitations, and may not even make it feasible to provide supplementary materials on the journal's website. So, as you read, develop a sense of how the different journals enable or discourage reproducibility. It may be that this will end up informing your thoughts on where you wish to publish.

It should be obvious that any paper you want to write should be written well, with a clear introduction that provides a context for the study and engages the reader, a discussion of what was done and why, the observations that you made from your work, detailed discussion of what the observations suggest and how they relate to the rest of the scientific literature, and – of course – conclusions.

Appendix D - Projects
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 331-338
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
There are three types of projects in this collection: short projects, long projects, and projects that involve the development of novel methods. Each project requires data analysis, either on real or simulated data, and also writing. Therefore, even the short projects will require about a week for completion.
The main purpose of the short projects is to familiarize the student with the process of computing and interpreting alignments and trees on datasets. Because the data analysis part of these projects should be fast to complete, they are focused on relatively small nucleotide datasets. If the student has access to sufficient computational resources, then analyses of larger datasets or amino acid datasets are possible. Each short project also asks the student to explore the impact of method choice (i.e., alignment method or tree estimation method) or dataset on the resultant tree, typically using visualization tools.
The long projects build on the short projects, but do more exploration of the impact of method choice (for alignment estimation or tree estimation) or dataset on phylogeny estimation. Some of these projects examine scalability of methods to large datasets, and so will require substantial computational resources. As the student will learn, the degree to which the method selection impacts the final phylogeny can depend on the properties of the data, such as number of sequences, number of sites (i.e., sequence length), rate of evolution, percentage of missing data, etc. The use of both biological and simulated data will help the students evaluate the impact of the different factors on the final outcomes.
The projects aimed at novel method development are likely to be the most difficult, and success in these projects will probably require substantial effort beyond the period of the course. However, a student who wishes to do a novel method development project is usually best served by starting with a long project to identify the competing methods and select datasets that are best able to differentiate between methods.
Final projects for the course are typically long projects rather than novel method development projects, and are focused on comparisons of leading computational methods on simulated or biological datasets, with an eye toward assessing the relative performance of these methods, and gaining insight into the conditions that impact each method.

2 - Trees
from PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 29-50
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

11 - Designing Methods for Large-Scale Phylogeny Estimation
from PART II - MOLECULAR PHYLOGENETICS
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 274-298
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
The construction of large phylogenies is of increasing interest, and the size of these datasets can be enormous. For example, the insect portion of the Tree of Life already contains “nearly a million described species,” and the evolutionary relationships between these species is far from resolved (Maddison, 2016). Yet methods for constructing large trees are typically based on approaches that were designed for much smaller datasets, and very few available software packages have adequate accuracy on large datasets. Thus, there is a gap in terms of basic algorithmic strategies and in terms of software development for large-scale phylogeny estimation.
In this chapter, we investigate techniques for constructing phylogenetic trees for large datasets.We explore algorithm design, including standard heuristics used in many software packages, and also divide-and-conquer techniques used to scale computationally intensive methods to large datasets. Some of the initial parts of this chapter appear in earlier chapters, and are repeated here for the sake of completeness.
Standard Approaches
Many phylogeny estimation methods fall into the following categories: distance-based methods, subtree assembly-based methods, heuristics for NP-hard optimizationmethods, or Bayesian methods. Understanding each of these types of methods is helpful in developing methods for large datasets.
Distance-based methods. Generally the fastest phylogeny estimation methods operate by computing a matrix of pairwise “distances” between every pair of taxa, and then using that distance matrix to compute a tree. Distance-based methods (described in Chapter 5) typically run in O(n3) time, but some even run in O(n2) time, and many are fast enough to be used on even very large datasets.
Subtree assembly-based methods. Some methods operate by computing trees on a collection of small subsets of the taxon set, and then combine the subset trees together into a tree. We call these “subtree assembly-based methods” since they operate by estimating subtrees and then assembling them into a larger tree. Many subtree assembly-based methods compute quartet trees using maximum likelihood, and then combine the quartet trees using some quartet amalgamation method, such as Quartet Puzzling (Strimmer and von Haeseler, 1996), Weight Optimization (Ranwez and Gascuel, 2001), Quartet Joining (Xin et al., 2007), Quartets MaxCut (Snir and Rao, 2010), and Quartet FM (Reaz et al., 2014).

Preface
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp xiii-xvi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Overview
The evolutionary history of a set of genes, species, or individuals provides a context in which biological questions can be addressed. For this reason, phylogeny estimation is a fundamental step in many biological studies, with many applications throughout biology, such as protein structure and function prediction, analyses of microbiomes, inference of human migrations, etc. In fact, there is a famous saying by Dobzhansky that “Nothing in biology makes sense except in the light of evolution” (Dobzhansky, 1973).
Because phylogenies represent what has happened in the past, they cannot be directly observed but rather must be estimated. Consequently, increasingly sophisticated statistical models of sequence evolution have been developed, and are now used to estimate phylogenetic trees. Indeed, over the last few decades, hundreds of software packages and novel algorithms have been developed for phylogeny estimation, and this influx of computational approaches into phylogenetic estimation has transformed systematics. The availability of sophisticated computational methods, fast computers and high-performance computing (HPC) platforms, and large sequence datasets enabled through DNA sequencing technologies, has led to the expectation that highly accurate large-scale phylogeny estimation, potentially answering open questions about how life evolved on earth, should be achievable.
Yet large-scale phylogeny estimation turns out to be much more difficult than expected, for multiple reasons. First, all the best methods are computationally intensive, and standard techniques do not scale well to large datasets; for example, maximum likelihood phylogeny estimation is NP-hard, so exact solutions cannot be found efficiently (unless P = NP), and Bayesian MCMC methods can take a long time to reach stationarity. While massive parallelism can ameliorate these challenges to some extent, it doesn't really address the basic challenge inherent in searching an exponential search space. However, another issue is that the statistical models of sequence evolution that properly address genomic data are substantially more complex than the ones that model individual loci, and methods to estimate genome-scale phylogenies are (relatively speaking) in their infancy compared to methods for single gene phylogenies.

8 - Statistical Gene Tree Estimation Methods
from PART II - MOLECULAR PHYLOGENETICS
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 145-177
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction to Statistical Estimation in Phylogenetics
Phylogeny estimation from biomolecular sequences is often posed as a statistical inference problem where the sequences evolve down a tree via a stochastic process. Statistical estimation methods take advantage of what is known (or hypothesized) about that stochastic process in order to produce an estimate of the evolutionary history. When we consider phylogeny reconstruction methods as statistical estimation methods, many statistical performance issues arise. For example: Is the method guaranteed to construct the true tree (with high probability) if there are enough data (i.e., is the method statistically consistent under the model)? How much data does the method need to obtain the true tree with high probability (i.e., what is the sample complexity of the method)? Is the method still relatively accurate if the assumptions of the model do not apply to the data that are used to estimate the tree (i.e., is the method robust to model misspecification)?
Markov models of evolution form the basis of most computational methods of analysis used in phylogenetics. The simplest of these models are for characters with two states, reflecting the presence or absence of a trait. However, the most common models are for nucleotide (four-state characters) or amino acid (20-state characters) data. They can also be used (although less commonly) for codon data, in which case they have 64 states (Goldman and Yang, 1994; Yang et al., 2000; Kosiol et al., 2007; Anisimova and Kosiol, 2009; De Maio et al., 2013).
In Chapter 1, we described the Cavender–Farris–Neyman (CFN) model of binary sequence evolution, and a simple method to estimate the CFN tree from binary sequences. In this chapter we focus on sequence evolution models that are applicable to molecular sequence evolution. As we will see, the mathematical theorems and algorithmic approaches are very similar to those developed for phylogeny estimation under the CFN model.
Statistical identifiability is an important concept related toMarkov models.We say that a parameter (such as the tree topology) of the Markov model is identifiable if the probability distribution of the patterns of states at the leaves of the tree are always different for two Markov models that are different for that parameter. Thus, some parameters of a model may be identifiable while others may not be. For example, the unrooted tree topology is identifiable under the CFN model, but the location of the root is not.

5 - Distance-based Tree Estimation Methods
from PART I - BASIC TECHNIQUES
Tandy Warnow, University of Illinois, Urbana-Champaign
Book:

Computational Phylogenetics

Published online:

26 October 2017

Print publication:

02 November 2017, pp 83-108
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Computational Biology and Bioinformatics

Refine search

Refine search

Actions for selected content:

1070 results in Computational Biology and Bioinformatics

Computing for Biologists

PART II - MOLECULAR PHYLOGENETICS

Contents

Notation

Index

Appendix B - Algorithm Design and Analysis

Summary

Appendix A - Primer on Biological Data and Evolution

Summary

3 - Constructing Trees from True Subtrees

Frontmatter

PART I - BASIC TECHNIQUES

1 - Brief Introduction to Phylogenetic Estimation

9 - Multiple Sequence Alignment

Summary

6 - Consensus and Agreement Trees

Summary

Appendix C - Guidelines forWriting Papers About Computational Methods

Summary

Appendix D - Projects

Summary

2 - Trees

11 - Designing Methods for Large-Scale Phylogeny Estimation

Summary

Preface

Summary

8 - Statistical Gene Tree Estimation Methods

Summary

5 - Distance-based Tree Estimation Methods

Computational Biology and Bioinformatics

Refine search

Refine search

Actions for selected content:

Save Search

1070 results in Computational Biology and Bioinformatics

Computing for Biologists

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary