Molecular phylogenetics continued…

Molecular phylogenetics continued…

Outline 1. Models of evolution
2. Phylogenetic tree reconstruction methods: -- distance based methods -- maximum parsimony (MP) -- maximum likelihood (ML) 3. Bootstrapping: evaluating the significance of a tree

The simplest model of evolution: pairwise distance
The simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences. The degree of divergence is called the p-distance. For an alignment of length N with n sites at which there are differences, the degree of divergence D is: D = n / N Consider an alignment where 3/60 aligned residues differ. The p-distance is 3/60 = 0.05.

Common assumptions of simple evolutionary models
Simple models of the evolutionary process make several incorrect assumptions: equal base or amino acid substitution rates an equal frequency of all bases or amino acids an equal evolutionary rate at all sites of an alignment 4) independent evolution between sites of an alignment. The Poisson model is an oversimplification of the evolutionary process. In addition to equal substitution probability, it makes several other assumptions including 1) an equal frequency of all amino acids, 2) an equal evolutionary rate at all sites and 3) independent evolution between sites. Empirical observations of protein alignments demonstrates these assumptions are often not met in nature. For example, amino acid substitution tends to occur much more frequently between amino acids of similar physiochemical properties [19]. Therefore much more realistic models of protein evolution have been devised (see section on Models of Evolution). Observations of DNA/protein alignments demonstrates these assumptions are often not met in nature. Therefore much more realistic models of DNA/protein evolution have been devised (more on this to follow).

Evolutionary models: The Poisson distance correction
-- A simple correction of the p-distance can be derived by assuming the probability of mutation at a site follows a Poisson distribution (with a uniform mutation rate) -- Correction takes account of multiple mutations at the same site Poisson distribution: probability of a given number of events occurring over a given time interval with a known average reate

Evolutionary models: The Poisson distance correction
-- A simple correction of the p-distance can be derived by assuming the probability of mutation at a site follows a Poisson distribution (with a uniform mutation rate) -- Correction takes account of multiple mutations at the same site Poisson corrected distance: dp = -ln(1-p) -- The corrected distance starts to deviate noticeably from the p-distance for p > 0.25 Assumption: equal rate of mutation at all sites Figure 8.1

Evolutionary models: the Gamma distance correction
-- The Gamma distance correction takes account of mutation rate variation at different sites -- A Gamma distribution (Γ) can effectively model realistic variation in mutation rates

Evolutionary models: the Gamma distance correction
-- The Gamma distance correction takes account of mutation rate variation at different sites -- A Gamma distribution (Γ) can effectively model realistic variation in mutation rates DΓ = a[(1-p)-1/a – 1] -- The parameter a determines the rate variation -- Values of a estimated from real protein sequence data vary between 0.2 (high variation) and 3.5 (lower variation) Figure 8.2

Evolutionary models p-distance, Poisson model, Gamma distance correction: These mutation models do not include any information relating to the chemical nature of the sequences, which means they can be applied to both nucleotide and protein sequences. So, it follows that there are a whole series of more complex evolutionary models specific for nucleotide sequence or protein sequence evolution

Jukes and Cantor (JC) one-parameter
model of nucleotide substitution: all substitutions occur with equal probability a A G Substitution rate matrix A C G T -3α α α α α -3α α α α α -3α α α α α -3α a A C G T a a a T C a P. 271

Kimura two-parameter model (K2P) of nucleotide
substitutions: the probability of transitions and transversion occurring are different a A G Substitution rate matrix A C G T -2β-α β α β β -2β-α α α α β -2β-α β β α α -2β-α b A C G T b b b T C a P. 272

Incorporation of unequal base frequencies
HKY85 substitution rate matrix: this is a K2P model, but rate matrix has been modified to account for differences in base composition (πA:πC:πG:πT) A C G T (-2β-α)πA βπC απG βπT βπA (-2β-α)πC απG απT απA βπC (-2β-α)πG βπT βπA απC απG (-2β-α)πT A C G T P. 273

Different models of molecular evolution (nucleotides)
Model name Base composition Different transition and transversion rates All transition rates identical All transversion rates identical JC 1:1:1:1 No Yes F81 Variable K2P HKY85 Tamura-Nei (TN) K3P SYM REV (GTR) Table 7.2

Evolutionary models: amino acid substitution matrices
There are empirically based models of amino acid substitution, which consist of a 20 x 20 rate matrix that estimates the probabilities for each amino acid being replaced by each alternative amino acid. With respect to amino acids, earlier we described the Poisson model as a simple, theoretical model of amino acid substitution that treats all amino acid replacements with an equal probability. However, empirical evidence has revealed that amino acids are much more likely to be replaced by amino acids with similar physiochemical properties (polarity, size and charge) then is assumed under an equal replacement probability model, such as the Poisson model [19]. This observation has lead to the implementation of empirically based models of amino acid substitution, which consist of a 20 x 20 rate matrix that estimates the probabilities for each amino acid being replaced by each alternative amino acid. This approach was first employed by Dayhoff and coworkers who developed an amino acid substitution matrix by calculating the replacement probabilities of amino acids from trees inferred from protein alignments by maximum parsimony [45]. Additional empirical models of amino acid substitution have been developed. The Jones-Taylor-Thornton model (JTT) is based on a more up to date substitution matrix constructed from a larger database of sequences [46] and as such is preferred over the Dayhoff model. In the JTT model, the probabilities were estimated from protein alignments that were at least 85% identical, in order to reduce the chance that an observed change resulted from multiple substitutions [46]. The PMB model is derived from the BLOCKS database of conserved protein motifs [47]. The NadV maximum likelihood phylogeny presented in Figure 10 was inferred from the WAG substitution model, which uses a substitution matrix calculated using an approximate maximum likelihood method [48]. The Jones-Taylor-Thornton model (JTT) is the same as the Dayhoff models but based on a more up to date substitution matrix constructed from a larger database of sequence The PMB model is derived from the BLOCKS database of conserved protein motifs and is therefore related to BLOSUM

Simple models of the evolutionary process make several incorrect assumptions: equal base or amino acid substitution rates an equal frequency of all bases or amino acids an equal evolutionary rate at all sites of an alignment 4) independent evolution between sites of an alignment. The Poisson model is an oversimplification of the evolutionary process. In addition to equal substitution probability, it makes several other assumptions including 1) an equal frequency of all amino acids, 2) an equal evolutionary rate at all sites and 3) independent evolution between sites. Empirical observations of protein alignments demonstrates these assumptions are often not met in nature. For example, amino acid substitution tends to occur much more frequently between amino acids of similar physiochemical properties [19]. Therefore much more realistic models of protein evolution have been devised (see section on Models of Evolution). Observations of DNA/protein alignments demonstrates these assumptions are often not met in nature. Therefore much more realistic models of DNA/protein evolution have been devised

Simple models of the evolutionary process make several incorrect assumptions: equal base or amino acid substitution rates  solution: use a more complex substitution matrix an equal frequency of all bases or amino acids  solution: estimate from sequence alignment data an equal evolutionary rate at all sites of an alignment  solution: model among site rate variation (ASRV) with a Gamma distribution 4) independent evolution between sites of an alignment  solution: yikes! No easy solution here… The Poisson model is an oversimplification of the evolutionary process. In addition to equal substitution probability, it makes several other assumptions including 1) an equal frequency of all amino acids, 2) an equal evolutionary rate at all sites and 3) independent evolution between sites. Empirical observations of protein alignments demonstrates these assumptions are often not met in nature. For example, amino acid substitution tends to occur much more frequently between amino acids of similar physiochemical properties [19]. Therefore much more realistic models of protein evolution have been devised (see section on Models of Evolution).

How to select an appropriate evolutionary model:
While it is easy to identify models that are formally more realistic, these are not necessarily more effective in representing the real data (i.e. the MSA) The Poisson model is an oversimplification of the evolutionary process. In addition to equal substitution probability, it makes several other assumptions including 1) an equal frequency of all amino acids, 2) an equal evolutionary rate at all sites and 3) independent evolution between sites. Empirical observations of protein alignments demonstrates these assumptions are often not met in nature. For example, amino acid substitution tends to occur much more frequently between amino acids of similar physiochemical properties [19]. Therefore much more realistic models of protein evolution have been devised (see section on Models of Evolution). Figure 7.18

How to select an appropriate evolutionary model:
While it is easy to identify models that are formally more realistic, these are not necessarily more effective in representing the real data (i.e. the MSA) Example of model selection Model No. of parameters log-likelihood (lnL) AIC JC 17 -19864 39762 F81 20 -19859 39758 HKY85 21 -19779 39601 HKY85+Γ 22 -19462 38968 The Poisson model is an oversimplification of the evolutionary process. In addition to equal substitution probability, it makes several other assumptions including 1) an equal frequency of all amino acids, 2) an equal evolutionary rate at all sites and 3) independent evolution between sites. Empirical observations of protein alignments demonstrates these assumptions are often not met in nature. For example, amino acid substitution tends to occur much more frequently between amino acids of similar physiochemical properties [19]. Therefore much more realistic models of protein evolution have been devised (see section on Models of Evolution). Akaike information criterion (AIC): measures the support in the data for a given model. The model with the smallest AIC value is regarded as the most suitable Table 7.3

Phylogenetic tree reconstruction
Phylogenetic inference is an hypothesis-generating procedure, where an inferred tree represents the “best hypothesis” of evolutionary relationships based on the limited information contained in molecular sequence data and the assumptions of the phylogenetic reconstruction method. Of the many possible evolutionary histories that could produce the observed differences between homologous sequences, we must have some method for choosing one or more best trees from all possible trees. Phylogenetic inference is an hypothesis-generating procedure, where an inferred tree represents the “best hypothesis” of evolutionary relationships based on the limited information contained in molecular sequence data and the assumptions of the phylogenetic reconstruction method (see below). Of the many possible evolutionary histories that could produce the observed differences between homologous sequences, we must have some method for choosing one or more best trees from all possible trees.

Tree reconstruction methods
Algorithmic methods follow a fixed series of procedures (an algorithm) to derive a tree from the data. - computationally fast - how well the tree fits the data relative to an alternative tree is unknown. - e.g. UPGMA or neighbor-joining methods Algorithmic methods follow a fixed series of procedures (an algorithm) to derive a tree from the data. Algorithmic methods tend to be computationally fast. However, because they proceed directly to a final tree, without evaluating multiple trees, confidence in how well the algorithm-generated tree fits the data relative to an alternative hypothesized tree is unknown. Most distance based clustering methods (e.g. UPGMA, neighbor-joining) fall into this category. Optimality criterion (Objective function) methods define a criterion for comparing alternative trees and then finds the best tree that maximizes/minimizes the criterion. The advantage of optimality criterion methods, which include maximum likelihood and maximum parsimony methods, is they can define how good or bad any one tree is with respect to the criterion and the data. If many trees can explain the data equally well, the user will not be deceived into choosing a single tree as the best hypothesis.

Tree reconstruction methods
Algorithmic methods follow a fixed series of procedures (an algorithm) to derive a tree from the data. - computationally fast - how well the tree fits the data relative to an alternative tree is unknown. - e.g. UPGMA or neighbor-joining methods Optimality criterion methods define a criterion for comparing trees and then finds the tree that maximizes/minimizes the criterion. - can define how good or bad any one tree is compared to other possibilities - e.g. maximum parsimony and maximum likelihood methods Algorithmic methods follow a fixed series of procedures (an algorithm) to derive a tree from the data. Algorithmic methods tend to be computationally fast. However, because they proceed directly to a final tree, without evaluating multiple trees, confidence in how well the algorithm-generated tree fits the data relative to an alternative hypothesized tree is unknown. Most distance based clustering methods (e.g. UPGMA, neighbor-joining) fall into this category. Optimality criterion (Objective function) methods define a criterion for comparing alternative trees and then finds the best tree that maximizes/minimizes the criterion. The advantage of optimality criterion methods, which include maximum likelihood and maximum parsimony methods, is they can define how good or bad any one tree is with respect to the criterion and the data. If many trees can explain the data equally well, the user will not be deceived into choosing a single tree as the best hypothesis.

Distance matrix methods
Phylogenetic inference by distance matrix methods involves two sequential steps: the evolutionary distances (i.e. number of substitutions) between all taxa in an alignment is estimated based on a model of evolution. the results are tabulated in a distance matrix and one of a variety of approaches is used to reconstruct a phylogenetic tree from the pairwise distance values. Phylogenetic inference by distance matrix methods involves two sequential steps (Figure 4). First, the evolutionary distances (i.e. number of substitutions) between all taxa in an alignment is estimated based on a model of evolution. Then the results are tabulated in a distance matrix and one of a variety of approaches is used to reconstruct a phylogenetic tree from the pairwise distance values.

The general flow of a distance matrix method for phylogenetic inference
Species A Species D Species B Species E Species C Species A B C D Phylogenetic inference by distance matrix methods involves two sequential steps (Figure 4). First, the evolutionary distances (i.e. number of substitutions) between all taxa in an alignment is estimated based on a model of evolution. Then the results are tabulated in a distance matrix and one of a variety of approaches is used to reconstruct a phylogenetic tree from the pairwise distance values.

Inferring a tree from a distance matrix
The simplest algorithm is Unweighted pair-group method with arithmetic mean (UPGMA). UPGMA uses a sequential clustering algorithm to group taxa in order of decreasing similarity. Inferring a tree from a distance matrix. There is an extensive variety of both algorithmic and optimality criterion methods available for inferring a phylogenetic tree from a matrix of evolutionary distances. The simplest algorithmic method is Unweighted pair-group method with arithmetic mean (UPGMA) [28]. UPGMA uses a sequential clustering algorithm to group taxa in order of decreasing similarity. UPGMA makes the assumption that there is a linear relationship between evolutionary distance and divergence time, or, in other words, that the rate of evolution is equal and has remained constant among taxa (i.e. ultrametric or clock-like). This assumption is rarely, if ever, met and therefore it is advised that UPGMA not be used to infer a best tree. There are many other superior methods for tree reconstruction that are as easy to implement and are computationally fast. Ultrametric tree The details of this algorithm are presented in Chapter 8 (p )

Assumptions of UPGMA UPGMA makes the assumption that there is a linear relationship between evolutionary distance and divergence time, or, in other words, that the rate of evolution is equal and constant among taxa (i.e. ultrametric or clock-like). This assumption is rarely, if ever, met and therefore it is advised that UPGMA not be used to infer a best tree. There are many other superior methods for tree reconstruction that are as easy to implement and are computationally fast. . UPGMA makes the assumption that there is a linear relationship between evolutionary distance and divergence time, or, in other words, that the rate of evolution is equal and has remained constant among taxa (i.e. ultrametric or clock-like). This assumption is rarely, if ever, met and therefore it is advised that UPGMA not be used to infer a best tree. There are many other superior methods for tree reconstruction that are as easy to implement and are computationally fast.

The neighbor joining (NJ) method
NJ does not assume all sequences have the same constant rate of evolution The basis of the method lies in the concept of minimum evolution, specifically that the tree with the shortest total branch length is the best tree One such widely employed algorithmic procedure that does not make the assumption that data are ultrametric is the neighbor-joining method [29]. Neighbor-joining is a star decomposition algorithm that attempts to minimize the overall branch length of the tree. From an initial star tree with a single internal node, all possible two node trees are constructed, where the second node consists of all pairs of taxa (Figure 6). The pair of taxa that gives the tree with the smallest sum of branch lengths (S) is chosen as the first pair of “neighbors”. These two taxa are then treated as a single composite taxon, a new distance matrix is computed and the process is repeated successively until a fully resolved tree is assembled [30]. Modified versions of the original neighbor-joining method such as BioNJ [31] and Weighbor [32] have been formulated and they tend to outperform the original neighbor-joining algorithm [33]. Because of fast run times neighbor-joining is particularly useful for large studies or bootstrap resampling studies that require analysis of multiple datasets (see section on Nonparametric Bootstrap Analysis). The first steps of NJ; start with a star tree, identify the first pair of nearest neighbors… full details on p NJ is a star decomposition algorithm that attempts to minimize the overall branch length of the tree. Figure 8.6

Neighbor joining method
Modified versions of the original neighbor-joining method such as BioNJ and Weighbor have been formulated and they tend to outperform the original neighbor-joining algorithm. Because of fast run times neighbor-joining is particularly useful for large studies or bootstrap resampling studies that require analysis of multiple datasets (e.g. Bootstrap Analysis). One such widely employed algorithmic procedure that does not make the assumption that data are ultrametric is the neighbor-joining method [29]. Neighbor-joining is a star decomposition algorithm that attempts to minimize the overall branch length of the tree. From an initial star tree with a single internal node, all possible two node trees are constructed, where the second node consists of all pairs of taxa (Figure 6). The pair of taxa that gives the tree with the smallest sum of branch lengths (S) is chosen as the first pair of “neighbors”. These two taxa are then treated as a single composite taxon, a new distance matrix is computed and the process is repeated successively until a fully resolved tree is assembled [30]. Modified versions of the original neighbor-joining method such as BioNJ [31] and Weighbor [32] have been formulated and they tend to outperform the original neighbor-joining algorithm [33]. Because of fast run times neighbor-joining is particularly useful for large studies or bootstrap resampling studies that require analysis of multiple datasets (see section on Nonparametric Bootstrap Analysis).

Parsimony methods Parsimony methods are based on the concept that the best hypothesis is the one that requires the least amount of evolutionary changes. Objective: to find the tree (i.e. hypothesis) that requires the minimum number of substitutions to explain the observed/inferred difference between sequences. Maximum parsimony (MP) is thus an optimality-criterion method in which the criterion (i.e. number of substitutions) is to be minimized. The tree that minimizes the number of substitutions required to explain the data is called the maximum parsimony tree. Parsimony methods were among the first for inferring phylogenies and are based on the concept that the best hypothesis is the one that requires the least amount of evolutionary changes [37]. In molecular phylogenetics, the principle of parsimony is to find the tree (i.e. hypothesis) that requires the minimum number of substitutions to explain the observed/inferred difference between taxa. Maximum parsimony (MP) is thus an optimality-criterion method in which the criterion (i.e. number of substitutions) is to be minimized. The tree that minimizes the number of substitutions required to explain the data is called the maximum parsimony tree.

There are only 3 possible trees with 4 taxa
C A C B D D B A D A B C B C D Which two trees are the same?

Parsimony methods Parsimony begins with the classification of sites as either informative or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. Site 1 is uninformative because the character states are all identical Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37].

Parsimony methods Parsimony begins with the classification of sites as either informative or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. Site 2 is uninformative because there are two mutations required for all possible trees Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37].

(2 substitutions in all trees)
Site 2 is uninformative (2 substitutions in all trees) Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37].

Site 3 is informative and tree 1 is most parsimonious
Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37].

and tree 2 is most parsimonious
Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37]. Site 4 is informative and tree 2 is most parsimonious

Tree 2 is the maximum parsimony tree
Parsimony begins with the classification of sites as either informative sensu parsimony or uninformative. A site is considered informative if it favors a subset of trees over all possible trees. The classification of sites and the basic procedure behind choosing the maximum parsimony tree can be illustrated by considering the hypothetical four taxon alignment in Figure 7a. For four taxa there exists three possible unrooted trees and we can use the information in the sequence alignment to choose which tree amongst these three possibilities is the most parsimonious. Site 1 is considered uninformative because all sequences in this site have the same character state (adenosine) and no change is required, regardless of the inferred tree. Site 2, although not invariant, is also uninformative because the most parsimonious explanation for this character pattern requires an identical minimum of two substitutions in all three trees (Figure 7b). The remaining three sites are all parsimony informative, in that they favor one tree over the other two possibilities, and can be used to search for the maximum parsimony tree. To identify the maximum parsimony tree, we first calculate the minimum number of substitutions at each informative site for all three possible trees (Figure 7b). For example, a single change is required to explain the substitution pattern observed at Site 3, given Tree 1, whereas a minimum of two substitutions must be inferred from Tree 2 and 3 in order to reconstruct the same substitution pattern. Therefore, Tree 1 is the most parsimonious explanation for the observed substitution pattern at Site 3. Next, we sum the number of substitutions across all informative sites for each possible tree and select the tree that gives the minimum number of changes (Figure 7c). In our example, Tree 2 is the most parsimonious tree because it infers only six substitutions, whereas Trees 1 and 3 infer 7 and 8 substitutions, respectively. This is a simple example of parsimony, presented here to explain the basic principle. However, different types of parsimony exist and for further information see [37]. Site 5 is informative and tree 2 is most parsimonious

Searching through the “forest” for the “best tree”
As the number of taxa becomes large (10+), the number of possible trees becomes enormous and searching this “tree space” for the optimal tree can become computationally impossible. A tradeoff is that as the number of taxa becomes large (10+), the number of possible trees becomes enormous (Table 1) and searching this “tree space” for the optimal tree can become computationally impossible. However, procedures exist for reducing the search time (e.g. heuristic search) and will be discussed below. Briefly, Bayesian methods are an alternative to the algorithmic and optimality criterion methods presented here [24]. Procedures exist for reducing the search time (e.g. heuristic search)

Searching tree space Heuristic tree searches seek the optimal tree though the use of iterative trial and error processes, which examine a subset of all possible trees Some common branch swapping algorithms: Nearest neighbor interchange (NNI) a branch swapping method that results in local rearrangements of a tree. Subtree pruning and regrafting (SPR), all possible subtrees are “pruned” from the reference tree and then “regrafted” at an alternative location. Optimality-criterion phylogenetic methods, such as maximum likelihood, search through multiple trees to find the optimal tree. The most thorough approach is to evaluate all possible trees in an exhaustive search, and choose the globally optimal tree. Exhaustive search methods are only useful for a small number of taxa since the number of possible trees increases rapidly with increasing number of taxa, such that it becomes computationally impossible to evaluate all trees (Table 1). One approach to overcoming this problem is to employ heuristic search methods. Heuristic tree searches seek the optimal tree though the use of iterative trial and error processes, which examine a subset of all possible trees. Most searches of this type operate under similar principles, where a starting tree is first constructed by a fast algorithmic tree building method, such as neighbor joining. Alternative trees are then examined by systematically rearranging branches and those that have a higher optimality than the reference tree are maintained and used as the new reference tree. This process is repeated until there is failure to uncover a better tree. There are many branch swapping algorithms in use for generating alternative topologies. Several examples are briefly outlined here. Nearest neighbor interchange (NNI) represents a branch swapping method that results in local rearrangements of a tree topology [27]. In subtree pruning and regrafting (SPR), all possible subtrees are “pruned” from the reference tree and then “regrafted” at an alternative location [27]. Tree bisection and reconnection (TBR) considers all possible bisections of a tree, from which all combinations of pair-wise reconnections are evaluated [27]. The tradeoff of heuristic searching is that this method is not guaranteed to find the globally optimal tree due to the possibility of local optima in the evaluated criterion (see [37].

Problems with parsimony
Correct phylogeny True convergent evolution Incorrect reconstruction Suppose the true phylogeny for a collection of four taxa is as illustrated in Figure 8a, where the length of the branches specify the amount of evolutionary change that has occurred. The long terminal branches suggest that the rate of evolution is accelerated in taxa I and II. The probability that homoplasious substitutions have occurred in these fast-evolving lineages is higher compared to taxa III and IV. Hence, an informative site may have a substitution pattern as illustrated in Figure 7b. Unfortunately, this pattern supports an incorrect tree if incorporated into parsimony (Figure 7c). Because this inconsistency in parsimony tends to cluster long branches together, it has become known as “long branch attraction”[38]. Long branch attraction can be a problem in all phylogenetic methods and will be discussed more thoroughly in a later section. This inconsistency in parsimony clusters long branches together and is termed “long branch attraction”. It can be a problem in all phylogenetic methods.

Maximum likelihood methods
Maximum likelihood is an optimality based method, which evaluates a hypothesized tree in terms of the probability that it would lead to the observed sequence data under a proposed model of evolution. ML methods are among the most accurate at inferring phylogenetic trees, but also some of the most time consuming methods to run The principle of maximum likelihood is to find the tree that maximizes the likelihood of observing the data. Maximum likelihood is an optimality based method, which evaluates a hypothesized tree in terms of the probability that it would lead to the observed sequence data under a proposed model of evolution [39, 40]. The principle of maximum likelihood is to find the tree that maximizes the likelihood value for the data.

Maximum likelihood methods
Maximum likelihood is an optimality based method, which evaluates a hypothesized tree in terms of the probability that it would lead to the observed sequence data under a proposed model of evolution. The principle of maximum likelihood is to find the tree that maximizes the likelihood of observing the data. Maximum likelihood is an optimality based method, which evaluates a hypothesized tree in terms of the probability that it would lead to the observed sequence data under a proposed model of evolution [39, 40]. The principle of maximum likelihood is to find the tree that maximizes the likelihood value for the data. Data Hypothesis

A very brief overview of the maximum likelihood method
1) Calculate the likelihood (L) of each site given the tree Calculating the likelihood of a tree is a statistical procedure, which, like parsimony, considers each site of the alignment individually. Here we will introduce the basic principles of the likelihood calculation using a collection of aligned nucleotide sequences and a four taxon tree (Figure 9). If we wanted to calculate the likelihood for the unrooted tree illustrated in Figure 9b, we begin by evaluating each site individually, and then combine all the site likelihoods together into a total likelihood value for the tree. To calculate a site likelihood, for example the likelihood of Site 2 in the nucleotide alignment illustrated in Figure 9a, we must consider every potential path of evolution that could have led to the extant character pattern. There are a total of 16 possible paths to consider for Site 2 (Figure 9c) and, although some pathways are less reasonable than others, they must all be considered because they exist with a probability greater than zero and contribute positively to the likelihood value of the tree. The likelihood for each site is found by summing the probabilities of each of the 16 possible pathways (Figure 9c). Under the assumption of independent evolution of sites, the overall likelihood for the tree is equal to the product of the likelihoods for each site (Figure 9d). In practice, likelihood values are extremely small. Therefore, it is the logarithmic transformation of the likelihood that is usually evaluated. When the log likelihood (lnL) is considered, multiplication is transformed to summation and therefore the equation in Figure 9d is transformed to that in Figure 9e.

A very brief overview of the maximum likelihood method
1) Calculate the likelihood (L) of each site given the tree Calculating the likelihood of a tree is a statistical procedure, which, like parsimony, considers each site of the alignment individually. Here we will introduce the basic principles of the likelihood calculation using a collection of aligned nucleotide sequences and a four taxon tree (Figure 9). If we wanted to calculate the likelihood for the unrooted tree illustrated in Figure 9b, we begin by evaluating each site individually, and then combine all the site likelihoods together into a total likelihood value for the tree. To calculate a site likelihood, for example the likelihood of Site 2 in the nucleotide alignment illustrated in Figure 9a, we must consider every potential path of evolution that could have led to the extant character pattern. There are a total of 16 possible paths to consider for Site 2 (Figure 9c) and, although some pathways are less reasonable than others, they must all be considered because they exist with a probability greater than zero and contribute positively to the likelihood value of the tree. The likelihood for each site is found by summing the probabilities of each of the 16 possible pathways (Figure 9c). Under the assumption of independent evolution of sites, the overall likelihood for the tree is equal to the product of the likelihoods for each site (Figure 9d). In practice, likelihood values are extremely small. Therefore, it is the logarithmic transformation of the likelihood that is usually evaluated. When the log likelihood (lnL) is considered, multiplication is transformed to summation and therefore the equation in Figure 9d is transformed to that in Figure 9e. 2) Sum the ln (L) to get the likelihood of the whole alignment This calculation must be performed for each tree during a heuristic search

Error associated with inferred trees
Random error is the deviation from the true tree, because there is a limited length of sequence data. Random error will therefore tend to decrease with an increasing length of data, as the stochastic variation associated with small sample size becomes less. Systematic error is the deviation from the true tree due to incorrect assumptions in the method or model used for phylogenetic inference. Systematic error will introduce a bias that may support the wrong tree and, unlike random error, the addition of more data will tend to increase support for the incorrect tree. There are several sources of error encountered when inferring a phylogenetic tree. Random error is the deviation from the true tree, because there is a limited length of sequence data. Random error will therefore tend to decrease with an increasing length of data, as the stochastic variation associated with small sample size becomes less. Systematic error is the deviation from the true tree due to incorrect assumptions in the method or model used for phylogenetic inference. Systematic error will introduce a bias that may support the wrong tree and, unlike random error, the addition of more data will tend to increase support for the incorrect tree. Fortunately, procedures for assessing both random and systematic error exist and considerable effort has been directed at minimizing their effects in phylogenetic inference. Several of these methods will be described here, yet, for further discussion on phylogenetic artifact due to phylogenetic pitfalls see [51].

Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology. Given a branching order, how consistently does A phylogenetic method find that branching order in a randomly permuted version of the original data set? IMPORTANT: Bootstrapping allows an assessment of random error only, not systematic error due to inaccurate assumptions in an evolutionary model.

Evaluating trees: bootstrapping
- To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. - Do 100 (to 1,000) bootstrap replicates. - Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. - >70% is considered significant.

Evaluating trees: bootstrap analysis
Random error When the length of sequence data available for a given set of taxa is limited, there exists the possibility that one tree will be favoured over a second tree by chance alone. This random error associated with finite sample size will only disappear once an infinite amount of data has been obtained, a realistically unattainable situation. Hence, once the best tree has been generated, it is important to assess how sensitive this tree is to the amount of sequence data from which it was inferred. Nonparametric Bootstrap Analysis. The nonparametric bootstrap analysis is a statistical technique that uses random resampling of data with replacement to determine sampling error or the confidence interval for an estimated parameter, in this case the groups of the hypothesized best tree [52]. This type of analysis is commonly employed in phylogenetics and is outlined in Figure 12. To begin, a series of pseudo-alignments of the same length as the original alignment are generated by a sample with replacement procedure from the original alignment. Sites can be sampled multiple times with the same probability, or not at all. Typically either 100 or 1000 pseudo-alignments are generated, depending on the computational time required by the phylogenetic method. From each pseudo-alignment a tree is inferred, resulting in a collection of 100 or 1000 estimated trees. The phylogenetic information (i.e. the number of trees in which the same group of taxa is recovered) contained in this set of trees is summarized in a consensus tree. There are several different methods for constructing a consensus tree but the most common are strict consensus and majority-rule consensus trees [37]. In practice, bootstrap values >70% are often considered significant support for a clade, however the significance of bootstrap values is highly debated. In our example, the clade containing A and B is reconstructed in all estimated trees, therefore this branch is given a value of 100%. In contrast, the clade containing D and E is only recovered in 66% of the estimated trees and therefore our confidence that this is a robust relationship is lower. The recovery of this clade may reflect a weak phylogenetic signal, potentially affected by stochastic error in the data.

Molecular phylogenetics continued…

Similar presentations

Presentation on theme: "Molecular phylogenetics continued…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Molecular phylogenetics continued…

Similar presentations

Presentation on theme: "Molecular phylogenetics continued…"— Presentation transcript:

Similar presentations

About project

Feedback