Download presentation
Presentation is loading. Please wait.
1
Phylogenetics -- Introduction
phylogenetics (pron.: /faɪlɵdʒɪˈnɛtɪks/) is the study of evolutionary relationships among groups of organisms (e.g. species, populations)
2
Why phylogeny? Phylogeny is the evolutionary relationship among taxa, it provides framework for studying Comparative morphology Conservation Agriculture Biogeography Disease Genomics
4
Terminology Branches Internal nodes Root Terminal nodes
Phylogram or phylogenetic tree
5
Number of Trees Unrooted bifurcating (2N-5)!!
1 3 4 2 1 3 4 2 1 3 4 2
6
Number of Trees Rooted bifurcating (2N-3)!!
1 3 4 2 1 3 2 4 2 4 1 3 1 3 2 4 4 2 1 3 1 3 2 4
7
Reading phylogenetic trees
A valid clade is monophyletic consisting of the ancestor taxon and all its descendants (a) Monophyletic. In this tree, grouping 1, consisting of the seven species B–H, is a monophyletic group, or clade. A mono- phyletic group is made up of an ancestral species (species B in this case) and all of its descendant species. Only monophyletic groups qualify as legitimate taxa derived from cladistics.
8
Reading phylogenetic trees
A paraphyletic clade consists of an ancestral taxon and some, but not all, of the descendants (b) Paraphyletic. Grouping 2 does not meet the cladistic criterion: It is paraphyletic, which means that it consists of an ancestor (A in this case) and some, but not all, of that ancestor’s descendants. (Grouping 2 includes the descendants I, J, and K, but excludes B–H, which also descended from A.)
9
Reading phylogenetic trees
A polyphyletic grouping includes numerous taxa that lack a common ancestor Polyphyletic. Grouping 3 also fails the cladistic test. It is polyphyletic, which means that it lacks the common ancestor of (A) the species in the group. Further- more, a valid taxon that includes the extant species G, H, J, and K would necessarily also contain D and E, which are also descended from A.
10
Homology vs. Homoplasy Homology is any similarity between characters that is due to their shared ancestry Homoplasy occurs when characters are similar, but are not derived from a common ancestor. Homoplasy often results from convergent evolution. What type of characters do we use to determine evolutionary relationships?
11
Vertebrate limbs are homologous.
12
Are bird and bat WINGS homologous?
lazybirder.blogspot.com No, they are homoplasious.
13
Sequence homolog Orthology Paralog xenolog Speciation Gene duplication
Horizontal gene transfer
14
Sequence homolog
15
Gene Duplication If we unwittingly sequenced genes 1, 3 and 5 we would incorrectly infer that ((A,C) B) was the organismal tree. This would be a paralogous comparison. Only orthologous comparisons reflect the organismal tree
16
Phylogenetic Inference
Joseph Felsenstein “father of statistical phylogenetics” Professor Departments of Genome Sciences and Biology Departments of Computer Science and Statistics University of Washington in Seattle
17
Methods Distance based Characters based Algorithmic
UPGMA, NJ Maximum parsimony Optimization Minimum evolution Maximum likelihood Bayesian analysis Others!!!
18
Parsimony methods Occam's razor is a principle of parsimony, economy, or succinctness used in logic and problem-solving. It states that among competing hypotheses, the one that makes the fewest assumptions should be selected.
19
Parsimony: An Example Various trees that could explain the phylogeny of the following four sequences: AAG, AAA, GGA, AGA. For example, AAA AGA AAG GGA AAA AGA AAG GGA Parsimony prefers the second tree to the first, because it requires less substitution events (three vs. four changes).
20
Fitch’s Algorithm – Step 1
# of changes = # union operations T T AGT CT GT C T G T A T
21
Fitch’s Algorithm – Step 2
CT C A G AGT GT T CT C A G AGT GT T CT C A G AGT GT T CT C A G AGT GT T CT C A G AGT GT T CT C A G AGT GT
22
Parsimony vs. probabilistic methods
The important advantage of probabilistic methods over parsimony is statistically consistent. MP is not consistent, particularly in the case of unequal evolutionary rates between different lineages (Felsenstein, 1978). The “model freeness” of MP methods does not grant it less error from model misspecification, but rather they are less flexible to accommodate complex data signals.
23
Phylogenetic reconstruction: distance matrix methods
UPGMA (unweighted pair-group method with arithmetic mean) Example: a distance matrix for 5 sequences. B C D E C/D A/B/E A .53 .99 1.02 .82 A/B .90 .98 .78 .94 .80 .93 .73 .65 .86 .81 The pair with the smallest distance is grouped until all of the sequences are clustered in a tree • assumes all sequences evolve at the same rate • generates a rooted tree March 11, 2005 BIOS427/827
24
Phylogenetic reconstruction: distance matrix methods
NJ (neighbor joining method) Example: a distance matrix for 5 sequences. B C D E A .53 .99 1.02 .82 .80 .93 .73 .65 .81 .94 1) Start with a star-like phylogeny. 2) The total length of the tree (the sum of the branch lengths) is estimated. 3) Find neighbors sequentially that minimize the total length of the tree. • does not assume a constant rate • generates a unrooted tree March 11, 2005 BIOS427/827
25
Main strength of distance approach is speed of computation relative to other methods
26
Shortcommings ... Summarizing a set of sequences by a pairwise distance matrix looses information. (For example can no longer trace the evolution of individual sites or categories of sites on a tree once the data have been converted to distances) Branch lengths estimated by some methods may not be biologically interpretable. (For example could get a distance of substitutions - what is 0.6 of a substitution? Also can get estimates that are biologically impossible, like tree distances that are less than the pair-wise distances observed between sequences.
27
Maximum Likelihood
28
Given two explanations for a particular outcome which should we choose?
The explanation that makes the observed outcome more likely... More formally if given some data D and a hypothesis H the likelihood of those data is given by LD = Pr (D | H) Which is the probability of D given H.
29
Likelihood function
30
Basic evolutionary models
Topology and branch length Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5
31
Markov chain A Markov chain is a model in which changes in states
follow transition probabilities. It is a stochastic system, i.e. random process The probability of the next state depends on the current state, but can also have a chain with memory The probability of moving to another state follows a probability distribution But it can stay in the same locality, where locality may be in space or time
32
Markov chain Monte Carlo
Monte Carlo: town in Monaco famous for its casino (including the European Poker Tour and World Backgammon Championship) Relevance ? both operate on random processes
33
MCMC Robot
34
Putting confidence limits on phylogenies
35
Most scientific measures are accompanied by some estimate of precision.
For example cms Phylogenies should also be accompanied by some indication of confidence limits One reason for a poor estimate is sampling error As a consequence, estimates of phylogeny based on samples will be accompanied by error.
36
The effect of sampling error can be seen by comparing the trees that result for different genes in the mitochondrion. Phylogenies for the same 6 mammals based on 15 different mitochondrial genes
37
Taxon sampling. Just as we may have limited samples of DNA so we may also have limited samples of taxa. In the mammalian example, only 6 taxa were used - a small fraction of the total number of extant mammals Inferred relationships among species can change if additional sequences are added For clades with a good fossil record, morphological data may prove superior to molecular data if the evidence from extinct taxa are crucial to recovering the phylogeny.
38
Estimating the sampling error using the non-parametric bootstrap
A good way to measure sampling error is to take multiple samples from the popn. being studied and compare the estimates from the different samples. The spread of the estimates gives an indication of the sampling error. i.e. how much our conclusions would vary depending on the samples we took.
39
bootstrap contd... The n.p. bootstrap invokes the same underlying principal, but rather than re-sample replicates from the population we re- sample pseudoreplicates from the data. For each pseudoreplicate we derive an estimate of the parameter we are trying to measure (like mean height of population). The variation among the estimates derived from each pseudoreplicate provides a measure of the sampling error.
40
Comparison of methods for estimating sampling error of the estimate of population mean
41
The bootstrap can be applied to phylogenies
Used to give us a “feel” for how good our inferred tree is Procedure: Generate pseudoreplicates from the sequences. by sampling columns at random w/ replacement until we have a new (fake) data set with the same number of sites as the original a gggg gttt cccc ctct ttcc agag gaaa tcct original data a 2 ctct 1 cccc 2 gttt 3 agag 1 1 gaaa 1 tcct pseudoreplicate 1
42
Bootstrap contd. Estimate the phylogeny for the pseudo-replicate data set a ctct cccc gttt agag gaaa tcct pseudoreplicate 1 Repeat the process of generating pseudo replicates and estimating its tree a large number of times (100-10,000 times) This set of trees contains info. on the sampling error.
43
Because sampling is with replacement, some sites may occur more than once in the pseudoreplicate while others may not be represented at all. a gggg gttt cccc ctct ttcc agag gaaa tcct original data a 2 ctct 1 cccc 2 gttt 3 agag 1 1 gaaa 1 tcct pseudoreplicate 1 Each pseudoreplicate resembles original data in that it contains ONLY sites found in that data set, but differs in the frequency of the sites represented.
44
For 100 bootstrap replicates of a hominoid mtDNA data set, three topologies are obtained
The split {orang,gibbon} {human,chimp, gorilla} occurs in all 100 bootstrap replicates (i.e. has 100% bootstrap support). However, there is a conflict between the relationships among the African apes {human,chimp, gorilla} This suggests these data lack the information to discriminate among the three hypotheses of relationship
45
Bootstrap consensus O B C D E A
For small number of taxa, it is feasible to show the kinds of trees resulting from bootstrap replications For larger numbers of taxa we merely use a consensus tree to summarize the information collectively from each replicate O B C D E A 55 85 100 100
46
Parametric bootstrap Cross between a simulation and a bootstrap.
Would like to know if the tree resulting from an analysis is erroneous due to an intrinsic bias in the data. Involves generating artificial data sets with a computer, but using the tree that was generated from the initial data with a particular model in mind.
47
Parametric bootstrap contd.
Example: 18s rRNA subjected to parsimony analysis suggests birds and mammals are each others closest living relatives. Morphology and fossil data suggest this is incorrect. (Birds & Crocs sister-taxa w/ morphology) Possible reason for the conflict is long branch attraction between birds and mammals. Would like to know: If the traditional tree (bird-croc) is indeed correct, what are the chances that we could mistakenly conclude that the bird-mammal tree was best for these data? To test this, Huelsenbeck et al 1996 simulated the evolution of 18s on the 3 possible trees for 4 taxa to see how often the various tree-building methods recovered the correct tree.
48
Huelsenbeck et al 1996 Freq. of each tree recovered Assumed tree Estimated branch lengths and ts/tv ratio from original data when applied to each of the possible trees for 4 taxa Used the parameters they obtained to generate 1000 artificial data sets same size as original. Results show that no matter which topology the sequences were evolved on, tree 1 was recovered 85% of the time. So even if tree 3 is correct as suggested by morphology, the 18s data would likely support tree 1.
49
Popular software used for phylogenetics and their limits
50
PAUP PHYLIP RaxML PhyML TreeFinder MrBayes BEAST BayesPhylogenetics
51
MEGA
52
Summary of sources of error in phylogenetic reconstruction
Sampling error Incorrect model of sequence evolution Tree structure. In some cases the evolutionary history itself can conspire to thwart our efforts to recover it. Rapid cladogenesis, widely diff. rates of divergence and extinction all compromise ability to reconstruct evolutionary trees.
53
Sampling errors Stochastic errors
Often being a problem with small sampling sizes e.g. standard error of population mean decreases with sampling size se = sd √ n
54
Systematic errors Systematic caused by incorrect modeling.
Lead to inconsistency, which means that more data will converge to the wrong answer. Usually resulted from under-parameterized models.
55
An Illustration of the General Properties of Model Selection (Pybus OG ,2006)
(A) A hypothetical dataset consisting of thirteen points plotted on two axes. (B) A simple model, represented by a straight line through the points. (C) A very complex model, which fits the data almost perfectly but has too many parameters. (D) A model with an intermediate number of parameters represented by a curve. This fits the data well but still has relatively few parameters and therefore has greater explanatory power.
56
Systematic errors: analytical factor
Long-branch attraction Non-stationarity Heterotachy Among site rate variation etc.
57
Basic evolutionary models
Topology and branch length Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5
58
Assumptions in basic models
The evolution of characters follows a Markov model with Poisson distribution, but some evidence suggest the overdispersed point process fits the data better . Each site evolves independently and according to the identical process, so called “i.i.d.” process. Molecular clock assumption describes the evolutionary rate as constant along the evolutionary process.
59
i.i.d. assumption Each site evolves independently and according to the identical process, so called “i.i.d.” process.
60
Assumptions in basic models
Stationarity and time reversibility. Stationarity and time reversibility assure the expected frequencies of the nucleotides or amino acids are constant along the evolutionary pathway. The conditional probabilities of nucl. subst. are the same for all sites and do not change over time or among lineages. Q Are these assumptions reasonable?
61
Weight differently for stem and loop sites
INDEPENDENCE? We assume that change at one site has no effect on other sites. Frequently violated. eg. Ribosomal RNA A substitution in a stem region can result in a pair of nucleotides that cannot “Watson-Crick pair” correctly, reducing stability of the structure. Often we find that single changes are accompanied by compensatory changes. Clearly violates the independence assumption. Weight differently for stem and loop sites
62
Variation in rates of substitution among sites?
All of the methods presented assume that each site in a sequence is equally likely to undergo substitution. If rates of substitution vary, can have considerable influence on sequence divergence (i.e. how much change we estimate to have occurred) Consider the case where some sites are free to vary while others are constrained to be invariant
63
If a large proportion of sites are not free to vary then paradoxically, sequences that evolve at a fast rate can appear to show less sequence divergence than more slowly evolving sequences that have fewer constraints. (A) rate of subst. 0.5%/Myr: 80% of sites free to vary (B) rate of subst. 2%/Myr: 50% of sites free to vary
64
In reality sites show a range of probabilities of distribution of rates
Challenge is to develop a tractable model of the rate variation Most widely used approach uses the “gamma distribution” Gamma distrib has a shape parameter that specifies range of rate variation among sites small values of result in L-shaped distrib. larger values smaller range of rates. when > 1 distribution is “bell shaped”
65
those from 3rd codon positions
Estimates of alpha vary from nuclear and mitochondrial genes vary between 0.16 (12sRNA) (prolactin) note. Values of a from first & 2nd codon positions tend to be smaller than those from 3rd codon positions
66
Can modify models of evolutionary change to include the gamma distribution - typically represented by the symbol HKY +
67
Base Composition Equilibrium?
Assumes that base composition is roughly the same over the collection of sequences. Deviations from this assumption occur commonly and often lead to misleading inferences. When constructing trees there is a tendency to cluster sequences together that have similar base compositional profiles. Explicitly modeling the non-stationary process
68
Compositional bias (non-stationary)
“Compositional bias can result in the artefactual grouping of species with similar nucleotide composition, because most methods assume the homogeneity of the substitution process and the constancy of sequence composition (stationarity) through time ” (Delsuc et al. 2005). A 50% B 70% C 70% D 50% A 50% B 70% C 70% D 50% 50%
69
Heterotachy Heterotachy is the variation of evolutionary rate of a given position of a molecule through time. The diagram on the right is a simple scenario used by Kolaczkowski and Thornton (2004). From Steel, 2005
70
Long branch attraction
“Intuitively, with long branches leading to speices A and C, the probability of parallel changes that arrive at the same state becomes greater than the probability of an informative single change in the interior branch of the tree” (Felsenstein, 2004). B D B D
71
Phylogenomics Prediction of gene function (Eisen, 1998)
Establishment of evolutionary relationships using genome or genome-scale data
72
One gene or more genes? Single gene or a few genes often result low resolution. Single gene or a few genes may even reach to the wrong phylogeny.
73
Systematic error + + + … Phylogenetic signal Gene A Gene B Gene C
75
How many gene needed? The figure shows resolving different node may need different number of genes. Few nodes can be resolved by single gene or a few genes. Most node need 5 to 10 thousand amino-acid (15-30 genes) to be resolved. Few nodes can not be resolved even with many genes. 2,5000 nucleotides are needed for resoultion of avian tree (Edwards et at., 2005). From Delsuc et al. 2005
76
How to analyze multilocus data?
Remember i.i.d.?
77
Partitioned analysis guided by cluster analysis and phylogeny of ray-finned fish
78
??? Assumption of i.i.d. Topology and branch length
Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5
79
Partitioning by genes and codons
Concatenated sequence By genes G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 By codon positions 1st 2nd 3rd
80
By both genes and codon positions
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd Is partitioned by both genes and codon positions over-parameterized?
81
Data Ten nuclear genes: zic1, myh6, RYR3, Ptc, tbr1, ENC1, Glyt, SH3PX3, plagl2 and sreb2. 56 taxa representing 41 of the 44 orders of ray-finned fish and 4 outgroups 8025 nucleotides
82
Data
83
Clustering of blocks based on genes and codons
5 parts 2 parts
84
∆i(AICi - AICbest) and Bayes likelihood for partitioning based on grouping blocks
85
Conclusion Partitioning by both genes and codon positions is over-parameterized. Cluster analysis helps in reducing the number of partitions. Li et al., 2008, Syst. Biol. 57(4):
86
Lanfear et al., 2012, MBE, 被引用次数:483
87
Gene tree vs. species tree
A paradigm shift (Scott Edward)
88
Gene duplication and loss
Li et al., 2007 BMC Evolutionary Biology
89
Horizontal Gene Transfer
Cordero et al., 2009 PNAS
90
Incomplete lineage sorting (ILD)
92
Changing probabilities of reciprocal mono-, para- and polyphyly and
93
‘Deep coalescence’ = discordance between gene and species tree
94
Deep coalescence vs. branch length heterogeneity
Edwards Evolution 63:1-19
95
Discordance is a function of t/Ne in the internode
96
Gene divergence substantially predates population divergence
3.00 2.80 2.60 Gene divergence (D/2) gene divergence ( D / 2) population divergence ( 2.40 Population divergence () - MLE g ) - MLE population divergence ( g ) - Bayesian 2.20 Population divergence () - Bayesian 2.00 Pliocene 1.80 Pleistocene 1.60 Divergence Time (MYA) 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00 cincta vs. (acuticauda, hecki) 1 acuticauda vs . hecki
97
Hierarchical nature of phylogeny
Liu, Yu, Kubatko, Pearl and Edwards Mol. Phyl. Evol. 53:
99
BEAST is a cross-platform program for Bayesian MCMC analysis of molecular sequences. Can be used to reconstruct species tree.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.