Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.

Similar presentations


Presentation on theme: ". Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891."— Presentation transcript:

1 . Class 9: Phylogenetic Trees

2 The Tree of Life D’après Ernst Haeckel, 1891

3 Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different species l Speciation caused by physical separation into groups where different genetic variants become dominant u Any two species share a (possibly distant) common ancestor

4 Phylogenies u A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species u Leafs - current day species u Nodes - hypothetical most recent common ancestors u Edges length - “time” from one speciation to the next AardvarkBisonChimpDogElephant

5 Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node

6 Example: Primate evolution 40-45 mya 35-37 mya 20-25 mya

7 How to construct a Phylogeny? u Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) u Since then, focus on objective criteria for constructing phylogenetic trees l Thousands of articles in the last decades u Important for many aspects of biology l Classification (systematics) l Understanding biological mechanisms

8 Morphological vs. Molecular u Classical phylogenetic analysis: morphological features l number of legs, lengths of legs, etc. u Modern biological methods allow to use molecular features l Gene sequences l Protein sequences u Analysis based on homologous sequences (e.g., globins) in different species

9 Dangers in Molecular Phylogenies u We have to remember that gene/protein sequence can be homologous for different reasons: u Orthologs -- sequences diverged after a speciation event u Paralogs -- sequences diverged after a duplication event u Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

10 Dangers of Paralogues Speciation events Gene Duplication 1A 2A 3A3B 2B1B

11 Dangers of Paralogs Speciation events Gene Duplication 1A 2A 3A3B 2B1B u If we only consider 1A, 2B, and 3A...

12 Types of Trees u A natural model to consider is that of rooted trees Common Ancestor

13 Types of Trees u Depending on the model, data from current day species does not distinguish between different placements of the root vs

14 Types of trees u Unrooted tree represents the same phylogeny with out the root node

15 Positioning Roots in Unrooted Trees u We can estimate the position of the root by introducing an outgroup: l a set of species that are definitely distant from all the species of interest AardvarkBisonChimpDogElephant Falcon Proposed root

16 Types of Data u Distance-based l Input is a matrix of distances between species l Can be fraction of residues they disagree on, or -alignment score between them, or … u Character-based l Examine each character (e.g., residue) separately

17 Simple Distance-Based Method Input: distance matrix between species Outline: u Cluster species together u Initially clusters are singletons u At each iteration combine two “closest” clusters to get a new one

18 UPGMA Clustering  Let C i and C j be clusters, define distance between them to be  When combining two clusters, C i and C j, to form a new cluster C k, then

19 Molecular Clock u UPGMA implicitly assumes that all distances measure time in the same way 1 23 4 2341

20 Additivity u A weaker requirement is additivity l In “real” tree, distances between species are the sum of distances between intermediate nodes a b c i j k

21 Consequences of Additivity u Suppose input distances are additive u For any three leaves u Thus a b c i j k m

22 u Can we use this fact to construct trees? u Let where Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree Neighbor Joining

23  Set L to contain all leaves Iteration:  Choose i,j such that D(i,j) is minimal  Create new node k, and set  remove i,j from L, and add k Terminate: when |L| =2, connect two remaining nodes Neighbor Joining i j m k

24 Distance Based Methods u If we make strong assumptions on distances, we can reconstruct trees u In real-life distances are not additive u Sometimes they are close to additive

25 Character Based Methods u We start with a multiple alignment u Assumptions: l All sequences are homologous l Each position in alignment is homologous l Positions evolve independently l No gaps u We seek to explain the evolution of each position in the alignment

26 Parsimony u Character-based method u A way to score trees (but not to build trees!) Assumptions: u Independence of characters (no interactions) u Best tree is one where minimal changes take place

27 A Simple Example u What is the parsimony score of AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

28 A Simple Example u Each column is scored separately. u Let’s look at the first column: u Minimal tree has one evolutionary change: C C C C C T T T T  C A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

29 Evaluating Parsimony Scores u How do we compute the Parsimony score for a given tree? u Traditional Parsimony l Each base change has a cost of 1 u Weighted Parsimony Each change is weighted by the score c(a,b)

30 Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a

31 Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization:  For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration:  if k is a node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination:  cost of tree is min a S(r,a) where r is the root

32 Cost of Evaluating Parsimony u Score is evaluated on each position independetly. Scores are then summed over all positions.  If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) u By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

33 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

34 Maximum Parsimony How many possible unrooted trees? 1 2 3 4 5 6 7 8 9 10 Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

35 Maximum Parsimony How many substitutions? MP

36 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 0 0

37 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3

38 Maximum Parsimony 2 1 - G 2 - C 3 - T 4 - A A G C T C A G T C C C G A T C 3 3 3

39 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2

40 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 3 2 1 0 3 2 2

41 Maximum Parsimony 4 1 - G 2 - A 3 - A 4 - G G G A A A G G A A A G A A G A 2 2 1

42 Maximum Parsimony 0 3 2 2 0 1 1 1 1 3 14 0 3 2 1 0 1 2 1 2 3 15 0 3 2 2 0 1 2 1 2 3 16

43 Maximum Parsimony 1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2 2 0 1 1 1 1 3 14

44 Searching for Trees

45 Searching for the Optimal Tree u Exhaustive Search l Very intensive u Branch and Bound l A compromise u Heuristic l Fast l Usually starts with NJ

46 Phylogenetic Tree Assumptions u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2  Lengths t = {t i } for each branch u Phylogenetic tree = (Topology, Lengths) = (T,t) leaf branch internal node

47 Probabilistic Methods u The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. u Background probabilities: q(a) u Mutation probabilities: P(a|b,t) u Models for evolutionary mutations l Jukes Cantor l Kimura 2-parameter model u Such models are used to derive the probabilities

48 Jukes Cantor model u A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate .

49 Kimura 2-parameter model u Allows a different rate for transitions and transversions.

50 Mutation Probabilities u The rate matrix R is used to derive the mutation probability matrix S: u S is obtained by integration. For Jukes Cantor: u q can be obtained by setting t to infinity

51 Mutation Probabilities  Both models satisfy the following properties: u Lack of memory: l u Reversibility: Exist stationary probabilities { P a } s.t. A GT C

52 Probabilistic Approach u Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4

53 Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.

54 Tree Likelihood Computation u Define P(L k |a)= prob. of leaves below node k given that x k =a u Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise  Iteration: if k is node with children i and j, then u Termination: Likelihood is

55 Maximum Likelihood (ML) u Score each tree by l Assumption of independent positions u Branch lengths t can be optimized l Gradient ascent l EM u We look for the highest scoring tree l Exhaustive search l Sampling methods (Metropolis)

56 Optimal Tree Search u Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima

57 Computational Problem u Such procedures are computationally expensive! u Computation of optimal parameters, per candidate, requires non-trivial optimization step. u Spend non-negligible computation on a candidate, even if it is a low scoring one. u In practice, such learning procedures can only consider small sets of candidate structures

58 Structural EM Idea: Use parameters found for current topology to help evaluate new topologies. Outline:  Perform search in (T, t) space. u Use EM-like iterations: l E-step: use current solution to compute expected sufficient statistics for all topologies l M-step: select new topology based on these expected sufficient statistics

59 The Complete-Data Scenario Suppose we observe H, the ancestral sequences. Define: Find: topology T that maximizes S i,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j F is a linear function of S i,j

60 Expected Likelihood  Start with a tree (T 0,t 0 ) u Compute Formal justification: u Define: Theorem: Consequence: improvement in expected score  improvement in likelihood

61 Proof Theorem: u Simple application of Jensen’s inequality

62 Algorithm Outline Original Tree (T 0,t 0 ) Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N 2 M) Compute: Weights:

63 Pairwise weights This stage also computes the branch length for each pair (i,j) Algorithm Outline Compute: Weights: Find:

64 Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’)  Q(T 0,t 0 ) Thus, l(T’,t’)  l(T 0,t 0 ) Algorithm Outline Compute: Weights: Find: Construct bifurcation T 1

65 Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T 1,t’) =l(T’,t’)  l(T 0,t 0 ) Algorithm Outline Compute: Find: Weights: Construct bifurcation T 1

66 Assessing trees: the Bootstrap u Often we don’t trust the tree found as the “correct” one. u Bootstrapping: l Sample (with replacement) n positions from the alignment l Learn the best tree for each sample l Look for tree features which are frequent in all trees. u For some models this procedure approximates the tree posterior P(T| X 1,…, X n )

67 New Tree Thm: l(T 1,t 1 )  l(T 0,t 0 ) Algorithm Outline Compute: Construct bifurcation T 1 Find: Weights: These steps are then repeated until convergence


Download ppt ". Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891."

Similar presentations


Ads by Google