Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,

Similar presentations


Presentation on theme: ". Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,"— Presentation transcript:

1 . Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau, Taub 700, tel 4894 Website: http://webcourse.cs.technion.ac.il/236512/http://webcourse.cs.technion.ac.il/236512/

2 2 Evolution Evolution of new organisms is driven by u Diversity l Different individuals carry different variants of the same basic blue print u Mutations l The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. u Selection bias

3 3 The Tree of Life Source: Alberts et al

4 4 Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

5 5 Theory of Evolution u Basic idea l speciation events lead to creation of different species. l Speciation caused by physical separation into groups where different genetic variants become dominant u Any two species share a (possibly distant) common ancestor

6 6 Phylogenenetic trees u Leaves - current day species (or taxa – plural of taxon) u Internal vertices - hypothetical common ancestors u Edges length - “time” from one speciation to the next AardvarkBisonChimpDogElephant

7 7 Types of Trees A natural model to consider is that of rooted trees Common Ancestor

8 8 Types of trees Unrooted tree represents the same phylogeny without the root node Usually, data from current day species does not distinguish between different placements of the root.

9 9 Rooted versus unrooted trees Tree a a b Tree b c Tree c Represents the three rooted trees

10 10 Positioning Roots in Unrooted Trees u We can estimate the position of the root by introducing an outgroup: l a set of species that are definitely distant from all the species of interest AardvarkBisonChimpDogElephant Falcon Proposed root

11 11 Morphological vs. Molecular u Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. u Modern biological methods allow to use molecular features l Gene sequences l Protein sequences u Analysis based on homologous sequences (e.g., globins) in different species

12 12 RatQEPGGLVVPPTDA RabbitQEPGGMVVPPTDA GorillaQEPGGLVVPPTDA CatREPGGLVVPPTEG From sequences to a phylogenetic tree There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

13 13 Type of Data u Distance-based (The project focus on this method). l Input is a matrix of distances between species l Can be fraction of residue they disagree on, or alignment score between them, or … u Character-based l Examine each character (e.g., residue) separately Not covered in this project

14 14 Constructing trees from distances: u Transform differences between species to numerical distances u Find a weighted tree that realizes/approximates the distances between the species. The task is: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

15 15 Exact solution: Additive sets Given a set S of n objects with an n×n distance matrix: u d(i,i)=0, and for i≠j, d(i,j)>0 u d(i,j)=d(j,i). u For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k). Can we construct a weighted tree which realizes these distances?

16 16 There is always a tree for 3 objects For n=3: There is always a (unique) tree with one internal node. a b c i j k v ijk i 0 a+ba+c j 0 b+c k 0 Distance metrics on 4 objects may not have a tree.

17 17 The Four Points Condition Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) i k l j Theorem: A distance metric is additive, it satisfies the four points condition Note: The four point condition implies O(n 4 ) algorithm, which is not very efficient.

18 18 Constructing additive trees: The neighbor joining problem Let i, j be neighboring leaves in a tree, let v be their parent, and let k be any other vertex. The formula shows that we can compute the distances of v to all other leaves. d(k,v)d(k,v) i j k v

19 19 Constructing additive trees: The neighbor joining problem This suggest the following method to construct tree from a distance matrix: 1.Find neighboring leaves i,j in the tree, 2. Replace i,j by their parent v and recursively construct a tree T for the smaller set. 3.Add i,j as children of v in T.

20 20 Neighbor Finding How can we find from distances alone a pair of neighboring leaves (called also cherries)? Closest vertices aren’t necessarily neighboring leaves. A B C D

21 21 Neighbor Finding: Seitou&Nei method Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions

22 22 S&N Neighbor Joining Algorithm u If n =3, return tree of three vertices u Compute Q(i,j) for all i,j u Choose i,j such that Q(i,j) is minimal u Create new vertex v, and set i j v k u remove i,j, and add v to the set of objects u Recursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v). d(k,v)

23 23 Initialization: θ(n 2 ) to compute r(i) and Q(i,j) for all i,j  L. Each Iteration: u O(n 2 ) to find the maximal Q(i,j). u O(n) to compute {D(v,k):k  L} for the new node v, and to update the matrix. u O(n 2 ) to update the values Q(i,j). Total of O(n 3 ). Complexity of S&N Neighbor Joining Algorithm i j k D(v,k)

24 24 Some remarks on S&N Neighbor Joining Algorithm u Applicable to matrices which are not additive u Known to work good in practice. u The algorithm and its variants are the most widely used distance-based algorithms today. Next we present a more efficient Neighbor Joining algorithm, which is based on LCA distances.

25 25 Least Common Ancestor distances Definition: Given a weighted tree T and a specific vertex r in it: d T (r;i j)=distance in T from r to path(i,j). d T (r;i i)=distance in T from r to i. A E D C B r 3 5 5 23 1 2 2 5 2 3 Edge weights: LCA distances: D T (r;AD)= 3 7 8 5 7 6 D T (r;AA)= 7

26 26 Least Common Ancestor distances The distances d T (r;i,j) can be presented by a matrix: ABCDE A70035 B8500 C700 D53 E6 A E D C B r 3 5 5 7 8 6 7 5

27 27 LCA Matrices Definition: A symmetric nonnegative matrix L is an LCA matrix iff 1. For each i: L(i,i)=max j {L(i,j)} 2. It satisfies the “3 points condition”: for each 3 distinct indices i, j, k, L(i,j) ≥ min {L(i,k), L(j,k)} “the smallest value appears twice” jk i 11 96 j86

28 28 LCA Matrices jk i96 j6 Theorem: The following conditions are equivalent for an (n-1)  (n-1) matrix L: 1. L is an LCA matrix. 2. There is a weighted tree T with n leaves and a leaf r in T such that for each pair of leaves i,j  r: L(i,j)= d T (r;ij)

29 29 LCA distances  LCA matrix There is a weighted tree T s.t. L(i,j)= d T (r;ij).  L is an LCA matrix: By properties of least common ancestors in trees i j k L(k,i) = L(j,i)  L(k,j) r

30 30 LCA matrix  LCA distances Now we are given an LCA matrix L and need to construct a tree. The construction uses “maximal off diagonal” entries: L(i,j) is a “maximal off-diagonal” in entry in row i if L(i,j)=max k {L(i,k):k  i} 1 2k 1189 837 Example: L(1,2) is maximal off diagonal entry in row 1

31 31 Maximal off diagonal entries Lemma: If L(i,j) is the maximal “off-diagonal” entry in both rows i and j in L, then for all k  i,j: L(i,k)=L(j,k). Proof: By the 3 points condition on {i,j,k}. i jk i 189 837 j 914 837 Example for i=1, j=2

32 32 LCA matrix  LCA distances: Proof by induction We now prove by induction on n: L is an (n-1)  (n-1) LCA matrix  There is a weighted tree T with a root r as in the theorem. Basis: n= 2. L=[w]. T is a tree with a single edge of weight w. 4 ri 4

33 33 LCA matrix  LCA distances: n= 3: T is a tree with two leaves 123 7 i j ij 9 ji 4 r 3

34 34 Induction step Induction step: n ¸ 3. Let L be an LCA matrix of dimension n-1. We describe an algorithm for constructing the corresponding tree: 1. Find i,j s.t. L(i,j) is the maximal off-diagonal entry in L. i jk i 11 9 j 9 14 L (In the example i=1 and j=2)

35 35 Induction step 2. Let L` be the matrix obtained by removing rows/columns i and j, and inserting row/column v s.t. L`(v,v)=L(i,j), and for k  i,j, L`(v,k)=L(i,k) (=L(j,k)) v k v9 837 L` 12k 1 119 83 7 2 914 837 L

36 36 Induction Step To show that L` We is an LCA matrix we need a definition and a simple observation: Definition: Let L be an n  n matrix, and let S  {1,...,n}. L[S] is the submatrix of L consisting of the rows and columns with indices from S. Observation 1: If L is an LCA matrix then for every S  {1,...,n}, L[S] is also an LCA matrix.

37 37 Induction step Claim: L` is an LCA matrix of dimension n-2 Proof: Let S be all leaves except j. Than L` is obtained from L(S) as follows: 1. change the index i to v 2. set L`(v,v) Ã L(i,j) By Observation 1 and the maximality of L(i,j), L` is also an LCA matrix. v k v9 837 L` 12k 1 119 83 7 2 914 837 L

38 38 Induction step 3. Construct a tree T` for L` (with n-1 leaves) v k v 9 837 v T` L`

39 39 Induction step 4. Add to v to childs, for i and j, with appropriate edge lengths. v T` i jk i 119 83 7 j 914 837 2 5 i j

40 40 Deepest LCA neighbor joining u If n · 3, return tree of n vertices u Prepare a list MAX of size n, s.t. MAX(i ) = maximal off diagonal element in row i Recursion: u Find i,j s.t. L(i,j) is maximal off diagonal entry of L u Make the reduction to L` as described u update the list MAX (only MAX(v) needs an update!) u Construct T` for L` u Add i and j as childs of v. v T`T` ij

41 41 Complexity Analysis Initialization: Constructing MAX - O(n 2 ). Let Time(n) be the complexity of the algorithm, given the input matrix L and the list MAX. Time(n) is given by: u Reducing L to L`: O(n) u Updating MAX: O(n). u Constructing T` from L`: T(n-1). u Constructing T from T`: O(1). Time(n) · Time(n-1)+O(n) Hence Time(n)=O(n 2 )

42 42 Seitou&Nei vs. DLCA methods u DLCA like S&N can be implemented on noisy data (in many ways) u On exact data, DLCA and S&N methods have the same (correct) output. They differ on noisy data (which occurs in practice). u One basic difference: Unlike S&N method, the DLCA algorithm depends on selecting a root. Hence DLCA may produce many different trees on the same output. Some of the projects will concentrate on this difference.

43 43 Incremental Reconstruction via Local Queries Incrementally reconstructing the tree: a b c d e f g h 6 4 1 2 3 5 a b c d e f g h 1 23 4 5 6 u When inserting a new taxon x to a given topology T, we need to find out to which edge x should be attached. u We are allowed to ask the ‘oracle’ local queries LQ(x,v). (x – taxon, v – internal vertex)

44 44 Local Queries - Motivation u Asking LQ(x,v) is equivalent to asking the topology of {x, a, b, c}, where v is the center-point of a,b,c in T. a b c d e f g h 6 4 1 2 3 5 f a b c d e 1 23 4 u Such questions can be asked directly (using likelihood) or through a pairwise distance matrix (which will be discussed later)

45 45 Balancing Vertices u We’d like to minimize the number of queries required for inserting a new taxon. l Lower bound – log 3 (|E T |). (simple adversarial argument) l Upper bound – log 2 (|E T |). u The algorithm which achieves the upper bound uses ‘balancing vertices’: l A balancing vertex in T is an internal vertex, which splits T into 3 subtrees of size at most ceil(|T|/2). l Using balancing vertices in the local queries, the edge to which a new taxon should be attached can be found in ~ log 2 (|E T |) queries.

46 46 Balancing Vertices u Every tree contains either a single balancing vertex or two adjacent balancing vertices. u Finding a balancing vertex: l Start at some arbitrary vertex v. H If v is balancing, stop. H Otherwise, continue to the vertex u, adjacent to v in the ‘heaviest’ subtree. u The algorithm traverses each edge at most once  Time complexity – O(|T|). a c d e f g h 13 edges in T 11 edges 9 edges 7 edges

47 47 A Simple and Efficient Algorithm u Iteratively add taxa 1,2,…,n to the topology u When adding taxon x to topology T: l If T is trivial (consists of a single edge), attach x to that edge. l Otherwise: H Find a balancing vertex v of T. H Ask query LQ(v,x) H Continue recursively on T’, the subtree corresponding to the answer of the query. u Complexity: l Adding taxon 1≤x≤n to T takes O(log(x)) queries and O(x) time. l Total query complexity: O(n·log(n)) l Total time complexity: O(n 2 )

48 48 Interesting Issues u Two major issues are raised in this area: l Queries do not always have reliable answers - Use confidence level for answers - Verify the answers l Reduce running time to O(n·log(n)) - Finding balancing vertices leads to high overhead - Maybe we don’t have to re-compute the balancing vertices in every stage

49 49 Robustness to Noise in Data Answering local queries using a distance matrix D: l We wish to assess the topology spanned by four taxa: x, a, b, c. l Observe the 4×4 submatrix of D over x, a, b, c: a b c x bxac b x a c l If D is additive then there is a labeling of the taxa by i, j, k, l s.t: D(i,j) + D(k,l) ≤ D(i,k) + D(j,l) = D(i,l) + D(j,k) l The configuration of the quartet is (ij ; kl), and the path separating them is of length ½(D(i,k) + D(j,l) - D(i,j) + D(k,l)) l If D is not additive we set the configuration of the quartet to (ij ; kl), where D(i,j) + D(k,l) is minimal of the three sums. l Confidence of prediction can be estimated by the difference between maximal and minimal sums. ?

50 50 Robustness to Noise in Data Answering local queries using a distance matrix D: We can check several quartets of type x, a, b, c to answer a single local query. Example: To answer LQ( 1, g ) we can check all quartets in { g } ×{ a } ×{ c,f } ×{ b,d,e } l We can choose a representative set of quartets, and answer the local query according to (weighted) majority. l If the answer is still inconclusive, we can choose to ask another local query. a b c d e f 1 23 4 g ?

51 51 Improving Running Time Separator Trees: l A deterministic algorithm which inserts a new taxon x to a given topology T can be viewed as a rooted decision tree. Each internal node represents a local query (internal vertex in T). Each internal node has three outgoing edges corresponding to the three possible answers to the query. Each leaf corresponds to an edge in T. l A special case of decision trees are separator trees. The time complexity of the algorithm is the depth of the separator tree a b c d e f g h i j k l m 1 2 5 3 6 S:S: 4 a b d e f g h i j l m k 1 2 3 4 5 6 T:T: c

52 52 a b c d e f g h i j k l m Improving Running Time Balanced Separator Trees: l A balanced separator tree uses balancing vertices (of the appropriate subtrees of T) l Can be constructed in O(n·log(n)) time l Inserting a taxon does not drastically harm the balance l If we allow some imbalance, we can guarantee that the costly balancing procedure is executed few times during construction of the whole topology. l Amortized analysis of total time complexity: O(n·log 2 (n)) a b c d e f g h i j l m k 1 2 3 4 5 6 1 2 5 3 6 T:T:S:S: 4

53 53 Improving Running Time Bottom-up approach: (simple separator trees) l Start with the edge-set of T l Choose disjoint edge triplets, s.t. that each triplet contains at least one leaf l Contract each triplet to a single edge l Recursively continue on the reduced topology T:T: S:S: a b c d e f g h i j l m k 1 2 3 4 5 6 j 1 2 3 4 5 6 j 3 5 6 5 a b c d e f g h i j k l m 1 246j 36j 5

54 54 Improving Running Time Bottom-up approach: (simple separator trees) l By simple linear traversal of T you can find θ(|T|) edge-triplets l Topology size is reduced by a constant factor each stage Depth of simple separator tree is O(log(n)) Time complexity is O(n). l Insertion of taxon induces modifications propagating bottom-up through the layers of the separator tree a b c d e f g h i j l m k 1 2 3 4 5 6 j 1 2 3 4 5 6 j 3 5 6 5 a b c d e f g h i j k l m 1246j 36j 5 IS: {1,2,4,6} IS: {3} IS: {5}

55 55 ATTCG … ATACG … ACTGG …. Testing Reconstruction Methods on Noisy Data We’d like to test reconstruction algorithms on actual phylogenetic data. Problem: Confirmed phylogenetic trees are scarce and small. Solution: Simulate the data. Generate an edge-weighted tree under some probabilistic model (Yule-Harding) Choose random DNA string for root and simulate evolution on tree to obtain sequences for all leaves SeqGen DNAdist from PHYLIP Obtain pariwise distances from sequences 0 0 0 0 0 0 0 0 0 T D T’ Reconstruction Algorithm Compare topologies

56 56 The Projects Project I: The DLCA algorithm Implement algorithms: u Saitou&Nei's neighbor joining u DLCA neighbor-joining l mid-point reduction l maximal-value reduction Simulate data: Use pre-generated trees to simulate process of evolution (using SeqGen program) For each tree generate several sequence-sets Experiments: Test the various algorithms on the generated data: l Use DNADIST program (part of the Phylip package) to get a distance matrix corresponding to the sequence-set of the leaves. l Execute algorithms on distance matrix l Check topological accuracy using the RF-score

57 57 The Projects Project II: Fast Algorithms Using Local Queries Implement algorithms: Implement advanced data structures which support the various algorithms: u Algorithm using semi-balanced separator trees u Algorithm using simple separator trees Simulate data: Use pre-generated trees and/or uniform random model Experiments:  Test the various algorithms on the generated trees: o Use the generated trees to answer the local queries asked by the algorithms. o Compare the performance of the different algorithms on this data.

58 58 The Projects Project III: Robust Algorithms Using Local Queries Implement algorithms: Implement the O(n 2 ) algorithm using O(n·log(n)) queries Simulate data: Use pre-generated trees and distance matrices Experiments:  Test various approaches on the generated data: o Use the distance-matrices to answer the local queries asked by the algorithms. o Suggest some method of estimating the confidence level of an answer to a query. o Check for errors in the reconstructed topology.  Compare several approaches

59 59 Grading Scheme u 10% - work plan u 60% - final report + submitted code Rough distribution of grade: l 40% - meeting project requirements l 10% - code organization and documentation l 10% - innovation and creativeness u 30% - final presentation

60 60 Schedule 21/3 – Introductory meeting 28/3 – Deadline for choosing a project 26-30/3 – Individual 30 minute meetings with each teem to discuss the specification of the project. 23-27/4 – Individual 60 minute meeting with each team to discuss work plan and design of project 2/5 – Deadline for submitting work plan 21-25/5 – Individual progress meetings 18-22/6 – Concluding 60 minute meetings with each team 27/6, 4/7 – Project presentations and submission of final draft Final submission deadline – To be announced

61 61 Homework u Team up in pairs u Choose project u Send me e-mail containing: l The names, id numbers, e-mails of all students in the group l Preferred project + 2 nd priority project l Two optional dates for first project meeting (next week) u Go over references of your chosen project Good Luck !


Download ppt ". Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,"

Similar presentations


Ads by Google