. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,

. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau, Taub 700, tel 4894 Website: http://webcourse.cs.technion.ac.il/236512/http://webcourse.cs.technion.ac.il/236512/

2 Evolution Evolution of new organisms is driven by u Diversity l Different individuals carry different variants of the same basic blue print u Mutations l The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. u Selection bias

3 The Tree of Life Source: Alberts et al

4 Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

5 Theory of Evolution u Basic idea l speciation events lead to creation of different species. l Speciation caused by physical separation into groups where different genetic variants become dominant u Any two species share a (possibly distant) common ancestor

6 Phylogenenetic trees u Leaves - current day species (or taxa – plural of taxon) u Internal vertices - hypothetical common ancestors u Edges length - “time” from one speciation to the next AardvarkBisonChimpDogElephant

7 Types of Trees A natural model to consider is that of rooted trees Common Ancestor

8 Types of trees Unrooted tree represents the same phylogeny without the root node Usually, data from current day species does not distinguish between different placements of the root.

9 Rooted versus unrooted trees Tree a a b Tree b c Tree c Represents the three rooted trees

10 Positioning Roots in Unrooted Trees u We can estimate the position of the root by introducing an outgroup: l a set of species that are definitely distant from all the species of interest AardvarkBisonChimpDogElephant Falcon Proposed root

11 Morphological vs. Molecular u Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. u Modern biological methods allow to use molecular features l Gene sequences l Protein sequences u Analysis based on homologous sequences (e.g., globins) in different species

12 RatQEPGGLVVPPTDA RabbitQEPGGMVVPPTDA GorillaQEPGGLVVPPTDA CatREPGGLVVPPTEG From sequences to a phylogenetic tree There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

13 Type of Data u Distance-based (The project focus on this method). l Input is a matrix of distances between species l Can be fraction of residue they disagree on, or alignment score between them, or … u Character-based l Examine each character (e.g., residue) separately Not covered in this project

14 Constructing trees from distances: u Transform differences between species to numerical distances u Find a weighted tree that realizes/approximates the distances between the species. The task is: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

15 Exact solution: Additive sets Given a set S of n objects with an n×n distance matrix: u d(i,i)=0, and for i≠j, d(i,j)>0 u d(i,j)=d(j,i). u For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k). Can we construct a weighted tree which realizes these distances?

16 There is always a tree for 3 objects For n=3: There is always a (unique) tree with one internal node. a b c i j k v ijk i 0 a+ba+c j 0 b+c k 0 Distance metrics on 4 objects may not have a tree.

17 The Four Points Condition Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) i k l j Theorem: A distance metric is additive, it satisfies the four points condition Note: The four point condition implies O(n 4 ) algorithm, which is not very efficient.

18 Constructing additive trees: The neighbor joining problem Let i, j be neighboring leaves in a tree, let v be their parent, and let k be any other vertex. The formula shows that we can compute the distances of v to all other leaves. d(k,v)d(k,v) i j k v

19 Constructing additive trees: The neighbor joining problem This suggest the following method to construct tree from a distance matrix: 1.Find neighboring leaves i,j in the tree, 2. Replace i,j by their parent v and recursively construct a tree T for the smaller set. 3.Add i,j as children of v in T.

20 Neighbor Finding How can we find from distances alone a pair of neighboring leaves (called also cherries)? Closest vertices aren’t necessarily neighboring leaves. A B C D

21 Neighbor Finding: Seitou&Nei method Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions

22 S&N Neighbor Joining Algorithm u If n =3, return tree of three vertices u Compute Q(i,j) for all i,j u Choose i,j such that Q(i,j) is minimal u Create new vertex v, and set i j v k u remove i,j, and add v to the set of objects u Recursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v). d(k,v)

23 Initialization: θ(n 2 ) to compute r(i) and Q(i,j) for all i,j  L. Each Iteration: u O(n 2 ) to find the maximal Q(i,j). u O(n) to compute {D(v,k):k  L} for the new node v, and to update the matrix. u O(n 2 ) to update the values Q(i,j). Total of O(n 3 ). Complexity of S&N Neighbor Joining Algorithm i j k D(v,k)

24 Some remarks on S&N Neighbor Joining Algorithm u Applicable to matrices which are not additive u Known to work good in practice. u The algorithm and its variants are the most widely used distance-based algorithms today. Next we present a more efficient Neighbor Joining algorithm, which is based on LCA distances.

25 Least Common Ancestor distances Definition: Given a weighted tree T and a specific vertex r in it: d T (r;i j)=distance in T from r to path(i,j). d T (r;i i)=distance in T from r to i. A E D C B r 3 5 5 23 1 2 2 5 2 3 Edge weights: LCA distances: D T (r;AD)= 3 7 8 5 7 6 D T (r;AA)= 7

26 Least Common Ancestor distances The distances d T (r;i,j) can be presented by a matrix: ABCDE A70035 B8500 C700 D53 E6 A E D C B r 3 5 5 7 8 6 7 5

27 LCA Matrices Definition: A symmetric nonnegative matrix L is an LCA matrix iff 1. For each i: L(i,i)=max j {L(i,j)} 2. It satisfies the “3 points condition”: for each 3 distinct indices i, j, k, L(i,j) ≥ min {L(i,k), L(j,k)} “the smallest value appears twice” jk i 11 96 j86

28 LCA Matrices jk i96 j6 Theorem: The following conditions are equivalent for an (n-1)  (n-1) matrix L: 1. L is an LCA matrix. 2. There is a weighted tree T with n leaves and a leaf r in T such that for each pair of leaves i,j  r: L(i,j)= d T (r;ij)

29 LCA distances  LCA matrix There is a weighted tree T s.t. L(i,j)= d T (r;ij).  L is an LCA matrix: By properties of least common ancestors in trees i j k L(k,i) = L(j,i)  L(k,j) r

30 LCA matrix  LCA distances Now we are given an LCA matrix L and need to construct a tree. The construction uses “maximal off diagonal” entries: L(i,j) is a “maximal off-diagonal” in entry in row i if L(i,j)=max k {L(i,k):k  i} 1 2k 1189 837 Example: L(1,2) is maximal off diagonal entry in row 1

31 Maximal off diagonal entries Lemma: If L(i,j) is the maximal “off-diagonal” entry in both rows i and j in L, then for all k  i,j: L(i,k)=L(j,k). Proof: By the 3 points condition on {i,j,k}. i jk i 189 837 j 914 837 Example for i=1, j=2

32 LCA matrix  LCA distances: Proof by induction We now prove by induction on n: L is an (n-1)  (n-1) LCA matrix  There is a weighted tree T with a root r as in the theorem. Basis: n= 2. L=[w]. T is a tree with a single edge of weight w. 4 ri 4

33 LCA matrix  LCA distances: n= 3: T is a tree with two leaves 123 7 i j ij 9 ji 4 r 3

34 Induction step Induction step: n ¸ 3. Let L be an LCA matrix of dimension n-1. We describe an algorithm for constructing the corresponding tree: 1. Find i,j s.t. L(i,j) is the maximal off-diagonal entry in L. i jk i 11 9 j 9 14 L (In the example i=1 and j=2)

35 Induction step 2. Let L` be the matrix obtained by removing rows/columns i and j, and inserting row/column v s.t. L`(v,v)=L(i,j), and for k  i,j, L`(v,k)=L(i,k) (=L(j,k)) v k v9 837 L` 12k 1 119 83 7 2 914 837 L

36 Induction Step To show that L` We is an LCA matrix we need a definition and a simple observation: Definition: Let L be an n  n matrix, and let S  {1,...,n}. L[S] is the submatrix of L consisting of the rows and columns with indices from S. Observation 1: If L is an LCA matrix then for every S  {1,...,n}, L[S] is also an LCA matrix.

37 Induction step Claim: L` is an LCA matrix of dimension n-2 Proof: Let S be all leaves except j. Than L` is obtained from L(S) as follows: 1. change the index i to v 2. set L`(v,v) Ã L(i,j) By Observation 1 and the maximality of L(i,j), L` is also an LCA matrix. v k v9 837 L` 12k 1 119 83 7 2 914 837 L

38 Induction step 3. Construct a tree T` for L` (with n-1 leaves) v k v 9 837 v T` L`

39 Induction step 4. Add to v to childs, for i and j, with appropriate edge lengths. v T` i jk i 119 83 7 j 914 837 2 5 i j

40 Deepest LCA neighbor joining u If n · 3, return tree of n vertices u Prepare a list MAX of size n, s.t. MAX(i ) = maximal off diagonal element in row i Recursion: u Find i,j s.t. L(i,j) is maximal off diagonal entry of L u Make the reduction to L` as described u update the list MAX (only MAX(v) needs an update!) u Construct T` for L` u Add i and j as childs of v. v T`T` ij

41 Complexity Analysis Initialization: Constructing MAX - O(n 2 ). Let Time(n) be the complexity of the algorithm, given the input matrix L and the list MAX. Time(n) is given by: u Reducing L to L`: O(n) u Updating MAX: O(n). u Constructing T` from L`: T(n-1). u Constructing T from T`: O(1). Time(n) · Time(n-1)+O(n) Hence Time(n)=O(n 2 )

42 Seitou&Nei vs. DLCA methods u DLCA like S&N can be implemented on noisy data (in many ways) u On exact data, DLCA and S&N methods have the same (correct) output. They differ on noisy data (which occurs in practice). u One basic difference: Unlike S&N method, the DLCA algorithm depends on selecting a root. Hence DLCA may produce many different trees on the same output. Some of the projects will concentrate on this difference.

43 Incremental Reconstruction via Local Queries Incrementally reconstructing the tree: a b c d e f g h 6 4 1 2 3 5 a b c d e f g h 1 23 4 5 6 u When inserting a new taxon x to a given topology T, we need to find out to which edge x should be attached. u We are allowed to ask the ‘oracle’ local queries LQ(x,v). (x – taxon, v – internal vertex)

44 Local Queries - Motivation u Asking LQ(x,v) is equivalent to asking the topology of {x, a, b, c}, where v is the center-point of a,b,c in T. a b c d e f g h 6 4 1 2 3 5 f a b c d e 1 23 4 u Such questions can be asked directly (using likelihood) or through a pairwise distance matrix (which will be discussed later)

45 Balancing Vertices u We’d like to minimize the number of queries required for inserting a new taxon. l Lower bound – log 3 (|E T |). (simple adversarial argument) l Upper bound – log 2 (|E T |). u The algorithm which achieves the upper bound uses ‘balancing vertices’: l A balancing vertex in T is an internal vertex, which splits T into 3 subtrees of size at most ceil(|T|/2). l Using balancing vertices in the local queries, the edge to which a new taxon should be attached can be found in ~ log 2 (|E T |) queries.

46 Balancing Vertices u Every tree contains either a single balancing vertex or two adjacent balancing vertices. u Finding a balancing vertex: l Start at some arbitrary vertex v. H If v is balancing, stop. H Otherwise, continue to the vertex u, adjacent to v in the ‘heaviest’ subtree. u The algorithm traverses each edge at most once  Time complexity – O(|T|). a c d e f g h 13 edges in T 11 edges 9 edges 7 edges

47 A Simple and Efficient Algorithm u Iteratively add taxa 1,2,…,n to the topology u When adding taxon x to topology T: l If T is trivial (consists of a single edge), attach x to that edge. l Otherwise: H Find a balancing vertex v of T. H Ask query LQ(v,x) H Continue recursively on T’, the subtree corresponding to the answer of the query. u Complexity: l Adding taxon 1≤x≤n to T takes O(log(x)) queries and O(x) time. l Total query complexity: O(n·log(n)) l Total time complexity: O(n 2 )

48 Interesting Issues u Two major issues are raised in this area: l Queries do not always have reliable answers - Use confidence level for answers - Verify the answers l Reduce running time to O(n·log(n)) - Finding balancing vertices leads to high overhead - Maybe we don’t have to re-compute the balancing vertices in every stage

49 Robustness to Noise in Data Answering local queries using a distance matrix D: l We wish to assess the topology spanned by four taxa: x, a, b, c. l Observe the 4×4 submatrix of D over x, a, b, c: a b c x bxac b x a c l If D is additive then there is a labeling of the taxa by i, j, k, l s.t: D(i,j) + D(k,l) ≤ D(i,k) + D(j,l) = D(i,l) + D(j,k) l The configuration of the quartet is (ij ; kl), and the path separating them is of length ½(D(i,k) + D(j,l) - D(i,j) + D(k,l)) l If D is not additive we set the configuration of the quartet to (ij ; kl), where D(i,j) + D(k,l) is minimal of the three sums. l Confidence of prediction can be estimated by the difference between maximal and minimal sums. ?

50 Robustness to Noise in Data Answering local queries using a distance matrix D: We can check several quartets of type x, a, b, c to answer a single local query. Example: To answer LQ( 1, g ) we can check all quartets in { g } ×{ a } ×{ c,f } ×{ b,d,e } l We can choose a representative set of quartets, and answer the local query according to (weighted) majority. l If the answer is still inconclusive, we can choose to ask another local query. a b c d e f 1 23 4 g ?

51 Improving Running Time Separator Trees: l A deterministic algorithm which inserts a new taxon x to a given topology T can be viewed as a rooted decision tree. Each internal node represents a local query (internal vertex in T). Each internal node has three outgoing edges corresponding to the three possible answers to the query. Each leaf corresponds to an edge in T. l A special case of decision trees are separator trees. The time complexity of the algorithm is the depth of the separator tree a b c d e f g h i j k l m 1 2 5 3 6 S:S: 4 a b d e f g h i j l m k 1 2 3 4 5 6 T:T: c

52 a b c d e f g h i j k l m Improving Running Time Balanced Separator Trees: l A balanced separator tree uses balancing vertices (of the appropriate subtrees of T) l Can be constructed in O(n·log(n)) time l Inserting a taxon does not drastically harm the balance l If we allow some imbalance, we can guarantee that the costly balancing procedure is executed few times during construction of the whole topology. l Amortized analysis of total time complexity: O(n·log 2 (n)) a b c d e f g h i j l m k 1 2 3 4 5 6 1 2 5 3 6 T:T:S:S: 4

53 Improving Running Time Bottom-up approach: (simple separator trees) l Start with the edge-set of T l Choose disjoint edge triplets, s.t. that each triplet contains at least one leaf l Contract each triplet to a single edge l Recursively continue on the reduced topology T:T: S:S: a b c d e f g h i j l m k 1 2 3 4 5 6 j 1 2 3 4 5 6 j 3 5 6 5 a b c d e f g h i j k l m 1 246j 36j 5

54 Improving Running Time Bottom-up approach: (simple separator trees) l By simple linear traversal of T you can find θ(|T|) edge-triplets l Topology size is reduced by a constant factor each stage Depth of simple separator tree is O(log(n)) Time complexity is O(n). l Insertion of taxon induces modifications propagating bottom-up through the layers of the separator tree a b c d e f g h i j l m k 1 2 3 4 5 6 j 1 2 3 4 5 6 j 3 5 6 5 a b c d e f g h i j k l m 1246j 36j 5 IS: {1,2,4,6} IS: {3} IS: {5}

55 ATTCG … ATACG … ACTGG …. Testing Reconstruction Methods on Noisy Data We’d like to test reconstruction algorithms on actual phylogenetic data. Problem: Confirmed phylogenetic trees are scarce and small. Solution: Simulate the data. Generate an edge-weighted tree under some probabilistic model (Yule-Harding) Choose random DNA string for root and simulate evolution on tree to obtain sequences for all leaves SeqGen DNAdist from PHYLIP Obtain pariwise distances from sequences 0 0 0 0 0 0 0 0 0 T D T’ Reconstruction Algorithm Compare topologies

56 The Projects Project I: The DLCA algorithm Implement algorithms: u Saitou&Nei's neighbor joining u DLCA neighbor-joining l mid-point reduction l maximal-value reduction Simulate data: Use pre-generated trees to simulate process of evolution (using SeqGen program) For each tree generate several sequence-sets Experiments: Test the various algorithms on the generated data: l Use DNADIST program (part of the Phylip package) to get a distance matrix corresponding to the sequence-set of the leaves. l Execute algorithms on distance matrix l Check topological accuracy using the RF-score

57 The Projects Project II: Fast Algorithms Using Local Queries Implement algorithms: Implement advanced data structures which support the various algorithms: u Algorithm using semi-balanced separator trees u Algorithm using simple separator trees Simulate data: Use pre-generated trees and/or uniform random model Experiments:  Test the various algorithms on the generated trees: o Use the generated trees to answer the local queries asked by the algorithms. o Compare the performance of the different algorithms on this data.

58 The Projects Project III: Robust Algorithms Using Local Queries Implement algorithms: Implement the O(n 2 ) algorithm using O(n·log(n)) queries Simulate data: Use pre-generated trees and distance matrices Experiments:  Test various approaches on the generated data: o Use the distance-matrices to answer the local queries asked by the algorithms. o Suggest some method of estimating the confidence level of an answer to a query. o Check for errors in the reconstructed topology.  Compare several approaches

59 Grading Scheme u 10% - work plan u 60% - final report + submitted code Rough distribution of grade: l 40% - meeting project requirements l 10% - code organization and documentation l 10% - innovation and creativeness u 30% - final presentation

60 Schedule 21/3 – Introductory meeting 28/3 – Deadline for choosing a project 26-30/3 – Individual 30 minute meetings with each teem to discuss the specification of the project. 23-27/4 – Individual 60 minute meeting with each team to discuss work plan and design of project 2/5 – Deadline for submitting work plan 21-25/5 – Individual progress meetings 18-22/6 – Concluding 60 minute meetings with each team 27/6, 4/7 – Project presentations and submission of final draft Final submission deadline – To be announced

61 Homework u Team up in pairs u Choose project u Send me e-mail containing: l The names, id numbers, e-mails of all students in the group l Preferred project + 2 nd priority project l Two optional dates for first project meeting (next week) u Go over references of your chosen project Good Luck !

. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,

Similar presentations

Presentation on theme: ". Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,

Similar presentations

Presentation on theme: ". Advanced programming 236512 Algorithms for reconstructing phylogenetic trees spring 2006 Lecturer: Shlomo Moran, Taub 639, tel 4363 TA: Ilan Gronau,"— Presentation transcript:

Similar presentations

About project

Feedback