Download presentation
Presentation is loading. Please wait.
Published byJerome Chambers Modified over 9 years ago
1
. 236503פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה: אלגוריתמים קיימים מול תכנות בשלמים אביב 2013 מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website: http://webcourse.cs.technion.ac.il/236503/http://webcourse.cs.technion.ac.il/236503/
2
2 Evolution Evolution of new organisms is driven by u Diversity l Different individuals carry different variants of the same basic blue print u Mutations l The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. u Selection bias
3
3 The Phylogenetic Reconstrutction Problem MPI, June 2012
4
4 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Evolution is modeled by a Tree (Species represented by their DNA sequences, consisting of {A,G,C,T}) MPI, June 2012
5
5 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction MPI, June 2012
6
6 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT Goal: reconstruct the ‘true’ tree as accurately as possible Distance Methods: use “evolutionary distances” between sequences reconstruct A B C F G IHJ D E A B C F G I H J D E (root) Phylogenetic Reconstruction MPI, June 2012
7
7 Reconstructing weighted tree From exact interleaf distances Exact (additive) distances Between leaves Reconstruction (linear-time) Algorithm MPI, June 2012 A C B D F G E edge-weighted unknown tree 5 6 0.4 6 30.3 2 2 4 5 A C B D F G E Reconstructed tree 5 6 0.4 6 30.3 2 2 4 5
8
8 Formal statement of the problem for exact distances Input: an n×n distance matrix D=(d(i,j)): u d(i,i)=0, and for i≠j, d(i,j)>0 u d(i,j)=d(j,i). u For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k). Output: If the distances can be realized by a weighted tree (i.e., the distances are additive) – return that tree. Else – return nothing.
9
4 5 7 2 1 2 10 6 1 Distance based reconstruction methods: (since the 60’s): MPI, June 2012
10
10 Solution for 3 objects For n=3: Each distance metric can be realized by a (unique) tree with one internal node. a b c i j k v ijk i 0 a+ba+c j 0 b+c k 0 Distance metrics on 4 objects may not have a tree.
11
11 The Four Points Condition Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) i k l j Theorem: A distance metric is additive iff it satisfies the four points condition
12
12 Neighbor Joining Let i, j be neighboring leaves in a tree, let v be their parent, and let k be any other leaf. The formula shows that we can compute the distances of v to all other leaves. d(k,v)d(k,v) i j k v
13
13 Reconstructing trees by Neighbor Joining Algorithms This suggest the following method to construct tree from a distance matrix: 1.Find neighboring leaves i,j in the tree, 2. Replace i,j by their parent v and recursively construct a tree T for the smaller set. 3.Add i,j as children of v in T.
14
14 Neighbor Finding How can we find from distances alone a pair of neighboring leaves (called also cherries)? Closest vertices aren’t necessarily neighboring leaves. A B C D
15
15 Neighbor Finding: Seitou&Nei method Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions
16
16 S&N Neighbor Joining Algorithm u If n =3, return tree of three vertices u Compute Q(i,j) for all i,j u Choose i,j such that Q(i,j) is minimal u Create new vertex v, and set i j v k u remove i,j, and add v to the set of objects u Recursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v). d(k,v)
17
17 Initialization: θ(n 2 ) to compute r(i) and Q(i,j) for all i,j L. Each Iteration: u O(n 2 ) to find the maximal Q(i,j). u O(n) to compute {D(v,k):k L} for the new node v, and to update the matrix. u O(n 2 ) to update the values Q(i,j). Total of O(n 3 ). Complexity of S&N Neighbor Joining Algorithm i j k D(v,k)
18
18 NEEDED: Additive Distances Between DNA Sequences MPI, June 2012
19
Additive Evolutionary distance : The number of substitutions which occurred during the sequence evolution ACAC CCCC C G T A 1 2 3 1 site 1 site 2 substitutions Some substitutions are hidden, due to overwriting. Therefore, the exact number of subst. is usually larger than the number of observed changes. site 3 0
20
20 Edge weight = Expected number of substit’s per site AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT MPI, June 2012 0100…0200110121001 0.321 Number of substitutions per site
21
21 When the exact number of substitutions between any two sequences is known, any algorithm which reconstructs trees from the exact distances returns the correct evolutionary tree Interleaf distances: sum of edge weights v u 0.5 0.42 0.3 d(u,v) = 1.12
22
22 The expected number of substitutions is estimated from the observed number of substitutions What we see is only the observed number of substitutions between pairs of leaf sequences.
23
23 The estimation is based on Substitution Model The simplest model: Juke Cantor Model On each tree edge e, each letter is mutated to any other later by the same ratio r e. The length of an edge is the expected number of mutations per site, i.e. t=3r u v t MPI, June 2012 TCGA rrr-A rr-rG r-rrC -rrrT
24
standard distance in the K2P model: Δ total = Expected # of substitutions MPI, June 2012 24
25
25 P uv can be estimated from the observed substitutions beween u and v AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT MPI, June 2012
26
26 P uv can be estimated from the alignment of the sequences at u and v AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT MPI, June 2012
27
27 The expected number of substitutions is estimated from the observed changes by a correction formula u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT MPI, June 2012
28
28 K2P distance estimation process: u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT an Estimation of P uv From observed substit’s Computation by the assumed model
29
29 A C B D F G E edge-weighted ‘true’ tree reconstructed tree reconstruction B C A D F G E 5 6 0.4 6 30.3 2 2 4 5 Reconstruction from estimated distances: Estimated distances Exact (additive) distances Between species Distance estimation Assuming DNA substitution model MPI, June 2012 Challenge: minimize Reconstruction errors
30
A C B D F G E edge-weighted ‘true’ tree T 5 6 0.4 6 30.3 2 2 4 5 30 reconstructed tree T’ B C A D F G E Correct and incorrect reconstruction of edges MPI, June 2012 Each (internal) edge defines a split of the leaves: The edge {ABC | DEFG} is correctly reconstructed The edge {ABCD | EFG} is false negative The edge {AC | BDEFG} is false positive.
31
31 Robinson Foulds Distance MPI, June 2012 False positives + false negatives Total number of internal edges Robison Foulds distance = A C B D F G E edge-weighted ‘true’ tree T 5 6 0.4 6 30.3 2 2 4 5 reconstructed tree T’ B C A D F G E
32
32 Formal statement of the problem for estimated distances Input: an n×n distance matrix, which are estimations of tree (additive) distances. Output: return a tree with small Robinson Foulds distance from the true tree.
33
33 Project’s Goal u Practice current algorithm (NJ) of phylogenetic reconstruction by distance methods. u Simulate evolutions of DNA sequences, and generate evolutionary distances. u Study a new method for tree reconstruction, based on mixed integer programming with CPLEX. u Compare the accuracy of this new method with that of Neighbor Joining. You should use the PHYLIP phylogenetic package for most of the required tasks: http://evolution.genetics.washington.edu/phylip.html
34
34 Opening Phase (3-4 weeks) 1.Write Programs which compute interleaf distance matrices of weighted undirected trees. 2.Use the Neighbor Joining algorithm of PHYLIP to construct trees from (exact or noisy) distance matrices. 3.Use Treedist program of PHYLIP to compute Robinson Foulds distances between trees. 4.Check the accuracy of reconstructed trees when the matrices are noisy. 5.Repeat the above with mixed integer programming
35
Time Line 35
36
36 Main Phase (12-13 weeks) 1. write a program which computes Trees from distance matrices using CPLEX. 2.Simulate evolution of DNA sequences on weighted trees. 3.Compute distances between DNA sequences using PHYLIP. 4.Compare the accuracy of Neighbor Joining to that of your CPLEX algorithm in reconstructing trees from distance matrices.
37
37 Grading Scheme u 10% - work plan u 60% - final report + submitted code Rough distribution of grade: l 40% - meeting project requirements l 10% - code organization and documentation l 10% - innovation and creativeness u 30% - final presentation
38
38 Schedule 3/4 – Introductory meeting 24-30/4 – meeting for concluding the 1 st phase and opening the main phase. 23-27/4 – Individual 60 minute meeting with each team to discuss work plan and design of project 10-17.7– submission of final report Final submission deadline – To be announced Other meetings during the semester will be scheduled online, when the need arises. Good Luck !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.