. פרויקט בתכנות מתקדם – 512236 פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב 2010 דואר אלקטרוני חדרטלפון.

Slides:



Advertisements
Similar presentations
1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel.
Advertisements

Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau.
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 4
פרויקט בתכנות מחקר השוואתי בשחזור עצי אבולוציה: אלגוריתמים קיימים מול תכנות בשלמים אביב 2013 מרצה: שלמה מורן מנחה חיצוני: יוסי שילוח Website:
PLGW01 - September Inferring Phylogenies from LCA distances (back to the basics of distance-based phylogenetic reconstruction) Ilan Gronau Shlomo.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan Gronau.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
. Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetic Trees - Parsimony Tutorial #13
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
WABI: Workshop on Algorithms in Bioinformatics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Consistent and Efficient Reconstruction of Latent Tree Models
Distance based phylogenetics
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
All pairs shortest path problem
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Presentation transcript:

. פרויקט בתכנות מתקדם – פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב דואר אלקטרוני חדרטלפון שלמה מורן דניאל דור

. ההשפעה של פונקציות מרחק על שיחזור עצי אבולוציה לאחר שלב ההודעות, נעביר היום קורס בזק מקוצר על: 1.עצי אבולוציה: הגדרות ומודלים מבוססי DNA 2.שיטות מבוססות מרחקים לבניית עצי אבולוציה 3.פונקציות מרחק למודלים אבולוציוניים 4.הערכת מרחקים בין זנים על סמך השוני בין סדרות הDNA לאחר ה"קורס המזורז" עדיין תזדקקו להשלמות מסוימות בהמשך הסמסטר. במהלך "קורס הבזק" יוצגו הפרויקטים. נושא הפרויקט

. דרישות קדם: אלגוריתמים 1, הסתברות רצוי (אך לא הכרחי): אלגוריתמים בביולוגיה חישובית ככלל, הפרויקטים יעשו בזוגות. תוך שבוע הודיעונו על החלוקה לזוגות (בדוא"ל) בחירת פרוייקט: יהיו שני כיוונים עיקריים. השלב הראשוני דומה בשני הכיוונים. התמקדות בכיוון מסוים תעשה בהמשך (תוך כחודש). (מכאן והלאה שקפים באנגלית) אדמיניסטרציה

4 Crash course on evolutionary distances

5 The Phylogenetic Reconstrutction Problem

6 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Evolution is modeled by DNA sequences which evolve along an Evolution Tree (Phylogeny) (All our sequences are DNA sequences, consisting of {A,G,C,T})

7 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction

8 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT Goal: reconstruct the ‘true’ tree as accurately as possible reconstruct A B C F G IHJ D E A B C F G I H J D E (root) Phylogenetic Reconstruction

9 Three Methods of Tree Construction u Parsimony – A tree with minimum number of mutations. u Maximum likelihood - Finding the “most probable” tree. u Distance- A weighted tree that realizes the distances between the species.

10 A C B D F G E edge-weighted ‘true’ tree reconstructed tree reconstruction B C A D F G E noise α Major problem: sensitivity to noise reconstruction in O(n 2 ) Distance Based Reconstruction: Exact vs. approximate distances Exact distances

11 A C B D F G E edge-weighted ‘true’ tree reconstruction in O(n 2 ) The Algorithmic Aspect Exact distances Many algorithms can reconstruct a weighted tree from the exact distances. In this project we will use the “Saitou&Nei Neighbor Joining algorithm”, or simply the “NJ algorithm”.

12 Evolutionary Distances: - How are they defined? - How are they extracted from the DNA sequences? We’ll show this on a specific model the Kimura 2 Parameters (K2P) model The Distance Estimation Aspect noise α

13 The Kimura 2 Parameter (K2P) model [Kimura80]: each edge corresponds to a “Rate Matrix” Transitions Transversions Transitions K2P generic rate matrix u v

14 K2P standard distance: Δ total = Total substitution rate u vw The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additive distance. + α + 2βα’ + 2β’ (α+α’) + 2(β+ β’)

15 The distance Δ total (R uv ) = d K2P (u,v) is estimated from the aligned sequences u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT K2P total rate “distance correction” procedure since mutations may overwrite each other, this is a “noisy” process

A basic question: How good is a reconstruction method which uses K2P distances? AC BD w sep The performance of tree reconstructions method is often tested on quartets, which are trees with 4 taxa. A quartet contains a single internal edge, which defines the quartet-split.

17 A correct reconstruction of the quartet requires finding of the true quartet-split AC BD AB C D AC DB Distance methods reconstruct the true split by the 4-point condition: There are 3 possible splits: w sep The 4-point condition for noisy distances is:

18 We evaluate the accuracy of the K2P distance estimation by Split Resolution Test: root D C A B t is “evolutionary time” The diameter of the quartet is 22t

19 Phase A: simulate evolution D C A B

20 Phase B: reconstruct the split by the 4p condition DCBA                   Apply the 4p condition. Was the correct split found? compute distances between sequences, Repeat this process 10,000 times, count number of failures

21 the split resolution test was applied on the model quartet with various diameters  For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide) ……

22 Performance of K2P distances in resolving quartets, small diameters: Template quartet

23 Performance for larger diameters “site saturation”

24 Transitions Transversions Transitions When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δ tv, which counts only transversions: This is the CFN model [Cavendar78, Farris73, Neymann71] α α β

25 Apply the same split resolution test on the transversions only distance: u AACA…GTCTTCGAGGCCC v AGCA…GCCTATGCGACCT Transversions only Distance correction procedure

26 transversions only performs better on large, worse on small rates Transversions only total K2P rate

Conclusion: Distance based reconstruction methods should be adaptive: Find a distance function d which is good for the input                      Projects goal: Evaluate the performance of distance functions in reconstructing phylogenies

28 1 st step in finding good distance functions ( for the K2P model): Characterize the available distance functions. Ideally, we would like to use the K2P distance associated with the rate matrix of each edge, but...

29 Rate matrices are hard to observe, hence we use Substitution matrices AACA…GTCTTCGAGGCCC u v AGCA…GCCTATGCGACCT Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix P uv

30 Subtitution matrices are extended to paths: u v w

31 Substitution matrices are converted to distances by a Substitution Rate function u v w SR function need to satisfy the following for all substitution matrices P,Q in K2P : 1.Δ(PQ) = Δ(P)+ Δ(Q) (additivity) 2.Δ(P)>0 (positivity)

32 To define SR functions which are additive: Δ(PQ) = Δ(P)+ Δ(Q) We use some linear algebra

33 Lemma: There is a matrix U which diagonalizes each K2P Substitution Matrix P: λPλP 00 00μPμP 0 000μPμP Where: U -1 PU = P = 0 < λ P <1 0 < μ P < 1

34 μPμP 000 0μPμP 00 00λPλP U -1 P U = μQμQ 000 0μQμQ 00 00λQλQ U -1 Q U = U -1 PQ U = Let P,Q be two matrices in K2P. Then: μ P μ Q λ P λ Q U -1 PQ U =

35 Proof: D λ (PQ) = -ln(λ P λ Q ) = -ln(λ P ) -ln(λ Q ) = D λ (P)+ D λ (Q) And the same for D μ (P )= -ln(μ P ) Hence, the functions: D λ (P)= -ln(λ P ), D μ (P)=-ln(μ P ) are additive distance functions For the K2P model

36 Moreover, Each positive linear combination of D λ and D μ is an additive distance function u v w Our goal: given set of input sequences, select D which guarantees best reconstruction of the true tree.

37 ACGGTCA ACGGATA GGGGATT The approximate distance function is defined by the observable noisy version of the substitution matrices w v u We would like to use functions which minimize the influence of the “noise” on the reconstruction. Such a function can be defined&computed analytically for a single distance. Computing it for even small trees looks hard.

38 Summary We have infinitely many additive distance functions for the K2P model. Which one should we use for the given input DNA sequences? If we have the exact substitution matrices for all pairs of taxa, then all functions are equally good. But we have only finite sequences, whose alignments provide only estimations of the true substitution matrices

39 3 phases of the project Phase 1: Distance functions on simulated quartets :1 month Phase 2: Distance functions on larger simulated trees: (1+) month Phase 3: Extensions to real data and/or different models: 1 month Phase 2 and 3 are flexible

40 Phase I: Quartets (~one month) Study the relevant info in “Towards Optimal....” Write a program (in MATLAB or C..) which compute optimal distance functions as in the above paper Repeat the “quartet resolution test” given in this presentation, and extend it to include optimal distance functions. Feel free modify the simulation by your judgment.

41 Phase II: Reconstructing Larger Trees using the Neighbor Joining Algorithm 1.Study the Neighbor Joining algorithm 2.Newick trees representations, and Robinson Fould measure. 3.Make similar tests, but this time on larger trees. 4.Implementation of NJ, and “Tree Templates” can be downloaded from the www. More information will be given later, either via the course site or in a meeting.

42 Phase III: Trees from Real Data 1.Get Homologeous DNA sequences from existing databases. 2.Align the sequences using public domain software. 3.Select appropriate distance functions, and estimate distances between the aligned sequences, using appropriate distance functions 4.Use the various distance functions to reconstruct the trees, and compare their perfomance.