. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic Trees Lecture 12
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic Trees Lecture 4
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
From Ernst Haeckel, 1891 The Tree of Life.  Classical approach considers morphological features  number of legs, lengths of legs, etc.  Modern approach.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
. Phylogenetic Trees - Parsimony Tutorial #12 Next semester: Project in advanced algorithms for phylogenetic reconstruction (236512) Initial details in:
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
The Tree of Life From Ernst Haeckel, 1891.
. Phylogenetic Trees Lecture 1 Credits: N. Friedman, D. Geiger, S. Moran,
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Terminology of phylogenetic trees
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan WWW:
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
Phylogenetic Tree Reconstruction
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic Trees - Parsimony Tutorial #12
Phylogenetic basis of systematics
Distance based phylogenetics
Character-Based Phylogeny Reconstruction
Goals of Phylogenetic Analysis
The Tree of Life From Ernst Haeckel, 1891.
Phylogenetic Trees.
CS 581 Tandy Warnow.
Phylogeny.
Presentation transcript:

. Class 9: Phylogenetic Trees

The Tree of Life D’après Ernst Haeckel, 1891

Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different species l Speciation caused by physical separation into groups where different genetic variants become dominant u Any two species share a (possibly distant) common ancestor

Phylogenies u A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species u Leafs - current day species u Nodes - hypothetical most recent common ancestors u Edges length - “time” from one speciation to the next AardvarkBisonChimpDogElephant

Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node

Example: Primate evolution mya mya mya

How to construct a Phylogeny? u Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) u Since then, focus on objective criteria for constructing phylogenetic trees l Thousands of articles in the last decades u Important for many aspects of biology l Classification (systematics) l Understanding biological mechanisms

Morphological vs. Molecular u Classical phylogenetic analysis: morphological features l number of legs, lengths of legs, etc. u Modern biological methods allow to use molecular features l Gene sequences l Protein sequences u Analysis based on homologous sequences (e.g., globins) in different species

Dangers in Molecular Phylogenies u We have to remember that gene/protein sequence can be homologous for different reasons: u Orthologs -- sequences diverged after a speciation event u Paralogs -- sequences diverged after a duplication event u Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Dangers of Paralogues Speciation events Gene Duplication 1A 2A 3A3B 2B1B

Dangers of Paralogs Speciation events Gene Duplication 1A 2A 3A3B 2B1B u If we only consider 1A, 2B, and 3A...

Types of Trees u A natural model to consider is that of rooted trees Common Ancestor

Types of Trees u Depending on the model, data from current day species does not distinguish between different placements of the root vs

Types of trees u Unrooted tree represents the same phylogeny with out the root node

Positioning Roots in Unrooted Trees u We can estimate the position of the root by introducing an outgroup: l a set of species that are definitely distant from all the species of interest AardvarkBisonChimpDogElephant Falcon Proposed root

Types of Data u Distance-based l Input is a matrix of distances between species l Can be fraction of residues they disagree on, or -alignment score between them, or … u Character-based l Examine each character (e.g., residue) separately

Simple Distance-Based Method Input: distance matrix between species Outline: u Cluster species together u Initially clusters are singletons u At each iteration combine two “closest” clusters to get a new one

UPGMA Clustering  Let C i and C j be clusters, define distance between them to be  When combining two clusters, C i and C j, to form a new cluster C k, then

Molecular Clock u UPGMA implicitly assumes that all distances measure time in the same way

Additivity u A weaker requirement is additivity l In “real” tree, distances between species are the sum of distances between intermediate nodes a b c i j k

Consequences of Additivity u Suppose input distances are additive u For any three leaves u Thus a b c i j k m

u Can we use this fact to construct trees? u Let where Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree Neighbor Joining

 Set L to contain all leaves Iteration:  Choose i,j such that D(i,j) is minimal  Create new node k, and set  remove i,j from L, and add k Terminate: when |L| =2, connect two remaining nodes Neighbor Joining i j m k

Distance Based Methods u If we make strong assumptions on distances, we can reconstruct trees u In real-life distances are not additive u Sometimes they are close to additive

Character Based Methods u We start with a multiple alignment u Assumptions: l All sequences are homologous l Each position in alignment is homologous l Positions evolve independently l No gaps u We seek to explain the evolution of each position in the alignment

Parsimony u Character-based method u A way to score trees (but not to build trees!) Assumptions: u Independence of characters (no interactions) u Best tree is one where minimal changes take place

A Simple Example u What is the parsimony score of AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

A Simple Example u Each column is scored separately. u Let’s look at the first column: u Minimal tree has one evolutionary change: C C C C C T T T T  C A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA

Evaluating Parsimony Scores u How do we compute the Parsimony score for a given tree? u Traditional Parsimony l Each base change has a cost of 1 u Weighted Parsimony Each change is weighted by the score c(a,b)

Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a

Evaluating Weighted Parsimony Dynamic programming on the tree S(i,a) = cost of tree rooted at i if i is labeled by a Initialization:  For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration:  if k is a node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination:  cost of tree is min a S(r,a) where r is the root

Cost of Evaluating Parsimony u Score is evaluated on each position independetly. Scores are then summed over all positions.  If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) u By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

Maximum Parsimony Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?

Maximum Parsimony How many possible unrooted trees? Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G

Maximum Parsimony How many substitutions? MP

Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 0 0

Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3

Maximum Parsimony G 2 - C 3 - T 4 - A A G C T C A G T C C C G A T C 3 3 3

Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2

Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

Maximum Parsimony G 2 - A 3 - A 4 - G G G A A A G G A A A G A A G A 2 2 1

Maximum Parsimony

Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G

Searching for Trees

Searching for the Optimal Tree u Exhaustive Search l Very intensive u Branch and Bound l A compromise u Heuristic l Fast l Usually starts with NJ

Phylogenetic Tree Assumptions u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2  Lengths t = {t i } for each branch u Phylogenetic tree = (Topology, Lengths) = (T,t) leaf branch internal node

Probabilistic Methods u The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. u Background probabilities: q(a) u Mutation probabilities: P(a|b,t) u Models for evolutionary mutations l Jukes Cantor l Kimura 2-parameter model u Such models are used to derive the probabilities

Jukes Cantor model u A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate .

Kimura 2-parameter model u Allows a different rate for transitions and transversions.

Mutation Probabilities u The rate matrix R is used to derive the mutation probability matrix S: u S is obtained by integration. For Jukes Cantor: u q can be obtained by setting t to infinity

Mutation Probabilities  Both models satisfy the following properties: u Lack of memory: l u Reversibility: Exist stationary probabilities { P a } s.t. A GT C

Probabilistic Approach u Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4

Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.

Tree Likelihood Computation u Define P(L k |a)= prob. of leaves below node k given that x k =a u Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise  Iteration: if k is node with children i and j, then u Termination: Likelihood is

Maximum Likelihood (ML) u Score each tree by l Assumption of independent positions u Branch lengths t can be optimized l Gradient ascent l EM u We look for the highest scoring tree l Exhaustive search l Sampling methods (Metropolis)

Optimal Tree Search u Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima

Computational Problem u Such procedures are computationally expensive! u Computation of optimal parameters, per candidate, requires non-trivial optimization step. u Spend non-negligible computation on a candidate, even if it is a low scoring one. u In practice, such learning procedures can only consider small sets of candidate structures

Structural EM Idea: Use parameters found for current topology to help evaluate new topologies. Outline:  Perform search in (T, t) space. u Use EM-like iterations: l E-step: use current solution to compute expected sufficient statistics for all topologies l M-step: select new topology based on these expected sufficient statistics

The Complete-Data Scenario Suppose we observe H, the ancestral sequences. Define: Find: topology T that maximizes S i,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j F is a linear function of S i,j

Expected Likelihood  Start with a tree (T 0,t 0 ) u Compute Formal justification: u Define: Theorem: Consequence: improvement in expected score  improvement in likelihood

Proof Theorem: u Simple application of Jensen’s inequality

Algorithm Outline Original Tree (T 0,t 0 ) Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N 2 M) Compute: Weights:

Pairwise weights This stage also computes the branch length for each pair (i,j) Algorithm Outline Compute: Weights: Find:

Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’)  Q(T 0,t 0 ) Thus, l(T’,t’)  l(T 0,t 0 ) Algorithm Outline Compute: Weights: Find: Construct bifurcation T 1

Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T 1,t’) =l(T’,t’)  l(T 0,t 0 ) Algorithm Outline Compute: Find: Weights: Construct bifurcation T 1

Assessing trees: the Bootstrap u Often we don’t trust the tree found as the “correct” one. u Bootstrapping: l Sample (with replacement) n positions from the alignment l Learn the best tree for each sample l Look for tree features which are frequent in all trees. u For some models this procedure approximates the tree posterior P(T| X 1,…, X n )

New Tree Thm: l(T 1,t 1 )  l(T 0,t 0 ) Algorithm Outline Compute: Construct bifurcation T 1 Find: Weights: These steps are then repeated until convergence