Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon
Phylogenetic Reconstruction We’d like to study the evolutionary history of species Problems: No information regarding extinct species Many possible tree topologies
3 Common Terminology A B C D E Edges represent distance between nodes Root (Ancestral node) Internal nodes (common ancestors) Leaves TAXA (genes, proteins, species etc.)
Phylogenetic Reconstruction Approach 1: (Character based) Given a probabilistic model (HMM) of evolution, find the most probable tree to yield the known set of species. Problem: Finding ML tree is very hard Evolutionary models are very complex, with many parameters Estimating parameters using EM Many local maxima Small trees (up to 5 taxa) are relatively easy Big trees (more than 50 taxa) are almost impossible Approach 2: (Distance based) Given ML pairwise ( evolutionary ) distances between species, find the edge-weighted tree best describing this metric Note: ML pairwise distances = ML trees spanning two species
Distance-Based Reconstruction Given ML pairwise ( evolutionary ) distances between species, find the edge-weighted tree best describing this metric The input: distance matrix – D – D(i,i) ≥ 0 – D(i,i) = 0 – D(i,j) = D(j,i) – D(i,j) ≤ D(i,k) + D(k,j) The Output: edge-weighted tree – T If D is additive, then D T = D Otherwise, return a tree best ‘fitting’ the input – D. Note: Usually ML-estimated pairwise distances are not additive, but they are ‘close’ to some additive metric metric BearRaccoonWeaselSealDog Bear Raccoon Weasel Seal Dog Bear Raccoon Weasel Seal Dog
Neighbor-Joining Algorithms Agglomerative approach: (bottom-up) 1.Find a pair of taxa neighbors – i,j 2.Connect them to a new internal vertex – v (Define edge weights) 3.Remove i,j from taxon-set, and add v (Define distances from v ) 4.Return to (1) When only 2 taxa are left, connect them Consistency: Given an additive metric D T : - We always choose a pair of neighbors in T (stage 1) - The reduced distance-matrix is consistent with the reduced tree (stage 3) Neighbors: taxa connected by a 2-edge path By induction: We eventually reconstruct T
UPGMA (U nweighted P air G roup M ethod with A rithmetic-Mean ) UPGMA algorithm: 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) ) 4.Return to (1) When only 2 taxa are left, connect them Consistency ? - Given an additive metric D T, do we always choose a pair of neighbors in T ? abcd a b 0315 c 014 d 0 c a b d UPGMA chooses b,c Closest taxon is not necessarily a neighbor α, 1- α – proportional to the number of ‘original’ taxa i,j represent
Molecular Clock Reminder: Edge weights correspond to evolutionary distance If rate of evolution is universally constant: The root is equidistant from all taxa Closest taxon-pair is a neighbor-pair time
Molecular Clock Reminder: Edge weights correspond to evolutionary distance Rate of evolution is different in each branch Most observed evolutionary trees Closest taxon-pair is not necessarily a neighbor-pair time
Ultrametric Trees Edge-weighted trees which have a point (root) equidistant from all leaves Additive metrics consistent with an ultrametric tree are called ultrametrics A distance-matrix is ultrametric iff it obeys the 3-point condition: “ Any subset of three taxa can be labelled i,j,k such that d(i,j) ≤ d(j,k) = d(i,k) ”
UPGMA on Ultrametrics UPGMA algorithm: 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) ) 4.Return to (1) When only 2 taxa are left, connect them Consistency for ultrametrics: Given an ultrametric U T : - We always choose a pair of neighbors in T (stage 1) - The reduced distance-matrix is consistent with the reduced tree (stage 3)
Consistency for ultrametrics: Given an ultrametric U T : - We always choose a pair of neighbors in T (stage 1) - The reduced distance-matrix is consistent with the reduced tree (stage 3) If i,j are neighbors in an ultrametric tree, then D(i,k) = D(j,k) for all k. - or - If D(i,j) is minimal in an ultrametric, then D(i,k) = D(j,k) for all k. k ij UPGMA on Ultrametrics 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
UPGMA on Ultrametrics Consistency for ultrametrics: Given an ultrametric U T : - We always choose a pair of neighbors in T (stage 1) - The reduced distance-matrix is consistent with the reduced tree (stage 3) 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) ) Assume, to the contrary, that i,j are not neighbors The path connecting i,j contains at least 3 non-zero weight edges v – the least-common ancestor (lca) of i,j. There is a taxon k, s.t. D(j,k) (or D(i,k) ) is smaller than D(i,j). k i j v contradiction changed!!
UPGMA on Non-Ultrametric Data Edge-weights are set so that UPGMA always returns an ultrametric tree (we won’t prove) Example: BearRaccoonWeaselSealDog Bear Raccoon Weasel Seal Dog D: D is not ultrametric
UPGMA on Non-Ultrametric Data Example: 1 st iteration BRWSD B R W S 050 D 0 D: BearRaccoonWeaselSealDogB-R 13 B-RWSD W S 050 D 0 α = ½ 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
UPGMA on Non-Ultrametric Data Example: 2 nd iteration D: B-RWSD W S 050 D 0 BearRaccoonWeaselSealDog BR 13 B-R-S =5.25 B-R-SWD ⅓ W 051 D 0 α = ⅓ 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
UPGMA on Non-Ultrametric Data Example: 3 rd iteration D: BearRaccoonWeaselSealDog BR 13 B-R-S =5.25 B-R-S-WD 045¼ D 0 B-R-SWD ⅓ W 051 D 0 BRSW =1.75 α = ¼ 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
UPGMA on Non-Ultrametric Data Example: 4 th iteration D: B-R-S-WD D 0 BearRaccoonWeaselSealDog BR 13 BRS =5.25 BRSW =1.75 BRSWD = Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
UPGMA Additional notes: In the reduction formula D(v,k) can be set to any value within the interval defined by D(i,k) and D(j,k). In particular: D(v,k) = ½(D(i,k) + D(j,k)) ( WPGMA algorithm) If we use: D(v,k) = min {D(i,k), D(j,k)} we get the ‘closest’ ultrametric from below (unique subdominant ultrametric) Run-time analysis: ―Naïve implementation: O(n 3 ) ―By keeping a sorted version of each row in D : O(n 2 log(n)) ―Third variant can be executed in: O(n 2 ) 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )