Phylogeny
Tree construction methods Character based Parsimony Fitch Sankoff Probabilistic Maximum likelihood Distance based UPGMA
Maximum Likelihood Method Input: 𝑛 strings of length 𝑚 (multiple alignment) Substitution matrix Character frequency Output: A tree topology with the input strings at the leaves
Maximum Likelihood Method Input: 𝑛 strings of length 𝑚 (multiple alignment) Substitution matrix Character frequency for each possible tree topology with leaf labeling 𝑇: for each position 𝑖 from 1 to 𝑚: 𝐿 𝑖 = 𝑃(𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑖𝑛𝑛𝑒𝑟 𝑛𝑜𝑑𝑒 𝑙𝑎𝑏𝑒𝑙𝑖𝑛𝑔) 𝐿 𝑇 = 𝐿 1 ∗ 𝐿 2 ∗…∗ 𝐿 𝑚 Pick the tree with the Maximal Likelihood
Maximum Likelihood Computation for a specific tree Given a tree topology and a leaf labeling Every possible inner node labeling should be considered How many different trees with inner nodes labeling exist for the given tree? (DNA alphabet) 4 4 =256 T G C A
Maximum Likelihood Computation for a specific tree Compute the likelihood for the given tree: A T G C T G C A 0.1 0.2 1 *All other options for inner node labeling were already computed, and their sum is 0.2 𝑃 𝐴 =0.3, 𝑃 𝑇 =0.3 𝑃 𝐶 =0.2, 𝑃 𝐺 =0.2
Maximum Likelihood Computation for a specific tree Compute the likelihood for the given tree: A T G C T G C A 0.1 0.2 1 *All other options for inner node labeling were already computed, and their sum is 0.2 𝑃 𝐴 =0.3, 𝑃 𝑇 =0.3 𝑃 𝐶 =0.2, 𝑃 𝐺 =0.2 𝐿(𝑇)=0.2+𝑃 𝐴 ∗𝑃 𝐴→𝑇 ∗𝑃 𝑇→𝐶 ∗𝑃 𝐶→𝑇 ∗𝑃 𝐶→𝐺 ∗𝑃 𝑇→𝐶 ∗𝑃 𝐴→𝐺 ∗𝑃 𝐺→𝐴 ∗𝑃 𝐺→𝐺
Maximum Likelihood Computation for a specific tree Compute the likelihood for the given tree: A T G C T G C A 0.1 0.2 1 *All other options for inner node labeling were already computed, and their sum is 0.2 𝑃 𝐴 =0.3, 𝑃 𝑇 =0.3 𝑃 𝐶 =0.2, 𝑃 𝐺 =0.2 𝐿(𝑇)=0.2+𝑃 𝐴 ∗𝑃 𝐴→𝑇 ∗𝑃 𝑇→𝐶 ∗𝑃 𝐶→𝑇 ∗𝑃 𝐶→𝐺 ∗𝑃 𝑇→𝐶 ∗𝑃 𝐴→𝐺 ∗𝑃 𝐺→𝐴 ∗𝑃 𝐺→𝐺 =0.3∗0.1∗0.2∗0.2∗0.1∗0.2∗0.2∗0.2∗1=0.2+9.6∗ 10 −7 =0.20000096
UPGMA UPGMA is a greedy algorithm that constructs a phylogenetic tree, given 𝑛 species and a table 𝐷[𝑛×𝑛] of distances between each 2 species.
Some definitions Additive distance matrix: A distance matrix is called additive if there exists a tree in which the distances between the leaves correspond to the matrix’s distances. Another definition is the “4 point criterion”, which is easier to verify.
Additive matrix The “4 points criterion”: A matrix is said to be additive if every 4 objects (species) can be labeled as 𝑥,𝑦,𝑧,𝑤 so that: z x c a x y b d w 𝑎+𝑏 + 𝑐+𝑑 ≤ 𝑎+𝑥+𝑐 + 𝑏+𝑥+𝑑 = 𝑎+𝑥+𝑑 +(𝑏+𝑥+𝑐)
Additive matrix The “4 points criterion”: A matrix is said to be additive if every 4 objects (species) can be labeled as 𝑥,𝑦,𝑧,𝑤 so that: z x c a x y b d w 𝑎+𝑏 + 𝑐+𝑑 ≤ 𝑎+𝑥+𝑐 + 𝑏+𝑥+𝑑 = 𝑎+𝑥+𝑑 +(𝑏+𝑥+𝑐)
Additive matrix The “4 points criterion”: A matrix is said to be additive if every 4 objects (species) can be labeled as 𝑥,𝑦,𝑧,𝑤 so that: z x c a x y b d w 𝑎+𝑏 + 𝑐+𝑑 ≤ 𝑎+𝑥+𝑐 + 𝑏+𝑥+𝑑 = 𝑎+𝑥+𝑑 +(𝑏+𝑥+𝑐)
Additive matrix The “4 points criterion”: A matrix is said to be additive if every 4 objects (species) can be labeled as 𝑥,𝑦,𝑧,𝑤 so that: z x c a x y b d w 𝑎+𝑏 + 𝑐+𝑑 ≤ 𝑎+𝑥+𝑐 + 𝑏+𝑥+𝑑 = 𝑎+𝑥+𝑑 +(𝑏+𝑥+𝑐)
Additive matrix 𝑑 𝐴,𝐵 +𝑑 𝐶,𝐷 =12+6=18 𝑑 𝐴,𝐶 +𝑑 𝐵,𝐷 =14+12=26 𝑑 𝐴,𝐷 +𝑑(𝐵,𝐶)=14+12=26 𝑑 𝐴,𝐵 +𝑑 𝐶,𝐷 ≤𝑑 𝐴,𝐶 +𝑑 𝐵,𝐷 =𝑑 𝐴,𝐷 +𝑑(𝐵,𝐶) 18 26 26
Non-Additive matrix 𝑑 𝐴,𝐵 +𝑑 𝐶,𝐷 =2+2 𝑑 𝐴,𝐶 +𝑑 𝐵,𝐷 =2+2 𝑑 𝐴,𝐷 +𝑑 𝐵,𝐶 =2+3 A B C D 2 3
Ultrametric Distance Matrix A distance matrix is called ultrametric if there exists a tree corresponding to the matrix’s distances, in which all leaves have equal distance from the root. Notice that by definition, ultrametric is a special case of additive.
Some definitions The “3 point criterion”: Like the additive case, ultrametric has another definition: If all 3 taxa can be relabeled as 𝑥,𝑦,𝑧 so that:
Some definitions Ultrametric distance: For example, this is an ultrametric tree:
UPGMA algorithm UPGMA - Unweighted Pair Group Method with Arithmatic Mean Input – a distance matrix D. Each cell [𝑖,𝑗] represents the distance 𝑑(𝑖,𝑗) between species 𝑖 and species 𝑗. Output – an ultrametric phylogenetic tree T, with leaf labeling
UPGMA algorithm Input: 𝐷[𝑛×𝑛] – distance matrix Initialize: 𝑇={ 𝐶 1 ,…, 𝐶 𝑛 } While 𝑇 >1 cluster taxa: Pick shortest distance 𝑑(𝑖,𝑗) C← C 𝑖 , C j Define node at height 𝑑 𝐶 𝑖 , 𝐶 𝑗 2 T← T \ { C i , C j } T←T U {C} Update D: ∀ 𝐶 𝑘 ∈𝑇, 𝐶 𝑘 ≠ 𝐶 𝑖 , 𝐶 𝑗 𝑑 𝐶, 𝐶 𝑘 = 𝑑 𝐶 𝑖 , 𝐶 𝑘 | 𝐶 𝑖 |+𝑑 𝐶 𝑗 , 𝐶 𝑘 | 𝐶 𝑗 | 𝐶 𝑖 +| 𝐶 𝑗 |
UPGMA Example Given the distance matrix below, build a phylogenetic tree using UPGMA A B C D E 2 4 6 F 8
Example A B C D E 2 4 6 F 8 We begin by choosing a minimal distance, and clustering the nodes chosen.
UPGMA Example Then we calculate the distances between our new cluster and all the rest of the nodes, to create an updated distance matrix D. The distances not including our cluster’s nodes remain exactly the same.
Example The updated distance matrix: A B C D E 2 4 6 F 8 AB C D E 4 6 𝑑 𝐶, 𝐶 𝑘 = 𝑑 𝐶 𝑖 , 𝐶 𝑘 | 𝐶 𝑖 |+𝑑 𝐶 𝑗 , 𝐶 𝑘 | 𝐶 𝑗 | 𝐶 𝑖 +| 𝐶 𝑗 | A B C D E 2 4 6 F 8 𝑑 𝐴𝐵, 𝐶 𝑘 = 𝑑 𝐴, 𝐶 𝑘 𝐴 +𝑑 𝐵, 𝐶 𝑘 𝐵 𝐴 +|𝐵| 𝑑 𝐴𝐵, 𝐶 = 4 𝐴 +4 𝐵 𝐴 +|𝐵| =4 AB C D E 4 6 F 8 The updated distance matrix:
Example AB C D E 4 6 F 8 Now, we carry on doing the exact same procedure, until we are left with only one cluster.
Example AB C D E 4 6 F 8 Now, we carry on doing the exact same procedure, until we are left with only one cluster.
Example 𝑑 𝐶, 𝐶 𝑘 = 𝑑 𝐶 𝑖 , 𝐶 𝑘 | 𝐶 𝑖 |+𝑑 𝐶 𝑗 , 𝐶 𝑘 | 𝐶 𝑗 | 𝐶 𝑖 +| 𝐶 𝑗 | AB C D E 4 6 F 8 AB C DE 4 6 F 8 𝑑 𝐷𝐸, 𝐴𝐵 = 𝑑 𝐷, 𝐴𝐵 𝐷 +𝑑 𝐸, 𝐴𝐵 𝐸 𝐸 +|𝐷| = 6+6 2 =6
Example AB C DE 4 6 F 8
Example AB C DE 4 6 F 8
Example 𝑑 𝐶, 𝐶 𝑘 = 𝑑 𝐶 𝑖 , 𝐶 𝑘 | 𝐶 𝑖 |+𝑑 𝐶 𝑗 , 𝐶 𝑘 | 𝐶 𝑗 | 𝐶 𝑖 +| 𝐶 𝑗 | AB C DE 4 6 F 8 AB,C DE 6 F 8 𝑑 𝐴𝐵𝐶, 𝐷𝐸 = 𝑑 𝐴𝐵, 𝐷𝐸 𝐴𝐵 +𝑑 𝐶, 𝐷𝐸 𝐶 𝐴𝐵 +|𝐶| = 6∗2+6∗1 3 =6 𝑑 𝐴𝐵𝐶, 𝐹 = 𝑑 𝐴𝐵, 𝐹 𝐴𝐵 +𝑑 𝐶, 𝐹 𝐶 𝐴𝐵 +|𝐶| = 8∗2+8∗1 3 =8
Example AB,C DE 6 F 8 (AB,C),DE F 8 AB,C DE 6 F 8 (AB,C),DE F 8 𝑑 𝐴𝐵𝐶𝐷𝐸, 𝐹 = 𝑑 𝐴𝐵𝐶, 𝐹 𝐴𝐵𝐶 +𝑑 𝐷𝐸, 𝐹 𝐷𝐸 𝐴𝐵𝐶 +|𝐷𝐸| = 8∗3+8∗2 5 =8
UPGMA Example Our output tree! Lovely, isn’t it? (AB,C),DE F 8 Our output tree! Lovely, isn’t it? Can there be more the one tree?
UPGMA Example A B C D E 2 4 6 F 8 What can be said about the distance matrix? Is it additive? Ultrametric?
UPGMA downfalls UPGMA will always return an ultrametric tree. It assumes all species mutate at the same rate (molecular clock). What will happen if we will try and reconstruct a tree such as this one?
UPGMA downfalls This tree corresponds to the following distance matrix: A B C D E 5 4 7 10 6 9 F 8 11
UPGMA downfalls If we run UPGMA on the matrix shown, will get this output: Compared to the original tree:
UPGMA downfalls UPGMA returns the right tree if the distance matrix is ultrametric. Even then, we can’t be certain the original tree was also ultrametric. If the distance matrix D is not additive, UPGMA will generate a heuristic solution that does not fit D