Phylogenetics I.

Slides:



Advertisements
Similar presentations
Intro to Phylogenetic Trees Computational Genomics Lecture 4b
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic Trees Lecture 12
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
Multiple Sequence Alignment & Phylogenetic Trees.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
. Distance-Based Phylogenetic Reconstruction ( part II ) Tutorial #11 © Ilan Gronau.
The Tree of Life From Ernst Haeckel, 1891.
We have shown that: To see what this means in the long run let α=.001 and graph p:
. Phylogenetic Trees Lecture 1 Credits: N. Friedman, D. Geiger, S. Moran,
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic Trees Lecture 2
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield (updated April 12, 2009)
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Phylogenetic trees Sushmita Roy BMI/CS 576
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
. Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Introduction to Phylogenetic Trees
Molecular Evolution.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
CSCE555 Bioinformatics Lecture 12 Phylogenetics I Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Calculating branch lengths from distances. ABC A B C----- a b c.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Tutorial 5 Phylogenetic Trees.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic Analysis
Molecular Evolution and Phylogeny
Distance based phylogenetics
CSCI2950-C Lecture 7 Molecular Evolution and Phylogeny
dij(T) - the length of a path between leaves i and j
Inferring a phylogeny is an estimation procedure.
CLASSIFICATION AND EVOLUTION Part 1.
Character-Based Phylogeny Reconstruction
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
CSE 5290: Algorithms for Bioinformatics Fall 2009
The Tree of Life From Ernst Haeckel, 1891.
CLASSIFICATION AND EVOLUTION Part 1.
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are both.
Presentation transcript:

Phylogenetics I

Evolution Evolution of new organisms is driven by Mutations The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias

Theory of Evolution Basic idea speciation events lead to creation of different species. Speciation caused by physical separation into groups where different genetic variants become dominant Any two species share a (possibly distant) common ancestor

The Tree of Life

Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

Morphological vs. Molecular Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences

Morphological topology (Based on Mc Kenna and Bell, 1997) Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Tree shrew Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Horseshoe bat Little red flying fox Ryukyu flying fox Mouse Rat Vole Cane-rat Guinea pig Squirrel Dormouse Rabbit Pika Pig Hippopotamus Sheep Cow Alpaca Blue whale Fin whale Sperm whale Donkey Horse Indian rhino White rhino Elephant Aardvark Grey seal Harbor seal Dog Cat Asiatic shrew Long-clawed shrew Small Madagascar hedgehog Hedgehog Gymnure Mole Armadillo Bandicoot Wallaroo Opossum Platypus Archonta Glires Ungulata Carnivora Insectivora Xenarthra

From sequences to a phylogenetic tree Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

Mitochondrial topology (Based on Pupko et al.,) Perissodactyla Carnivora Cetartiodactyla Donkey Horse Indian rhino White rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot Wallaroo Opossum Platypus Primates Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Rodentia 1 Hedgehogs Rodentia 2

Nuclear topology 1 2 3 4 Chiroptera Eulipotyphla Pholidota (Based on Pupko et al. slide) (tree by Madsenl) Cetartiodactyla Afrotheria Chiroptera Eulipotyphla Glires Xenarthra Carnivora Perissodactyla Scandentia+ Dermoptera Pholidota Primate Round Eared Bat Flying Fox Hedgehog Mole Pangolin Whale Hippo Cow Pig Cat Dog Horse Rhino Rat Capybara Rabbit Flying Lemur Tree Shrew Human Galago Sloth Hyrax Dugong Elephant Aardvark Elephant Shrew Opossum Kangaroo 1 2 3 4

Phylogenenetic trees Aardvark Bison Chimp Dog Elephant Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next

Twists in molecular phylogenies We have to emphasize that gene/protein sequence can be homologous for several different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Paralogs Consider evolutionary tree of three taxa: Gene Duplication 1 2 3 …and assume that at some point in the past a gene duplication event occurred.

Paralogs The gene evolution is described by this tree (A, B are the copies of the same gene). Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B

Paralogs If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species Gene Duplication S S S Speciation events 1A 2A 3A 3B 2B 1B

Types of Trees A natural model to consider is that of rooted trees Common Ancestor

Types of trees Unrooted tree represents the same phylogeny without the root node Depending on the model, data from current day species does not distinguish between different placements of the root.

Rooted versus unrooted trees Tree a Tree b Tree c b a c Represents the three rooted trees

Total numbers of trees For N taxa, Rooted bifurcating trees: (2n-3)!! = (2n-3)!/2n-2(n-2)! Unrooted bifurcating trees (2n-5)!! Tree shapes

Positioning Roots in Unrooted Trees We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant

Type of Data Distance-based Character-based Input is a matrix of distances between species Can be fraction of residue they disagree on, or alignment score between them, or … Character-based Examine each character (e.g., residue) separately

Two methods of tree Construction Distance- A weighted tree that realizes the distances between the objects. Parsimony – A tree with a total minimum number of character changes between nodes. We start with distance based methods, considering the following question: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

Distance Matrix Given n species, we can compute the n x n distance matrix Dij Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

The distance between two sequences Protein sequences: PAM BLOSUM DNA sequences Jukes-Cantor HGY Kimura 2-Parameter

General Stationary Time-reversible Model . pCrCA pGrGA pTrTA pArAC pGrGC pTrTC pArAG pCrCG pTrTG pArAT pCrCT pGrGT R = (Diagonal elements such that rows sum to zero) Time reversibility: pirij = pjrji

General Stationary Time-reversible Model P(t) = eRt Given rates, one can find transition probabilities, and vice-versa.

Jukes-Cantor . u/3 R =

Jukes-Cantor P(no mutation) = e-4/3ut P(at least one mutation) = 1-e-4/3ut Ds = ¾ * (1-e-4/3ut) D  ut = -3/4 ln (1-4/3 * Ds)

Kimura 2-Parameter R = a/b = transition/transversion bias  R A C G T . b a R = a/b = transition/transversion bias  R a+2b = 1 per unit time

Kimura 2-Parameter a=R/(R+1), b=0.5/(R+1)

HKY (Hasegawa, Kishino, Yano) . mpC mkpG mpT mpA mpG mkpT mkpA mkpC R = Some rules of thumb: Use simpler models with shorter sequences (< 200 bp). Otherwise, use a model as complex as necessary. Compare results from more than one method. k = transversion / transition

Distances in Trees Edges may have weights reflecting: Number of mutations on evolutionary path from one species to another Time estimate for evolution of one species into another In a tree T, we often compute dij(T) - the length of a path between leaves i and j

Distance in Trees: an Exampe j d1,4 = 12 + 13 + 14 + 17 + 12 = 68

Fitting Distance Matrix Given n species, we can compute the n x n distance matrix Dij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix Dij

Reconstructing a 3 Leaved Tree Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

Reconstructing a 3 Leaved Tree dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2

Trees with > 3 Leaves An tree with n leaves has 2n-3 edges This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables This is not always possible to solve for n > 3

Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

Distance Based Phylogeny Problem Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

Using Neighboring Leaves to Construct the Tree Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves.

Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. WRONG

Finding Neighboring Leaves Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) Finding a pair of neighboring leaves is a nontrivial problem!

Neighbor Joining Algorithm In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

Constructing additive trees: The neighbor joining algorithm Let i, j be neighboring leaves in a tree, let k be their parent, and let m be any other vertex. The formula shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: Find neighboring leaves i,j in the tree, Replace i,j by their parent k and recursively construct a tree T for the smaller set. Add i,j as children of k in T.

Neighbor Finding How can we find from distances alone a pair of nodes which are neighboring leaves? Closest nodes aren’t necessarily neighboring leaves. A B C D Next we show one way to find neighbors from distances.

Neighbor Finding: Seitou & Nei algorithm Definitions Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

Complexity of Neighbor Joining Algorithm Naive Implementation: Initialization: θ(L2) to compute d(r,i) and C(i,j) for all i,jL. Each Iteration: O(L2) to find the maximal C(i,j). O(L) to compute {C(m,k):m L} for the new node k. Total of O(L3). r C(m,k) m k

Complexity of Neighbor Joining Algorithm Using Heap to store the C(i,j)’s: Input: Distance matrix D= d(i,j), and an arbitrary object r. Initialization: θ(L2) to compute and heapify the C(i,j)’s in a heap H. Each Iteration: O(log L) to find and delete the maximal C(i,j) from H. O(L) to add the values {d(k,m)} to D, for all objects m. O(L) to delete {d(m,i), d(m,j)} from D (for all m). O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all objects m. Total of O(L2 log L). (implementation details are omitted)

Neighbor Joining Algorithm Applicable to matrices which are not additive Known to work good in practice The algorithm and its variants are the most widely used distance-based algorithms today.

The Four Point Condition Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 3 1 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge

The Four Point Condition: Theorem The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n

Least Squares Distance Phylogeny Problem If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).