Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Multiple Sequence Alignment & Phylogenetic Trees.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
From Ernst Haeckel, 1891 The Tree of Life.  Classical approach considers morphological features  number of legs, lengths of legs, etc.  Modern approach.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Bioinformatics Algorithms and Data Structures
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogeny Tree Reconstruction
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic trees Sushmita Roy BMI/CS 576
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
PHYLOGENETIC TREES Dwyane George February 24,
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Introduction to Phylogenetic Trees
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Tutorial 5 Phylogenetic Trees.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Part 9 Phylogenetic Trees
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Phylogenetic trees Sushmita Roy BMI/CS Sep 23 rd, 2014

Key concepts in this section What are phylogenies or phylogenetic trees? – Terminology such as extant, ancestral, branch point, branch length, orthologs, paralogs Why build phylogenetic trees? Algorithms to build phylogenetic trees – Distance-based methods – Parsimony methods Minimize the number of changes – Probabilistic methods Find the tree that best explains the data using probabilistic models

Readings Chapter 7 –

What are phylogenetic trees? A tree that describes evolutionary relationships among entities – Species, genes, strains This relationship is called “phylogeny” Leaves represent extant (current day) species Internal nodes represent ancestral species Phylogenetics: – The task for inferring the phylogenetic tree from observations in existing organisms

Why phylogenetic trees? Inform multiple sequence alignments Identify signatures of conservation of sequence Understand how organisms are related – Do humans and chimpanzees share a common ancestor or do humans and gorillas? Ask how closely organisms are related – Humans and chimpanzees shard a common ancestor 5mya How specific functions/traits have evolved – What made us human? Conjecture the fate of specific regions of the genome – Will the human Y disappear?

From Tree of life aims to represents the phylogeny of all species on earth

Tracing the evolution of the Ebola virus Ebola virus: a lethal human pathogen, fatality rate 78% Ebola is spreading now in Africa – Until recently the largest known case happened in 1976 (318 cases) – This year’s outbreak reported in Feb 2014 – As of 19 Aug 2014, 1229 deaths have been reported Largest known in history Key questions – Where did the pathogen come from? – How is it evolving? In a 2014 Science paper, researchers reported whole genome sequence alignment of 78 Ebola virus samples

Phylogenetic tree of the Ebola virus Gire et al, Science 2014 Three recent outbreaks from the same ancestor

Insights gained from sequence comparison “Genetic similarity across the sequenced 2014 samples suggests a single transmission from the natural reservoir, followed by human-to-human transmission during the outbreak” “..data suggest that the Sierra Leone outbreak stemmed from the introduction of two genetically distinct viruses from Guinea around the same time…” “..the catalog of 395 mutations, including 50 fixed nonsynonymous changes with 8 at positions with high levels of conservation across ebola viruses, provides a starting point for such studies” Gire et al., Science 2014

Phylogenetic tree basics Leaves represent entities(genes, species, individuals/strains) being compared – the term taxon (taxa plural) is used to refer to these when they represent species and broader classifications of organisms – For example if taxa are species, the tree is a species tree Internal nodes are ancestral units Phylogenetic trees can be rooted or unrooted – the root represents the common ancestor In a rooted tree, path from root to a node represents an evolutionary path – Gives directionality to evolutionary time An unrooted tree specifies relationships among taxa, but not from an ancestor

Tree basics Branch Leaf node: Extant Internal node: Ancestral For a species tree, internal nodes represent speciation events Unrooted treeRooted tree Each tree topology represents a different evolutionary history Branch length Branch length describes the evolutionary divergence between two nodes

Orthologs and paralogs Orthologs: – Two sequences in two species that have a a common ancestor – Diverged due to a speciation event – Used to create a “species tree” Paralogs: – Two sequences in the same species that arose from a gene duplication event – Captured in a “gene tree”.

Tree counting A rooted tree with n leaf nodes has – n-1 internal nodes – 2n-2 edges/branches An unrooted tree with n leaf nodes has – n-2 internal nodes – 2n-3 edges/branches – A root can be added to any of these branches to give 2n-3 rooted trees for any unrooted tree E.g. for n=3 there is one unrooted tree and three rooted trees

Tree counting An unrooted tree Possible positions for rootRooted trees

Tree counting Instead of adding a root we could add a branch for the n+1 th taxon

Tree counting A tree with 3 nodes can be grown in (2*3)-3=3 ways to make a tree of 4 nodes Each tree with 4 nodes can be grown in (2*4)-3=5 ways to make a tree of 5 nodes – So we have 3*5 trees Each tree of 5 nodes can be grown in (2*5)-3=7 – So we have 3*5*7 In general for n nodes we can have – (1)*(3)*(5)*..(2n-5) unrooted trees

Tree counting This grows very fast – For n=10, we have 2 million unrooted trees – For n=20, we have 2.2*10 20

Constructing phylogenetic trees Phylogenetic tree construction – Given observations of n taxonomical units infer the tree that best describes the evolutionary relationships among the units Three types of methods – Distance based methods – Parsimony methods – Probabilistic approaches

Distance-based methods for phylogenetic tree reconstruction Given nXn distance matrix for n units, construct the tree for these n units Algorithms – UPGMA – Neighbor joining Assume additivity and sometimes a “molecular clock” Additivity means we can add up the branch lengths of the tree connecting two nodes and get their distances.

Defining distance between sequences Fractional alignment mismatch for two sequences i and j – p ij = m ij /L ij Gives an estimate of changes per site – m ij : Number of mismatches – Assumes that changes have happened only once Underestimates the distance between sequences – Assumes all sequences change at the same rate Jukes Cantor distance – The simplest evolutionary distance d ij between sequences i and j, p ij fractional mismatch

UPGMA algorithm for phylogenetic tree reconstruction UPGMA: Unweighted pair group method using arithmetic averages Represent all sequences as the leaf nodes of a tree Merge two closest nodes at a time to create a new node in the tree – Set new node at height determined by nodes being merged – Recompute distance between new node and all other nodes Leaf nodes have one sequence Intermediate nodes have multiple sequences We will call sequences associated with an intermediate node i cluster C i Need to compute – Distance between two clusters of sequences – Height

Computing distance between clusters Let i and j be two nodes Let C i be the cluster of sequences for node i Let C j be the cluster of sequences for node j |C j | : Number of sequences in C j Distance between nodes i and j

Computing distance from a new node Let k be a new node to be created from merging i and j Let C i be the cluster of sequences for node i Let C j be the cluster of sequences for node j Distance d kl between nodes k and l, l!=i and l!=j This is equal to

UPGMA algorithm Input – n sequences – Distance matrix for all pairs of n sequences, d ij Output – Tree T Initialization – Assign each sequence i to its own cluster C i – Define one leaf of T for each sequence Iterate until only two clusters remain – Find two nodes C i and C j that have the smallest d ij – Define new cluster C k = C i U C j – Define daughters of k as i and j, place at height d ij /2 – Add k to cluster set. Remove i and j from the set of clusters Terminate – When only two clusters C i and C j remain, place root at d ij /2

UPGMA example ABCDE A08853 B0388 C088 D05 E0 AEDBC AEBCD 0885 B038 C08 D0 AEDBC initial state after one merge Example calculation

UPGMA example (cont.) AEDBC AEBCD AE085 BC08 D0 AEDBC AED08 BC0 AEDBC AEDBC after two merges after three merges final state

UPGMA relies on the molecular clock assumption Sequences diverge at the same rate at different points in the phylogeny Distance from any leaf to root is the same. If this is true the distances are said to have an “ultrametric” property This assumption is rarely true in practice

The molecular clock assumption & ultrametric data Ultrametric data: for any triplet of sequences, i, j, k, the distances are either all equal, or two are equal and the remaining one is smaller ABCDE A08853 B0388 C088 D05 E0 AEDBC

Problems with the molecular clock assumption Actual tree 2341 Constructed by UPGMA

Neighbor joining The assumption about the ultra-metric property is too strong – Most sequences diverge at different rates A more relaxed requirement is that of additivity – Distance between a pair of species/nodes is equal to the sum of the branch lengths Uses a similar idea to construct trees as UPGMA – That is consider pairs of nodes and joins them Produces unrooted trees

How to select nodes for joining? Given all pairwise distances for n sequences d ij denote the distance between node i and j Should we select node pairs with the smallest d ij ? A B C D This will give us an incorrect tree

Selecting nodes to join r i : Average distance from all other leaves L : number of leaves Neighbor joining requires us to correct the distance to account for distances from all other nodes. The corrected distance is denoted as D ij

Defining the distance to a new node i j m k d km ? New node Given d ij, d im, d jm, how to calculate distance of existing node m to new node k ?

Updating distances in neighbor joining Calculate the distance from a leaf to its parent node so that we take into account the distance to all other leaves where and L is the set of leaves

Algorithm for NJ Initialization – T be set the of leaf nodes – L = T – Estimate r i for all i in L – Estimate D ij Iteration – Pick a pair i, j from L such that D ij is smallest – Define new node k – Estimate d ik, d jk, add edge between k and i, and between k to j – Add k to T, remove i and j from L – Estimate D mn for all nodes m, n in L Terminate – If L has two nodes, add the edge between these two.

An example with neighbor joining Consider 5 sequences: A, B, C, D, E Distance matrix Let us infer the tree using the Neighbor joining algorithm A B C D E BCDE

Can we check for additivity? Check for additivity: For four leaves, i, j, k, l and the distances d ij, d ik, d il, d jk, d jl, d kl i j k l The three sums of two distances i j k l i j k l i j k l Should be such that two of these are equal, and larger than the third.

Comparing NJ and UPGMA UPGMA – Rooted tree – Assumptions: Molecular clock assumption/ultrametric distance and additivity NJ – Unrooted tree – Assumption: Additivity

Rooting a tree An unrooted tree can be converted to a rooted tree using an outgroup species Outgroup: a species known to be more distantly related all the species than each of the species themselves Find the branch where the outgroup is selected to be added That gives the root candidate root outgroup