Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Slides:



Advertisements
Similar presentations
Parsimony Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Doug Brutlag Professor.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Multiple Sequence Alignment & Phylogenetic Trees.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Phylogenetic Trees Lecture 4
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Bioinformatics Algorithms and Data Structures
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Phylogeny Tree Reconstruction
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Phylogenetic trees Sushmita Roy BMI/CS 576
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Tutorial 5 Phylogenetic Trees.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Lecture 14 CS5661 Neighbor Joining Generates unrooted tree, allowing for unequal branches Given: Distance matrix for sequences Steps: Repeat 1-3 till all.
Inferring phylogenetic trees: Distance methods
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Defining what a “tree” means rooted tree (all real trees are rooted): unrooted tree (used when the root isn’t known): time ancestral sequence time vaguely radiates out from somewhere near the center …divergence time is the sum of (horizontal) branch lengths sequences (leaves or tips) branch points branches root

A tree has topology and distances Topologically, these are the SAME tree. In general, two trees are the same if they can be inter-converted by branch rotations. Are these different trees?

The number of tree topologies grows extremely fast 3 leaves 3 branches 1 internal node 1 topology (3 insertions) 4 leaves 5 branches 2 internal nodes 3 topologies (x3) (5 insertions) 5 leaves 7 branches 3 internal nodes 15 topologies (x5) (7 insertions) In general, an unrooted tree with N leaves has: 2N – 3 branches N – 2 internal nodes ~ O(N!) topologies

There are many rooted trees for each unrooted tree For each unrooted tree, there are 2N - 3 times as many rooted trees, where N is the number of leaves (# internal branches = 2N – 3). 20 leaves - 564,480,989,588,730,591,336,960,000,000 topologies

How can you compute a tree? Many methods available, we will talk about: Distance trees Parsimony trees Others include: Maximum-likelihood trees Bayesian trees

Distance tree methods Measure pairwise distance between each pair of sequences. Use a clustering method to build up a tree, starting with the closest pair.

humanchimpgorillaorang human02/64/6 chimp05/63/6 gorilla02/6 orang0 Distance matrix from alignment (symmetrical, lower left not filled in)

Distance matrix methods Methods based on a set of pairwise sequence distances, typically from a multiple alignment. Try to build the tree that best matches the distances. Usual standard for “best match” is the least squares of the tree distances compared to the real pairwise distances: Let D m be the real distances and D t be the tree distances. Find the tree that minimizes:

Enumerate and score all trees Enumerate every tree topology, fit least-squares best distances for each topology, keep best. Not used for distance trees - there is a much faster way to get very close to correct. Called Neighbor-Joining algorithm, one of a general class called hierarchical clustering algorithms.

Sequential clustering approach (UPGMA)

Sequential clustering algorithm 1)generate a table of pairwise sequence distances and assign each sequence to a list of N tree nodes. 2)look through the current list of nodes (initially these will all be leaf “nodes”) for the pair with the smallest distance. 3)merge the closest pair, remove the pair of nodes from the list and add the merged node to the list. 4)repeat until there is only one node left - it is the root. (in words, this is the arithmetic average of the distances between all the leaves in one node and all the leaves in the other node)

Specifically, for sets of leaves i and j, we denote the set of all other leaves as L, and the size of that set as, and we compute the corrected distance D ij as: Neighbor-Joining Algorithm (side note) Essentially as for UPGMA, but correction for distance to other leaves is made. (the mean distance from i to all 'other' leaves)

class TreeNode: The tree itself is made up of TreeNode objects, each of which is connected to other TreeNode objects based on its three attributes. How do we know a node is a leaf? A root? A leaf (or tip) has no child nodes. A root has no parent node. All the rest have all three. Data structure for a tree

Raw distance correction DNA As two DNA sequences diverge, it is easy to see that their maximum raw distance is ~0.75 (assuming equal nt frequencies). This graph shows evolutionary distance related to raw distance:

Mutational models for DNA Jukes-Cantor (JC) - all mutations equally likely. Kimura 2-parameter (K2P) - transitions and transversions have separate rates. Generalized Time Reversible (GTR) - all changes may have separate rates. (Models similar to GTR are also available for protein)

GATC G 1-3  A   T   C  GATC G 1-  -2  A   T   C  purinespyrimidines Jukes-Cantor Kimura 2-parameter

Jukes-Cantor model - distance correction Jukes-Cantor model: D raw is the raw distance (what we directly measure) D is the corrected distance (what we want) ln is natural log Similar calculations can be made for the other models (but more complex).

Convert each pairwise raw distance to a corrected distance. Build tree as before (UPGMA or neighbor-joining). Notice that these methods do not consider all tree topologies - they are very fast, even for large trees. Distance trees - summary