Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Phylogenetic Trees Lecture 4
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
We have shown that: To see what this means in the long run let α=.001 and graph p:
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple sequence alignment
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Phylogeny Tree Reconstruction
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Lecture 3: Markov models of sequence evolution Alexei Drummond.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Tutorial 5 Phylogenetic Trees.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogeny - based on whole genome data
Inferring a phylogeny is an estimation procedure.
Sequence comparison: Local alignment
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Phylogeny.
Presentation transcript:

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review A tree can be used to represent sequence relationships. A sequence tree has a topology and branch lengths (distances). The number of tree topologies grows very fast.

Distance tree methods Measure pairwise distance between each pair of sequences. Use a clustering method to build up a tree, starting with the closest pair.

humanchimpgorillaorang human02/64/6 chimp05/63/6 gorilla02/6 orang0 Distance matrix from alignment (symmetrical, lower left not filled in)

Distance matrix methods Methods based on a set of pairwise sequence distances, typically from a multiple alignment. Try to build the tree that best matches the distances. Usual standard for “best match” is the least squares of the tree distances compared to the real pairwise distances: Let D m be the real distances and D t be the tree distances. Find the tree that minimizes:

Enumerate and score all trees Enumerate every tree topology, fit least-squares best distances for each topology, keep best. Not used for distance trees - there is a much faster way to get very close to correct. Called Neighbor-Joining algorithm, one of a general class called hierarchical clustering algorithms.

Sequential clustering approach (UPGMA)

Sequential clustering algorithm 1)generate a table of pairwise sequence distances and assign each sequence to a list of N tree nodes. 2)look through the current list of nodes (initially these will all be leaf “nodes”) for the pair with the smallest distance. 3)merge the closest pair, remove the pair of nodes from the list and add the merged node to the list. 4)repeat until there is only one node left - it is the root. (in words, this is the arithmetic average of the distances between all the leaves in one node and all the leaves in the other node)

Specifically, for sets of leaves i and j, we denote the set of all other leaves as L, and the size of that set as, and we compute the corrected distance D ij as: Neighbor-Joining Algorithm (side note) Essentially as for UPGMA, but correction for distance to other leaves is made. (the mean distance from i to all 'other' leaves)

class TreeNode: The tree itself is made up of TreeNode objects, each of which is connected to other TreeNode objects based on its three attributes. How do we know a node is a leaf? A root? Data structure for a tree

Raw distance correction DNA As two DNA sequences diverge, it is easy to see that their maximum raw distance is ~0.75 (assuming equal nt frequencies). This graph shows evolutionary distance related to raw distance:

Mutational models for DNA Jukes-Cantor (JC) - all mutations equally likely. Kimura 2-parameter (K2P) - transitions and transversions have separate rates. Generalized Time Reversible (GTR) - all changes may have separate rates. (Models similar to GTR are also available for protein)

GATC G 1-3  A   T   C  GATC G 1-  -2  A   T   C  purinespyrimidines Jukes-Cantor Kimura 2-parameter

Jukes-Cantor model - distance correction Jukes-Cantor model: D raw is the raw distance (what we directly measure) D is the corrected distance (what we want) ln is natural log

Convert each pairwise raw distance to a corrected distance. Build tree as before (UPGMA or neighbor-joining). Notice that these methods don't consider all tree topologies - they are very fast, even for large trees. Distance trees - summary