Reconstructing Phylogenies from Gene-Order Data Overview.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
ECE 667 Synthesis and Verification of Digital Circuits
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Traveling Salesperson Problem
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
The Evolution Trees From: Computational Biology by R. C. T. Lee S. J. Shyu Department of Computer Science Ming Chuan University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Chapter 5 The Evolution Trees.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
High-Performance Reconfigurable Computing for Genome Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia,
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Calculating branch lengths from distances. ABC A B C----- a b c.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
BackTracking CS255.
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Finding Heuristics Using Abstraction
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
Multiple Genome Rearrangement
Phylogeny.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Computational Genomics Lecture #3a
Major Design Strategies
Major Design Strategies
Presentation transcript:

Reconstructing Phylogenies from Gene-Order Data Overview

What are Phylogenies? Tree of Life A UAG representing evolution of species

Phylogenic Analysis Used For… Phylogenies help biologists understand and predict: –functions and interactions of genes –genotype => phenotype –host/parasite co-evolution –origins and spread of disease –drug and vaccine development –origins and migrations of humans –RoundUp herbicide was developed with the help of phylogenetic analysis

Gene-Level Phylogeny Nadeau-Taylor model of evolution –Assume discrete set of genes Each gene represents a sequence of nucleic acids Genes have polarity (a, -a) –A species genome is a sequence of genes –Rare evolutionary events cause changes in genome Inversion: (a b c d) => (a –c –b d) Transposition: (a b c d) => (a c d b) Inverted transposition: (a b c d) => (a –d –c b) Insertion: (a b c d) => (a e b c d) Deletion: (a b c d) => (a c d)

Goal of Phylogenetics Given a set of observed genomes, reconstruct an evolutionary tree –Leaves are the observed genomes –Internal nodes are evolutionary steps (missing link genomes) –Edges may contain multiple events Fundamentally impossible to solve without a time machine –Fossils? However: –Of the set of valid trees that include all observed genomes as leaf nodes, tree containing the minimum number of events (sum of edge weights) is closest to actual –Maximum parsimony

Tree Construction Techniques Three primary methods: –Criterion-based (NP-HARD optimization) Relies on an evolutionary model Examples: –Breakpoint phylogeny –Maximum-likelihood, maximum-parsimony, minimum evolution Provides good accuracy but intractable for larger sets of genomes –Ad hoc / distance-based Relies on pair-wise distances Example: –Neighbor-joining Runs in polynomial time but very inaccurate for large sets of genomes –Meta-methods Ex: disk-covering, quartet-based methods Divide-and-conquer approach

Breakpoint Phylogeny Method Sankoff-Blanchette Technique –Assume an unrooted, binary tree topology, where leaves are genomes –Basic algorithm: For each circular ordering of genomes… From bottom up, label each of the 2N-2 internal nodes with a genome that has minimal distance to each of its neighbors The tree with the minimal sum of edge- weights (height) is the most parsimonious –First problem with S-B: exponential number of genome orderings (n-1)! possible circular orderings: G1 G2 G3 G4 is equivalent to… G2 G3 G4 G1 Topology (and thus length) of tree depends solely on gene ordering

Breakpoint Distance S-B use breakpoint distance to estimate distance between two genomes –Approximates number of evolutionary events –Assumes consistent gene set and sequence length –Given genomes G 1 and G 2 –If a and b are adjacent in genome G1 but not in G2, then bp_distance++ –Example: {a b c d} and {a c d b} have two breakpoints –Must also take polarity into account… No breakpoint between {a b} and {-b –a} Example: {a b c d} and {-b –a c d} –Breakpoint distance is 1

Median Problem for Breakpoints S-B labels internal nodes by finding a median among 3 genomes, such that: –D(S,A) + D(S, B) + D(S,C) is minimal Performed using a TSP: –Build fully-connected graph with an edge for each polarity of each gene –Edge weights assigned as 3-(number of times each pair of genes are adjacent) –Run TSP –Path of salesman specifies medium

Example Median Assume gene set={A, B, C, D} Assume genomes: A B C D B D -A -C -D C B A A-AC-CB-BD-D edges not shown have weight 3 u(A,B)=0 u(A,-B)=1 u(A,C)=0 u(A,-C)=1 u(A,D)=0 u(A,-D)=0 u(-A,B)=1 u(-A,-B)=0 u(-A,C)=0 u(-A,-C)=0 u(-A,D)=0 u(-A,-D)=1 u(B,C)=0 u(B,-C)=1 u(B,D)=0 u(B,-D)=0 u(-B,C)=1 u(-B,-C)=0 u(-B,D)=1 u(-B,-D)=0 u(C,D)=1 u(C,-D)=0 u(-C,D)=1 u(-C,-D)=0 If solution to TSP is s 1,-s 1,s 2,-s 2,…,s n,-s n then median is s 1,s 2,…,s n (include signs) weight=3-(adjacencies)

S-B Algorithm only when nodes have changed label initialization N+2N-2

S-B Algorithm S and B propose three different methods for initializing the TSPs for achieving global optimum Second problem with S-B: –Each tree requires the solving of multiple TSPs, which themselves are NP-HARD –Initial labeling: 2N-2 TSPs –Repeats this process an unknown number of times to optimize internal nodes

Neighbor Joining A polynomial-time heuristic for tree construction Given the distances between each pair of genomes (distance matrix)… Grow a complex tree structure, starting from a star Basic algorithm: –Begin with a star-topology –Choose pairs of leaves that are closely related –Remove these leaves and join them with a new internal node –Join this new internal node somewhere into the old tree –Do this until all N-3 internal nodes have been created

Neighbor-Joining X1235 S 0 =D)/(N-1) = 45/4 = D XY345 N(N-2)/2 possibilities S S

Neighbor-Joining

Edges weight approximations can be computed with neighbor-joining However, it is more accurate to label the internal nodes as with S-B and measure edge lengths based on this –Scoring

Morets Distance Estimators IEBP estimator –Approximates event distance from breakpoint distance weights: inversion, transposition, inverted transposition –Fast but not accurate Exact-IEBP –Returns the exact value –Slow but exact EDE –Correction function to improve accuracy of IEBP EDE used to build distance matrix –Set up NJ –Finding lower bound –Scoring

EDE Distance correction Non-negative inverse of F(x) defines minimum inversion distance, x defines actual inversions

Bounding Given a distance matrix, lower bound can be determined –Tree is at least this size –Use twice around the tree –Length of tree (sum of edges) is.5 * (d 12, d 23, …, d n1 ) Given a constructed tree, upper bound can be determined –Label internal nodes –Sum up all edges using distance calculator

GRAPPA Optimizations –Gene ordering Given a circular gene ordering Build a S-B tree Swap internal leaf orderings, changing the order Upper bound stays constant (no relabeling), while lower bound changes

GRAPPA Layered search: –Build EDE distance matrix –Build and score NJ tree (provides initial upper bound) –Enumerate all genome orderings –For each: Compute lower bound using twice around the tree If LB < UB, add ordering to queue, sorted by LB –Requires too much disk space –Score each tree from queue in order: Keep track of lowest upper bound Allows for more pruning

GRAPPA Without layered search: –Build EDE distance matrix –Build and score NJ tree (initial upper bound) –For each genome ordering: Compute lower bound If lower bound < UB Score tree and compute new upper bound (may do swap-as-you-go to eliminate redundant orderings) If new upper bound < old upper bound, set new upper bound

FPGA Implementation Software can perform NJ, since thats only done once Software can enumerate valid genome orderings Scoring should be done in hardware EDE can be performed via BRAM/CLB lookup table Need to implement TSP in hardware GRAPPA uses specialized version of TSP –As opposed to chained and simple versions of Lin-Kernighan heuristic – O(n 3 ) Most important question: –Map to multi-FPGA architecture?

GRAPPA Version of S-B Algorithm Iterative refinement –Only refine internal nodes when one of the neighbors has changed in the refinement iteration Condenasation –Gene reduction to speed up TSP for shared subsequences –Not used by default Exact TSP algorithm Initial labeling –Uses second approach in S-B paper ( nearest neighbors/trees of TSPs)

Parallelism? Scoring is very parallel –TSP only depends on three nearest nodes –Can overlap iterations GRAPPA is parallelized for cluster –Compute, not communication bound Achieve finer-grain parallelism with FPGAs –Problem may turn communication-bound Research Plan –GRAPPA analysis (drill-down) –Get preliminary results for TSP over FPGA SRC implementation (Charlie) Determine granularity vs. communication

Possible HPRC Approach G1G2G3G4I1I2I3I4I5I6 I1 I2I3 I4I5I6 wrap-around – one TSP core buffered requests

Possible HPRC Approach g5 input species ancesteral group 1 ancesteral group 2

HPRC FPGAs –Comp. density Cost –Granularity Mesh –Load balancing