INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.

Slides:



Advertisements
Similar presentations
Win8 on Intel Programming Course Desktop : Sensors Cédric Andreolli Intel Software
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
An Introduction to Phylogenetic Methods
Data Gateways for Scientific Communities Birds of a Feather (BoF) Tuesday, June 10, 2008 Craig Stewart (Indiana University) Chris Jordan.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Probabilistic methods for phylogenetic trees (Part 2)
1 Supplemental line if need be (example: Supported by the National Science Foundation) Delete if not needed. Supporting Polar Research with National Cyberinfrastructure.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Rockhopper: Penguin on Demand at Indiana.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
INDIANAUNIVERSITYINDIANAUNIVERSITY April 2002 Implementing advanced IT facilities for the Indiana Genomics Initiative Craig A. Stewart
1 Computational Phylogenetics Craig Stewart November 2000.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Craig Stewart 23 July 2009 Cyberinfrastructure in research, education, and workforce development.
INDIANAUNIVERSITYINDIANAUNIVERSITY January 2002 INGEN's advanced IT facilities Craig A. Stewart
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Goodbye from Indianapolis, IUPUI, and Craig A. Stewart Executive Director, Pervasive Technology Institute Associate Dean, Research Technologies Indiana.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart Director, Research and Academic Computing.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Big Red II & Supporting Infrastructure Craig A. Stewart, Matthew R. Link, David Y Hancock Presented at IUPUI Faculty Council Information Technology Subcommittee.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Leveraging the National Cyberinfrastructure for Top Down Mass Spectrometry Richard LeDuc.
September 6, 2013 A HUBzero Extension for Automated Tagging Jim Mullen Advanced Biomedical IT Core Indiana University.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. The IQ-Table & Collection Viewer A.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Introduction to Phylogenetics
Pti.iu.edu /jetstream Award # funded by the National Science Foundation Award #ACI Jetstream Overview – XSEDE ’15 Panel - New and emerging.
Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
INDIANAUNIVERSITYINDIANAUNIVERSITY Spring 2000 Indiana University Information Technology University Information Technology Services Please cite as: Stewart,
February 27, 2007 University Information Technology Services Research Computing Craig A. Stewart Associate Vice President, Research Computing Chief Operating.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Algorithms and data structures Protected by
Recent key achievements in research computing at IU Craig Stewart Associate Vice President, Research & Academic Computing Chief Operating Officer, Pervasive.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Update on EAGER: Best Practices and.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Award # funded by the National Science Foundation Award #ACI Jetstream: A Distributed Cloud Infrastructure for.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Informatics Tools at the Indiana CTSI.
Research & Academic Computing Indiana University Statewide IT Conference 11 September 2003 Indianapolis IN.
Phylogeny and the Tree of Life
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Matt Link Associate Vice President (Acting) Director, Systems
Inferring phylogenetic trees: Distance and maximum likelihood methods
BNFO 602 Phylogenetics Usman Roshan.
The Most General Markov Substitution Model on an Unrooted Tree
Coevolutionary Automated Software Correction
Presentation transcript:

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig A. Stewart, David Hart, Donald K. Berry, Gary J. Olsen, Eric Wernert, Will Fischer 14 November 2001

INDIANAUNIVERSITYINDIANAUNIVERSITY License Terms Please cite as Stewart, C.A., D. Hart, D.K. Berry, G.J. Olsen, E. Wernert, W. Fischer Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference. Presentation. Presented at IEEE/ACM SC01 Conference, Nov , Denver, CO. Available from: Except where otherwise noted, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license ( This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2

INDIANAUNIVERSITYINDIANAUNIVERSITY Phylogenetic tree – a depiction of the course of evolution Diagram that was originally here removed prior to archiving due to a rights question. 3

INDIANAUNIVERSITYINDIANAUNIVERSITY 4 Evolutionary processes Evolution proceeds as series of bifurcations Same techniques work with genes, gene products, and taxa

INDIANAUNIVERSITYINDIANAUNIVERSITY 5 Rooted and Unrooted trees Finding best unrooted tree, and finding the root of a tree are two different processes Rooting a tree is more a biological than computing problem Cytoplasmic Coat Proteins (analysis done from Singapore as part of iGrid display at SC98)

INDIANAUNIVERSITYINDIANAUNIVERSITY 6 Why study phylogenetics? Useful in understanding disease- causing organisms. Examples –Timing origin of HIV-1 pandemic (Korber et al.) /- 12 –Fungi and animals The original slide deck had a diagram from Korber et al Timing the Ancestor of the HIV-1 Pandemic Strains. Science 9 June 2000: DOI: /science /1789.full

INDIANAUNIVERSITYINDIANAUNIVERSITY 7 Availability of large amounts of genetic data makes possible use of statistical techniques to infer phylogenies, but…

INDIANAUNIVERSITYINDIANAUNIVERSITY 8 Why is phylogenetic inference a HPC problem? The number of bifurcating unrooted trees for n taxa is (2n-5)! (n-3)! 2n-3 Problem of searching among trees is NP-complete Larger data sets tend to produce better results (# of taxa and length of sequences) HPC techniques are required to make large scale phylogenetic inference practical TaxaPossible unrooted trees x x x

INDIANAUNIVERSITYINDIANAUNIVERSITY 9 Markov model of base substitution In any small interval of time there is a small chance of a mutation at any site (sites independent) 4 x 4 matrix for DNA sequences (site-specific) Only single nucleotide changes considered – not insertions and deletions

INDIANAUNIVERSITYINDIANAUNIVERSITY 10 Maximum Likelihood Phylogenetic Inference Objective: find the (unrooted) tree that has the highest overall likelihood value Branching patterns, branch lengths, and likelihood values all calculated from the data. Likelihood values used only for comparisons ML is most computationally intensive of the mathematically-based phylogeny methodologies

INDIANAUNIVERSITYINDIANAUNIVERSITY 11 fastDNAml Based on Felsenstein ’ s DNAml Program created by Gary Olsen et al. –New search algorithms –Parallel code (one of first parallel phylogenetics codes) Olsen primary developer of serial version

INDIANAUNIVERSITYINDIANAUNIVERSITY 12 Basic fastDNAml algorithm – adding taxa Optimize tree for 3 (randomly chosen) taxa - only one topology possible Randomly pick another taxon – (2i-5) trees possible Keep the best (maximum likelihood tree)

INDIANAUNIVERSITYINDIANAUNIVERSITY 13 Basic fastDNAml algorithm - Branch rearrangement Move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities) Keep best resulting tree Repeat this step until local swapping no longer improves likelihood value

INDIANAUNIVERSITYINDIANAUNIVERSITY 14 Basic fastDNAml algorithm - Iterate Get sequence data for next taxon Add new taxa (2i-5) Keep best Rearrangements Keep best Keep going…. When all taxa have been added, perform a full tree check (crossing 2 to n vertices)

INDIANAUNIVERSITYINDIANAUNIVERSITY 15 Because of local effects…. Can get stuck in local optimum, rather than global Must do multiple runs with different randomizations of taxa, and compare the results A set of similar trees with similar (high) likelihood values provide some confidence in results

INDIANAUNIVERSITYINDIANAUNIVERSITY 16 Parallelization of fastDNAml At each step, many trees may be analyzed simultaneously A tree and its likelihood value are the only communication needed High computation/communication ratio – hundreds of thousands of floats per byte of data transmitted back to main program in examples used in performance analysis

INDIANAUNIVERSITYINDIANAUNIVERSITY 17 Overview of parallel program flow

INDIANAUNIVERSITYINDIANAUNIVERSITY 18 Parallel implementation of fastDNAml Program modules –Master (generates trees, receives back from Foreman best tree at each step) –Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) –Worker –Monitor (instrumentation) New features in fastDNAml –Calls to message passing libraries sequestered to one file –Parallel versions include fault tolerance features (useful in large clusters and grid computing)

INDIANAUNIVERSITYINDIANAUNIVERSITY 19 Performance analysis of fastDNAml Used three data sets (50,101, 150 taxa) from studies of Microsporidia having 1858 or 1269 positions Performance analyzed on Indiana University ’ s IBM SP, using serial version as baseline for performance Program set to cross 5 vertices in rearrangement step 10 random orderings, (three replications each), 1 to 64 processors

INDIANAUNIVERSITYINDIANAUNIVERSITY 20 Performance of fastDNAml

INDIANAUNIVERSITYINDIANAUNIVERSITY 21 Performance of fastDNAml

INDIANAUNIVERSITYINDIANAUNIVERSITY 22 Other phylogenetics software Ceron – maximum likelihood analysis – Parallel (PVM) program based on Felsenstein ’ s DNAml –fastDNAml as we are using it does more extensive branch swapping –Ceron version - speculative calculations based on assumption rearrangement won ’ t improve tree –Essentially two different search strategies GRAPPA (Bader et al.): Breakpoint analysis program - scales well

INDIANAUNIVERSITYINDIANAUNIVERSITY 23 Why bother with parallel code? Why not just achieve speedup of n on n processors by running n independent jobs? Practical benefits of seeing results quickly Parallel program permits assault on much more complicated problems (e.g. protein sequences)

INDIANAUNIVERSITYINDIANAUNIVERSITY 24 Visualization

INDIANAUNIVERSITYINDIANAUNIVERSITY 25

INDIANAUNIVERSITYINDIANAUNIVERSITY 26

INDIANAUNIVERSITYINDIANAUNIVERSITY 27 Future Plans A Condor version of fastDNAml Improvements to tree optimization process Protein sequences

INDIANAUNIVERSITYINDIANAUNIVERSITY 28 Summary Significant speed up in time to solution. Speed enables biologists to choose phylogenetic methodologies on the basis of the quality of results Scales well Available from:

INDIANAUNIVERSITYINDIANAUNIVERSITY 29 Acknowledgements This work supported by in part by –Shared University Research grants from IBM, Inc. –The Lilly Endowment for the Indiana Genomics Initiative (INGEN) of Indiana University. [ Diagrams for this talk created by W. Leslie Teach, UITS

INDIANAUNIVERSITYINDIANAUNIVERSITY 30 Thank you. Any questions?