INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart Director, Research and Academic Computing.

Slides:

Advertisements

Similar presentations

Hidden Markov Model in Biological Sequence Analysis – Part 2

Advertisements

Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)

Bill Barnett, Bob Flynn & Anurag Shankar Pervasive Technology Institute and University Information Technology Services, Indiana University CASC. September.

Data Gateways for Scientific Communities Birds of a Feather (BoF) Tuesday, June 10, 2008 Craig Stewart (Indiana University) Chris Jordan.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Molecular Evolution Revised 29/12/06

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.

Probabilistic methods for phylogenetic trees (Part 2)

1 Supplemental line if need be (example: Supported by the National Science Foundation) Delete if not needed. Supporting Polar Research with National Cyberinfrastructure.

Phylogenetic trees Sushmita Roy BMI/CS 576

Sequencing a genome and Basic Sequence Alignment

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

INDIANAUNIVERSITYINDIANAUNIVERSITY April 2002 Implementing advanced IT facilities for the Indiana Genomics Initiative Craig A. Stewart

1 Computational Phylogenetics Craig Stewart November 2000.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.

BINF6201/8201 Molecular phylogenetic methods

Craig Stewart 23 July 2009 Cyberinfrastructure in research, education, and workforce development.

INDIANAUNIVERSITYINDIANAUNIVERSITY January 2002 INGEN's advanced IT facilities Craig A. Stewart

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Big Red II & Supporting Infrastructure Craig A. Stewart, Matthew R. Link, David Y Hancock Presented at IUPUI Faculty Council Information Technology Subcommittee.

Leveraging the National Cyberinfrastructure for Top Down Mass Spectrometry Richard LeDuc.

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Molecular Science in NPACI Russ B. Altman NPACI Molecular Science Thrust Stanford Medical.

September 6, 2013 A HUBzero Extension for Automated Tagging Jim Mullen Advanced Biomedical IT Core Indiana University.

© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. The IQ-Table & Collection Viewer A.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.

RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

1 BioGrids in the US: Current status and future opportunities Craig A. Stewart 15 April 2004 Director, Research and Academic Computing Director,

Computing and Communications and Biology Molecular Communication; Biological Communications Technology Workshop Arlington, VA 20 February 2008 Jeannette.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Lecture 2: Principles of Phylogenetics

Introduction to Phylogenetics

Pti.iu.edu /jetstream Award # funded by the National Science Foundation Award #ACI Jetstream Overview – XSEDE ’15 Panel - New and emerging.

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.

Using Prior Knowledge to Improve Scoring in High-Throughput Top-Down Proteomics Experiments Rich LeDuc Le-Shin Wu.

INDIANAUNIVERSITYINDIANAUNIVERSITY Spring 2000 Indiana University Information Technology University Information Technology Services Please cite as: Stewart,

November 18, 2015 Quarterly Meeting 30Aug2011 – 1Sep2011 Campus Bridging Presentation.

February 27, 2007 University Information Technology Services Research Computing Craig A. Stewart Associate Vice President, Research Computing Chief Operating.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

A national science & engineering cloud funded by the National Science Foundation Award #ACI Craig Stewart ORCID ID Jetstream.

Recent key achievements in research computing at IU Craig Stewart Associate Vice President, Research & Academic Computing Chief Operating Officer, Pervasive.

© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Update on EAGER: Best Practices and.

Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.

Award # funded by the National Science Foundation Award #ACI Jetstream: A Distributed Cloud Infrastructure for.

Phylogeny Ch. 7 & 8.

Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.

A national science & engineering cloud funded by the National Science Foundation Award #ACI Craig Stewart ORCID ID Jetstream.

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Informatics Tools at the Indiana CTSI.

Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.

Research & Academic Computing Indiana University Statewide IT Conference 11 September 2003 Indianapolis IN.

Bioinformatics Overview

Introduction to Bioinformatics Resources for DNA Barcoding

Matt Link Associate Vice President (Acting) Director, Systems

Inferring a phylogeny is an estimation procedure.

Methodology Overview 2 basics in user studies Lecture /slide deck produced by Saul Greenberg, University of Calgary, Canada Notice: some material in this.

BNFO 602 Phylogenetics Usman Roshan.

The Most General Markov Substitution Model on an Unrooted Tree

Lecture 7 – Algorithmic Approaches

Presentation transcript:

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart Director, Research and Academic Computing 10 November 1999 Please cite as: Stewart, C.A Evolutionary Biology and Computational Grids. (Presentation) CASCON Workshop on Computational Biology (Mississauga, Ontario, Canada, 10 Nov 1999). Available from:

INDIANAUNIVERSITYINDIANAUNIVERSITY 2 Intellectual credits Collaborators –National University of Singapore Tan Tin Wee, Louxin Zhang (NUS), Meena Sakharkar –ACSys (Advanced Computational SYStems, Australian National University) Markus Buckhorn –Indiana University David Hart, Donald K. Berry, Jeffery Palmer, Will Fischer, Chris Parkinson, Sean Turner, Eric Wernert Code development –J. Felsenstein – DNAml (PHYLIP) [U. Washington] –G. Olsen – fastDNAml [UIUC] –H. Matsuda, R. Overbeek – initial P4 parallel code [ANL] –D.K. Berry – PVM and MPI ports [IU]

INDIANAUNIVERSITYINDIANAUNIVERSITY 3 Outline Phylogenies Statistical methods for estimating phylogenies & the fastDNAml program –Models of DNA replication and evolution –Algorithm –Parallelization Grid computing, HPCC, visualization What we ’ ve learned so far Future plans

INDIANAUNIVERSITYINDIANAUNIVERSITY 4 This slide previously contained an image scanned from E. Colbert The age of reptiles. W.W. Norton, NY, NY.

INDIANAUNIVERSITYINDIANAUNIVERSITY 5 Lots of DNA sequence data Automation of sequencing process Many large-scale genomic projects thermotoga ATTTGCCCCA GAAATTAAAG CAAAAACCCC AGTAAGTTGG GGATGGCAAA AAAGGAAAAT Tthermophi ATTTGCCCCA GGGGTTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA Taquaticus ATTTGCCCCA GGGGTTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA deinonema- ATTTGCCCCA GGGATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA ChlamydiaB ATTTTCCCCA GAAATTCCCG AAAAAACCCC AATAAATTGG GGATGGCAGG GGAGGAAGGA flexistips ATTTTCCCCA CAAAAAAAAG AAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA borrelia-b ATTTGCCCCA GAAGTTAAAG CAAAAACCCC AATAAGTTGG GGATGGCAGG GGAGGAAAAA bacteroide ATTTGCCCCA GAAATTCCCG CAAAAACCCC AGTAAATTGG GGATGGCAGG GGAGGAAAAA pseudomona ATTTGCCCCA GGGATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA ecoli----- GTTTTCCCCA GAAATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA 3B bases in human genome

INDIANAUNIVERSITYINDIANAUNIVERSITY 6 Statistical Methods in Phylogeny Availability of large amounts of genetic data makes possible application of statistical analysis to genetic data so as to create evolutionary phylogenies of organisms, organelles, or gene products.

INDIANAUNIVERSITYINDIANAUNIVERSITY 7 Confluence of events Development of computationally intensive methods for estimating phylogenies Abundance of DNA data –The limiting factor in scientists ’ ability to analyze genetic data is often the availability of computer time, not the availability of raw data Development of Grids as a high performance computing architecture –The concept of computational grids is dramatically changing the way we think about HPC. IU ’ s biologists were eating our computers alive

INDIANAUNIVERSITYINDIANAUNIVERSITY 8 Maximum Likelihood Typical statistical inference: calculate probability of data given the hypothesis Phylogenetic tree building: tree, tree lengths, and associated likelihood values all calculated from the data. Likelihood values used only for comparisons ML is most computationally intensive of the mathematically-based phylogeny methodologies

INDIANAUNIVERSITYINDIANAUNIVERSITY 9 Tree estimation

INDIANAUNIVERSITYINDIANAUNIVERSITY 10

INDIANAUNIVERSITYINDIANAUNIVERSITY 11 DNA replication Purines: Adenine & Guanine Pyrimidines:Thymine & Cytosine

INDIANAUNIVERSITYINDIANAUNIVERSITY 12 Markov model of base substitution In a small interval of time t there is a probability u that a base at a site is replaced For any site: P ij (t) = e -ut  ij + (1- e -ut  j Treat each site as independent (insertions and deletions outside capabilties of this program) Must correct for empirical base frequencies, unequal rates for transitions and transversions, and/or independent rates for specific changes

INDIANAUNIVERSITYINDIANAUNIVERSITY 13 fastDNAml ’ s phylogeny construction Objective: find the tree and branch lengths that have the greatest probability of giving rise to the present day sequences The number of bifurcating unrooted trees for n taxa is (2n-5)! (n-3)! 2 n-3 for 50 taxa the number of possible trees is O(10 74 ) So, build trees incrementally, and search within the space of all possible trees looking for best tree

INDIANAUNIVERSITYINDIANAUNIVERSITY 14 fastDNAml algorithm Compute the optimal tree for three taxa (chosen randomly) - only one topology possible Randomly pick another taxon, and consider each of the 2i-5 trees possible by adding this taxon into the first, three-taxa tree. Keep the best (maximum likelihood tree)

INDIANAUNIVERSITYINDIANAUNIVERSITY 15 Initial steps in tree building

INDIANAUNIVERSITYINDIANAUNIVERSITY 16 Local branch rearrangement Move any subtree to a neighboring branch (2i-6 possibilities) Keep best resulting tree Repeat this step until local swapping no longer improves likelihood value

INDIANAUNIVERSITYINDIANAUNIVERSITY 17 Nearest neighbor interchange

INDIANAUNIVERSITYINDIANAUNIVERSITY 18 Iterate Get sequence data for next taxon Add new taxa (2i-5) Keep best Local rearrangements (2i-6) Keep best Keep going…. When all taxa have been added, perform a full tree check

INDIANAUNIVERSITYINDIANAUNIVERSITY 19 Because of local effects…. Where you end up sometimes depends on where you start This process searches a huge space of possible trees, and is thus dependent upon the randomly selected initial taxa Can get stuck in local optimum, rather than global Must do multiple runs with different randomizations of taxa, and compare the results Similar trees and likelihood values provide some confidence

INDIANAUNIVERSITYINDIANAUNIVERSITY 20 How many calculations are there? For 50 taxa, there are  i-5)+(2i-6)} = 4,559 i=4,50 trees to evaluate presuming that no local rearrangements ever produces an improved tree. And each step is fairly computationally intensive. This algorithm is ideal for parallelization, because communications involve at most a tree and a probability value

INDIANAUNIVERSITYINDIANAUNIVERSITY 21 Overview of parallel program flow

INDIANAUNIVERSITYINDIANAUNIVERSITY 22 Geographically distributed computing The high computation/communication ratio makes this program a good candidate for geographic distribution Time to completion is a constant forever and ever The key task is to combine geographically distributed resources so that large jobs can be completed in tolerable (for the biologist) amounts of wall clock time

INDIANAUNIVERSITYINDIANAUNIVERSITY 23 Programming for geographically distributed computing Conversion of PVM version to grid-based computations Load balancing Handles timeouts, system crashes, etc. Conversion to MPI/Globus

INDIANAUNIVERSITYINDIANAUNIVERSITY 24 StarTAP

INDIANAUNIVERSITYINDIANAUNIVERSITY 25

INDIANAUNIVERSITYINDIANAUNIVERSITY 26 SC98 Demonstration Indiana University - SP nodes NUS - SP nodes ACSys – DEC Workstations Immersadesk on the SC98 show floor as part of the IU/EVL iGRID demonstration

INDIANAUNIVERSITYINDIANAUNIVERSITY 27

INDIANAUNIVERSITYINDIANAUNIVERSITY 28 Cytoplasmic Coat Proteins

INDIANAUNIVERSITYINDIANAUNIVERSITY 29 Performance of fastDNAml

INDIANAUNIVERSITYINDIANAUNIVERSITY 30 Applications Better understanding of evolution (Ceolocanths) Medicine –example: our cousins, the fungi –classification of genes & gene products Maintenance of biodiversity

INDIANAUNIVERSITYINDIANAUNIVERSITY 31 What we ’ ve learned so far We can run the program We can do productive biology Security is a headache, especially with PVM Security is a headache, especially with Globus The time difference causes some problems, but more benefits in terms of the partnering opportunities

INDIANAUNIVERSITYINDIANAUNIVERSITY 32 Computing grids and Power Grids When you plug your hair dryer into an outlet, you don ’ t know how the power was generated or where it came from. Someday you ’ ll plug your laptop into a wall and cycles and storage will be available in a similarly magical fashion, but we ’ re a long way from that (plus it is probably an unrealistic goal for high-end computing). Before the current electrical power grid, there were regional electrical suppliers Before the regional electrical suppliers, there were battles over power standards, organizations of power companies, what type of generators were best, etc.

INDIANAUNIVERSITYINDIANAUNIVERSITY 33 Models for Computational Grids Geographically distributed organizations (NASA, ASCI) Alliances and consortia (NCSA, NPACI, CIC) A new approach: communities of interest

INDIANAUNIVERSITYINDIANAUNIVERSITY 34 Future Plans Make the ‘ evolutionary biology grid ’ a (periodically available) production service Enhance MPI/Globus version of code, make code publicly available Step up a level in parallelization Key objective: create a geographically-distributed version of fastDNAml that makes possible new advances in understanding of evolutionary biology.

INDIANAUNIVERSITYINDIANAUNIVERSITY 35 Particular benefits of IBM RS/6000 SPs Distributed memory ‘ preadapts ’ code for an individual SP to a geographically distributed scenario Excellent interface with storage systems Luck never hurts: many of our collaborators and potential collaborators have significant IBM installations

INDIANAUNIVERSITYINDIANAUNIVERSITY 36 Acknowledgements In addition to the intellectual debts noted at the beginning of this talk, our research has been greatly aided by Sponsored University Research grants from IBM This work would not have been possible without the cooperation and collaboration of Dr. Jeffrey Palmer and his research group.

INDIANAUNIVERSITYINDIANAUNIVERSITY 37 Acknowlegements, con ’ t The phylogeny depicted in slide 4 when this slide deck was presented was taken from E. Colbert The age of reptiles. W.W. Norton, NY, NY. This diagram is not shown in this archived version of the slide show out of respect for copyright. The graphic of an unrooted tree in slide 9 is adapted from Olsen et al Les Teach [IU] created all other graphics for this talk

INDIANAUNIVERSITYINDIANAUNIVERSITY 38 References Felsenstein, J Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17: Olsen, Gary J., H. Matsuda, R. Hagstrom, R. Overbeek fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48

INDIANAUNIVERSITYINDIANAUNIVERSITY 39 References, con ’ t Foster, I., and C. Kesselman The Grid: blueprint for a new computing infrastructure. Morkan Kaufman Publishers, San Francisco Baxevanis, A.D., and B.F.F. Ouellette Bioinformatics: a practical guide to the analysis of genes and proteins.Wiley-Interscience, NY.

INDIANAUNIVERSITYINDIANAUNIVERSITY 40 Thank you Any questions?

INDIANAUNIVERSITYINDIANAUNIVERSITY 41 Except where otherwise noted, the contents of this presentation are © the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license ( This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.