Presentation is loading. Please wait.

Presentation is loading. Please wait.

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.

Similar presentations


Presentation on theme: "INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig."— Presentation transcript:

1 INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig A. Stewart, David Hart, Donald K. Berry, Gary J. Olsen, Eric Wernert, Will Fischer stewart@iu.edu 14 November 2001

2 INDIANAUNIVERSITYINDIANAUNIVERSITY License Terms Please cite as Stewart, C.A., D. Hart, D.K. Berry, G.J. Olsen, E. Wernert, W. Fischer. 2001. Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference. Presentation. Presented at IEEE/ACM SC01 Conference, Nov. 10-16, Denver, CO. Available from: http://hdl.handle.net/2022/14004 Except where otherwise noted, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2

3 INDIANAUNIVERSITYINDIANAUNIVERSITY Phylogenetic tree – a depiction of the course of evolution Diagram that was originally here removed prior to archiving due to a rights question. 3

4 INDIANAUNIVERSITYINDIANAUNIVERSITY 4 Evolutionary processes Evolution proceeds as series of bifurcations Same techniques work with genes, gene products, and taxa

5 INDIANAUNIVERSITYINDIANAUNIVERSITY 5 Rooted and Unrooted trees Finding best unrooted tree, and finding the root of a tree are two different processes Rooting a tree is more a biological than computing problem Cytoplasmic Coat Proteins (analysis done from Singapore as part of iGrid display at SC98)

6 INDIANAUNIVERSITYINDIANAUNIVERSITY 6 Why study phylogenetics? Useful in understanding disease- causing organisms. Examples –Timing origin of HIV-1 pandemic (Korber et al.). 1931 +/- 12 –Fungi and animals The original slide deck had a diagram from Korber et al. 2000. Timing the Ancestor of the HIV-1 Pandemic Strains. Science 9 June 2000: 1789- 1796.DOI:10.1126/science.288.5472.17 89 http://www.sciencemag.org/content/288/ 5472/1789.full

7 INDIANAUNIVERSITYINDIANAUNIVERSITY 7 Availability of large amounts of genetic data makes possible use of statistical techniques to infer phylogenies, but… http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

8 INDIANAUNIVERSITYINDIANAUNIVERSITY 8 Why is phylogenetic inference a HPC problem? The number of bifurcating unrooted trees for n taxa is (2n-5)! ------------ (n-3)! 2n-3 Problem of searching among trees is NP-complete Larger data sets tend to produce better results (# of taxa and length of sequences) HPC techniques are required to make large scale phylogenetic inference practical TaxaPossible unrooted trees 502.8 x 10 74 1001.7 x10 182 1504.2 x 10 301

9 INDIANAUNIVERSITYINDIANAUNIVERSITY 9 Markov model of base substitution In any small interval of time there is a small chance of a mutation at any site (sites independent) 4 x 4 matrix for DNA sequences (site-specific) Only single nucleotide changes considered – not insertions and deletions

10 INDIANAUNIVERSITYINDIANAUNIVERSITY 10 Maximum Likelihood Phylogenetic Inference Objective: find the (unrooted) tree that has the highest overall likelihood value Branching patterns, branch lengths, and likelihood values all calculated from the data. Likelihood values used only for comparisons ML is most computationally intensive of the mathematically-based phylogeny methodologies

11 INDIANAUNIVERSITYINDIANAUNIVERSITY 11 fastDNAml Based on Felsenstein ’ s DNAml Program created by Gary Olsen et al. –New search algorithms –Parallel code (one of first parallel phylogenetics codes) Olsen primary developer of serial version

12 INDIANAUNIVERSITYINDIANAUNIVERSITY 12 Basic fastDNAml algorithm – adding taxa Optimize tree for 3 (randomly chosen) taxa - only one topology possible Randomly pick another taxon – (2i-5) trees possible Keep the best (maximum likelihood tree)

13 INDIANAUNIVERSITYINDIANAUNIVERSITY 13 Basic fastDNAml algorithm - Branch rearrangement Move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities) Keep best resulting tree Repeat this step until local swapping no longer improves likelihood value

14 INDIANAUNIVERSITYINDIANAUNIVERSITY 14 Basic fastDNAml algorithm - Iterate Get sequence data for next taxon Add new taxa (2i-5) Keep best Rearrangements Keep best Keep going…. When all taxa have been added, perform a full tree check (crossing 2 to n vertices)

15 INDIANAUNIVERSITYINDIANAUNIVERSITY 15 Because of local effects…. Can get stuck in local optimum, rather than global Must do multiple runs with different randomizations of taxa, and compare the results A set of similar trees with similar (high) likelihood values provide some confidence in results

16 INDIANAUNIVERSITYINDIANAUNIVERSITY 16 Parallelization of fastDNAml At each step, many trees may be analyzed simultaneously A tree and its likelihood value are the only communication needed High computation/communication ratio – hundreds of thousands of floats per byte of data transmitted back to main program in examples used in performance analysis

17 INDIANAUNIVERSITYINDIANAUNIVERSITY 17 Overview of parallel program flow

18 INDIANAUNIVERSITYINDIANAUNIVERSITY 18 Parallel implementation of fastDNAml Program modules –Master (generates trees, receives back from Foreman best tree at each step) –Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) –Worker –Monitor (instrumentation) New features in fastDNAml –Calls to message passing libraries sequestered to one file –Parallel versions include fault tolerance features (useful in large clusters and grid computing)

19 INDIANAUNIVERSITYINDIANAUNIVERSITY 19 Performance analysis of fastDNAml Used three data sets (50,101, 150 taxa) from studies of Microsporidia having 1858 or 1269 positions Performance analyzed on Indiana University ’ s IBM SP, using serial version as baseline for performance Program set to cross 5 vertices in rearrangement step 10 random orderings, (three replications each), 1 to 64 processors

20 INDIANAUNIVERSITYINDIANAUNIVERSITY 20 Performance of fastDNAml

21 INDIANAUNIVERSITYINDIANAUNIVERSITY 21 Performance of fastDNAml

22 INDIANAUNIVERSITYINDIANAUNIVERSITY 22 Other phylogenetics software Ceron – maximum likelihood analysis – Parallel (PVM) program based on Felsenstein ’ s DNAml –fastDNAml as we are using it does more extensive branch swapping –Ceron version - speculative calculations based on assumption rearrangement won ’ t improve tree –Essentially two different search strategies GRAPPA (Bader et al.): Breakpoint analysis program - scales well

23 INDIANAUNIVERSITYINDIANAUNIVERSITY 23 Why bother with parallel code? Why not just achieve speedup of n on n processors by running n independent jobs? Practical benefits of seeing results quickly Parallel program permits assault on much more complicated problems (e.g. protein sequences)

24 INDIANAUNIVERSITYINDIANAUNIVERSITY 24 Visualization

25 INDIANAUNIVERSITYINDIANAUNIVERSITY 25

26 INDIANAUNIVERSITYINDIANAUNIVERSITY 26

27 INDIANAUNIVERSITYINDIANAUNIVERSITY 27 Future Plans A Condor version of fastDNAml Improvements to tree optimization process Protein sequences

28 INDIANAUNIVERSITYINDIANAUNIVERSITY 28 Summary Significant speed up in time to solution. Speed enables biologists to choose phylogenetic methodologies on the basis of the quality of results Scales well Available from: www.indiana.edu/~rac/hpc/fastDNAml/index.html

29 INDIANAUNIVERSITYINDIANAUNIVERSITY 29 Acknowledgements This work supported by in part by –Shared University Research grants from IBM, Inc. –The Lilly Endowment for the Indiana Genomics Initiative (INGEN) of Indiana University. [www.ingen.iu.edu] Diagrams for this talk created by W. Leslie Teach, UITS

30 INDIANAUNIVERSITYINDIANAUNIVERSITY 30 Thank you. Any questions?


Download ppt "INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig."

Similar presentations


Ads by Google