A CCELERATING E VOLUTIONARY M OLECULAR P HYLOGENETIC ANALYSES ON THE NUS TCG G RID Hu Yongli Department of Biochemistry, Yong Loo Lin School of Medicine.

Slides:



Advertisements
Similar presentations
H EURISTIC S OLVER  Builds and tests alternative fuel treatment schedules (solutions) at each iteration  In each iteration:  Evaluates the effects of.
Advertisements

C LOUD C OM 2012 Self-Adaptive Management of The Sleep Depths of Idle Nodes in Large Scale Systems to Balance Between Energy Consumption and Response Times.
Practical Session: Bayesian evolutionary analysis by sampling trees (BEAST) Rebecca R. Gray, Ph.D. Department of Pathology University of Florida.
ATPase dataset -> nj in figtree. ATPase dataset -> muscle -> phyml (with ASRV)– re-rooted.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Its potential for Grid and Cloud Computing Tan Tin Wee Dept of Biochemistry Yong Loo Lin School of Medicine National University of Singapore.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
CoMPAS Pro: Comprehensive Meta Prediction and Annotation Services for Proteins Sebastian J. Schultheiß Christoph Malisi.
Input and output. What’s in PHYLIP Programs in PHYLIP allow to do parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus.
Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Trees – what might they mean? Calculating a tree is comparatively easy, figuring out what it might mean is much more difficult. If this is the probable.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Laboratory Training for Field Epidemiologists Typing May 2007 Sequencing and Phylogeny.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
P HYLOGENETIC T REE. OVERVIEW Phylogenetic Tree Phylogeny Applications Types of phylogenetic tree Terminology Data used to build a tree Building phylogenetic.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Demo: Phylip Ziheng Yang Department of Biology, UCL.
Phylogenetic Analysis Dayong Guo. Introduction Phylogenetics is the study of evolutionary relatedness among various species, populations, or among a set.
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Grid MP at ISIS Tom Griffin, ISIS Facility. Introduction About ISIS Why Grid MP? About Grid MP Examples The future.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
APGrid Core Meeting Phuket Asia Pacific BioGRID initiative A/P Tan Tin Wee, Mark De Silva, Lim Kuan Siong – Bioinformatics Centre, National Univ.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Grid enabling phylogenetic inference on virus sequences using BEAST - a possibility? EUAsiaGrid Workshop 4-6 May 2010 Chanditha Hapuarachchi Environmental.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Cyclins Presentation1 Cyclin family of the yeast S. cerevisiae: Biological vs. Bioinformatical Presented by: Tzvika HoltzmanYan Tsitrin.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
PhyloGrid: a development for a workflow in Phylogeny E. Montes 1, R. Isea 2 and R. Mayo 1 1 CIEMAT, Avda. Complutense, 22, Madrid, Spain 2 Fundación.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Bio-Linux 3.0 An integrated bioinformatics solution for the EG community ClustalX showing DNA polymerase alignment GeneSpring showing yeast transcriptome.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.
INFSO-RI Enabling Grids for E-sciencE Application of GRID resource for modeling of charge transfer in DNA Nadezhda S. Fialko, Victor.
Computational Research in the Battelle Center for Mathmatical medicine.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Automatic and manual sequence alignment Inferring phylogenetic trees Mining web-based databases Estimating rates of molecular evolution Testing evolutionary.
Bayesian Evolutionary Analysis by Sampling Trees (BEAST) LEE KIM-SUNG Environmental Health Institute National Environment Agency.
Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Phylip PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). PHYLIP is the most widely-distributed.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Students Adda Zachary Deema Al Ghanim Horsley Jacqueline Sandrick Daniel Mentors Xiaoming Gao Xinjun Zhang Thilina Gunarathne Supervised by Dr.Judy Qiu.
Whole Genome Sequencing of Brucella melitensis Isolates for the Identification of Biovar, Variants and Relationship within a Biovar *Shaheed F [1], Habibi.
Unrooted phylogenetic tree showing the relationship between the human SLC2A gene family for all 14 members created using PHYLIP 3.6 softwareDistance between.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Discovery of Multiple Differentially Methylated Regions
Genomic Data Clustering on FPGAs for Compression
Biological Classification: The science of taxonomy
CSE8380 Parallel and Distributed Processing Presentation
Explore Evolution: Instrument for Analysis
A. Papa, K. Dumaidi, F. Franzidou, A. Antoniadis 
Chapter 19 Molecular Phylogenetics
Volume 9, Issue 9, Pages (September 2016)
Run time performance for all benchmarked software.
Phylogenetic tree representation of a neighbor-joining analysis of several species of piroplasms. Phylogenetic tree representation of a neighbor-joining.
Molecular phylogenetic analysis of RNA polymerase II largest-subunit protein sequences from various trichomonads, including D. fragilis. Molecular phylogenetic.
Presentation transcript:

A CCELERATING E VOLUTIONARY M OLECULAR P HYLOGENETIC ANALYSES ON THE NUS TCG G RID Hu Yongli Department of Biochemistry, Yong Loo Lin School of Medicine

W HAT IS P HYLOGENY ? The Science of estimating the evolutionary past Fossil data Morphological data Protein sequence data DNA sequence data Etc… Baldauf, S.L., 2003, Trends Genet. 16 (6):345 ‐ 51 retrieved on 21 Nov 09 W HAT IS M OLECULAR P HYLOGENY ?

Maurer-Stroh, S. et. al, 2009, Bio. Direct 4: 18

W HICH S OFTWARE TO USE ? PHYLIP MEGA PAUP* PHYLO_WIN VOSTROG MAC_CLADE TURBOTREE VOSTROG EVOMONY

PHYLIP Developed in the 1980s Most commonly used package for inferring phylogenies Most widely ‐ distributed phylogeny packages Used for building the largest number of published phylogenetic trees Contains a large number of methods and can handle many type of data Open source retrieved on 21 Nov 09 Abdennadher, N. and Boesch, R., 2007, Stud Health Technol Inform. 126 :55 ‐ 64

B UILDING A P ROTEIN P HYLOGENETIC T REE seqbootprotdistneighbor consense drawgram protein_1 protein_2 protein_3 protein_4 >protein_1 GJYWLKADWWGGMD… >protein_2 KKLLDWGGJWGGMD… >protein_3 KKLLDWGKJWGGME… >protein_4 GJYWLAADWWGGMS…

W HY P ROTDIST ??? Most time consuming step Building a tree with 178 protein sequences * protdist ~9 hours and 6 minutes seqboot, neighbor and consense ~ 2 minutes each Ability to be parallelized to be placed on the grid each of the 100 seqboot output datasets can be discretely used for the calculation of protein distances in protdist *Sunfire 6800 server, with 16 CPUs at 900MHz and 16GB RAM

E NABLING PHYLIP ON NUS TCG

S TEPS TAKEN TO PLACE META -PHYLIP ON NUS TCG Preparing the protdist program in meta ‐ PHYLIP Data and Parameter Files Preparation Running meta ‐ PHYLIP on the NUS TCG

P REPARING THE PROTDIST PROGRAM IN META ‐ PHYLIP Downloading PHYLIP 3.68 Compiling source code on Linux server* * Intel Pentium 4 CPU 3.00GHz, 4 GB of RAM running on Slackware 10.0 Testing functionality of meta-PHYLIP on NUS altas ‐ 4 Linux computer cluster

S TEPS TAKEN TO PLACE META -PHYLIP ON NUS TCG G RID Preparing the protdist program in meta ‐ PHYLIP Data and Parameter Files Preparation Running meta ‐ PHYLIP on the NUS TCG

D ATA AND PARAMETER FILE PREPARATION (D ATA FILES = INPUT 1. DAT ) seqbootprotdistneighbor consense drawgram >protein_1 GJYWLKADWWGGMD… >protein_2 KKLLDWGGJWGGMD… >protein_3 KKLLDWGKJWGGME… >protein_4 GJYWLAADWWGGMS… Seqboot_1Seqboot_2Seqboot_3 ……… Seqboot_99Seqboot_100 Seqboot_1 Seqboot_2 Seqboot_3 Seqboot_99 Seqboot_100 Seqboot_4 Seqboot_89 Seqboot_23 Seqboot_38 Seqboot_8 Seqboot_54 Seqboot_88Seqboot_13 Seqboot_75

Parameter File input1.dat F output1.dat Y D ATA AND PARAMETER FILE PREPARATION (P ARAMETER FILES = INPUT 2. DAT )

S TEPS TAKEN TO PLACE META -PHYLIP ON NUS TCG Preparing the protdist program in meta ‐ PHYLIP Data and Parameter Files Preparation Running meta ‐ PHYLIP on the NUS TCG

R UNNING META ‐ PHYLIP ON THE NUS TCG Download parametrics study programparametrics study program Prepare zipped input file: “input.zip” (data+parameter files)

DATA PROCESSING ON GRID Input.zip (100 seqboot output files parameter files ) Koala1 (GridMP Server) Seqboot_1 Seqboot_2 Seqboot_3 Seqboot_99 Seqboot_100 Param_1 Param_2 Param_3 Param_99 Param_100 Seqboot_1 Seqboot_2 Seqboot_3 Seqboot_99 Seqboot_100 Param_1 Param_2 Param_3 Param_99 Param_ Meta-PHYLIP Output1.dat Output2.dat Output1.dat Output2.dat Output1.dat Output2.dat Output1.dat Output2.dat

Parameter File input1.dat F output1.dat Y L OG F ILES

E VALUATING THE S PEEDUP OF M ETA -PHYLIP

E VALUATION OF S PEEDUP Speedup is explored with Same protein length different number of protein sequences Real-life biological datasets Speedup = RT 100 / Tp RT 100 : time (in seconds) from the job creation to return of the last output to the grid server Tp : total CPU time required to run the program in serial.

S PEEDUP A CHIEVED WITH DATASET OF DIFFERENT NUMBER OF SEQUENCES speedup achieved ranges from 14.1 to 65.0 times speedup for small datasets is lower than larger datasets

S PEEDUP A CHIEVED WITH REAL BIOLOGICAL DATA speedup achieved ranges from 25.0 to 58.1 times speedup for small datasets is lower than larger datasets

D ISCUSSION AND C ONCLUSION Advancement in sequencing technology brings about sequence data explosion Phylogenetic analyses can no longer be carried out within an acceptable time frame Placing PHYLIP on the grid will greatly enhance the rate of molecular phylogenetic analyses Acceleration depends on availability of idle computer cycles on grid clients Importance in the study of disease outbreaks and emerging pandemics, especially in disease treatment and pandemic containment Future challenge: Enhance distribution and generality and efficiency Sanderson, M.J. and Driskell, A.C.,2003, Trends Plant Sci. 8(8):374 ‐ 379 Maurer-Stroh, S. et. al, 2009, Bio. Direct 4:18

A CKNOWLEDGEMENTS A/Prof Tan Tin Wee Mark De Silva Lim Kuan Siong Wang Jun Hong Mohammad Asif Khan Heiny Tan All members of BIC

THANK YOU