E-science grid facility for Europe and Latin America Computational challenges on Grid Computing for workflows applied to Phylogeny R. Isea 1, E. Montes 2, A J. Rubio-Montero 2 and R. Mayo 2 1 Fundación IDEA (Venezuela) 2 CIEMAT (Spain) IWPACBB 2009 Salamanca, June 12 th, 2009
IWPACBB Salamanca, June 12 th, Outline Phylogenetics: a reminder Challenges in Phylogenetics –Computational methods: MrBayes –Exploiting of Grid technology MrBayes and Bioinformatic resources on Grid The PhyloGrid approach –General description and objectives –Taverna workflow –GridSphere portal –Future work: GridWay metascheduler Some results: HPV case study Summary and conclusions
IWPACBB Salamanca, June 12 th, Phylogenetics: a reminder Phylogeny: reconstruction of the evolutionary history (evolutionary tree) of organisms –Influence and relationship between species –Evolution of selected populations Applications on Life Sciences, Industry, etc: –Know real history of evolution: Tree of Life –Drug discovery –Tracing geographical origin, dating introduction of stumps –Prediction of gene’s and proteins’ function –Epidemiological studies Complete Tree of life At July 1837 Darwin draw his first-know sketch of a evolutionary tree
IWPACBB Salamanca, June 12 th, Computational problem: so many trees… Nº of Rooted trees Nº of Unrooted trees Nº of taxa Nº of possible labelled topologies with n species or taxa Rooted Trees: Unrooted Trees: Exhaustive enumeration of all possible phylogenies is not computationally feasible
IWPACBB Salamanca, June 12 th, Computational methods Phenetics: no evolutionary model –Distance-matrix based methods (Neighbour-Joining) Cladistics: –Maximum Parsimony (not statistically consistent) –Maximum Likelihood –Bayesian inference (Markov Chain Monte Carlo): simulation techniques for approximating posterior probability distribution of trees MrBayes ( –Sequential and Parallel implementations (MPI enabled) –High CPU and memory consumption: 50 taxa: simulation of generations ~ 50 hours in a P4 2.8Ghz 2900 sequences of HIV-1 computational challenge
IWPACBB Salamanca, June 12 th, Challenges for Bioinformatics Yet a computational problem –Partial scientific community: inefficient local facilities –Rise in provision of HPC facilities: additional skills required Different approach to access computing infrastructures irrespective of their location Grid Computing
IWPACBB Salamanca, June 12 th, Why Grid Computing? Grids represent a powerful new tool for e-Science –Provide seamless sharing of computing and storage resources –Enable the creation of scalable VOs: Biomed VO –Service Grids (EGEE, EELA) and Opportunistic Grids Benefit for applications demanding non-trivial computing capabilities Local and remote computing and storage facilities
IWPACBB Salamanca, June 12 th, Bioinformatics Grid resources Wide range of Bioinformatics resources through Web Interfaces: –Projects of public databases (genomes, proteins, etc.): EMBL-EB I(UK), NCBI (USA), DDBJ and PDBJ (Japan), etc. –Web services for Bioinformatics toolkits: EBI web services, NCBI Entrez Utils, DDBJ, BioMoby services –Bioinformatics Web services Index/registry servers: EMBRACE service registry (BioCatalogue), BioMoby Central Registry Grid-enabled software packages: –EELA-2: grEMBOSS (UNAM) Grid portals to mask applications –Genius, GridSphere Grid infrastructures & VOs –EGEE related: Biomed, GENE, EELA-prod VOs –myGrid, caBIG, TeraGrid.
IWPACBB Salamanca, June 12 th, How to access MrBayes on Grid Simply sending a standard job to a site –Software must be preinstalled in sites –Successfully tested in several projects National Grid Service (UK) FIRB LIBI “International Laboratory for Bioinformatics” project (Italy) BioinfoGRID project EELA: MPI version installed and tested in EELA-CIEMAT site –Supported by EELA-2/EGEE sites Grid bureaucracy: certificates, VOs, etc. –Usually Biologists are not advanced grid users Need for friendly interfaces to Grid facilities
IWPACBB Salamanca, June 12 th, PhyloGrid aim Offer to the scientific community an easy interface for calculating phylogenies in Grid without requiring the user knowledge about the computational procedure: –Based on MPI-enabled version of MrBayes By means of a Taverna workflow –Takes advantage of the computational power of actual Grid infrastructures The use of Taverna Workflows: –Allows multiple database selection –Extendable with access to complementary tools (Clustalw-MPI) or other workflows (MyExperiment repository)
IWPACBB Salamanca, June 12 th, PhyloGrid architecture WMS LFC Catalog SE Portal Certificate GridSphere Portal + WF Enactor/Engine gLite UI + Submission WS HTTPS gLite GRID GRID protocols CE WNs SOAP
IWPACBB Salamanca, June 12 th, Taverna Workflow Mgmt. System A bioinformatician could easily implement Grid Workflows without Grid skills Public workflow repository (myExperiment) Several Plugins to use WS –MyGrid, CaBIG, GridSAM, BioMoby –Many public databases –GT4 services and gRavi developer framework Many tools/plugins –Manipulating files, format converter, local and remote execution, visualization applets, tools for accessing WS
IWPACBB Salamanca, June 12 th, PhyloGrid Workflow for MrBayes Input params received from GridSphere portal ALN/ClustalW, PHYLIP, MSA to NEXUS format Builds NEXUS file for MrBayes Creates JDL file Job submission Nested workflow checks Grid job execution Get output from SE
IWPACBB Salamanca, June 12 th, GridSphere portal PhyloGrid web portal built on top of GridSphere portal framework ( –A Grid portal improves usability of Grids Hiding complexity of technology involved –A Grid portal improves utilization of Grids Providing an appealing user-friendly Web Interface Enforcing Grid utilization policies PKI security, etc. Cohesive Grid portals Snapshot of the virtual work area of PhyloGrid Portal with some results
IWPACBB Salamanca, June 12 th, Future work: GridWay The JDL job approach –Hard to handle job errors into Taverna workflow –gLite plugin for Taverna is under development Taverna must be installed in a UI or, Use remote execution to a UI (Taverna remote workflow enactor) GridWay metascheduler –Characteristics Fully compatible with gLite based Grids (EELA-2, EGEE) Better resource selection based on internal statistics Automatic migration and re-schedule of failed jobs Checkpointing management for large duration tasks –Taverna binding implementation: WS GRAM interface deployed over GridWay By means of GT4 plugins or directly implementing a JSDL plugin
IWPACBB Salamanca, June 12 th, HPV case study with PhyloGrid HPV is a recognized underlying factor in Cervical Cancer: –90% cases shows infection from some HPV strand Complete HPV nucleotide seqs. about 8000 basis long: –E1, E2, E4-E7 early expression and L1, L2 late expression genes –HPV classification according to L1 variability (> 100 types) –Two different categories with respect to oncogenic potential Study: check if this categorization really fits the evolutionary history of HPV –121 HPV sequences –Molecular phylogenetic calculations for L1, L2 and E7 genes
IWPACBB Salamanca, June 12 th, Results obatined with PhyloGrid Molecular Phylogeny of HPV in oncogenes from L1, L2, E7 121 HPV nucleotide sequences of L1 (the major capsid gene) Phylogenetic tree for L1 Broader lines means differences between this tree and tree derived from L2 gene Topology similarity score of 85% between L1 and L2 Conflict with HPV classification based on variability of L1 gene
IWPACBB Salamanca, June 12 th, Results obtained with PhyloGrid (II) 121 HPV nucleotide sequences of the L1 late expression gene Phylogenetic tree for L1 Broader lines means differences between this tree and the tree derived from E7 gene
IWPACBB Salamanca, June 12 th, Results obtained with PhyloGrid (III) 121 HPV nucleotide sequences of L2 late expression gene Phylogenetic tree for L2 Broader lines means differences between this tree and the tree derived from E7 gene
IWPACBB Salamanca, June 12 th, Summary and conclusions PhyloGrid is a tool for Phylogenetic studies on Grid by means of MPI-enabled MrBayes: –Friendly interface (GridSphere portal): no computational or grid skills required to perform calculations. –Automation of tasks: Taverna workflow PhyloGrid takes advantage of the computational power of actual Grid infrastructures –Allowing Phylogenetic analysis on large scale –Reducing the technological divide that a partial scientific community has for accessing computational platforms such as Grid
IWPACBB Salamanca, June 12 th, Thanks for your attention ?
E-science grid facility for Europe and Latin America Contact R. Isea 1 : raul.isea at gmail.comraul.isea at gmail.com E. Montes 2 : esther.montes at ciemat.esesther.montes at ciemat.es A J. Rubio-Montero 2 : antonio.rubio at ciemat.esantonio.rubio at ciemat.es R. Mayo 2 : rafael.mayo at ciemat.esrafael.mayo at ciemat.es 1 Fundación IDEA (Venezuela) 2 CIEMAT (Spain)