Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Slides:

Advertisements

Similar presentations

ATLAS/LHCb GANGA DEVELOPMENT Introduction Requirements Architecture and design Interfacing to the Grid Ganga prototyping A. Soroko (Oxford), K. Harrison.

Advertisements

Configuration management

GLOBUS PLUG-IN FOR WINGS WOKFLOW ENGINE Elizabeth Martí ITACA Universidad Politécnica de Valencia

1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.

Computing Lectures Introduction to Ganga 1 Ganga: Introduction Object Orientated Interactive Job Submission System –Written in python –Based on the concept.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

Parallel Reconstruction of CLEO III Data Gregory J. Sharp Christopher D. Jones Wilson Synchrotron Laboratory Cornell University.

Maintaining and Updating Windows Server 2008

Introduction to metagenomics Agnieszka S. Juncker Center for Biological Sequence Analysis Technical University of Denmark.

Online Surveys A Look at Cardiff-TeleForm Denise H. Wells Planning and Research Central Piedmont Community College.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Santiago de Chile, 1st EELA Conference, 4-5/9/06 1 Status.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

DISTRIBUTED DATABASES IN ADBMS Shilpa Seth

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

Cluster Reliability Project ISIS Vanderbilt University.

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

The Sargasso Sea “Metagenome”

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.

EGEE-III INFSO-RI Enabling Grids for E-sciencE I. Blanquer(1), V. Hernandez(1), G. Aparicio (1), M. Pignatelli(2), J. Tamames(2)

UKQCD QCDgrid Richard Kenway. UKQCD Nov 2001QCDgrid2 why build a QCD grid? the computational problem is too big for current computers –configuration generation.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

A Networked Machine Management System 16, 1999.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.

E-science grid facility for Europe and Latin America E2GRIS1 Gustavo Miranda Teixeira Ricardo Silva Campos Laboratório de Fisiologia Computacional.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America EELA Demo: Blast in Grids Ignacio Blanquer.

INFSO-RI Enabling Grids for E-sciencE Status of the Biomedical Applications in EELA Project (E-Infrastructures Shared Between Europe.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

INFSO-RI Grupo de Redes y Computación de Altas Prestaciones Actividades del Grupo de Redes y Computación de Altas Prestaciones.

CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.

Online System Status LHCb Week Beat Jost / Cern 9 June 2015.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks BiG: A Grid Service to Distribute Large BLAST.

INFSO-RI Enabling Grids for E-sciencE Activities of the UPV in NA4- Biomed Ignacio Blanquer Vicente Hernández Universidad Politécnica.

OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.

Vincenzo Innocente, CERN/EPUser Collections1 Grid Scenarios in CMS Vincenzo Innocente CERN/EP Simulation, Reconstruction and Analysis scenarios.

EGEE is a project funded by the European Union under contract IST Enabling bioinformatics applications to.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE III User Forum – Clermont Ferrand Analysis of Metagenomes on the EGEE Grid Gabriel.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res

Maintaining and Updating Windows Server 2008 Lesson 8.

Five todos when moving an application to distributed HTC.

BIG DATA/ Hadoop Interview Questions.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.

INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.

MANAGEMENT INFORMATION SYSTEM

U.S. ATLAS Grid Production Experience

Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory

CompChem VO: User experience using MPI

Genre1: Condor Grid: CSECCR

gLite Job Management Christos Theodosiou

Part II SeqViewer AraCyc Help

Presentation transcript:

Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Topic summary Global Process Topics Introduction Cases of study Deployment Automation System Results and Performances Analysis Future Plans

Introduction What is a Metagenome? What is a Metagenome Analysis? Why Grid is a Good Solution? Which is the Proposed Structure? Which are the Future Plans?

What is a Metagenome? Introduction Term first used by Jo Handelsman and others in the University of Wisconsin in A metagenome is a collection of genes. It can be studied as a single gene. Analysis can be done without isolating genes and lab-cultivating them.

What is a Metagenome Analysis? Introduction A Metagenome Analysis is the group of necessary steps to transform a file of a coded metagenome into another file with some interest information. This can include: – Database filtering. – BLAST alignments. – BLAST output filtering. – Creation of Phylogenetic Trees.

Why Grid is a Good Solution? Introduction A Metagenome can be coded into several hundred of thousand sequences. Sequential time can take more than a year. Public databases are continuously changing. Several coarse steps can be done in parallel. In a Grid, the global job can be divided into subjobs. A Metagenome Analysis can be processed in a few days with a Grid Infrastructure.

Farm Soil Metagenome Cases of Study This is a sample from a nutrient rich, moderately contaminated soil environment. This community is very diverse and complex. Many yet unknown enzymes are probably present there. Its study is very interesting from the biotechnological point of view.

Whale fall Metagenome Cases of Study Whale carcasses are known to be a nutrient- rich environment in the bottom of the ocean. A heterogeneous mixture of bacteria flourish there. It is one of the best examples of marine bacterial communities.

Sargasso Sea Metagenome Cases of Study These oceanic samples are taken from surface waters. They represent the diversity of bacteria that live planktonically.

Gut Metagenomes Cases of Study Several metagenomes of the human intestinal microbiota. This consortia of bacteria helps its host to metabolize many nutrients that would be indigestible otherwise. It is involved in other functions – Maturation and modulation of the immune response of the host. – Prevention of infection by bacterial pathogens.

Sequential or Parallel jobs? (I) Deployment There are around 150 CEs in BIOMED and EELA VOs. There are only around 30 CEs able to run MPICH jobs. The number of CEs decreases when the number of required nodes increases. Full efficiency in MPICH jobs is achieved occasionally.

Sequential or Parallel jobs? (II) Deployment

Selecting CEs (I) Deployment Several jobs are needed – A single job can take more than a year. – It is needed to split the Analysis into several subjobs (often more than 100 subjobs). Several CEs are needed – To decentralize processing, storing and network bandwidth. A Metagenome Analysis job has requirements – On software, hardware and configuration.

Selecting CEs (II) Deployment Not all available CEs are able to produce results. Not all available CEs have the same performance. It is needed to select CEs and to distribute jobs according to their performance.

Selecting CEs (III) Deployment

Selecting SEs and Replicating Files Deployment All jobs need certain common files. These files have to be replicated to increase performance and to distribute network bandwidth. SEs selected will be located according to their geographical and administrative nearness to selected CEs, their performance and their configuration.

Splitting global job Deployment The global job has to be broken down into subjobs. The subjob lifetime will decrease – Increase interactivity. – Improve monitoring capabilities.

Submitting Jobs Automation System Subjobs are assigned to a list of CEs These CEs have been tested. Assignation is done according to obtained performances in previous experiments.

Monitoring Jobs Automation System Periodically, jobs status are monitored. In case of errors (aborted job, bad results, etc.), the job is automatically resubmitted. In case the job is running too long, the job is cancelled and resubmitted. In case the job has finished successfully, its CEs is annotated for later submissions.

Resubmitting Jobs Automation System Each correctly finished job annotates its CEs and puts it into a list. The jobs are resubmitted to a random CE of this list. If the list does not exist, the job is submitted to a random CE.

Retrieving Results Automation System Once results are available, they are downloaded and the standard outputs are explored to find any error. A retrieved job is no longer monitored.

First conclusions Results and Performances Jobs are too long to run sequentially – Sargasso Sea Metagenome takes 512 days. The same job in Grid takes 13 days to be fully finished. – Speedup is around 40. High speed for most jobs (90% in 7 days) – Speedup is around 80. – No need to finish all jobs to begin with new stages.

Correctly finished jobs percentage Results and Performances

Sequences processed per hour Results and Performances

Future plans To create several shell-scripts with different stages depending on the desired results. To increase cases of study. To improve automation performances. To make a report with the issues and lessons learnt in EGEE and EELA infrastructures.

Contact Gabriel Aparício i Pla Ignacio Blanquer Espert Vicente Hernández García Universitat Politècnica de València Camí de Vera, s/n València, Spain s: