GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.

Slides:



Advertisements
Similar presentations
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Ian Foster Computation Institute Argonne National Lab & University of Chicago Education in the Science 2.0 Era.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Ted Hesselroth Nordugrid 2007 September 24-28, 2007 Abhishek Singh Rana and Frank Wuerthwein UC San Diego The Open Science Grid Ted Hesselroth Fermilab.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Grid Services at NERSC Shreyas Cholia Open Software and Programming Group, NERSC NERSC User Group Meeting September 17, 2007.
Workload Management Massimo Sgaravatto INFN Padova.
Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
INFSO-RI Enabling Grids for E-sciencE V. Breton, 30/08/05, seminar at SERONO Grid added value to fight malaria Vincent Breton EGEE.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Grid Workload Management Massimo Sgaravatto INFN Padova.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
AgINFRA science gateway for workflows and integrated services 07/02/2012 Robert Lovas MTA SZTAKI.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Applications & a Reality Check Mark Hayes. Applications on the UK Grid Ion diffusion through radiation damaged crystal structures (Mark Calleja, Mark.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
LIGO Plans for OSG J. Kent Blackburn LIGO Laboratory California Institute of Technology Open Science Grid Technical Meeting UCSD December 15-17, 2004.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Workload Management Workpackage
U.S. ATLAS Grid Production Experience
Recap: introduction to e-science
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Open Science Grid at Condor Week
Experiences in Running Workloads over OSG/Grid3
Frieda meets Pegasus-WMS
Presentation transcript:

GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory Presented By: Dinanath Sulakhe Members: Natalia Maltsev, Dinanath Sulakhe, Alex Rodriguez (Bioinformatics group) Mike Wilde, Nika Nefedova, Jens Voeckler, Ian Foster (Globus Group)

Thousands of Complete genomes will be available by 2010 and a lot of experimental, biochemical, phenotypic data Analysis of Large amounts of data (genomic, biochemical, functional, phenotypic, etc) requires:  Mature data integration and distributed technologies  Scalable computational resources (Grids, supercomputing)  New tools and algorithms for pattern recognition and comparisons of biosystems on various levels of organization Number of Sequencing Projects Recent progress in genomics and experimental biology has brought exponential growth of the biological information available for computational analysis in public genomics databases.

Integrated Database Integrated Database Includes: Parsed Sequence Data and Annotation Data from Public web sources. Results of different tools used for Analysis. E.g: Blast, Blocks, TMHMM etc. GADU using Grid Applications are executed on the Grid as workflows and the results are stored in the integrated Database. Few tools in GADU use the parsed data for further analysis. GADU Performs: Acquisition: to acquire Genome Data from a variety of publicly available databases and store temporarily on the file system. Analysis: to run different publicly available tools and in-house tools on the Grid using Acquired data and data from Integrated database. Storage: Store the parsed data acquired from public databases and parsed results of the tools and workflows used during analysis. Bidirectional Data Flow Public Databases Genomic databases available on the web. Eg: NCBI, PIR, KEGG, EMP, InterPro, etc. Applications (Web Interfaces) Based on the Integrated Database PUMA2 Evolutionary Analysis of Metabolism Chisel Protein Function Analysis Tool. TARGET Targets for Structural analysis of proteins. PATHOS Pathogenic DB for Bio-defense research Phyloblocks Evolutionary analysis of protein families TeraGridOSGDOE SG GNARE – Genome Analysis Research Environment Services to Other Groups SEED (Data Acquisition) Shewanella Consortium (Genome Analysis) Others..

Parallelization of the tools on Grid Million sequences Fig. Example of a Dag representing the workflow. ATGCATGCA 1000 sequences ATGCATGCA

Workflow Generator The Workflow Generator is responsible for producing a workflow suitable for execution in the Grid environment. This task is accomplished through the use of the “virtual data language” (VDL). Once the VDL for the workflow is written, VDS converts it into condor submit files and a DAG that can be submitted at the site selected by the site selector. VDLt vdlt2vdlx VDLx vdlx2vdlt ins/upd VDLx VDDB gendax DAX TR FileBreaker(input filename, none nodes, output sequences[], none species) { argument = ${species}; argument = ${filename}; argument = ${nodes}; profile globus.maxwalltime = "300"; } TR BLAST( none OutPre, none evalue, input query[], none type ) { argument = ${OutPre}; argument = ${evalue}; profile globus.maxwalltime = "300"; } DV jobNo_1_1separator->FileBreaker ], species="Aeropyrum_Pernix" ) …. VDL for BLAST workflow

VDS makes it easier to add new Grids to the pool of grid sites: You need: Gatekeeper Hostname. Batch Job Manger. Remote APP and DATA Dir Paths. Globus Path on gatekeeper. GridFTP path. A Gridcat Service giving the above information can make it easier to use it for submitting the jobs. Simultaneous Use of Heterogeneous Grids (OSG, TeraGrid, DOE Science Grid).

Implementation of the Site Selector – (development) One challenge in using the Grid reliably for high- throughput analysis is monitoring the state of all Grid sites and how well they have performed for job requests from a given submit host. Big Challege (site selection) We view a site as “available” if our submit host can communicate with it, if it is responding to Globus job-submission commands, and if it will run our jobs promptly, with minimal queuing delays

GADU is Fast ! (Statistics of running BLAST) Blast Database (non-redundant database of proteins): sequences One CPU: (time in walltime). 100 sequences took 66mins on one CPU. One Genome (~4000 seqs would take 2640 mins or 44 hours) sequences takes ~ mins (i.e, hours or 1385 days :-() On the Grid: (using OSG and TeraGrid) sequences took 7800 mins (i,e, 130 hours or 5days 10hours ) So One Genome (approximately 4000 seqs) takes about 10 mins on the Grid Number of CPUs used on the Grid at a given time varies based on the availability of CPUs and the max load the submit host can take. Max CPUs used at any given time was 500. And the average number of CPUs that were running at any given time was about 350 to 400 CPUs. Tools Run on the Grids regularly: Blast and Blocks. (Currently Non-redundant database of proteins has 3.1Million seqs. Frequency: Every 2 months. Apart from the regular blast and blocks for NR, the user submitted genome from GNARE are analyzed on the grid.

Problems Currently Faced by the GADU VO. Very few sites in OSG authenticate GADU VO certificates. On average only about 6 to 8 sites work. (Atleast free cycles can be used). Job failures like Out of Memory Kill go unnoticed. Site Selection is still a problem.

Applications using GADU Inhouse Application: Puma2 TarGet Pathos Chisel Sentra PhyloBlocks SVMMER We are also building a set of Web services to make GADU available for the public usage and other collaborative projects.

Bioinformatics Group: Natalia Maltsev, PI Mark D’ Souza Elizabeth Glass John Peterson Mustafa Syed VDS and Others Mike Wilde Nika Nefedova Jens Voeckler Ian Foster Rick Stevens VDT Support. Condor Support. Systems at MCS. Ackowledgements