GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory Presented By: Dinanath Sulakhe Members: Natalia Maltsev, Dinanath Sulakhe, Alex Rodriguez (Bioinformatics group) Mike Wilde, Nika Nefedova, Jens Voeckler, Ian Foster (Globus Group)
Thousands of Complete genomes will be available by 2010 and a lot of experimental, biochemical, phenotypic data Analysis of Large amounts of data (genomic, biochemical, functional, phenotypic, etc) requires: Mature data integration and distributed technologies Scalable computational resources (Grids, supercomputing) New tools and algorithms for pattern recognition and comparisons of biosystems on various levels of organization Number of Sequencing Projects Recent progress in genomics and experimental biology has brought exponential growth of the biological information available for computational analysis in public genomics databases.
Integrated Database Integrated Database Includes: Parsed Sequence Data and Annotation Data from Public web sources. Results of different tools used for Analysis. E.g: Blast, Blocks, TMHMM etc. GADU using Grid Applications are executed on the Grid as workflows and the results are stored in the integrated Database. Few tools in GADU use the parsed data for further analysis. GADU Performs: Acquisition: to acquire Genome Data from a variety of publicly available databases and store temporarily on the file system. Analysis: to run different publicly available tools and in-house tools on the Grid using Acquired data and data from Integrated database. Storage: Store the parsed data acquired from public databases and parsed results of the tools and workflows used during analysis. Bidirectional Data Flow Public Databases Genomic databases available on the web. Eg: NCBI, PIR, KEGG, EMP, InterPro, etc. Applications (Web Interfaces) Based on the Integrated Database PUMA2 Evolutionary Analysis of Metabolism Chisel Protein Function Analysis Tool. TARGET Targets for Structural analysis of proteins. PATHOS Pathogenic DB for Bio-defense research Phyloblocks Evolutionary analysis of protein families TeraGridOSGDOE SG GNARE – Genome Analysis Research Environment Services to Other Groups SEED (Data Acquisition) Shewanella Consortium (Genome Analysis) Others..
Parallelization of the tools on Grid Million sequences Fig. Example of a Dag representing the workflow. ATGCATGCA 1000 sequences ATGCATGCA
Workflow Generator The Workflow Generator is responsible for producing a workflow suitable for execution in the Grid environment. This task is accomplished through the use of the “virtual data language” (VDL). Once the VDL for the workflow is written, VDS converts it into condor submit files and a DAG that can be submitted at the site selected by the site selector. VDLt vdlt2vdlx VDLx vdlx2vdlt ins/upd VDLx VDDB gendax DAX TR FileBreaker(input filename, none nodes, output sequences[], none species) { argument = ${species}; argument = ${filename}; argument = ${nodes}; profile globus.maxwalltime = "300"; } TR BLAST( none OutPre, none evalue, input query[], none type ) { argument = ${OutPre}; argument = ${evalue}; profile globus.maxwalltime = "300"; } DV jobNo_1_1separator->FileBreaker ], species="Aeropyrum_Pernix" ) …. VDL for BLAST workflow
VDS makes it easier to add new Grids to the pool of grid sites: You need: Gatekeeper Hostname. Batch Job Manger. Remote APP and DATA Dir Paths. Globus Path on gatekeeper. GridFTP path. A Gridcat Service giving the above information can make it easier to use it for submitting the jobs. Simultaneous Use of Heterogeneous Grids (OSG, TeraGrid, DOE Science Grid).
Implementation of the Site Selector – (development) One challenge in using the Grid reliably for high- throughput analysis is monitoring the state of all Grid sites and how well they have performed for job requests from a given submit host. Big Challege (site selection) We view a site as “available” if our submit host can communicate with it, if it is responding to Globus job-submission commands, and if it will run our jobs promptly, with minimal queuing delays
GADU is Fast ! (Statistics of running BLAST) Blast Database (non-redundant database of proteins): sequences One CPU: (time in walltime). 100 sequences took 66mins on one CPU. One Genome (~4000 seqs would take 2640 mins or 44 hours) sequences takes ~ mins (i.e, hours or 1385 days :-() On the Grid: (using OSG and TeraGrid) sequences took 7800 mins (i,e, 130 hours or 5days 10hours ) So One Genome (approximately 4000 seqs) takes about 10 mins on the Grid Number of CPUs used on the Grid at a given time varies based on the availability of CPUs and the max load the submit host can take. Max CPUs used at any given time was 500. And the average number of CPUs that were running at any given time was about 350 to 400 CPUs. Tools Run on the Grids regularly: Blast and Blocks. (Currently Non-redundant database of proteins has 3.1Million seqs. Frequency: Every 2 months. Apart from the regular blast and blocks for NR, the user submitted genome from GNARE are analyzed on the grid.
Problems Currently Faced by the GADU VO. Very few sites in OSG authenticate GADU VO certificates. On average only about 6 to 8 sites work. (Atleast free cycles can be used). Job failures like Out of Memory Kill go unnoticed. Site Selection is still a problem.
Applications using GADU Inhouse Application: Puma2 TarGet Pathos Chisel Sentra PhyloBlocks SVMMER We are also building a set of Web services to make GADU available for the public usage and other collaborative projects.
Bioinformatics Group: Natalia Maltsev, PI Mark D’ Souza Elizabeth Glass John Peterson Mustafa Syed VDS and Others Mike Wilde Nika Nefedova Jens Voeckler Ian Foster Rick Stevens VDT Support. Condor Support. Systems at MCS. Ackowledgements