28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing Centre, Taiwan 2 CERN IT-GD-ED, Switzerland
28 April, 2005ISGC 2005, Taiwan Outline The consideration of distributing BLAST jobs The master-worker computing model of BLAST –mpiBLAST The Gridified BLAST –mpiBLAST-g2 vs. DIANE-BLAST Summary
28 April, 2005ISGC 2005, Taiwan The considerations of distributing BLAST jobs BLAST has been widely and routinely used for sequence analysis The essential component in most of bioinformatics and life science applications Problem Complexity ~ O(S q xS d ) –S q : The query size –S d : The database size In most cases, S d >> S q –e.g. S q ~ O(MB), S d ~ O(GB) –The cost of moving query is lower Database management, storage and sharing issues –Replication, Archive –Privacy, Security Other perspective for service providing –scalability, robustness
28 April, 2005ISGC 2005, Taiwan The master-worker model of BLAST Database splitting is the easiest way to distribute BLAST jobs Fragmented databases for avoiding the memory swapping Each sub task can be 100% independent Each worker requests the tasks from master (pull model) and runs the normal BLAST search The individual result can be easily merged by master process Report generation (BioSeq fetching) Multi-query blast search can be easily split to multiple independent single-query blast search by a trivial script –Master-worker model can also be applied in each single-query search Database Master workers DB Fragments Task list Job requesting Result merging formatdb blast search BioSeq fetching
28 April, 2005ISGC 2005, Taiwan mpiBLAST LANL, US The MPI implementation of BLAST master-worker model Advantages –High throughput –Load Balancing Running in local cluster –Performance and Problem size still be limited by local computing power –Simultaneous I/O to centralized database causes the performance bottleneck –Database sharing is still difficult
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 ASCC, Taiwan and PRAGMA A GT2-enabled parallel BLAST runs on Grid –GT2 GASSCOPY API –MPICH-g2 The enhancement from mpiBLAST by ASCC Performing cross cluster scheme of job execution Performing remote database sharing Help Tools for –database replication –automatic resource specification and job submission (with static resource table) –multi-query job splitting and result merging Close link with mpiBLAST development team –The new patches of mpiBLAST can be quickly applied in mpiBLAST-g2
28 April, 2005ISGC 2005, Taiwan SC2004 mpiBLAST-g2 demonstration KISTI
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 current deployment -- From PRAGMA GOC
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 Performance Evaluation (perfect case) Elapsed timeSpeedup Database: est_human ~ 3.5 GBytes Queries: 441 test sequences ~ 300 KBytes Overall speedup is approximately linear — Searching + Merging — BioSeq fetching — Overall
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 Performance Evaluation (worse case) Elapsed timeSpeedup Database: drosophila NT ~ 122 MBytes Queries: 441 test sequences ~ 300 KBytes The overall speedup is limited by the unscalable BioSeq fetching — Searching + Merging — BioSeq fetching — Overall
28 April, 2005ISGC 2005, Taiwan Issues of mpiBLAST-g2 Single error will crash the whole job –The MPICH nature –Error might be due to the transient problem on the loosely coupled Grid environment MPI Job will be started only when all resources are available –Different level of resource availability Error recovery is required for –providing a robust application service on the Grid –efficiently using the Grid resources Asynchronous task dispatching/pulling to use the available resources immediately
28 April, 2005ISGC 2005, Taiwan The DIANE DIstributed ANalysis Environment Lightweight distributed framework for parallel scientific applications in master-worker model –A perfect match of the mpiBLAST computing model Current applications –BLAST for Genomic Sequence Analysis (DIANE-BLAST) –Geant4 Simulation for Radiotherapy and Astrophysics –Image Rendering –Data Analysis for High Energy Physics
28 April, 2005ISGC 2005, Taiwan DIANE Features Rapid prototyping –Python and CORBA Error recovery –Heartbeat worker health check –Resubmission of failed tasks –User defined error recovery method No need of outbound connectivity –Proxy of workers with only private IP Job submitters for –Simple fork –Condor, LSF, SGE, PBS –GT2, LCG, gLite Pull Model Batch and Interactive Distributed workers planner integrator
28 April, 2005ISGC 2005, Taiwan DIANE-BLAST implementation Splitting mpiBLAST-g2 to DIANE components –Master (Planner and Integrator), Worker Wrapping each component with Python –Hooking core BLAST C libraries with python swig Implementing the DIANE GT2 job submitter –For running workers on the GT2-enabled clusters Reusing the deployed databases for mpiBLAST-g2
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 vs. DIANE-BLAST The Speedup Query –Drosophila chromosome 4 –size: 1.2 Mbps DB –Drosophila nucleotide sequence database –size: 1170 seq. 122 Mbps –no. fragments: 32 Computing Resource –Available # of CPU: 12 –PIII 1.4GHz –1GByte Memory Speedup of mpiBLAST-g2 Speedup of DIANE-BLAST
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline DIANE-BLAST task dispatching Handled by DIANE’s task thread Due to the bugs in the current DIANE release DIANE-BLAST task dispatching Handled by DIANE’s task thread Due to the bugs in the current DIANE release mpiBLAST-g2 task dispatching mpiBLAST-g2 task handling logic mpiBLAST-g2 task dispatching mpiBLAST-g2 task handling logic
28 April, 2005ISGC 2005, Taiwan mpiBLAST-g2 vs. DIANE-BLAST Overall Comparisons mpiBLAST-g2 –Master-Worker model implemented by using MPICH-g2 libraries –Gridification efforts Implementing database sharing with GASSCOPY API Recompilation with MPICH-g2 and GT2 libraries –Error recovery Need the fault-tolerance MPI –Cross cluster computation Requiring outbound connectivity on each worker –Performance/Throughput In cluster performance is as well as the original mpiBLAST DIANE-BLAST –Pluggable application for DIANE Master-Worker framework –Gridification efforts Through the gridified DIANE framework –Error recovery Task resubmission Tracking the health of each worker –Cross cluster computation Using proxy for workers with private IPs –Performance/Throughput Performance can be tuned by controlling the job thread
28 April, 2005ISGC 2005, Taiwan Summary Two grid-enabled BLAST implementations (mpiBLAST-g2 and DIANE-BLAST) were introduced for efficient handling the BLAST jobs on the Grid Both implementations are based on the Master-Worker model for distributing BLAST jobs on the Grid The mpiBLAST-g2 has good scalability and speedup in some cases –Require the fault-tolerance MPI implementation for error recovery –In the unscalable cases, BioSeq fetching is the bottleneck DIANE-BLAST provides flexible mechanism for error recovery –Any master-worker workflow can be easily plugged into this framework –The job thread control should be improved to achieving the good performance and scalability
28 April, 2005ISGC 2005, Taiwan Thanks for your attention!!