Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient Pairwise Genome Alignment Algorithm – A Small-Scale Application with Grid Potential
Introduction Small scale application is studied in the grid environment Performances are compared with shared memory environment, grid environment and cluster environment Pairwise sequence alignment program is chosen as a small scale application The basic algorithm is modified to a memory efficient algorithm The parallel implementation for pairwise sequence alignment is studied in different environments Based on work done by Nova Ahmed, NMI Integration Testbed
Specification of the Distributed Environments Shared Memory environment is a SGI ORIGIN 2000 machine with 24 CPUs Cluster environment at UAB was a beowulf cluster with 8 homogenous nodes, each node with four 550 MHz Pentium III processors with 512 MB of RAM Grid environment is the same beowulf cluster of the cluster environment with the Globus Toolkit software layer over it. Summer 2005 USC HPC resources used
Two dimensional array - Similarity Matrix - stores the two sequences A match or a mismatch is calculated for each position in the pair of sequences to be matched Dynamic programming is used The Basic Pairwise Sequence Alignment Algorithm
The Reduced Memory Algorithm Keep only nonzero elements of the matrix Memory dynamically allocated as required New data structure for efficiency The Parallel Method The genome sequences are divided among processors The Similarity Matrix is divided among processors P1P2P3P4P5 Part being computed Computation completed P i sends Edge value to P i+1 Time
Results Computation time: Shared Memory, Cluster, Grid-enabled Cluster environment Computation time: Cluster, Grid- enabled Cluster environment
Comparison of speed up: Shared Memory, Cluster, and Grid-enabled Cluster environment Comparison of speed up: Cluster, and Grid-enabled Cluster environment Results
UAB multi-cluster (a) Computation time (b) Speedup Comparison of multi-Cluster Grid environments
Running Example (per Nova Ahmed, UAB Beowulf Cluster: Medusa) Here the steps of running the genome alignment program for grid. First the sample program which aligns a very small genome sequence is tested. The genome sequences were t1.txt, t2.txt The object file is: ar7
Grid-proxy-init, RSL script, globusrun 1. First the grid-proxy-init is run to get the grid certificate Your identity: /O=Grid/OU=UAB Grid/CN=Nova Ahmed Enter GRID pass phrase for this identity: Creating proxy Done Your proxy is valid until: Fri Apr 9 00:54: Then create the RSL script in genome.rsl to run the job & (count=4) (executable=/home/nova/ar7) (jobtype=mpi) 3. the actual program ran on the grid using globus run command globusrun -s -r medusa.lab.ac.uab.edu -f./genome.rsl
Output NOVA1 MyId = 1 NumProc = 4 [1 : 1 ->2 2] [1 : 2 ->13 3] [1 : 3 ->1 1] [1 : 3 ->11 1] myid = 1 finished NOVA1 MyId = 2 NumProc = 4 [2 : 0 ->1 1] [2 : 0 ->11 1] [2 : 2 ->1 1] [2 : 3 ->2 2] [2 : 4 ->2 2] [2 : 4 ->13 3] [2 : 5 ->1 1] [2 : 5 ->13 3] myid = 2 finished NOVA1 MyId = 3 NumProc = 4 [3 : 0 ->11 1] [3 : 0 ->21 1] [3 : 1 ->2 2] [3 : 2 ->11 1] [3 : 2 ->31 1] [3 : 3 ->1 1] [3 : 4 ->1 1] [3 : 4 ->12 2] [3 : 4 ->21 1] [3 : 5 ->2 2] [3 : 5 ->12 2] [3 : 5 ->23 3] [3 : 5 ->31 1] myid = 3 finished NOVA1 MyId = 0 NumProc = 4 tgatggaggt gatagg [0 : 0 ->11 1] [0 : 2 ->1 1] [0 : 4 ->11 1] [0 : 5 ->11 1] Elapsed time is = myid = 0 finished // Running the program using longer genome sequences a1-1000, a1-2000, a compared with a2-1000, a2-2000, a2-3000
USC HPC – Summer 2005 (a) for small set sequences (b) for long set sequences Computation time in Cluster and Grid environment varying number of processors
USC HPC – Summer 2005 (a) for small set sequences (b) for long set sequences Speed up in the Cluster and Grid environments
Conclusion Grid environment shows similar performance to cluster environment Grid environment adds little overhead Shared memory environment has better speedup performance compared to cluster and grid Shared memory environment shows the limitation of memory for computing large genome sequences Small scale applications (as well as large scale) can run efficiently on a grid Distributed applications with minimal communication among the processors will see benefit in a grid environment – perhaps even across multiple clusters
Future Work Additional work in a SURAgrid environment that includes multiple clusters Test data that provides a more computation intensive challenge for grid environments Adapt the application to the grid environment such that is is using less inter-process communication
Acknowledgements This material is based in part upon work supported by: –National Science Foundation under Grant No. ANI NMI Integration Testbed Program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) –SURA Grant SURA SURAgrid Application Development & Documentation Thanks to –Nova Ahmed, currently Georgia Tech Computer Science PhD program, for original work carried out as part of NMI Integration Testbed Program –John-Paul Robinson and University of Alabama at Birmingham for access to medusa cluster –Jim Cotillier, Shelley Henderson, University of Southern California, for access to HPC resources –Chao “Bill” Xie, Georgia State Computer Science PhD program, for continuing Nova Ahmed’s work –Victor Bolet, Georgia State Information Systems & Technology Advanced Campus Services unit, for support of Georgia State’s SURAgrid nodes –John McGee, RENCI.org, for discussions of approach using globus