Download presentation
Presentation is loading. Please wait.
Published byAlannah Parsons Modified over 9 years ago
1
1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications Mark K. Gardner (Virginia Tech) Wu-chun Feng (Virginia Tech) Jeremy Archuleta (U. Utah) Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Nominated for Best Paper Award, SC 2006, Tampa, FL
2
2 Overview StorCloud Demo of SC|05 I/O throughput competition of real world scientific applications When: Sun., Nov. 13 to Thu., Nov. 17, 2005 Part of slides modified from StorCloud presentation “mpiBLAST on the GreenGene Distributed Supercomputer” (Wu Feng et. al.) Story Built an ad-hoc grid (GreenGene) with 3048 Processor for intensive genomic sequence search (search NT against NT with mpiBLAST) Team Institutions LANL, NCSU, U. Utah, and Virginia Tech Vendors Intel, Panta Systems, and Foundry Networks
3
3 GreenGene Grid How? Intel (Dupont) SC2005 Showroom Floor U.Utah Va Tech
4
4 Outline About BLAST and mpiBLAST Motivation Planning Estimate resource requirements What kind of grid do we need System design Hardware architecture Software architecture Results Conclusion
5
5 What is BLAST? Basic Local Alignment Sequence Tool Ubiquitous sequence database search tool used in molecular biology Given a query DNA or amino-acid (AA) sequence, BLAST Finds similar sequences in database Reports statistical significance of similarities between query and database Newly sequenced genomes are typically BLAST- searched against database of known genes Similar sequences may have similar functions in a new organism
6
6 BLAST at the Core of Sequence DB Search Widely used: Approximately 75%-90% of all compute cycles in life sciences are devoted to BLAST searches But, it is: Computationally demanding, O(n 2 ) (variant of string matching algorithm) Requires seq database to be stored in memory to perform efficiently Challenge: sequence databases growing exponentially
7
7 mpiBLAST Algorithm: Querying the Database Open source BLAST parallelization (developed at LANL) Parallel approach: segment and distribute database across cluster Advantage: deliver super-linear speedup by avoiding repeated I/O Limitation: poor performance in handle search with large output volume because of results merging bottleneck
8
8 mpiBLAST-PIO: Enhancing Efficiency Optimizations transferred from pioBLAST Research prototype developed at NCSU and ORNL [Lin et. al. IPDPS05] Dramatically improves search throughput and scalability Using parallel I/O techniques to remove result merging bottleneck Results buffered and outputted concurrently by workers Enhancing output processing to reduce communication volume Largely used in SC StorCloud demo
9
9 Why Sequence-Search the NT Database Against Itself? From a Biological Perspective Aids in understanding of which genetic codes are unique and which are redundant Enables a number of useful studies from organism “barcoding” to gene function and evolution From a Computer Science Perspective Provides pertinent demonstration of mpiBLAST/pio’s scalability to larger problems (NT is one of the largest seq databases) Can potentially generate huge output data Enables realization of advanced indexing structure that tracks relationships among sequences in the database Such indexing structures can provide Up to 100x speedup in search times with little loss of sensitivity. Up to 20x compression of the database using phylogenetic methods.
10
10 Resource Estimation Why do we care? To evaluate the feasibility of the project To make better scheduling decision What’s the complexity of the problem? Intuitively: estimation by seq length NT composition
11
11 Sequence Length Based Estimation Simple linear extrapolation appears “mission impossible” Because of “hard queries” intensive computation, large quantities of intermediate results Fortunately, Weak correlation between sequence length and resource requirements because of BLAST employs heuristics G1 sequences well behaved, large portion of sequences belong to G1 Search of hard queries can be speeded up with more memory Sampling NT sequences search
12
12 Better Predictor? Hit-based rather than length-based? Two phase BLAST search First phase: find hits in word level Second phase: extend matched words in both direction to find maximal segment pair (longest local matching substring) Computation of first phase much less expensive then that of second phase Modified BLAST algorithm to collect number of hits in the first phase Attractive: utilizing internal knowledge of BLAST algorithm
13
13 Number of Hits Not a Better Predictor Linear regression on data collected from 500 seqs Y: output size, execution time; X: length, # hits Number of hits not necessary better Difference of mean square errors < 5% High correlation (0.9942) between number of hits and sequence length Sequence length is much easier to collect
14
14 What Kind of Grid Do We Need? Existing grid frameworks (such as Globus) not what we want Not available or well tested on Mac OS X and 64-bit Linux OS mpiBLAST-PIO not ported to Globus High learning curve for installation and configuration Home made grid software wrote from scratch Just fit our needs Easy to deploy, allow full control
15
15 Hardware Architecture Heterogeneous environment Interoperability is big concern ClusterOrganizationArchitectureMemory#ProcsFile System System XVirginia TechDual 2.3GHz PowerPC 970FX 4GB2200NFS TunnelArchUniv. of UtahDual AMD Opteron 240 CPU4GB126PVFS TunnelArchUniv. of UtahDual AMD Opteron 244 CPU2GB128PVFS DuponIntelQuad coreN/A512/25 6 NFS JarrelIntelDual 3.4GHz Intel P42GB20NFS Blade Center IntelDual 2.66GHz Intel Xeon2GB28NFS PantaPanta Systems Four AMD Opteron 246HE2GB32NFS
16
16
17
17 Software Architecture Hierarchical design SuperMaster: assign queries, fetch results, load balancing GroupMaster: fetch queries, perform search How to choose group size? Challenges: heterogeneity, scalability, fault tolerance NT Replica GroupMaster SuperMaster NT Replica
18
18 Heterogeneity And Accessibility Only use four existing, cross-platform tools Perl, ssh, rsync, bash 5 scripts, totaled only 458 lines Fast deployment in Unix like systems Customize mpiBLAST-PIO System X need special care Porting issues because of Mac OS and Power PC Implement pseudo-parallel-write to improve output performance on NFS
19
19 Design for Scalability Managing thousands of procs efficiently with loosely coupled, hierarchical design Reduce loads on SuperMaster Passive SuperMaster: easy to add group masters, regroup processors, and avoid security hole Allow incremental system start Hiding WAN latency by queuing queries in local Prevent “bubbles in the pipeline” Ensuring data integrity with MD5 checksum A silent error every 500GB [Paxson 1999] Alleviating network bandwidth constraint with compression (compression ration 1:5 ~ 1:7)
20
20 Fault Tolerance Serious: mean time failure < 10 hours in machines with thousands of processors [Reed 2004] Re-execution rather than checkpoint-restart Primary issue: query states management Maintain all query states in file system
21
21 Results Finished 1/7 NT in one day Coalesced sequences into batches targeting 30 minutes search time Execution statistics Output size: 600K ~ 7GB per batch, 284.2KB per seq Execution time: 6 secs ~ 1.6 hours, average 9 mins per batch
22
22 Conclusion Not be able to take advantage of existing grid software Home made grid software did work Enables rapid development and deployment Portable to Unix like platforms Identify hard queries for bio research Future work Extend framework to support more general applications Better resource estimation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.