00110001001110010011011000110111 Computer Science Department of 1 Massively Parallel Genomic Sequence Search on Blue Gene/P Heshan Lin (NCSU) Pavan Balaji.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
SALSA HPC Group School of Informatics and Computing Indiana University.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing P. Balaji, Argonne National Laboratory W. Feng and J. Archuleta, Virginia Tech.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Ch 4. The Evolution of Analytic Scalability
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Emalayan Vairavanathan
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications Mark K. Gardner (Virginia Tech) Wu-chun Feng (Virginia.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Efficient Data Accesses for Parallel Sequence Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Praveen Chandramohan (ORNL) Al Geist (ORNL) Nagiza Samatova.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Efficient Data Handling in Large-Scale Sequence Database Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU and ORNL) Wu-chun Feng (LANL  VT) Al Geist (ORNL)
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Semantics-based Distributed I/O for mpiBLAST P. Balaji ά, W. Feng β, J. Archuleta β, H. Lin δ, R. Kettimuthu ά, R. Thakur ά and X. Ma δ ά Argonne National.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Database Allocation Strategies for Parallel BLAST Evaluation on Clusters Louis Woods.
Parallel I/O Optimizations
Grid Computing.
Linchuan Chen, Xin Huo and Gagan Agrawal
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Department of Computer Science University of California, Santa Barbara
Ch 4. The Evolution of Analytic Scalability
Virtual Memory: Working Sets
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Computer Science Department of 1 Massively Parallel Genomic Sequence Search on Blue Gene/P Heshan Lin (NCSU) Pavan Balaji (ANL) Ruth Poole, Carlos Sosa (IBM) Xiaosong Ma (NCSU & ORNL) Wu Feng (VT)

2 Sequence Search Fundamental tool in computational biology Search similarities between a set of query sequences and sequence databases (DBs) Analogous to web search engines Example application: predict functions of newly discovered sequences Web Search EngineSequence Search InputKeyword(s)Query sequence(s) Search spaceInternetKnown sequence database OutputRelated web pagesDB sequences similar to the query Sorted byCloseness & rankSimilarity score

3 Exponentially Growing Demands of Large-scale Sequence Search Genomic databases growing faster than compute capability of single CPU CPU clock scaling hits the power wall Many biological problems requiring large-scale sequence search Example: “Finding Missing Genes” Compute: 12,000+ cores across 7 supercomputer centers worldwide WAN: Shared Gigabit Ethernet Feat: Completed in 10 days what would normally take years to finish Storage Challenge Award

4 Challenge: Scaling of Sequence Search Performance of a massively parallel sequence search tool – mpiBLAST BG/L prototype Search 28,014 Arabidopsis thaliana against the NCBI NT DB 93% 70% 55%

5 Our Major Contributions Identification of key design issues of scalable sequence search on next-generation massively parallel supercomputers I/O and scheduling optimizations that allow efficient sequence searching across 10,000s of processors An extended prototype of mpiBLAST on BG/P with up to 32,768 compute cores (93% efficiency).

6 Road Map Introduction Background Optimizations Performance Results Conclusion

7 BLAST & mpiBLAST BLAST (Basic Local Alignment Sequence Tool) De facto standard tool for sequence search Computationally intensive: O(n 2 ) worst-case Algorithm: Heuristics-based alignment Execution time of a BLAST job hard to predic mpiBLAST Originally master-worker with database segmentation Enhanced for large-scale deployments Parallel I/O [Lin05, Ching06] Hierarchical Scheduling [Gardner06] Integrated I/O & Scheduling [Thorsen07]

8 Blue Gene/P System Overview 1024 nodes, 4096 cores per rack Compute node: Quad-core PowerPC 450, 2GB memory Scalable I/O system Compute and I/O nodes organized into PSET Five networks 3D torus, global collective, global interrupt, 10- Gigabit Ethernet, control Execution modes SMP, DUAL, VN Node Card 32 Nodes 72 Racks PetaFlops 1 Rack 32 Node Cards

9 Road Map Introduction Background Optimizations Performance Results Conclusion

Current mpiBLAST Scheduling Scheduling hierarchy Limitations Fixed worker-to-master mapping High overhead with fine-grained load balancing SuperMaster f1f1 P 1,1 Partition 1 f2f2 P 1,2 qiqi qiqi … Master 1 f1f1 P 2,1 f2f2 P 2,2 qjqj qjqj Master 2 f1f1 P n-1,1 Partition n f2f2 P n-1,2 qkqk qkqk Master n-1 f1f1 P n,1 f2f2 P n,2 qlql qlql Master n

11 Scheduling Optimization Overview Optimizations Allow mapping arbitrary workers to a master Hide balancing overhead with query prefetching SuperMaster f1f1 P 1,1 Partition 1 f2f2 P 1,2 f1f1 P 1,3 qiqi qiqi qjqj … f2f2 P 1,4 Master 1 qjqj f1f1 P n,1 Partition n f2f2 P n,2 f1f1 P n,3 qkqk qkqk qlql f2f2 P n,4 Master n qlql prefetch

12 Mapping on BG/P … Partition 1 Disk DB frags cached in workers, queries streamed across One output file per partition Results merged and written to GPFS through I/O nodes Compute Nodes IO Node q i+2 qiqi q i+1 q i+2 qiqi q i+1 q i+2 qiqi q i+1 … … Config example PSize DBs / partition 32768/128 = 256 partitions IO Node Partition i Compute Nodes Partition 2Partition i File i qiqi q i+1 File 1File 2 G P F S

13 I/O Characteristics Irregularity Output data from a process is unstructured and non-contiguous I/O data distribution varies from query to query Query search time are imbalanced across workers … … Straightforward non-contiguous I/O bad for BG/P I/O systems Too many requests create contention at I/O node GPFS less optimized for small-chunk, irregular I/O [Yu06]

14 Existing Option 1: Collective I/O Optimizes accesses from N processes Two-phase I/O: Merge small, non-contiguous I/O requests into large, contiguous ones Advantage Excellent I/O throughput Highly optimized on Blue Gene [Yu06] Disadvantage Synchronization between processes Suitable applications Synchronizations in the compute kernel Balanced computation time between I/O phases

15 Option1 Implementation: WorkerCollective qiqi Merge blocks info q i+1 Calculate offsets of q i Output of q i qiqi q i+1 q i+2 qiqi q i+1 q i Search Merge 1 Send evalue+size 2 Send offsets 3 Write data 4 Exchange data 33 Wait q i+2 Worker 1Worker 2Worker 3Master Output File

16 Existing Option2: Optimized Independent I/O Optimized with data sieving Read-modify-write: read in large chunks and only modify target regions Advantages Does not incur synchronizations Performs well for dense non-contiguous requests Disadvantages Introduces redundant data accesses Causes lock contentions with false sharing

17 Option2 Implementation: WorkerIndividual qiqi Merge blocks info q i+1 Calculate offsets of q i Worker 1Worker 2Worker 3Master q i+2 qiqi q i+1 q i+2 qiqi q i+1 q i+2 Search Merge 1 Send evalue+size 1 2 Send offsets 3 Write data Output of q i Output File

18 Our Design: Asynchronous Two-Phase I/O Achieving high I/O throughput without forcing synchronization Asynchronously aggregate small, non-contiguous I/O requests into large chunks Current implementation Master selects a write leader for each query Write leader aggregates requests from other workers

19 Our Design Implementation: WorkerMerge qiqi Merge blocks info q i+1 Calculate offsets of q i q i+2 qiqi q i+1 q i+2 qiqi q i+1 q i Output of q i Search Merge 1 Send evalue+size 2 Offsets 3 3 Worker 1Worker 2Worker 3Master Output File 3 Write data 4 Exchange data

20 Discussion WorkerMerge implementation concerns Non-blocking MPI communications vs. threads Incremental write to avoid memory overflow Only a limited amount of data collected and written at a time Will split-collective I/O help? MPI_File_write_all_begin, MPI_File_write_all_end MPI_File_set_view is synchronized by definition

21 Road Map Introduction Background Optimizations Performance Results Conclusion

22 Compare I/O Strategies – Single Partition Experimental setup Database: NT (over 6 million seqs, 23 GB raw size) Query: 512 sequences randomly sampled from the database Metric: Overall execution time WM outperforms WC and WI by a factor of 2.7 and 4.9

23 Compare I/O Strategies – Multi-Partitions Experimental setup Database: NT Query: 2048 sequences randomly sampled from the database Fixed partition size at 128, scale number of partitions

24 Scalability on A Real Problem Discovering “missing genes” Database: 16M microbial sequences Query: 0.25M randomly sampled sequences 93% parallel efficiency All to all search entire DB within 12 hours

25 Road Map Introduction Background Optimizations Performance Results Conclusion

26 Conclusion Blue Gene/P: Well-suited for massively parallel sequence search when designed properly We proposed I/O and scheduling optimizations for scalable sequence searching across 10,000s processors mpiBLAST prototype scales efficiently (93% efficiency) on 32,768 cores on BG/P For non-contiguous I/O with imbalanced compute kernels, collective I/O without synchronization is desirable

27 References O. Thorsen, K. Jiang, A. Peters, B. Smith, H. Lin, W. Feng and C. Sosa, "Parallel Genomic Sequence-Search on a Massively Parallel System," ACM Int’l Conference on Computing Frontiers, H. Yu, R. Sahoo, C. Howson, G. Almasi, J. Castanos, M. Gupta J. Moreira, J. Parker, T. Engelsiepen, R. Ross, R. Thakur, R. Latham, and W. Gropp, "High Performance File I/O for the BlueGene/L Supercomputer," 12th IEEE Int’l Symposium on High-Performance Computer Architecture (HPCA-12), M. Gardner, W. Feng, J. Archuleta, H. Lin and X. Ma, "Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications," IEEE/ACM Int’l Conference for High-Performance Computing, Networking, Storage and Analysis (SC), Best Paper Finalist, H. Lin, X. Ma, P. Chandramohan, A. Geist and N. Samatova, "Efficient Data Access for Parallel BLAST," IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), A. Ching, W. Feng, H. Lin, X. Ma and A. Choudhary, "Exploring I/O Strategies for Parallel Sequence Database Search Tools with S3aSim," ACM Int’l Symposium on High Performance Distributed Computing (HPDC), 2006.

28 Thank You Questions?