RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Cluster 2015, Chicago, Illinois Cluster 2015
Motivation The sequencing costs are decreasing Available data is increasing! Cluster *Adapted from *Adapted from Parallel processing is inevitable!
Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling Cluster Sequences Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖
Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools IPDPS'144
Our Goal We want to develop a middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'145
Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention IPDPS' Coverage Variance
Background: PAGE (ipdps 14) PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'147
Parallel Genomic Applications RE-PAGE: A Map-Reduce-like middleware for easy parallelization of data-intensive genomic applications (like PAGE) Main goals (unlike PAGE) – Decrease I/O contention by employing a distributed file system – Workload balance in data intensive tasks – Avoid data transfers Cluster 20158
Execution Model Cluster 20159
RE-PAGE Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language Applicability – The algorithm should be safe to be parallelized by processing different regions of the genome independently – SNP calling, statistical tools and others Cluster
RE-PAGE Parallelization PAGE can parallelize all applications that have the following property M - Map task R, R 1 and R 2 are three regions such that R = concatenation of R 1 and R 2 M (R) = M(R 1 ) ⊕ M(R 2 ) where ⊕ is the reduction function IPDPS'1411 R1R1 R2R2 R
Domain-Specific Data Chunks Heuristic: The data in the same genomic location/region can be related and most likely will be processed together for many types of genomic data analysis Construct data chunks according to genomic region Cluster
Proposed Replication Method Needed to increase data locality Replicating all chunks into all nodes is not feasible. Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis. General Idea: Replicate important regions more than others. Cluster
Proposed Scheduling Schemes Problem definition – Each chunk can be of varying sizes and can have varying number of replicas – Tasks are data intensive. Data transfer costs out-weigh data processing costs General approach: – Avoid remote processing – Take advantage of variety in replication factors and data sizes Master & worker approach We propose 3 scheduling schemes – Largest Chunk First (LCF) – Help the busiest node (HBN) – Effective memory management (EMM) Cluster LCF HBN EMM
Experiments (1) Cluster Computation power: 32 Nodes (256 cores) Average Data Chunk Size: 32MB Replication Factor: 3Number of Chunks: 2000 Varying STD of Data BlocksVarying Computation Speed Average size of chunks in real genomic data: 68MB STD of chunks sizes in real genomic data: 63MB Processing Speed: 1MB/sec STD of chunk sizes : 24MB
Experiments (2) Cluster Comparison with a Centralized Approach Computation power: 32 Nodes (256 cores) Replication Factor: 3 Application: Coverage Analyzer
Experiments (3) Cluster Parallel Scalability Application: Coverage Analyzer Data Size: 15 SAM files (47 GB) Replication factor: 3 Application: Unified Genotyper Data Size: 40 BAM files (51 GB) Replication factor: 3 (only RE-PAGE) 4.2x 2.2x 7.1x 9.9x
Summary RE-PAGE for developing parallel data-intensive genomic applications – Programming Employs executables of genomic applications Can parallelize wide range of applications – Performance Keeps data in distributed file system Minimizes data transfer Employs intelligent replication method RE-PAGE outperforms Hadoop and GATK and has good parallel scalability results Observation – Prohibiting remote tasks increases performance if chunks have varying sizes and tasks are data intensive. Cluster
Thank you! Cluster