Download presentation
Presentation is loading. Please wait.
Published byRaymond Ryan Modified over 9 years ago
1
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Cluster 2015, Chicago, Illinois Cluster 2015
2
Motivation The sequencing costs are decreasing Available data is increasing! Cluster 20152 *Adapted from www.genome.gov/sequencingcosts *Adapted from www.nlm.nih.gov/about/2015CJ.html Parallel processing is inevitable!
3
Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling Cluster 20153 Sequences 12345678 Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences 12345678 Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖
4
Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools IPDPS'144
5
Our Goal We want to develop a middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'145
6
Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention IPDPS'14 6 1 34 Coverage Variance
7
Background: PAGE (ipdps 14) PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'147
8
Parallel Genomic Applications RE-PAGE: A Map-Reduce-like middleware for easy parallelization of data-intensive genomic applications (like PAGE) Main goals (unlike PAGE) – Decrease I/O contention by employing a distributed file system – Workload balance in data intensive tasks – Avoid data transfers Cluster 20158
9
Execution Model Cluster 20159
10
RE-PAGE Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language Applicability – The algorithm should be safe to be parallelized by processing different regions of the genome independently – SNP calling, statistical tools and others Cluster 201510
11
RE-PAGE Parallelization PAGE can parallelize all applications that have the following property M - Map task R, R 1 and R 2 are three regions such that R = concatenation of R 1 and R 2 M (R) = M(R 1 ) ⊕ M(R 2 ) where ⊕ is the reduction function IPDPS'1411 R1R1 R2R2 R
12
Domain-Specific Data Chunks Heuristic: The data in the same genomic location/region can be related and most likely will be processed together for many types of genomic data analysis Construct data chunks according to genomic region Cluster 201512
13
Proposed Replication Method Needed to increase data locality Replicating all chunks into all nodes is not feasible. Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis. General Idea: Replicate important regions more than others. Cluster 201513
14
Proposed Scheduling Schemes Problem definition – Each chunk can be of varying sizes and can have varying number of replicas – Tasks are data intensive. Data transfer costs out-weigh data processing costs General approach: – Avoid remote processing – Take advantage of variety in replication factors and data sizes Master & worker approach We propose 3 scheduling schemes – Largest Chunk First (LCF) – Help the busiest node (HBN) – Effective memory management (EMM) Cluster 201514 LCF HBN EMM
15
Experiments (1) Cluster 201515 Computation power: 32 Nodes (256 cores) Average Data Chunk Size: 32MB Replication Factor: 3Number of Chunks: 2000 Varying STD of Data BlocksVarying Computation Speed Average size of chunks in real genomic data: 68MB STD of chunks sizes in real genomic data: 63MB Processing Speed: 1MB/sec STD of chunk sizes : 24MB
16
Experiments (2) Cluster 201516 Comparison with a Centralized Approach Computation power: 32 Nodes (256 cores) Replication Factor: 3 Application: Coverage Analyzer
17
Experiments (3) Cluster 201517 Parallel Scalability Application: Coverage Analyzer Data Size: 15 SAM files (47 GB) Replication factor: 3 Application: Unified Genotyper Data Size: 40 BAM files (51 GB) Replication factor: 3 (only RE-PAGE) 4.2x 2.2x 7.1x 9.9x
18
Summary RE-PAGE for developing parallel data-intensive genomic applications – Programming Employs executables of genomic applications Can parallelize wide range of applications – Performance Keeps data in distributed file system Minimizes data transfer Employs intelligent replication method RE-PAGE outperforms Hadoop and GATK and has good parallel scalability results Observation – Prohibiting remote tasks increases performance if chunks have varying sizes and tasks are data intensive. Cluster 201518
19
Thank you! Cluster 201519
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.