RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Cluster 2015, Chicago, Illinois Cluster 2015

Motivation The sequencing costs are decreasing Available data is increasing! Cluster 20152 *Adapted from www.genome.gov/sequencingcosts *Adapted from www.nlm.nih.gov/about/2015CJ.html Parallel processing is inevitable!

Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling Cluster 20153 Sequences 12345678 Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences 12345678 Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖

Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools IPDPS'144

Our Goal We want to develop a middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'145

Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention IPDPS'14 6 1 34 Coverage Variance

Background: PAGE (ipdps 14) PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'147

Parallel Genomic Applications RE-PAGE: A Map-Reduce-like middleware for easy parallelization of data-intensive genomic applications (like PAGE) Main goals (unlike PAGE) – Decrease I/O contention by employing a distributed file system – Workload balance in data intensive tasks – Avoid data transfers Cluster 20158

Execution Model Cluster 20159

RE-PAGE Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language Applicability – The algorithm should be safe to be parallelized by processing different regions of the genome independently – SNP calling, statistical tools and others Cluster 201510

RE-PAGE Parallelization PAGE can parallelize all applications that have the following property M - Map task R, R 1 and R 2 are three regions such that R = concatenation of R 1 and R 2 M (R) = M(R 1 ) ⊕ M(R 2 ) where ⊕ is the reduction function IPDPS'1411 R1R1 R2R2 R

Domain-Specific Data Chunks Heuristic: The data in the same genomic location/region can be related and most likely will be processed together for many types of genomic data analysis Construct data chunks according to genomic region Cluster 201512

Proposed Replication Method Needed to increase data locality Replicating all chunks into all nodes is not feasible. Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis. General Idea: Replicate important regions more than others. Cluster 201513

Proposed Scheduling Schemes Problem definition – Each chunk can be of varying sizes and can have varying number of replicas – Tasks are data intensive. Data transfer costs out-weigh data processing costs General approach: – Avoid remote processing – Take advantage of variety in replication factors and data sizes Master & worker approach We propose 3 scheduling schemes – Largest Chunk First (LCF) – Help the busiest node (HBN) – Effective memory management (EMM) Cluster 201514 LCF HBN EMM

Experiments (1) Cluster 201515 Computation power: 32 Nodes (256 cores) Average Data Chunk Size: 32MB Replication Factor: 3Number of Chunks: 2000 Varying STD of Data BlocksVarying Computation Speed Average size of chunks in real genomic data: 68MB STD of chunks sizes in real genomic data: 63MB Processing Speed: 1MB/sec STD of chunk sizes : 24MB

Experiments (2) Cluster 201516 Comparison with a Centralized Approach Computation power: 32 Nodes (256 cores) Replication Factor: 3 Application: Coverage Analyzer

Experiments (3) Cluster 201517 Parallel Scalability Application: Coverage Analyzer Data Size: 15 SAM files (47 GB) Replication factor: 3 Application: Unified Genotyper Data Size: 40 BAM files (51 GB) Replication factor: 3 (only RE-PAGE) 4.2x 2.2x 7.1x 9.9x

Summary RE-PAGE for developing parallel data-intensive genomic applications – Programming Employs executables of genomic applications Can parallelize wide range of applications – Performance Keeps data in distributed file system Minimizes data transfer Employs intelligent replication method RE-PAGE outperforms Hadoop and GATK and has good parallel scalability results Observation – Prohibiting remote tasks increases performance if chunks have varying sizes and tasks are data intensive. Cluster 201518

Thank you! Cluster 201519

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback