Download presentation
Presentation is loading. Please wait.
Published byAnastasia Hutchinson Modified over 9 years ago
1
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona
2
Motivation The sequencing costs are decreasing IPDPS'14 2 *Adapted from genome.gov/sequencingcosts
3
Big data problem – 1000 Human Genome Project already produced 200 TB data – Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html Motivation IPDPS'14 3
4
Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling IPDPS'14 4 Sequences 12345678 Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences 12345678 Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖
5
Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 5
6
Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools IPDPS'14 6
7
Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 7
8
Our Goal We want to develop a middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'14 8
9
Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention IPDPS'14 9 1 34 Coverage Variance
10
Our Work PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'14 10
11
File-m File-2 File-1 Map Reduce Region-1 Map Region-n Intra-dependent Processing IPDPS'14 11 O-1 1 O-1 n Output-1 Map Reduce Region-1 Map Region-n O-m 1 O-m n Output -m Each file is processed independently
12
Map O1O1 O1O1 OkOk OkOk OnOn OnOn Reduce Output Region-1 Input Files Map Region-k Map Region-n Inter-dependent Processing Each map task processes a particular region of ALL files IPDPS'14 12
13
What Can PAGE Parallelize? PAGE can parallelize all applications that have the following property M - Map task R, R 1 and R 2 are three regions such that R = concatenation of R 1 and R 2 M (R) = M(R 1 ) ⊕ M(R 2 ) where ⊕ is the reduction function IPDPS'14 13 R1R1 R2R2 R
14
Data Partitioning Data is NOT packaged into equal-size data blocks as in Hadoop – Each application has a different way of reading the data – Equal-size data block packaging ignores nucleotide base location information Genome structure is divided into regions and each map task is assigned for a region. – Takes account location information – The map task is responsible of accessing particular region of the input files It is a common feature for many genomic tools (GATK, SamTools) IPDPS'14 14
15
Genome Partition PAGE provides two data partitioning methods – By-locus partitioning: Chromosomes are divided into regions – By-chromosome partitioning: Chromosomes preserve their unity IPDPS'14 15 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
16
Task Scheduling Static Each processor is responsible of regions with equal length. All map tasks should finish before the execution of reduce tasks. Dynamic Map & reduce tasks are assigned by a master process Reduce tasks can start if there are enough available intermediate results. IPDPS'14 16 PAGE provides two types of scheduling schemes.
17
Applications Developed Using PAGE We parallelized 4 applications – VarScan: SNP detection – Realigner Target Creator: Detects insertion/deletions in alignment files – Indel Realigner: Applies local realignment to improve quality of alignment files – Unified Genotyper: SNP detection IPDPS'14 17
18
Sample Application Development with PAGE Serial execution command of VarScan Software – samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp To parallelize VarScan with PAGE, user needs to define: – Genome Partition: By-Locus – Scheduling Scheme: Dynamic (or Static) – Execution Model: Inter-dependent – Map command: samtools mpileup –b file_list -r regionloc -f reference | java -jar VarScan.jar mpileup2snp >outputloc – Reduction : cat bash shell command IPDPS'14 18
19
Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 19
20
Experiments Experimental Setup – In our cluster Each node has 12 GB memory 8 cores (2.53 GHz) – We obtained the data from 1000 Human Genome Project – We evaluated PAGE with 4 applications – We compared PAGE with Hadoop Streaming and GATK IPDPS'14 20
21
Comparison with GATK IPDPS'14 21 ScalabilityData Size Impact - Indel Realigner tool of GATK Data Size: 11 GB# of cores: 128 3.3x 9x
22
Comparison with GATK IPDPS'14 22 ScalabilityData Size Impact - Unified Genotyper tool of GATK 10.9x 12.8x Data Size: 34 GB# of cores: 128
23
ScalabilityData Size Impact - VarScan Application 6.9x 12.7x Comparison with Hadoop Streaming IPDPS'14 23 Data Size: 52 GB# of cores: 128
24
Summary of Experimental Results When the computing power increased by 16 times IPDPS'14 24 Indel Realigner Unified Genotyper VarScanRealigner Target Creator PAGE9x12.8x12.7x14.1x GATK3.3x10.9x-- Hadoop Streaming --6.9x-
25
Conclusion We developed a middleware – Easily parallelizes genomic applications – High applicability No restriction on programming language or data format Allows to use existing applications – Provides user to control the parallel execution while hiding the details Alternative scheduling schemes, execution models and data partitioning types – Good Scalability IPDPS'14 25
26
Thank you for listening … IPDPS'14 26 Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.