Presentation is loading. Please wait.

Presentation is loading. Please wait.

PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.

Similar presentations


Presentation on theme: "PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu."— Presentation transcript:

1 PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu

2 Motivation The sequencing costs are decreasing Big data problem Candidacy Examination 2 *Adapted from genome.gov/sequencingcosts *Adapted from https://www.nlm.nih.gov/about/2015CJ.html Parallel processing is inevitable!

3 Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling Candidacy Examination 3 Sequences 12345678 Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences 12345678 Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖

4 Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools Candidacy Examination 4

5 Main Goal of My Thesis Candidacy Examination 5 We want to develop middleware systems – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Eases programming since most developers are biologists, not computer scientists

6 Papers During My PhD Study Mucahid Kutlu, Gagan Agrawal. Cluster-based SNP Calling on Large-Scale Genome Sequencing Data, the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014) (Accepted - 19.1% acceptance rate) -Mucahid Kutlu, Gagan Agrawal, PAGE: A Framework for Easy PArallelization of GEnomic Applications,the 28th IEEE International Parallel & Distributed Process- ing Symposium (IPDPS 2014) (Accepted - 21.1% acceptance rate) -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms," High Performance Computing (HiPC), 2012 (25.1 % acceptance rate) -Mucahid Kutlu, Gagan Agrawal and Oguz Kurt, "Fault tolerant parallel data-intensive algorithms", High Performance and Distributed Computing (HPDC), 2012 (poster paper) RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications (to be submitted) Candidacy Examination 6

7 Outline Motivation & Background Current Work – PAGE: A Framework for Easy PArallelization of GEnomic Applications – RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications Future Work Candidacy Examination 7

8 Our Work PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language Candidacy Examination 8

9 File-m File-2 File-1 Map Reduce Region-1 Map Region-n Intra-dependent Processing Candidacy Examination 9 O-1 1 O-1 n Output-1 Map Reduce Region-1 Map Region-n O-m 1 O-m n Output -m Each file is processed independently

10 Map O1O1 O1O1 OkOk OkOk OnOn OnOn Reduce Output Region-1 Input Files Map Region-k Map Region-n Inter-dependent Processing Each map task processes a particular region of ALL files Candidacy Examination 10

11 Data Partitioning Data is NOT packaged into equal-size data blocks as in Hadoop – Each application has a different way of reading the data – Equal-size data block packaging ignores nucleotide base location information Genome structure is divided into regions and each map task is assigned for a region. – Takes account location information – The map task is responsible of accessing particular region of the input files It is a common feature for many genomic tools (GATK, SamTools) Candidacy Examination 11

12 Genome Partition PAGE provides two data partitioning methods – By-locus partitioning: Chromosomes are divided into regions – By-chromosome partitioning: Chromosomes preserve their unity Candidacy Examination 12 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

13 Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention Candidacy Examination 13 1 34 Coverage Variance 13

14 Task Scheduling Static Each processor is responsible of regions with equal length. All map tasks should finish before the execution of reduce tasks. Dynamic Map & reduce tasks are assigned by a master process Reduce tasks can start if there are enough available intermediate results. Candidacy Examination 14 PAGE provides two types of scheduling schemes.

15 Sample Application Development with PAGE Serial execution command of VarScan Software – samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp To parallelize VarScan with PAGE, user needs to define: – Genome Partition: By-Locus – Scheduling Scheme: Dynamic (or Static) – Execution Model: Inter-dependent – Map command: samtools mpileup –b file_list -r regionloc -f reference | java -jar VarScan.jar mpileup2snp >outputloc – Reduction : cat bash shell command Candidacy Examination 15

16 Experiments Experimental Setup – In our cluster Each node has 12 GB memory 8 cores (2.53 GHz) – We obtained the data from 1000 Human Genome Project – We evaluated PAGE with 4 applications VarScan: SNP detection Realigner Target Creator: Detects insertion/deletions in alignment files Indel Realigner: Applies local realignment to improve quality of alignment files Unified Genotyper: SNP detection Candidacy Examination 16

17 Comparison with GATK Candidacy Examination 17 ScalabilityData Size Impact - Unified Genotyper tool of GATK 10.9x 12.8x Data Size: 34 GB# of cores: 128

18 ScalabilityData Size Impact - VarScan Application 6.9x 12.7x Comparison with Hadoop Streaming Candidacy Examination 18 Data Size: 52 GB# of cores: 128

19 Outline Motivation & Background Current Work – PAGE: A Framework for Easy PArallelization of GEnomic Applications – RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications Future Work Candidacy Examination 19

20 RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications In this study, we improve our middleware PAGE from several aspects Main goal: Less I/O contention Main approach: – Utilizing distributed disks – Intelligent replication technique – Scheduling scheme that minimizes network traffic Candidacy Examination 20

21 Execution Model Candidacy Examination 21

22 Allowing Remote Processing or Not? Candidacy Examination 22 Advantages Disadvantages As number of nodes increases, network traffic will increase Data transfer will be more effective as computation becomes more data intensive Data transfering can be problematic for large scale data Better workload balance

23 Proposed Scheduling Schemes General idea: Replicate data and prohibit remote processing – Replication will increase number of local tasks for nodes and be useful to decrease workload imbalance Data chunks can have varying sizes and varying replication factors Master & worker approach We propose 3 scheduling schemes – Factoring – Help the busiest node (HBN) – Effective memory management (EMM) Candidacy Examination 23 Factoring HBN EMM

24 Proposed Replication Method Replicating all chunks into all nodes is not feasible. Depending on the analysis we want to perform, some genomic regions can be more important than others for the target analysis. General Idea: Replicate important regions more than others. Candidacy Examination 24

25 Replication & Distribution Candidacy Examination 25

26 Scheduling Scheme Evaluation Candidacy Examination 26 Works on real data 32 nodes (256 cores) 20 BAM files (21 GB) All 3 scheduling schemes are better than random scheduling Factoring is the best among all for all experiments

27 Work Stealing vs. Our Approach Synthetic application Fixed data chunk size, varying execution time Performance comparison is shown: Work Stealing / Our approach As processing becomes more data intensive, our approach gives better results! Candidacy Examination 27

28 Data Size Impact Candidacy Examination 28 +%3 +%7 +%4 -%1 Unified Genotyper 32 nodes (256 cores) As data size increases, WS-3 becomes better than WS-1 As data size increases, RE- PAGE becomes better than WS-3

29 Candidacy Examination 29 4.2x 7.1x 2.2x 9.9x Scalability Evaluation Coverage AnalyzerUnified Genotyper

30 Outline Motivation & Background Current Work – PAGE: A Framework for Easy PArallelization of GEnomic Applications – RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications Future Work Candidacy Examination 30

31 Future Work An API to Develop Parallel Genomic Applications for Memory Constraint Architectures Processing Compressed Genomic Data Candidacy Examination 31

32 API for Memory Constraint Architectures We employed CPUs so far Co-processors can be also useful for genomic applications The trend in computing technologies – More cores, smaller memory – Intel Many Integrated Core (MIC) architecture Candidacy Examination 32

33 Proposed Work An API which helps user implement parallel genomic applications with memory constraint architectures In this work, executables are not used, the developer needs to write map-reduce functions with C programming language The middleware helps the developer in 3 ways – Data reading from BAM and Fasta files – Memory utilization – Parallel execution and task scheduling Candidacy Examination 33

34 Execution Flow Candidacy Examination 34 Input Data Compressed Data Intermediate Result Compress Map Reduce Input Data Compressed Data Intermediate Result Compress Map Result

35 Data Reading The middleware reads the data from files and generates genome matrices which are compressed inputs of map tasks. The genome matrix can be in two types – Sequence Based: Each row keeps a sequence – Location Based: Keeps the data in mpileup format. Each row of the matrix keeps information for a different location Candidacy Examination 35

36 Genome Matrices Sequence BasedLocation Based Candidacy Examination 36

37 Optimization of Memory Utilization In order to decrease memory usage, we apply two techniques: – Selective Loading – Transparent Compression Candidacy Examination 37

38 Selective Loading Each read-sequence in Sam/Bam files consist of 11 mandatory and 1 alternative sections – Sequence ID, location, base sequences, strand and others For many applications, we do not need all of them. – For counting bases, sequence ids can be ignored We load the parts only we need Candidacy Examination 38

39 Transparent Compression Main Idea: The genome matrices keep the data in compressed format but the developer can access the data with our API as it is uncompressed. Compression Technique: Will be investigated Candidacy Examination 39

40 Sample Map Task void* map_coveragedepth( location_based_genome_matrix gm) { int i,j,position, indelLength, char* sequence; reduce_object *total; for(i=0;i<gm.number_of_results;i++) { position = getPosition_from_lbgm(gm.code[i],selected_parts) chromosome = get_chromosome_from_lbgm(gm.code[i],selected_parts); for(j=0;j<gm.num_samples;j++) { sequence = get_base_sequence_for_sample_n(gm->code[i], selected_parts, gm.num_samples,j ); count_num_bases(sequence); add_results_to_reduce_object(total, position, chromosome, sequence); } return (void*)total ; } Input genome matrix Reduce object Methods we provide Candidacy Examination 40

41 Open Questions How to schedule map and reduce tasks? How to keep the intermediate results in memory? – Location based genome matrix structure is useful to decrease the intermediate results. No need iterative computation for many applications (e.g. SNP calling) Reduction is just concatenation of the intermediate results. So they can be written to the disks as they are produced. Candidacy Examination 41

42 A middleware for processing compressed genomic data Compression is useful for archiving concern, however, it decreases the performance There are enormous amount of compression method for genomic data – No need to another compression method Our goal: A middleware that helps users to process compressed data without fully decompressing it. Candidacy Examination 42

43 Execution Model Candidacy Examination 43

44 Candidacy Examination 44 THANKS!


Download ppt "PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu."

Similar presentations


Ads by Google