Download presentation
Presentation is loading. Please wait.
Published byFrancine Stanley Modified over 8 years ago
1
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona
2
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 2
3
Explosion of Next-Generation Sequencing Data NGS Advantages –Faster and cheaper E.g., over one billion short reads per instrument run –More accurate: higher resolution and deeper coverage Challenges –Urgent need for turning raw data into knowledge –Parallelism is the key 3
4
Historical Trends in Storage Prices v.s. DNA Sequencing Costs 4 Reported by Lincoln Stein
5
Varieties of NGS Data Formats Different Formats –SAM (Sequence Alignment/Map) The de-facto text format for storing large nucleotide sequence alignments –BAM (Binary Alignment/Map) The compressed, indexable, binary form of the SAM format Indexing is supported by BAI (BAM Index) file –Other formats BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc. 5
6
Analysis Pipeline 6 Current Pipeline –Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST Reality –Cross-utilization Problem: sequencing data ≠ input –Some other analysis steps stay sequential –Needs for removing other sequential bottlenecks
7
Motivation: Removing Other Sequential Bottlenecks Parallel Format Conversion –Current format conversion commonly makes use of a single core –Current downstream tools may not be exchanged between different aligners –Not hard to implement but important to scale out Parallelizing Certain Statistical Analysis Steps –E.g., parallel analysis on the histogram data 7
8
Framework Sequence Data Format Converter –Input: SAM/BAM –Output: BAM/SAM FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML Statistical Analysis Module –Parallelize other statistical analysis steps –E.g., non-local means (NL-Means) and false discovery rate (FDR) computation 8 only discuss the first component today
9
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 9
10
Sequence Data Format Converter 3 Converter Instances –SAM Format Converter –BAM Format Converter –Preprocessing-Optimized SAM Format Converter Support partial format conversion on a specific chromosome region 10
11
SAM Format Converter 11 No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability
12
Partitioning Algorithm 12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning 2.Adjust partition boundaries by detecting line breakers
13
BAM Format Converter Challenge –No explicit delimiter: –Even partitioning -> unparsable records Solution: add a preprocessing phase –Partition data by supporting random access 13 Cannot be parallelized because of the third-party API
14
BAMX and BAIX BAMX (BAM eXtended) File –Transform each varying-length BAM record into a regular-layout BAMX record –Align varying-length BAM fields by padding BAIX (BAI eXtended File) –Index file of the BAMX file –Store the alignment starting positions in BAM (logically) and in BAMX (physically) 14
15
Partial Conversion If only interested in a subset, no need for full conversion Based on the BAIX file –Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) –Evenly partition the subset and proceed in parallel 15
16
Preprocessing-Optimized SAM Format Converter Main Ideas –Preprocessing can also optimize the SAM format conversion –Such preprocessing can be parallelized because of the easy partitioning on the SAM format M procsN procsM × N target files
17
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 17
18
Experimental Setup Dataset –Whole genome DNA-sequencing of three mouse samples –Approximately 125 million sequences providing about 40-fold coverage of the genome –In the SAM/BAM format Cluster –8 GB Memory –Up to 32 8-core machines (256 cores in total) 18
19
Performance of SAM Format Converter Input: 100 GB SAM data Output: BED, BEDGRAPH and FASTA 19
20
Performance of BAM Format Converter Input: 117 GB BAM data Output: BED, BEDGRAPH and FASTA 20
21
SAM Format Converter Comparison: Preprocessing-Optimized vs. Original Input: 15.7 GB BAM data Output: BED, BEDGRAPH and FASTA 21
22
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 22
23
Conclusion In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed The first framework that can easily support parallel sequence format conversion in distributed environment –SAM format converter –BAM format converter –Preprocessing-optimized SAM format converter 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.