1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 2
Explosion of Next-Generation Sequencing Data NGS Advantages –Faster and cheaper E.g., over one billion short reads per instrument run –More accurate: higher resolution and deeper coverage Challenges –Urgent need for turning raw data into knowledge –Parallelism is the key 3
Historical Trends in Storage Prices v.s. DNA Sequencing Costs 4 Reported by Lincoln Stein
Varieties of NGS Data Formats Different Formats –SAM (Sequence Alignment/Map) The de-facto text format for storing large nucleotide sequence alignments –BAM (Binary Alignment/Map) The compressed, indexable, binary form of the SAM format Indexing is supported by BAI (BAM Index) file –Other formats BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc. 5
Analysis Pipeline 6 Current Pipeline –Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST Reality –Cross-utilization Problem: sequencing data ≠ input –Some other analysis steps stay sequential –Needs for removing other sequential bottlenecks
Motivation: Removing Other Sequential Bottlenecks Parallel Format Conversion –Current format conversion commonly makes use of a single core –Current downstream tools may not be exchanged between different aligners –Not hard to implement but important to scale out Parallelizing Certain Statistical Analysis Steps –E.g., parallel analysis on the histogram data 7
Framework Sequence Data Format Converter –Input: SAM/BAM –Output: BAM/SAM FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML Statistical Analysis Module –Parallelize other statistical analysis steps –E.g., non-local means (NL-Means) and false discovery rate (FDR) computation 8 only discuss the first component today
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 9
Sequence Data Format Converter 3 Converter Instances –SAM Format Converter –BAM Format Converter –Preprocessing-Optimized SAM Format Converter Support partial format conversion on a specific chromosome region 10
SAM Format Converter 11 No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability
Partitioning Algorithm 12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning 2.Adjust partition boundaries by detecting line breakers
BAM Format Converter Challenge –No explicit delimiter: –Even partitioning -> unparsable records Solution: add a preprocessing phase –Partition data by supporting random access 13 Cannot be parallelized because of the third-party API
BAMX and BAIX BAMX (BAM eXtended) File –Transform each varying-length BAM record into a regular-layout BAMX record –Align varying-length BAM fields by padding BAIX (BAI eXtended File) –Index file of the BAMX file –Store the alignment starting positions in BAM (logically) and in BAMX (physically) 14
Partial Conversion If only interested in a subset, no need for full conversion Based on the BAIX file –Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) –Evenly partition the subset and proceed in parallel 15
Preprocessing-Optimized SAM Format Converter Main Ideas –Preprocessing can also optimize the SAM format conversion –Such preprocessing can be parallelized because of the easy partitioning on the SAM format M procsN procsM × N target files
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 17
Experimental Setup Dataset –Whole genome DNA-sequencing of three mouse samples –Approximately 125 million sequences providing about 40-fold coverage of the genome –In the SAM/BAM format Cluster –8 GB Memory –Up to 32 8-core machines (256 cores in total) 18
Performance of SAM Format Converter Input: 100 GB SAM data Output: BED, BEDGRAPH and FASTA 19
Performance of BAM Format Converter Input: 117 GB BAM data Output: BED, BEDGRAPH and FASTA 20
SAM Format Converter Comparison: Preprocessing-Optimized vs. Original Input: 15.7 GB BAM data Output: BED, BEDGRAPH and FASTA 21
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 22
Conclusion In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed The first framework that can easily support parallel sequence format conversion in distributed environment –SAM format converter –BAM format converter –Preprocessing-optimized SAM format converter 23