SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data
- Aditi Thuse

Map Reduce Model Two phases Map Reduce
Take input as key-value pairs and generate intermediate output Output stored in intermediate storage Reduce Produce final set of output

Apache Spark Cluster computing framework Master/Slave architecture
Central coordinator Workers Supports in-memory and on disk computation Resilient distributed datasets(RDD) Transformations and actions programmers can perform iterative operations on their data without writing intermediary results to disk.

BWA - Burrows-Wheeler Aligner
Open source Mapping sequence reads to Genome Widely used alignment tool Algorithms- BWA-Backtrack - reads < 100bp BWA-SW - 70bp to 1Mbp BWA-MEM - 70bp to 1Mbp Parallel implementation Supports shared memory machine

Input and Output Accepts FASTQ format as input Output
SAM File (Sequence Alignment Map)

SparkBWA Integration of BWA into Spark framework Objectives-
Increase performance and scalability. Compatibility of SparkBWA versions of BWA. Solutions to perform sequence alignments efficiently in such a way that the implementation details are completely hidden to researchers. API is provided

System Design - RDD creation
Input data are prepared for MAP phase RDD is created from FASTQ input files Data is distributed in computing nodes For Pair End reads – 2RDDS are created Issue – Same identifier and two reads because of pair end reads Transformation is done on the data JOIN and sortByKey <read_id, Tuple<read_content1, read_content2>> SortHDFS <read_id, merged_content>

System Design - MAP Mappers will apply the sequence alignment algorithm from BWA on the RDDs. BWA source code is written in C language Spark supports SCALA, Java and Python Java Native Interface(JNI) is used Avoids any modification of the original BWA source code Reference genome is shared among all computing nodes

System Design - MAP System Design - Reduce Creates output in SAM file
Two software layers BWA layer Process RDD, pass input to BWA layer, collect partial results 2 levels of parallelism Map processes are distributed to clusters – Regular mode Each Map process is parallelized using threads – hybrid mode System Design - Reduce Merge all outputs in one file

Evaluation Algorithm Tools Parallelization Technology BWA-backtrack
pBWA MPI SEAL Hadoop SparkBWA Spark BWA-MEM BWA pthreads BigBWA Halvade BWA -shared-memory threaded version Message Passing Interface (MPI) is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures.

Dataset Tag Name Number of reads Read length (bp) Size (GiB) D1
NA12750/ERR000589 12×106 51 3.4 D2 HG00096/SRR062634 24.1×106 100 11.8 D3 150140/SRR642648 98.8×106 48.3

RDD creation

Execution time BWA-MEM algorithm
Regular mode – each mapper sequentially Hybrid mode – more threads per mapper In this way, as we have indicated previously, SparkBWA hybrid mode should be the preferred option only in those cases where limitations in memory do not allow to use all the cores in each node.

Execution time - BWA-backtrack algorithm

Execution time BWA-MEM algorithm

Thank You !

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Similar presentations

Presentation on theme: "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Similar presentations

Presentation on theme: "SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse."— Presentation transcript:

Similar presentations

About project

Feedback