SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Slides:

Advertisements

Similar presentations

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Advertisements

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

MapReduce Compilers-Apache Pig

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Computing challenges in working with genomics-scale data

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Hadoop Aakash Kag What Why How 1.

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Open Source distributed document DB for an enterprise

Spark Presentation.

Pattern Parallel Programming

Central Florida Business Intelligence User Group

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Introduction to Spark.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Cloud Distributed Computing Environment Hadoop

MapReduce: Data Distribution for Reduce

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

CS110: Discussion about Spark

Introduction to Apache

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Overview of big data tools

Spark and Scala.

Department of Intelligent Systems Engineering

Maximize read usage through mapping strategies

Introduction to Spark.

CS639: Data Management for Data Science

Apache Hadoop and Spark

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

MapReduce: Simplified Data Processing on Large Clusters

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Overview of Computer system

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse

Map Reduce Model Two phases Map Reduce Take input as key-value pairs and generate intermediate output Output stored in intermediate storage Reduce Produce final set of output

Apache Spark Cluster computing framework Master/Slave architecture Central coordinator Workers Supports in-memory and on disk computation Resilient distributed datasets(RDD) Transformations and actions programmers can perform iterative operations on their data without writing intermediary results to disk.

BWA - Burrows-Wheeler Aligner Open source Mapping sequence reads to Genome Widely used alignment tool Algorithms- BWA-Backtrack - reads < 100bp BWA-SW - 70bp to 1Mbp BWA-MEM - 70bp to 1Mbp Parallel implementation Supports shared memory machine

Input and Output Accepts FASTQ format as input Output SAM File (Sequence Alignment Map)

SparkBWA Integration of BWA into Spark framework Objectives- Increase performance and scalability. Compatibility of SparkBWA versions of BWA. Solutions to perform sequence alignments efficiently in such a way that the implementation details are completely hidden to researchers. API is provided

System Design - RDD creation Input data are prepared for MAP phase RDD is created from FASTQ input files Data is distributed in computing nodes For Pair End reads – 2RDDS are created Issue – Same identifier and two reads because of pair end reads Transformation is done on the data JOIN and sortByKey <read_id, Tuple<read_content1, read_content2>> SortHDFS <read_id, merged_content>

System Design - MAP Mappers will apply the sequence alignment algorithm from BWA on the RDDs. BWA source code is written in C language Spark supports SCALA, Java and Python Java Native Interface(JNI) is used Avoids any modification of the original BWA source code Reference genome is shared among all computing nodes

System Design - MAP System Design - Reduce Creates output in SAM file Two software layers BWA layer Process RDD, pass input to BWA layer, collect partial results 2 levels of parallelism Map processes are distributed to clusters – Regular mode Each Map process is parallelized using threads – hybrid mode System Design - Reduce Merge all outputs in one file

Evaluation Algorithm Tools Parallelization Technology BWA-backtrack pBWA MPI SEAL Hadoop SparkBWA Spark BWA-MEM BWA pthreads BigBWA Halvade BWA -shared-memory threaded version Message Passing Interface (MPI) is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures.

Dataset Tag Name Number of reads Read length (bp) Size (GiB) D1 NA12750/ERR000589 12×106 51 3.4 D2 HG00096/SRR062634 24.1×106 100 11.8 D3 150140/SRR642648 98.8×106 48.3

RDD creation

Execution time BWA-MEM algorithm Regular mode – each mapper sequentially Hybrid mode – more threads per mapper In this way, as we have indicated previously, SparkBWA hybrid mode should be the preferred option only in those cases where limitations in memory do not allow to use all the cores in each node.

Execution time - BWA-backtrack algorithm

Execution time BWA-MEM algorithm

Thank You !