Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.

Slides:



Advertisements
Similar presentations
John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
SALSA HPC Group School of Informatics and Computing Indiana University.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Bowtie2: Extending Burrows-Wheeler-based read alignment to longer reads and gapped alignments Ben Langmead 1, 2, Mihai Pop 1, Rafael A. Irizarry 2 and.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Big Data A big step towards innovation, competition and productivity.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
COMP 2903 A34s – Google and the Wisdom of Clouds Danny Silver JSOCS, Acadia University.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Hadoop Aakash Kag What Why How 1.
Hadoop.
Hadoop MapReduce Framework
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
Hadoop Basics.
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA 2 Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA 3 The iSchool, College of Information Studies, University of Maryland, College Park MD, USA Abstract As growth in short read sequencing throughput vastly outpaces improvements in microprocessor speed, there is a critical need to accelerate common tasks, such as short read alignment and SNP calling, via large-scale parallelization. Crossbow is a software tool that combines the speed of the short read aligner Bowtie 1 and the accuracy of the SOAPsnp 2 consensus and SNP caller within a cloud computing environment. Crossbow aligns reads and makes highly accurate SNP calls from a dataset comprising 38-fold coverage of the human genome in under 1 day on a local 40 core cluster, and under 3 hours using a 320‐core cluster rented from Amazon’s Elastic Compute Cloud 3 (EC2) service. Crossbow’s ability to run on EC2 means that users need not own or operate an expensive computer cluster in order to run Crossbow. Crossbow is available at under the Artistic license. Crossbow Design Crossbow Flow Whole-Human Resequencing with Crossbow Map step is short read alignment. Many instances of Bowtie run in parallel across the cluster. Input tuples are reads and output tuples are alignments. A cluster may consist of any number of nodes. Hadoop handles the details of routing data, distributing and invoking programs, providing fault tolerance, etc. The user first uploads reads to a filesystem visible to the Hadoop cluster. If the Hadoop cluster is in EC2, the filesystem might be an S3 bucket. If the Hadoop cluster is local, the filesystem might be an NFS share. Sort step bins alignments according to primary key (genome partition) and sorts according to a secondary key (offset into partition). This is handled efficiently by Hadoop. Reduce step calls SNPs for each reference partition. Many instances of SOAPsnp run in parallel across the cluster. Input tuples are sorted alignments for a partition and output tuples are SNP calls. Results are stored in cluster’s filesystem, then automatically archived and downloaded to the client machine. SNP calls are provided in SOAPsnp’s format. 1 Master 10 Workers 1 Master 20 Workers 1 Master 40 Workers Worker CPU cores Wall clock time 6h:30m4h:33m2h:53m Cost$61.60$84.00$98.40 Crossbow was used to align and call SNPs from the set of 2.7 billion reads sequenced from a Han Chinese male by Wang et al 6. Previous work demonstrated SNPs called from this dataset by SOAPsnp are highly concordant with genotypes determined via an Illumina 1M BeadChip assay of the same individual 2. Reads were downloaded from a mirror of the YanHuang site ( The reads cover the assembled human genome sequence to 38‐fold coverage. They consist of 2.02 billion unpaired reads with sizes ranging from 25 to 44 bps, and 658 million paired‐end reads. The most common unpaired read lengths are 35 and 40 bps, comprising 73.0% and 17.4% of unpaired reads respectively. Cost timing, and accuracy results are summarized below. SNPs produced by Crossbow exhibit similar agreement with the BeadChip calls as did the SOAPsnp study. References 1.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25. 2.Li, R., Y. Li, et al. (2009). SNP detection for massively parallel whole‐genome resequencing. Genome Res 19 (6): 1124‐ Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (Jan. 2008), Wang, J., W. Wang, et al. (2008). The diploid genome sequence of an Asian individual. Nature 456 (7218): 60‐5. Crossbow builds upon a parallel software framework called Hadoop 4. Hadoop is an open source implementation of the MapReduce programming model that was first described by scientists at Google 5. Hadoop has become a popular tool for computation over very large datasets, used at companies including Google, Yahoo, IBM, and Amazon. Hadoop requires that programs be expressed as a series of Map and Reduce steps operating on tuples of data. Though not all programs are easily expressed this way, Hadoop programs gain many benefits. In general, Hadoop programs need not deal with particulars of how work and data are distributed across a cluster or how to recover from failures. Hadoop handles this. Crossbow is a new software tool for efficient and accurate whole genome genotyping. Crossbow aligns and calls SNPs from 38‐fold coverage of short reads from a human in less than 3 hours on a 320‐core cluster rented from Amazon’s EC2 service. Crossbow condenses over 1,000 hours of resequencing computation into a few hours without requiring the user to own or operate a computer cluster. Running on standard software (Hadoop) and hardware (EC2 instances) makes it easier for other researchers to reproduce our results or other results obtained with Crossbow. Crossbow is freely available from The insight behind Crossbow is that alignment and SNP calling can be framed as a series of Map, Sort and Reduce steps. The Map step is short read alignment, the Sort step bins and sorts alignments according to the genomic position aligned to, and the Reduce step calls SNPs for a given partition. Steps involved in running Crossbow using Amazon’s EC2 and S3 services Simulated True # sites Crossbow sensitivity Crossbow precision Human Chr. 22, simulated SNPs 46, %99.14% Human Chr. X, simulated SNPs 102, %99.64% Real # sites Autosomal agreement Chr. X agreement Whole human, versus Illumina 1M BeadChip 1.04M99.5%99.6% Department of Biostatistics Simulated reads are paired-end reads with simulated SNPs, including known HapMap SNPs, which SOAPsnp handles specially, and novel SNPs. “Real” reads are the reads from the Wang et al study. Agreement is calculated as correct calls at genotyped sites divided by number of genotyped sites.