High Throughput Sequence Analysis with MapReduce Michael Schatz June 18, 2009 JCVI Informatics Seminar.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
1 Omics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Next Generation Sequencing, Assembly, and Alignment Methods
SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Introduction to Short Read Sequencing Analysis
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Gao, Ge Center for Bioinformatics Peking University
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Short Read Mapping On Post Genomics Datasets
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
RNAseq: a Closer Look at Read Mapping and Quantitation
Short Read Mapping On Post Genomics Datasets
VCF format: variants c.f. S. Brown NYU
Introduction to MapReduce and Hadoop
The Basics of Apache Hadoop
Hadoop Basics.
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Maximize read usage through mapping strategies
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Presentation transcript:

High Throughput Sequence Analysis with MapReduce Michael Schatz June 18, 2009 JCVI Informatics Seminar

Outline 1.Parallel Computing Resources 1.Amazon Cloud Services 2.Hadoop & MapReduce 1.MapReduce Showcase 1.Read Mapping: CloudBurst 2.Mapping & SNP Discovery: CrossBow 3.De novo Genome Assembly 2.Summary and Questions

A Brief History of the Amazon Cloud Urban Legend – Additional capacity added every fall for the holiday shopping season, underutilized rest of the year… Official Story – Amazon is a technology company Different divisions of Amazon are loosely coupled – Amazon Web Services is the 3 rd Business Division Retail & Seller Businesses

Source:

Amazon Web Services Elastic Compute Cloud (EC2) – On demand computing power Support for Windows, Linux, & OpenSolaris Simple Storage Service (S3) – Scalable data storage 17¢ / GB transfer fee, 15¢ / GB monthly fee Elastic MapReduce (EMR) – Point-and-click Hadoop Workflows Computation runs on EC2

EC2 Pricing and Models SmallHigh-CPU Medium High-CPU Extra Large GHz RAM1.7 GB 7 GB Local Disk150 GB350 GB1690 GB Price10¢ / hour20¢ / hour80¢ / hour Instances boot on demand from virtual disk image Stock images available for common configurations & operating systems Custom images “easy” to create Reserved Instances Pay small upfront fee, lower hourly rate

EC2 - 5 EC2 - 4 EC2 - 3 AWS Use Model AWS S3 EC2 - 2 EC2 - 1 Desktop After machines spool up, ssh to instance them like any other server. Use S3 for persistent data storage, with very fast interconnect to EC2

MapReduce is the parallel distributed framework invented by Google for large data computations. – Data and computations are spread over thousands of computers, processing petabytes of data each day (Dean and Ghemawat, 2004) – PankRank, Inverted Index Construction, Log Analysis, Distributed Grep,… Hadoop is the leading open source implementation of MapReduce – Sponsored by Yahoo, Google, Amazon, and other major vendors – Clusters with 10k nodes, petabytes of data Hadoop MapReduce Benefits – Scalable, Efficient, Reliable – Easy to Program – Open Source Challenges – Redesigning / Retooling applications – Not SunGrid, Not MPI

ATG,1 TGA,1 GAA,1 AAC,1 ACC,1 CCT,1 CTT,1 TTA,1 GAA,1 AAC,1 ACA,1 CAA,1 AAC,1 ACT,1 CTT,1 TTA,1 TTT,1 TTA,1 TAG,1 AGG,1 GGC,1 GCA,1 CAA,1 AAC,1 map shuffle reduce K-mer Counting with MapReduce Application developers focus on 2 (+1 internal) functions – Map: input -> key, value pairs – Shuffle: Group together pairs with same key – Reduce: key, value-lists -> output ATGAACCTT A GAACAACTT A TTTAGGCAA C ACA -> 1 ATG -> 1 CAA -> 1,1 GCA -> 1 TGA -> 1 TTA -> 1,1,1 ACT -> 1 AGG -> 1 CCT -> 1 GGC -> 1 TTT -> 1 AAC -> 1,1,1,1 ACC -> 1 CTT -> 1,1 GAA -> 1,1 TAG -> 1 ACA:1 ATG:1 CAA:2 GCA:1 TGA:1 TTA:3 ACT:1 AGG:1 CCT:1 GGC:1 TTT:1 AAC:4 ACC:1 CTT:2 GAA:2 TAG:1 Map, Shuffle & Reduce All Run in Parallel

Slave 5 Slave 4 Slave 3 Hadoop Use Model Slave 2 Slave 1 Master Desktop Hadoop Distributed File System (HDFS) – Data files partitioned into large chunks (64MB), replicated on multiple nodes – NameNode stores metadata information (block locations, directory structure) Master node (JobTracker) schedules and monitors work on slaves – Computation moves to the data

Hadoop for Computational Biology? Biological research requires an increasing amount of data Big data computing requires thinking in parallel – Hadoop may be an enabling technology 1000 GenomesHuman MicrobiomeGlobal Ocean Survey

Short Read Mapping Recent studies of entire human genomes analyzed billions of reads – Asian Individual Genome: 3.3 Billion 35bp, 104 GB (Wang et al., 2008) – African Individual Genome: 4.0 Billion 35bp, 144 GB (Bentley et al., 2008) Alignment computation required >10,000 CPU hours* – Alignments are “embarassingly parallel” by read – Variant detection is parallel by chromosome region …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC Identify variants Reference Subject

Seed and Extend Good alignments must have significant exact alignments – Use exact alignments to seed search for longer in-exact alignments – Minimal exact alignment length = l/(k+1) bp read 1 difference 1 x |s| RMAP (Smith et al, 2008) – Catalog mers in reads as seeds – Stream across reference sequence – Count mismatches of reads anchored by seeds Expensive to scale to large experiments or highly sensitive searches – If only there was a way to parallelize execution…

1.Map: Catalog K-mers Emit k-mers in the genome and reads Human chromosome 1 Read 1 Read 2 map 2.Shuffle: Collect Seeds Conceptually build a hash table of k-mers and their occurrences shuffle … … 3.Reduce: End-to-end alignment If read aligns end-to-end with ≤ k errors, record the alignment reduce Read 1, Chromosome 1, Read 2, Chromosome 1, CloudBurst Schatz, MC (2009) CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics. 25:

Evaluate running time on local 24 core cluster – Running time increases linearly with the number of reads Compare to RMAP – Highly sensitive alignments have better than 24x linear speedup. Produces identical results in a fraction of the time CloudBurst Results

CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on the local and EC 2 clusters. The 24-core Amazon High-CPU Medium Instance EC2 cluster is faster than the 24- core Small Instance EC2 cluster, and the 24-core local dedicated cluster. The 96-core cluster is 3.5x faster than the 24-core, and 100x faster than serial RMAP. EC2 Evaluation

CloudBurst Reflections CloudBurst efficiently reports every k-difference alignment of every read – But many applications only need the best alignment – Finding just the best alignment needs an in-memory index Suffix tree (>51 GB) Suffix array (>15 GB) Seed hash tables (>15 GB) Many variants, incl. spaced seeds

Burrows-Wheeler Transform Reversible permutation of the characters in a text BWT(T) is the index for T Burrows-Wheeler Matrix BWM(T) BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Rank: 2 LF Property implicitly encodes Suffix Array

Bowtie: Ultrafast Short Read Aligner Quality-aware backtracking of BWT to rapidly find the best alignment(s) for each read BWT precomputed once, easy to distribute, and analyze in RAM – 3 GB for whole human genome Support for paired-end alignment, quality guarantees, etc… – Langmead B, Trapnell C, Pop M, Salzberg SL. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGATACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGTTACGGCGACCACCGAGATCTA Reference BWT( Reference )

Bowtie algorithm Query: AATGTTACGGCGACCACCGAGATCTA Reference BWT( Reference )

Comparison to MAQ & SOAP Performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, Maq v0.6.6 when aligning 8.84M reads from the 1000 Genome project [NCBI Short Read Archive:SRR001115] trimmed to 35 base pairs. The “soap.contig” version of the SOAP binary was used. SOAP could not be run on the PC because SOAP’s memory footprint exceeds the PC’s physical memory. For the SOAP comparison, Bowtie was invoked with “-v 2” to mimic SOAP’s default matching policy (which allows up to 2 mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq’s default policy of allowing up to 2 mismatches in the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie’s memory footprint more comparable to Maq’s, Bowtie is invoked with the “-z” option in all experiments to ensure only the forward or mirror index is resident in memory at one time.

Crossbow: Rapid Whole Genome SNP Analysis Align billions of reads and find SNPs – Reuse software components: Hadoop Streaming Map: Bowtie – Emit (chromome region, alignment) … … Shuffle: Hadoop – Group and sort alignments by region Reduce: SoapSNP (Li et al, 2009) – Scan alignments for divergent columns – Accounts for sequencing error, known SNPs

Preliminary Results: Simulated Data Chr 22Chr X Read Data62 M, 45x200M, 45x Workers2 X-Large8 X-Large Running Time37 min40 min Cost$2.40$6.40 Predicted SNPs39,25177,844 True Positive Rate98.7%99.6% False Positive Rate1.0%0.1% Simulated reads using real reads as templates Simulated SNPs using allele information from Hapmap + 10% novel SNPs

Preliminary Results: Whole Genome Asian Individual Genome Data LoadingSE: 2.0 B, 24x PE: 1.3 B, 15x GB compressed $10.65 Data Transfer1 hour39+1 Small$4.00 Preprocessing0.5 hours40+1 X-Large$16.40 Alignment1.5 hours40+1 X-Large$49.20 Variant Calling1.0 hours40+1 X-Large$32.80 End-to-end4 hours$ Goal: Reproduce the analysis by Wang et al. for ~$100 in an afternoon.

Short Read Genome Assembly Several new assemblers developed for short read data – Variations on compressed de Bruijn graphs Velvet (Zerbino & Birney, 2008)EULER-USR (Chaisson et al, 2009) ALLPATHS (Butler et al, 2008)ABySS (Simpson et al, 2009) Short Read Assembler Outline 1.Construct compressed de Bruijn Graph 2.Remove sequencing error from graph 3.Use mate-pairs to resolve ambiguities Successful for small to medium genomes – 2Mbp bacteria – 39Mbp fungal assembly

de Bruijn Graph Construction D k = (V,E) V = All length-k mers (k < l) E = Directed edges between consecutive mers ACTG ACTCTG Reads ACTG ATCT CTGA CTGG CTGC GACT GCTG GGCT TCTG TGAC TGCA TGGC De Bruijn Graph ATC TCT CTG TGC GCA TGA GAC ACT TGG GGC GCT ATCT TGACT TGGCT TGCA Compressed Graph CTG

Genome Assembly with MapReduce The new short read assemblers require tremendous computation – Velvet on 48x coverage of 2 Mbp S. suis requires > 2GB of RAM – ABySS on 42x coverage of human required ~4 days on 168 cores Challenge: How to efficiently implement assembly algorithms in (restricted) MapReduce environment? – Different nodes of the graph are stored on different machines.

Parallel Path Compression List Ranking Problem – General problem of parallel linked list operations Pointer Jumping (Wyllie, 1979) – Add tail pointer to each node – In each round, advance (double) the tail pointer – Collect and concatenate all the nodes in a simple path – O(log S) parallel cycles B. anthracis 268,925 -> 19+1 cycles Human: 37,172 ->16+1 cycles

Summary 1.Hadoop is well suited to big data biological computation 2.Hadoop Streaming for easy scaling of existing software 3.Cloud computing is an attractive platform to augment resources 4.Look for many cloud computing & MapReduce solutions this year

Acknowledgements Mihai Pop Ben LangmeadJimmy Lin Steven Salzberg

Thank You!

Burrows-Wheeler Transform Property that makes BWT(T) reversible is LF Mapping – i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column – LF(r): function taking L row r to corresponding F row T Rank: 2 L F

Burrows-Wheeler Transform Recreating T from BWT(T) – Start in the first row and apply LF repeatedly, accumulating predecessors along the way Original T

Exact Matching LFc(r, c) does the same thing as LF(r) but it ignores r’s actual final character and “pretends” it’s c: Rank: 2 L F LFc(5, g) = 8 g

Exact Matching Start with a range, (top, bot) encompassing all rows and repeatedly apply LFc: top = LFc(top, qc); bot = LFc(bot, qc) qc = the next character to the left in the query Ferragina P, Manzini G: Opportunistic data structures with applications. FOCS. IEEE Computer Society; 2000.

Exact Matching If range becomes empty (top = bot) the query suffix (and therefore the query as a whole) does not occur