SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

Advertisements

John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.

Information Retrieval in Practice

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Indexing DNA Sequences Using q-Grams

Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Computations have to be distributed !

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Ch 4. The Evolution of Analytic Scalability

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

Massive Parallel Sequencing

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

1 Aplicação de metodologias genómicas na detecção de polimorfismos no sobreiro Ciência 2010 Octávio S. Paulo Computational Biology and Population Genomics.

Canadian Bioinformatics Workshops

Supercomputing versus Big Data processing — What's the difference?

Image taken from: slideshare

Computing challenges in working with genomics-scale data

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

Big Data is a Big Deal!.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Introduction to Spark Streaming for Real Time data analysis

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

VCF format: variants c.f. S. Brown NYU

Hadoop Clusters Tess Fulkerson.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Department of Computer Science

Applying Twister to Scientific Applications

湖南大学-信息科学与工程学院-计算机与科学系

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Selected Topics: External Sorting, Join Algorithms, …

CSC2431 February 3rd 2010 Alecia Fowler

Next-generation sequencing - Mapping short reads

Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming

Maximize read usage through mapping strategies

Introduction to MapReduce

BIOINFORMATICS Fast Alignment

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign 10/05/2009, CAMDA 2009, Chicago

Challenge of NGS Alignment Sequences: Short (25 ~ 76 bp) Size of data set: large, still increasing BLAST? Transaction /Long Query Batch/Short Query BLAST NGS Aligner We need INDEX !

The NGS Aligner War Where are you?

NGS Aligner Classification Standalone Algorithms Hash Reads: Eland, RMAP, MAQ, SHRiMP … Pros: less RAM, less overhead Cons: waste of genome scan Hash Genome: SOAP, PASS, Mosaik, BFAST … Pros: fast, scale up well Cons: big RAM, heavy overhead Index Genome (Burrows-Wheeler): Bowtie, BWA

NGS Aligner Classification Parallel Algorithm OptionsThings Needed to Consider Multi-threadHard to scale up to many cores Cluster ComputingLoad balancing, Fault tolerance Cloud ComputingRestricted programming interface

Programming Model of Cloud Computing MapReduce Developer supplies two functions – All v with the same k are reduced together Simple framework usually can scale up well

Why Cloud Computing Attractive? Fit for Data Intensive Computing (DIC) NGS alignment is DIC in nature Hadoop – open sourced Cloud Computing system Built-in Load balancing and Fault tolerance Easy to program

Cloud Based NGS Aligner Hash ReadsHash GenomeHash Both SeqMapReduce * CloudBurst * Hash/index Genome will be the next SeqMapReduce: Hash all reads in RAM in every node CloudBurst: Hash reads and the genome, but not in RAM

The SeqMapReduce Framework

Inside SeqMapReduce Pre-processing: formatting the genome Format once, use every time Bases at the end are duplicated

Inside SeqMapReduce Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables Genome seqs scanned for potential hits Then go to Mismatches Counting

Inside SeqMapReduce Reduce Phase Aggregating intermediate results Post Processing Duplication detection Mismatches counting Final output report

Inside SeqMapReduce Mismatches counting Naive way: simple counting (O(N)) Mismatches counting using bit operations Bit-wise XOR (Exclusive or)

Mismatches counting Original R (read), and G (genome) W=R XOR G Define 2 constants W1= … W2= … X=W & W1 (keep 10, clear 01, 11=>10) Y=W & W2 (keep 01, clear 10, 11=>01) Then Y << 1 N=POPCNT(X | Y) W is combinations of W W201 Y=W & W Y << W W110 X=W & W X=W & W1 X | Y W X | Y0010 Y =W & W2

Web Service of SeqMapReduce

Input format.zip of fasta format reads Reads can be upload through web site Support 13 model organisms Support reads longer than 32 bps Up to 5 mismatches No indels in current version (will update soon) Output with ELAND format Free of charge for academics Users: Small labs, want quick results but could be afford expensive hardware and softwares

Results on CAMDA 2009 datasets Pol II ChIP-seq FC201WVA_ _s_5 (4.5 million) IFNg stimulated STAT1 ChIP-seq FC302MA_ _s_1 (6.2 million) Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage. 2 mismatches are allowed. Accuracy: 95% of results are the same as MAQ.

Speed Up Run time VS No. of cores Pol II data set Run time VS No. of cores STAT1 data set Speed up is quasi-linear to the No. of cores Ave overhead time: 67.22s Ave overhead time: s

Scale Up SizeSize RatioRun Time Ratio STAT16.2 million second1.03 Pol II4.5 million354 second RAM requirement: ~ 50 M per million reads Can scale up to tens of millions of read with several Gs of RAM

Comparison to CloudBurst Why CloudBurst is slow? It hashes Reads and genome, with Hadoop system hash function No filtering in the Map phase: heavy I/O to Reduce phase

Results on Amazon EC2 Speed up similar of using UIUC Hadoop Cluster, but slower Large Standard Instances are chosen Cost $99.01

Future Plans Apply to Bisulfite Reads to genome wide methylation analysis Web-based visualization of short-read alignments

Acknowledgements UIUC Cloud Test Bed Michael Schatz CAMDA Organizers This work is supported by NSF DBI (SZ) Thank you!