Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.

Slides:

Advertisements

Similar presentations

John Dorband, Yaacov Yesha, and Ashwin Ganesan Analysis of DNA Sequence Alignment Tools.

Advertisements

Information Retrieval in Practice

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.

Longest Common Subsequence

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

BLAST Sequence alignment, E-value & Extreme value distribution.

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

Fast and accurate short read alignment with Burrows–Wheeler transform

1 Omics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.

Next Generation Sequencing, Assembly, and Alignment Methods

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽宋曉亞陳翰平.

June 3, 2015Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.

GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Bowtie2: Extending Burrows-Wheeler-based read alignment to longer reads and gapped alignments Ben Langmead 1, 2, Mihai Pop 1, Rafael A. Irizarry 2 and.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.

Sequence alignment, E-value & Extreme value distribution

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

BLAST What it does and what it means Steven Slater Adapted from pt.

Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MES Genome Informatics I - Lecture V. Short Read Alignment

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

High Throughput Sequence Analysis with MapReduce Michael Schatz June 18, 2009 JCVI Informatics Seminar.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Short Read Mapping On Post Genomics Datasets

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA

Lecture 15 Algorithm Analysis

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

High Throughput Sequencing

RNAseq: a Closer Look at Read Mapping and Quantitation

1 BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches 1Yangjun Chen, 2Yujia.

Burrows-Wheeler Transformation Review

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Indexing Graphs for Path Queries with Applications in Genome Research

VCF format: variants c.f. S. Brown NYU

CSC2431 February 3rd 2010 Alecia Fowler

Next-generation sequencing - Mapping short reads

Lecture 14 Algorithm Analysis

Maximize read usage through mapping strategies

BIOINFORMATICS Fast Alignment

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

Presentation transcript:

Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA Abstract Bowtie 1 is a fast and memory-efficient program for aligning short reads to mammalian genomes. Burrows-Wheeler indexing allows Bowtie to align more than 25 million 35-bp reads per CPU hour to the human genome in a memory footprint of as little as 1.1 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a quality-aware search algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve greater alignment speed. Bowtie is free, open source software available for download from Burrows-Wheeler Transformation Burrows-Wheeler Transform (BWT) Burrows-Wheeler Matrix Bowtie builds a genome index based in the Burrows-Wheeler Transformatopn (BWT) 2 and FM Index 3. The Burrows-Wheeler Transformation of a text T, BWT(T), is constructed as shown to the right. The Burrows- Wheeler Matrix of T is the matrix whose rows are all distinct cyclic rotations of T$ sorted lexicographically ($ is “less than” all other characters). BWT(T) is the sequence of characters in the last column of this matrix. An index for T consists of BWT(T) and some auxiliary data structures. For efficiency reasons, Bowtie indexes both the genome (the “forward” index) and its reverse (the “mirror” index). For the human genome, the total size of the Bowtie is smaller than other popular indexing schemes: Bowtie IndexSuffix TreeSuffix Array 1.1 gigabytes (2.2 incl. mirror index) >35 gigabytes>12 gigabytes k-mer Hash Tables the 2 nd occurrence of a in the last column… …corresponds to the same text character as the 2 nd occurrence of a in the first column The Burrows-Wheeler Matrix has a property called the LF mapping: the i th occurrence of character X in the last column corresponds to the same text character as the i th occurrence of X in the first column. This property underlies algorithms that use the BWT to navigate or search the text. LF Mapping Exact Matching Reversing the Transformation To recreate T from BWT( T ), start with i = 0 and T = BWT [0] and repeatedly apply rule 2 : T = BWT[ LF( i ) ] + T; i = LF( i ) Where LF( i ) maps row i to the row whose first character corresponds to i’s last character according to the LF Mapping: Final T To match a query string Q in text T using BWT(T), repeatedly apply rule 3 : top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF( i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc In progressive rounds, top / bot delimit rows beginning with progressively longer suffixes of Q a a ac ca a a c aa a acaacg aaa acaacg aaa acaacg aaa To allow mismatches in alignments, Bowtie extends on the above algorithm by, when necessary, “backtracking” to previously matched query characters and making substitutions. Bowtie currently searches the space of legal alignments using a quality-aware, greedy, depth-first search. Other search strategies (e.g. best-first) are also possible. Sequence: Phred Quals: (higher number = higher confidence) GCCATACGGATTAGCC GCCATACGGACTAGCC GCCATACGGGCTAGCC Inexact Matching and Quality Awareness Bowtie Index is Small Above: Backtracking scenarios leading to 3 distinct 1-mismatch alignments for “aaa” Crossbow: Bowtie for Cloud ComputingResults References Bowtie index for human genome can be built on a computer with 2 GB of RAM in less than a day (right), though indexing is faster with more RAM (left). When aligning on computers with multiple processor cores, Bowtie’s throughput increases significantly. We align 8.84 million 35-bp reads from the 1000 Genomes project pilot (SRR001115) to the human genome (NCBI build 36.3). Bowtie 0.9.6, SOAP , and Maq are used. Each is set to report at most one alignment per read. Maq and SOAP are run with default parameters and Bowtie is run with parameters to mimic those programs’ defaults. Results are obtained on two platforms: a desktop workstation (“PC”) with 2.4 GHz Intel Core 2 processor and 2 gigabytes of RAM; and a large-memory server (“server”) with a 4-core 2.4 GHz AMD Opteron processor and 32 gigabytes of RAM. Bowtie makes sacrifices in terms of sensitivity and quality of alignments reported in order to achieve greater performance. For example, Bowtie is not guaranteed in all cases to report the “best” alignment for reads that align inexactly. Also, with its default settings, Bowtie may fail to align a small number of reads with valid inexact alignments. If stronger guarantees are desired, Bowtie supports options that increase accuracy and sensitivity at the cost of some performance. Bowtie’s sensitivity in terms of reads aligned is comparable to SOAP’s. Maq reports some 3- mismatch alignments along with 2-mismatch alignments, making it more sensitive than Bowtie. Maq and SOAP generally perform better with respect to Bowtie when (a) fewer reads align, (b) reads are lower quality, (c) reads are longer, but Bowtie still outperforms by at least an order of magnitude. Because Bowtie indexes are compact, they are easy to distribute over the Internet, store, and reuse. This amortizes the cost of building the index. Pre-built indexes for model organisms including human, chimp, mouse and dog can be downloaded from Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie output to be used with Maq’s consensus generator and SNP caller. Bowtie does not yet find paired-end or gapped alignments, though both improvements are planned for the future. Bowtie can align reads as short as 4 bases and as long as 1024 bases. The input to a single run of Bowtie may comprise a mixture of reads with different lengths. In a typical Crossbow/AWS session, the client will: 1.Upload input data (reads in FASTQ format) to S3 2.Recruit EC2 cluster of N nodes & start Hadoop job 3.Wait while job runs 4.Download results (SNP calls) from S3 With the advent of cloud computing infrastructure like Hadoop 6 and Amazon Web Services 7 (AWS), it is now possible to solve highly data-intensive problems efficiently without owning or maintaining a computer cluster. Hadoop is a free open source software layer written in Java that allows programs adhering to the MapReduce 8 programming model to run on most clusters. AWS allows paying customers to rent large, Hadoop-compatible clusters on demand at a per-node hourly rate. Bioinformatics analyses cast in the Hadoop MapReduce model can exploit these technologies to achieve a high degree of portability, scalability, and convenience. Crossbow is a Bowtie-based variation-detection pipeline written in the Hadoop MapReduce model. Detecting genetic variations from a set of reads can be factored into a Map function: alignment, and a Reduce function: variant detection over a stretch of the reference. Crossbow’s Map function takes key/value pairs representing reads with qualities and aligns them using Bowtie. The Mapper’s output is 0 or more key/value pairs representing alignments for the read. The Reduce function takes the set of all alignments lying along a particular stretch of the reference, combines them into a multiple alignment, examines each column of the multiple alignment and outputs detected variants. In a recent experiment, Crossbow was used to align and call SNPs from 14.3x coverage worth of human Illumina reads in about 1 hour and 10 minutes on a 20-node cluster rented from Amazon EC2. This time includes steps 2-4; step 1 is very time-consuming, but that cost can be amortized by uploading reads incrementally as they are sequenced, or it can be eliminated entirely by running Crossbow on a local cluster. The experiment cost $32 to conduct. 1.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. To appear. 2.Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report Ferragina P, Manzini G: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE Computer Society; Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24(5): Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (Jan. 2008), EC2 = Amazon’s Elastic Compute Cloud 9 service, S3 = Amazon’s Simple Storage Service 10. (empty range indicates there are no such alignments)