Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Key idea: SHM identifies matching by incrementally shifting the read against the reference Mechanism: Use bit-wise XOR to find all matching bps. Then use.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.
Fast and accurate short read alignment with Burrows–Wheeler transform
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg 林恩羽 宋曉亞 陳翰平.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
DES Analysis and Attacks CSCI 5857: Encoding and Encryption.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
From Reads to Results Exome-seq analysis at CCBR
Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II.
Genome Read In-Memory (GRIM) Filter: Fast Location Filtering in DNA Read Mapping using Emerging Memory Technologies Jeremie Kim 1, Damla Senol 1, Hongyi.
RNAseq: a Closer Look at Read Mapping and Quantitation
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Nanopore Sequencing Technology and Tools:
1Carnegie Mellon University
Homology Search Tools Kun-Mao Chao (趙坤茂)
Genomic Data Clustering on FPGAs for Compression
The short-read alignment in distributed memory environment
Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,
Nanopore Sequencing Technology and Tools:
Nanopore Sequencing Technology and Tools for Genome Assembly:
GateKeeper: A New Hardware Architecture
Jin Zhang, Jiayin Wang and Yufeng Wu
Genome Read In-Memory (GRIM) Filter:
Homology Search Tools Kun-Mao Chao (趙坤茂)
GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Accelerator Architecture in Computational Biology and Bioinformatics, February 24th, 2018, Vienna, Austria Exploring Speed/Accuracy Trade-offs in Hardware.
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Homology Search Tools Kun-Mao Chao (趙坤茂)
By Nuno Dantas GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping Mohammed Alser §, Hasan Hassan †, Hongyi.
BitMAC: An In-Memory Accelerator for Bitvector-Based
SneakySnake: A New Fast and Highly Accurate Pre-Alignment Filter
Presentation transcript:

qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct it  Break DNA into pieces and store them as strings  Compare the strings to a known reference DNA string -- Search for matching coordinates in reference DNA  Stitch fragments together in corresponding order Difficulties: Individuals have mutations including  Mismatch, insertions and deletions; must tolerate Reference DNA Mis- match base pair (bp) Challenge of Next-generation DNA Sequencing Next-generation DNA Sequencing:  Instead of reading fewer long fragments, read many short fragments in parallel  This pushes the challenge to computation Challenge:  Shorter but many reads: billions of them  Mapping a fragment to entire reference genome is costly: cost does not reduce vs. a long fragment, and may increase for a shorter fragment  More potential mapping locations: harder to search for all possible matches in the reference DNA -- Even harder when mutations are allowed Requirement:  Algorithm that is fast and efficient which can process enormous amount of data Existing Mapping Tools Suffix tree or prefix tree based alignment tools:  Newer tools use Burrows-Wheeler transformation -- Bowtie, BWA, SOAPv2  Advantage -- Fast in finding the exact match without mutations  Disadvantages -- Very slow when mutations are allowed -- Not comprehensive: does not search for all possible locations Hash table based alignment tools:  Use hash table for filtering non-matching coordinates -- mrFAST, mrsFAST  Advantage -- Comprehensive, and fast when comprehensive  Disadvantage -- Slower in searching for just the exact match AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTT mrFAST Flow Chart AAAAAAAAAAAA CCCCCCCCCCCC TTTTTTTTTTTT Reference DNA Database Reference DNA Database AAAAAATAACAACCCCCCCCCCCCTTTTTTTTTTTT String Compare String Compare AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTCGAT AAAAAAAAAAAACCCCCCCCTCCCTTTTTTTTTTTT Hash Table (HT) Stores coordinates (coord.) of segments in reference DNA Hash Table (HT) Stores coordinates (coord.) of segments in reference DNA Divide fragment into segments 2.Check HT to get segments’ coordinates in reference DNA 3.Retrieve reference DNA strings at the coordinates 4.Compare fragment to reference DNA strings String compare: Compare every base pair  very slow mrFAST: Two Key Components Hash table (HT):  Stores coordinates of segments in reference DNA AAAAAAAAAAAA AAAAAAAAAAAC AAAAAAAAAAAG AAAAAAAAAAAT TTTTTTTTTTTTTT … Segments Coordinate list String Compare:  Compare input fragment against reference DNA  Check for mutations: mismatches, insertions and deletions (allow e mutations)  Need to compare every base pair  very slow String comparisons take too long  95% of execution time Our First Observation Problem with mrFAST:  Slow: 5 hours to process 1M fragments (108 bp) Our goal:  Reduce the execution time while maintaining comprehensiveness FastHASH Overview: Two key components:  Adjacency Filtering: Reject obviously non-matching coordinates at early stage to avoid unnecessary expensive string comparisons  Cheap segment selection: Reduce the absolute number of coordinates that are subject to examination Current Result:  38x speedup for 1M fragments compared to mrFAST Most string comparisons are useless: result in no match …AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT… AAAAAAAAAAAACCCCCCCCCCCCTTTTTTTTTTTT Input string Reference string m m m+12 m+24 Observation: If perfect match, consecutive segments should be at consecutive coordinates! Idea: For a coordinate, check if consecutive coordinates are in the coordinate lists of consecutive segments  If yes  Do string comparison  If no  No need for string comparison m m m m m+12 m+24 TTTTTTTTTTTT CCCCCCCCCCCC AAAAAAAAAAAA ? ? ? Do > e coordinate lists contain consecutive coordinates? Goal: Reduce the number of string comparisons Adjacency Filtering (AF) Effect of Adjacency Filtering String comparisons are drastically reduced: 3.7x speedup Cheap Segment Selection (CSS) Observation: Hash table is imbalanced  Cheap segments: Segments that have few coordinates in hash table  Expensive segments: Segments that have many coordinates  lead to slow execution during AF Idea: Select cheapest segments within a fragment  Selecting the cheapest e+1 segments guarantees comprehensiveness (at least one has no error) Example: If e = 1, select the cheapest 2 segments Effect of CSS: The number of coordinates examined AAAAAAAAAAAAACGTAACCTTAAAACCCATTTACC ExpensiveCheapCheapest Run time (s) Conclusions  Adjacency Filtering provides 3.7x speedup  Adjacency Filtering + Cheap Segment Selection provides 38x speedup CPU Execution Time 100% 6.4% Input fragment set:  Fragment length: 108 base-pairs  Fragment size: 1 million  Number of errors: 3 mismatches, insertions or deletions Intel i / 16 GB DRAM Preliminary GPU Execution Time of FastHASH Run time (s) Nvidia Tesla C2070 Conclusion  GPU provides 1.44x speedup (early result) Ongoing work  Schedule work better on GPU for higher speedup FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin 1, Donghyuk Lee 1, Farhad Hormozdiari 2, Can Alkan 3, Onur Mutlu 1 1 Departments of Computer Science and Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 2 Department of Computer Science, University of California Los Angeles, CA 3 Department of Genome Sciences, University of Washington, Seattle, WA Next-generation DNA Sequencing and the State-of-the-art Sequence Mapping Tools FastHASH mrFAST Preliminary Results fragment segments coordinates Reference strings  Each segment looked up in HT to get coordinate list  For each coordinate in the list, look up reference string  expensive n n n n+25 … … Adjacency Filtering becomes the bottleneck We can speed this up by avoiding the probing of long coordinate lists Our Second Observation coordinate coordinate list Our Goal and FastHASH Fragment