Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Slides:



Advertisements
Similar presentations
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Advertisements

Introduction to Algorithms
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Indexing DNA Sequences Using q-Grams
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Longest Common Subsequence
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Theory I Algorithm Design and Analysis (5 Hashing) Prof. Th. Ottmann.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
TEMPLATE DESIGN © SSAHA: Search with Speed Nick Altemose, Kelvin Gu, Tiffany Lin, Kevin Tao, Owen Astrachan Duke University.
Search and Recursion pt. 2 CS221 – 2/25/09. How to Implement Binary Search Take a sorted data-set to search and a key to search for Start at the mid-point.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Hash Tables1 Part E Hash Tables  
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Hashing CSE 331 Section 2 James Daly. Reminders Homework 3 is out Due Thursday in class Spring Break is next week Homework 4 is out Due after Spring Break.
CSC 213 – Large Scale Programming. Today’s Goal  Consider what will be important when searching  Why search in first place? What is its purpose?  What.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
TEMPLATE DESIGN © Haha  SSAHA Kelvin Gu, Tiffany Lin, Nick Altemose, Kevin Tao Duke University, Trinity College of Arts.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Sorting CS 110: Data Structures and Algorithms First Semester,
Sofia, Bulgaria | 9-10 October The Query Governor Richard Campbell Stephen Forte Richard Campbell Stephen Forte.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Doug Raiford Phage class: introduction to sequence databases.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
CSC 213 Lecture 19: Dynamic Programming and LCS. Subsequences (§ ) A subsequence of a string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Amortized Analysis and Heaps Intro David Kauchak cs302 Spring 2013.
ISOM MIS 215 Module 1 – Ordered Lists. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Recursion. Objectives At the conclusion of this lesson, students should be able to Explain what recursion is Design and write functions that use recursion.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
CSC 212 – Data Structures Lecture 28: More Hash and Dictionaries.
Module 11: File Structure
May 17th – Comparison Sorts
Hash table CSC317 We have elements with key and satellite data
Introduction to Algorithms
Searching.
Strings: Tries, Suffix Trees
Hashing CS2110 Spring 2018.
Fast Sequence Alignments
Data abstraction, revisited
CSC 380: Design and Analysis of Algorithms
Strings: Tries, Suffix Trees
Amortized Analysis and Heaps Intro
Programming Challenge Problem
Collision Handling Collisions occur when different elements are mapped to the same cell.
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
10/18: Lecture Topics Using spatial locality
Week 13 - Wednesday CS221.
Presentation transcript:

Short Read Mapper Evan Zhen CS 124

Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome, but then finding out where a particular short subsequence is located is not an easy task.

Problem Treat it as a standard string search problem, except it only contains characters A,T,C,G – Given a substring S of length L, a reference string R (very large), find all positions in R where S is located with at most D mismatches within the L region

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? Yes

Solution 2 – map + index Map the reference string R into indexes. Let p = partition string, where length(p) < L. For every position in R, store the index position and the string p at that index. Using this map, searching for S will be just a lookup To compensate mismatch, change characters in S and search again – Example, instead of searching “AAT”, search “GAT” Purpose – allow for multiple searches using the same map, so no need to process R multiple times.

Solution 2 – map + index Structure of map: hashtable – Key = partition string p – Value = list of all positions of p Building the map – Read R, character by character – At each read, store the index of p

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions)

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1 CTACC 2

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) GGCTA 0 GCTAC 1 CTACC 2... S - CTACCTTTTA To find S, break S into partitions of length(p). Search each partition and make sure positions are relative to one another.

Comparison Simple Search – Pro Easy to implement – Con Can potentially be very slow Map + Index – Pro Faster than simple search if performing multiple searches using same R – Con Hard to implement Can potentially require a lot of memory for storing the indexes

Why do these solutions work? Because they search for a subsequence in a larger sequence Both handle mismatches – Simple search – ignores characters in R (aka handle “insertion” types) – Map + index – since map is partitioned, hard to detect insertion types, so adjust the subsequence (aka handle “mutation” types)

Implementation Nothing hard-coded – Can easily change constants such as required length of S, length of partitions, max number of mismatches, etc Used Java – Later realized it was a bad idea, but it was a bit too late to rewrite in a different language Simple search – easily implemented Map + index – Map easily handled with hashtable – Accounting for mismatches was challenging

Analysis Using Nick’s simulator to generate reference sequence of 1million in length – Generating map ~ 15sec – Simple Search ~ 0.2 sec – Index Search ~ 0.22 sec Odd that the index search is slower – Possible reason – the way I handle mismatches

Conclusion Limitations – Subsequence S must be able to be broken down into equal partition sizes for the map – Because it’s written in Java, possible memory limitations To-Do – Find better way to handle mismatch for the Index search Future work – Different algorithms – Rewrite in a different language

Thank You