Www.strandls.com Read Alignment Algorithms. www.strandls.com The Problem 2 Given a very long reference sequence of length n and given several short strings.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Longest Common Subsequence
Space-for-Time Tradeoffs
Greedy Algorithms Amihood Amir Bar-Ilan University.
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Chapter 7 Space and Time Tradeoffs. Space-for-time tradeoffs Two varieties of space-for-time algorithms: b input enhancement — preprocess the input (or.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
Modern Information Retrieval
BTrees & Bitmap Indexes
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Sorting Fun1 Chapter 4: Sorting     29  9.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Internal and External Sorting External Searching
Week 15 – Friday.  What did we talk about last time?  Student questions  Review up to Exam 2  Recursion  Binary trees  Heaps  Tries  B-trees.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.
Contents What is a trie? When to use tries
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Tries 07/28/16 11:04 Text Compression
Top 50 Data Structures Interview Questions
Tries 5/27/2018 3:08 AM Tries Tries.
Database System Implementation CSE 507
B+ Tree.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Digital Search Trees & Binary Tries
Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Digital Search Trees & Binary Tries
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Tries 2/27/2019 5:37 PM Tries Tries.
Space-for-time tradeoffs
CSE 326: Data Structures Lecture #14
Presentation transcript:

Read Alignment Algorithms

The Problem 2 Given a very long reference sequence of length n and given several short strings (reads) of length m each, m << n Find the best matching location for each read in the reference Where the best location is that which minimizes the number of mismatches We ignore insertions and deletions for the moment; those will come later Provided the number of mismatches is at most, say 5% of m

Indexing the Reference 3 What if we do not allow any mismatches at all? Pre-process the reference sequence so… Each query – find the best matching location of a read – can be identified in time proportional to m and independent of n The resulting data structure is called an index Suffix trees are one possible index A trie of all suffixes of the reference sequence, with a $ marker at the end

Suffix Trees 4 CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T CG C Query

Space Required by Suffix Trees 5 n-1 internal nodes plus n leaves, so 2n-1 nodes 2n-2 tree pointers + n pointers into the reference So ~3n pointers 36GB! Can we make this smaller?

Indexing the Reference with Mismatches 6 What if we allow mismatches? So we put the query through the suffix tree but get struck – can’t proceed further Next, resume by dropping the first character, but without redoing the work already done How?

Suffix Links in Suffix Trees 7 CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T GC G Query

Indexing with Mismatches (Contd) 8 For an internal node A with string x leading down from the root to that node and branching into xa and xb Let x=cy Then there exists a node B with string y leading down from the root to that node The suffix link from A leads to this node B Such a node exists So if you get stuck, you follow the suffix link in constant time and continue from where you left off, to find the longest perfect-match substring starting at each position in the read Or alternatively, find all substrings of a certain minimum length that match Check explicitly for the number of mismatches at each of these locations

Space Required by Suffix Trees & Links 9 n-1 internal nodes plus n leaves plus n-1 suffix links, so 3n-1 nodes 3n-3 tree pointers + n pointers into the reference So ~4n pointers 48GB! Can we make this smaller? Can we fit this tree into an array?

A Succinct Data Structure 10 C G AC$ A C $CG C G AC$ C $ CGA G A C$C $ C GAC The Reference All circular shifts, sorted lexicographically Burrows- Wheeler Transform Store only the first and last columns and the links back to the reference Used in bzip

A Succinct Data Structure 11 CGAC$ AC$CG CGAC$ C$CGA GAC$C $CGAC $ A G C $ G The Reference The reference can be reconstructed from the first and last columns Claim: The ith G in the first column corresponds to the ith G in the last column! Likewise for A,C,G,T.

Proof of Claim 12  yG<xG if and only if Gy<Gx; That’s it!  So given a G in the first column, say corresponding to the string Gx – It’s rank r is trivial to find because the first column is sorted, just store counts for all 4 characters – We need to locate the corresponding G in the last column – In other words, the index of the string xG in the table – Which is the rth G in the last column [The Select Query]  So given a G in the last column, say corresponding to the string xG – Find it’s rank r among G’s in the last column [The Rank Query] – We need to locate the corresponding G in the first column – In other words, the index of the string Gx in the table – Which is the rth G in the first column, trivial to find

Select and Rank Queries 13  Given a binary array – SELECT: Given index i, find the ith 1 – RANK: Given index i, find how many 1s precede this location  Use a separate array for each of the 4 characters  RANK is easy, just keeps counts at Δ milestones and answer queries by traversing to the nearest milestone in time Δ – 4n/Δ bytes of storage, O(Δ) time  SELECT needs a bit more, keep counts for Δ-rank milestones – Go to the nearest rank milestone and traverse from there – May need to traverse quite a bit though – So need an extra data structure to get to the next 1, which you store at Δ milestones – So 8n/Δ bits storage, O(Δ) time  Of course we need the 4 n-bit binary arrays as well  So 4n bits + 48n/Δ bytes and O(Δ) time

String Matching using Rank-Selects 14  Given a string Gx  Assume inductively we have the band B of indices in the table corresponding to suffixes that begin with x  We want the band B’ that begins with Gx  Take the band B, take the last column, identify the rank of the first and last G in the last column, find their corresponding first column indices; that’s the band – All doable using RANK alone  At the end you have the band containing all suffixes which begin with Gx  Unless of course, there are none, in which case the band will vanish at some point  We can use this to find matches for say all length 16 substrings of a read  So 4n+48n/Δ bytes and O(mΔ) time per read

Indentifying Indices in the Reference 15  We still have to go from a band in the table to indices in the reference  4n bits if we store explicitly  We can use the same trick, store explicitly at Δ milestones  Then, if we have index i with string Gx, then we can go to index i+1 with string xG and so on till we get to a milestone  4n/Δ bytes storage  Time per index is O(Δ)

Sorting Circular Shifts 16  It remains to describe the construction of the table in the first place  Given a string S=x 0 x 1 x 2 ….$ – Consider string S’=(x 0 x 1 x 2 ) (x 1 x 2 x 3 ) (x 3 x 4 x 5 ) (x 4 x 5 x 6 )…. – Note (x 2 x 3 x 4 ) and other triplets starting at 2 mod 3 are missing – Rename S’ so identical tuples get the same number and distinct tuples get different numbers – Recursively sort S’ How does x 0 x 1 x 2 … compare to x 1 x 2 x 3 … ? – Already available from recursion How does x 0 x 1 x 2 … compare to x 2 x 3 x 4 … ? – Compare x 0, x 2 and then x 1 x 2 …, x 3 x 4 … – We have info for comparing all pairs of suffixes! – Sort the 2 mod 3 suffixes and then merge them in – Time T(n)= 2T(n/3)+O(n)

A Generalization: Difference Covers 17 v2v3v This string has size |D|n/v Set D of indices mod v Time taken to create this string is O(n |D|) Sorting suffixes of this string gives the sorted order of all suffixes which begin at indices j such that j mod v is in D

A Generalization: Difference Covers For any 2 indices i and j i-j mod v is the distance between some two beads in D x<v D is a Difference Cover if distances between beads in D generate 0,1…,v-1 x<v

A Generalization: Difference Covers There exists a Difference Cover of size 1.5* sqrt( v)! sqrt(v)

Thank you 20