Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Indexing DNA Sequences Using q-Grams
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
2-dimensional indexing structure
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
Chapter 5 Orthogonality
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Spatial and Temporal Data Mining
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting CS 202 – Fundamental Structures of Computer Science II Bilkent.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Efficient Partition Trees Jiri Matousek Presented By Benny Schlesinger Omer Tavori 1.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
C&O 355 Mathematical Programming Fall 2010 Lecture 17 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
Linear Algebra Chapter 4 Vector Spaces.
Chapter 2 Graph Algorithms.
Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.
CSIS7101 – Advanced Database Technologies Spatio-Temporal Data (Part 1) On Indexing Mobile Objects Kwong Chi Ho Leo Wong Chi Kwong Simon Lui, Tak Sing.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Elementary Linear Algebra Anton & Rorres, 9th Edition
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Exact indexing of Dynamic Time Warping
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Basic Concepts of Encoding Codes and Error Correction 1.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
LIMITATIONS OF ALGORITHM POWER
An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Fast Subsequence Matching in Time-Series Databases.
The Variable-Increment Counting Bloom Filter
Chapter 5. Optimal Matchings
Query Languages.
K Nearest Neighbor Classification
Introduction Wireless Ad-Hoc Network
Bioinformatics Algorithms and Data Structures
Presentation transcript:

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper

Overview Introduction A lower bound on the edit distance What is a wavelet? A refinement of the lowerbound MRS index Searching using the MRS index Questions

String searching Searching in a string database S = {s 1, …, s d } Range search  Find all the substrings of S within a distance r to a search string q  The error rate is denoted by ε = r / |q| K-Nearest neighbor search  Find the k-closest substrings of S to q

Question from Adriano What is the query range r in ε = r / |q|? When performing range search, r denotes the maximum value the edit distance of a substring of the database may have to be a valid result

f(s) is a frequency vector Σ is an alphabet of σ characters s is a string from that alphabet Σ f(s) = [v 1, …, v σ ] where v i = frequency of i-th letter of Σ in s Sum of v 1, …, v σ = length of s

Example of f(s) Σ = {A, C, G, T} s = CTACATCGATCGATCAG # A = 5, # C = 5, # G = 3, # T = 4 f(s) = [5, 5, 3, 4] Sum of v 1, …, v σ = length of s  = 17

Question from Laurence “As a result of lemma 1, the transformation of a string of length n lie on the σ-1 dimensional plane that passes through the point [n, 0, …, 0] and is perpendicular to the normal vector [1, …, 1]” Why is that the case?

Answer to Laurence Take a string s with |s| = 4 and an alphabet Σ = {A, B, C} This string might be AAAA, BBBB, CCCC and 21 more The corresponding f(s) are [4, 0, 0], [0, 4, 0] and [0, 0, 4]. All the f(s) span the same 2D plane in the 3D space with equation v 1 + v 2 + v 3 = 4 which makes the normal vector [1, 1, 1] This holds in general for every alphabet length n since v 1 max + … + v n max = n has always the normal vector [1, …, 1]

 f(s) = [v 1,..., v σ ]  Insert  v i := v i + 1  Delete  v i := v i - 1  Replace  v i := v i + 1 and v j := v j - 1 with i ≠ j Edit operations on s

 The space in which all the possible points [v 1,..., v σ ] exist  Take u and v points in the σ-dimensional space  We call u and v neighbors if you can obtain u from v using one edit operation The σ-dimensional space

 Take u and v frequency vectors of two strings of the same alphabet (points in the σ-dimensional space)  The Frequency distance FD 1 (u, v) between u and v is the minimum number of steps to get from u to v by jumping each step to a neighbor point Frequency distance FD 1

 The edit distance ED(s 1, s 2 ) of strings s 1 and s 2 is the minimal number of edit operations to get from s 1 to s 2  FD 1 (f(s 1 ), f(s 2 )) ≤ ED (s 1, s 2 )  s 1 = AC, s 2 = CA, f(s 1 ) = [1,1,0,0], f(s 2 ) = [1,1,0,0]  FD 1 = 0  ED = 2 (two replaces or one insert and one delete) Frequency distance vs. edit distance

 FD 1 (f(s 1 ), f(s 2 )) ≤ ED (s 1, s 2 )  Proof:  In case of a single insert or delete in ED, rule 1 or 2 is used in FD 1. Now ED as well as FD 1 are incremented resp. decremented by 1  In case of an insert and a delete or a replace in the ED, the FD 1 always uses rule 3: v i := v i + 1 and v j := v j – 1 This will result in a lower value for the FD 1 than the ED, hence the ≤ sign  So FD 1 is the lower bound on ED Frequency distance vs. edit distance

 Take q and s strings from alphabet Σ, r is the range (maximum ED in range search)  if r < FD 1 (f(q), f(s)) then r < ED(q, s)  To compute ED costs O(nm) time, but FD 1 costs only O(σ) The lower bound of ED

 Take frequency vectors u and v of two strings of the same alphabet Σ  We collect total positive distance (pos) and total negative distance (neg)  for every letter i in σ  if u i > v i we add the difference u i – v i to pos  otherwise we add v i – u i to neg  return the maximum of pos and neg Computing FD 1

How does it work?  u 1 < v 1, so add 8-2 to neg  u 2 > v 2, so add 10-1 to pos  Now pos = 6, neg = 9  return 9 u = [2, 10] v = [8, 1] 6 replaces and 3 inserts

Improving the lower bound We’ve established a lower bound on the edit distance, namely de frequency distance But we can improve this lower bound by incorpotating more information then how often letters occure. We would like to have more info about when they occur.

Wavelets Wavelet transform  Problems with Fourier transform Representation of frequencies in signal But we do not know when these frequencies occur Shows time & frequencies Used in all sorts of signal processing (compression) JPEG2000

How does this affect us?  Suppose we have some signal  We can encode this signal by recursively taking the average of a part of the signal, and then the difference between the averages of half of this part.

Wavelets Now with strings AT  frequency vector (average) = [1,0,0,1]  Detail = [1,0,0,0] – [0,0,0,1] = [1,0,0,-1]  Note that we know by the detail that the first half was an A and the second half was a T.

Wavelets (Adriano) TCACTTAG TCACTTAG TCACTTAG TCACTTA G [0, 0, 1] [0, 1, 0, 0] [1, 0, 0] [0, 1, 0, 0] [0, 0, 1] [0, 0, 1] [0, 0, 1, 0] [0, 0, 1, 0]

Wavelets TCACTTAG TCACTTAG TCACTTAG TCACTTA G [0, 0, 1] [0, 1, 0, 0] [1, 0, 0] [0, 1, 0, 0] [0, 0, 1] [0, 0, 1] [1, 0, 0] [0, 0, 1, 0] [0, 1, 0, 1] [1, 1, 0, 0] [0, 0, 2] [1, 0, 1, 0]

TCACTTAG TCACTTAG TCACTTAG TCACTTA G [0, 0, 1] [0, 1, 0, 0] [1, 0, 0] [0, 1, 0, 0] [0, 0, 1] [0, 0, 1] [1, 0, 0] [0, 0, 1, 0] [0, 1, 0, 1] [1, 1, 0, 0] [0, 0, 2] [1, 0, 1, 0] [1, 2, 0, 1] [1, 0, 1, 2] [2, 2, 1, 3]

TCACTTAG TCACTTAG TCACTTAG TCACTTA G [0, 0, 1] [0, 1, 0, 0] [1, 0, 0] [0, 1, 0, 0] [0, 0, 1] [0, 0, 1] [1, 0, 0] [0, 0, 1, 0] [0, 1, 0, 1] [1, 1, 0, 0] [0, 0, 2] [1, 0, 1, 0] [1, 2, 0, 1] [1, 0, 1, 2] [2, 2, 1, 3] [0, -1, 0, 1]

TCACTTAG TCACTTAG TCACTTAG TCACTTA G [0, 0, 1] [0, 1, 0, 0] [1, 0, 0] [0, 1, 0, 0] [0, 0, 1] [0, 0, 1] [1, 0, 0] [0, 0, 1, 0] [0, 1, 0, 1] [1, 1, 0, 0] [0, 0, 2] [1, 0, 1, 0] [1, 2, 0, 1] [1, 0, 1, 2] [2, 2, 1, 3] [0, 2, -1, -1] [-1, 0, 1] [0, -1, 0, 1] [1, -1, 0, 0] [1, -1, 0, 0] [-1, 0, -1, 2] [1, 0, -1, 0]

TCACTTAG TCACTTAG TCACTTAG TCACTTA G [2, 2, 1, 3] [0, 2, -1, -1] [-1, 0, 1] [0, -1, 1, 0] [1, -1, 0, 0] [1, -1, 0, 0] [-1, 0, -1, 2] [1, 0, -1, 0]

The k th wavelet transformation Definition 4 Let s=c 0,...c n-1 be a string from the alphabet {α 1,.., α σ }, then k th -level wavelet transformation, ψ k (s), 0 ≤ k ≤ log 2 n, of s is defined as: ψ k (s) = [v k,0,..,v k,n/(2^k)-1 ] where v k,i = [A k,i,B k,i ]

The 0 th wavelet transformation The 0 th wavelet transformation defines the original string ψ k (s) = [v 0,0,..,v 0,(n/1)-1 ] where v k,i = [A k,i,B k,i ] For TCACTTAG that is V 0,0,..,V 0,7 A 0,0,..A 0,7

The log 2 n wavelet transformation In the article they only chose to use the first and second wavelet coefficient, this corresponds to the log 2 n wavelet transformation. ψ k (s) = [v k,0,..,v k,n/(2^k)-1 ] so only v log(n),0 For TCACTTAG that is v 3,0 A 3,0 = A 2,0 +A 2,1 A2,0 = A 1,0 + A 1,1, A 2,1 = A 1,2 + A 1,3 A1,0 = A 0,0 + A 0,1, A 1,1 = A 0,2 + A 0,3 etc A 3,0 =[2,2,1,3] B 3,0 = A 2,0 – A 2,1 = [0,2,-1,-1]

Theorem 3 (Bogdan) String S with coefficients [A,B] Where A=[a 1,..,a σ ] and B = [b 1,..,b σ ] How can an edit operation influence A and B Replace first half & second half  a i := a i +1, a j := a j -1, b i := b i +1, b j := b j -1  a i := a i +1, a j := a j -1, b i := b i -1, b j := b j +1 Delete & Insert  a i := a i +-1, b j := b j +-1

Theorem 3 Delete on string of even length  a i := a i +-1, b i := b i -1, b j := b j +2  AABA, A=[3,1] B=[1,-1]  ABAA=[2,1] B=[0,1] Insert on string of odd length  a i := a i +-1, b i := b i +1, b j := b j -2  ABAA=[2,1] B=[0,1]  AABA, A=[3,1] B=[1,-1]

The lower bound So if we have ψ(s i ) and ψ(s j ), the five steps listed at theorem two can be used to walk from ψ(s i ) to ψ(s j ). (These are two points in 2 σ dimensional space ) So now the FD 2 (ψ(s i ), ψ(s j )) is the shortest legal path using these steps from ψ(s i ) to ψ(s j )

 A table of trees T i,j with index structure  A column stands for string s j of database S = {s 1,..., s d } with 1 < j < d  A row stands for a resolution (or window- size) 2 i with a < i < a + l -1 and l the number of resolution-levels in the index  Each tree T i,j consists of several Minimum Bounding Rectangles, MBR's, containing several wavelet-coefficients depending on the given capacity c of the MBR The MRS index structure

 Take string s 1 = CTAGTCGA  Let's build the tree T 2,1, given c = 3  window-size w = 4 (= 2 i ) string-number j = 1  Take a window of size w and slide along s 1  The first MBR contains the 1 st and 2 nd coefficient of the first c substrings in the window: {φ(CTAG), φ(TAGT), φ(AGTC)} next MBR: {φ(GTCG), φ(TCGA)} Building the MRS index (1)

 s 1 = CTAGTCGA, c = 3  So T 2,1 = {φ(CTAG), φ(TAGT), φ(AGTC)}, {φ(GTCG), φ(TCGA)}  Next T 3,1 = {φ(CTAGTCGA)} and consist of only 1 MBR  Normally s j is much bigger than a + l - 1 (the maximum resolution)  Et cetera Building the MRS index (2)

 Search for a subquery of length 2 i ? Just take row R i = {T i,1,..., T i, d } of the table  Take a query string q and a MBR B: FD(q, B) is the minimum of all the FD(q, s) where s Є B, so if r ≤ FD(q, B) then r ≤ FD(q, s) for all s Є B  Wavelet coefficients of substrings obtained by sliding the window are very close to each other, so the set of coefficients in an MBR are highly clustered. Some remarks on the index structure

Range Queries We are searching for all sequences that are within an edit distance of r from the query string Easy case: the index contains a resolution that exactly fits the size of the query string

Range queries Query string has length 2 a For the corresponding row in the index, for every database sequence, we compute the FD of the query string to the MBR’s. If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B If r < FD(q,B) then r < ED(q,s) for every s elem B So if FD(q,B) > r, then we drop B

Range queries However we may have some false positives.  If r > FD(q,s) this does not guarantee that r > ED(q,s) for every s elem B Thats why we have to post process all strings that are in the boxes for which r > FD(q,B). (e.g. By dynamic programming)

Range queries Now what if there is no row with a resolution corresponding to the query string. We partition the query string  We take the longest possible suffix such that the resolution exists in the index  We continue doing this interatively, so we get q 1,q 2,..,q t

Range queries

Nearest neighbour queries Given some query we search for the k closest substrings in the database. Phase 1  Lookup the set of k closest MBRs to the query  r1 is the k th smallest edit distance to strings in the set Phase 2  RangeSearch(q,r1)  Return the k closest strings Why phase 2 ?  FD box10 ≤ FD box11,FD 10 ≤ ED 10, FD 11 ≤ ED 11  However this does not guarantee that ED 10 ≤ ED 11

Questions (Peter) It is nice that they can prove that the MRS index does not incur any false drops (Theorem 4), but is this also true in a practical sense?  If r ≤ FD(q,B) then r ≤ FD(q,s) for all s elem B  If r < FD(q,B) then r < ED(q,s) for every s elem B  So if FD(q,B) > r, then we drop B

Questions (Lee) The article focuses on substring matching. What adaptations would we need for whole matching? Determine [A,B] for every entire string in the DB Determine FD of q to each string

Questions (Bogdan) The definition of FD(q,B) in section 3.3 says that the distance between a query transformation and a box is the minimum of the distances between the query transformation and the transformations in that box. It is also mentioned in the same section that for each box (MBR) only the lower/higher end points and the starting location of the first substring contained in that MBR are stored as part of the index Further on, in section 4.4, a part of the range query algorithm implies the computation of FD(q,B) for various (query, MBR) pairs. However, since we only have the lower/higher end points for each MBR, how is it possible to compute FD(q,B) without retrieving all the substrings s_i that are in the box B from the disk I could think of alternatively defining FD(q,B) with a formula involving only the lower/higher end points of the box B, but this is not what the authors are suggesting/using.

Questions(Bogdan)

 I read the article several times, and I don't understand wavelet coefficients. What does this mean and how can it be used for string comparison?  Do the wavelet coefficients depend on the data itself or only on the frequency of the appearance of the data?  Probably clear by Ingmars explanation Questions from Marjolijn

 In the article is said that they use the edit distance. But they also mention the weighted edit distance (ED), why don't they use this one? Does it take more calculation time?  In the FD you don’t know anymore if there was an delete + insert or a replace. Questions from Marjolijn

 Is the algorithm with the edit distance useful for data like DNA where we know for sure that several changes depend on each other and occur more often than other ones?  Same answer as previous slide No, not if you take these special occasions into account Questions from Marjolijn