An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Longest Common Subsequence

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Overview What is Dynamic Programming? A Sequence of 4 Steps

Chapter 7 Dynamic Programming.

Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 7: Strings and Biological Data In some applications we store, search and analyze long sequences of.

Sabegh Singh Virdi ASC Processor Group Computer Science Department

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Searching strings using the waves An efficient index structure for string databases Ingmar Brouns Jacob Kleerekoper.

Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

Distance Functions for Sequence Data and Time Series

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.

Based on Slides by D. Gunopulos (UCR)

7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.

Multimedia and Time-series Data

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Exact indexing of Dynamic Time Warping

Introduction to String Kernels Blaz Fortuna JSI, Slovenija.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Sequence Alignment.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.

Fast Subsequence Matching in Time-Series Databases.

SIMILARITY SEARCH The Metric Space Approach

Distance Functions for Sequence Data and Time Series

Intro to Alignment Algorithms: Global and Local

Searching Similar Segments over Textual Event Sequences

CSE 589 Applied Algorithms Spring 1999

3. Brute Force Selection sort Brute-Force string matching

Lecture 8. Paradigm #6 Dynamic Programming

3. Brute Force Selection sort Brute-Force string matching

Longest Common Subsequence

Minwise Hashing and Efficient Search

Longest Common Subsequence

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

3. Brute Force Selection sort Brute-Force string matching

Presentation transcript:

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1

2 Issue ? Find similar substrings in a large database, that is similar to a given query string quickly, using a small index structure In some applications we store, search and analyze long sequences of discrete characters, which we call “strings” There is a frequent need to find similarities between genetic data, web data and event sequences.

3 Applications ? Information Retrieval : A typical application of information retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords. searching keywords through the net: usually by “mtallica” we mean “metallica”:

Computational Biology : The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence. …ATGCATACGATCGATT… …TGCAATGGCTTAGCTA… Animal species from the same family are bound to have more similar DNAs

5 Video data can be viewed as an event sequence if some pre-specified set of events are detected and stored as a sequence. Searching similar event subsequences can be used to find related video segments.

6 String search algorithms proposed so far are in-memory algorithms. Scan the whole database for each query. Size of the string database grows faster than the available memory capacity, and extensive memory requirements make the search techniques impractical. Suffer from disk I/Os when the database is too large Performance deteriorates for long query patterns

7 Similarity Metrics The difference between two strings s1 and s2 is generally defined as the minimum number of edit operations to transform s1 to s2 called “edit distance ED”. Edit operations: – Insert – Delete – Replace

Suppose we have two strings x,y e.g. x = kitten, y = sitting and we want to transform x into y. A closer look: k i t t e n s i t t i n g 1 st step: kitten  sitten (Replace) 2 nd step: sitten  sittin (Replace) 3 rd step: sittin  sitting (Insert)s What is the edit distance between “survey” and “surgery”? s u r v e y---> s u r g e yreplace (+1) --->s u r g e r yinsert (+1) Edit distance = 2

In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved. For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”. This is called as weighted edit distance.

Global Alignment Global alignment (or similarity) of s1 and s2 is defined as the maximum valued alignment of s1 and s2. – Given two strings S 1 and S 2, the global alignment of them is obtained by inserting spaces into S 1 or S 2 and at the ends so that are of the same length and then writing them one against the other Example – qacdbd & qawdb qac_dbd qa_wdb_ Edits and alignments are dual. – A sequence of edits can be converted into a global alignment. – An alignment can be converted into a sequence of edits

Local Alignment Given two strings X and Y find two substrings x and y from X and Y, respectively, such that their alignment score (in the global sense) is maximum over all pairs of such substrings. (empty substrings are allowed) S(x,y) = +2, x = y -2, x != y -1, x = ‘_’ or y = ‘_’ X= pqraxabcstvq Y= yxaxbacsll x= axabcs y= axbacs a x a b _ c s a x _ b a c s =+8

String Matching Problem Whole Matching : finding the edit distance ED(q,s) between a data string s and a query string q. Substring Matching : Consider all substrings s[i:j] of s which are close to the query string. Two Types of Queries : Range search seeks all the substrings of S which are within an edit distance of r to a given query q (r = range query) K-nearest neighbor search seeks the K closest substrings of S to q.

Challenges in solving the substring matching problem Finding the edit distance is very costly in terms of both time and space. The strings in the database may be very long. The database size for most applications grows exponentially. New approach to overcome challenges Define a lower bound distance for substring searching Improve this lower bound by using the idea of wavelet transformation Use the MRS index structure based on the aforementioned distance formulations

A dynamic programming algorithm for computing the edit distance Problem: find the edit distance between strings x and y. Create a (|x|+1)×(|y|+1) matrix C, where Ci,j represents the minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows. Ci,0 = I C0,j = j Ci,j = min{(Ci-1,j-1)+cost, replace (Ci,j-1)+1,insert (Ci-1,j)+1}delete cost = 0 if xi=yi, else 1

How do we perform substring search? The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q. The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match. If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|).

16 Frequency Vector Let s be a string from the alphabet  ={  1,...,   }. Let n i be the number of occurrences of the character  i in s for 1  i , then frequency vector: f(s) =[n 1,..., n  ]. Example: – s = AATGATAG – f(s) = [n A, n C, n G, n T ] = [4, 0, 2, 2] Let s be a string from the alphabet  ={  1,...,   }. Let f(s) =[v 1,..., v  ], be the frequency vector of s then   i-1 v i = |s|. An edit operation on s has one of the following effects on f(s), for 1  i, j  , and i != j : – v i := v i + 1 – v i := v i - 1 – v i := v i + 1and v j := v j - 1

17 Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: – s = AATGATAG => f(s) = [4, 0, 2, 2] – (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] – (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] – (A  C), s = ACCTATAG => f(s) = [3, 2, 1, 2]

18 Frequency Distance Let u and v be integer points in  dimensional space. The frequency distance, FD 1 (u,v) between u and v is defined as the minimum number of steps in order to go from u to v ( or equivalently from v to u) by moving to a neighbor point at each step. frequency vector: f(s) =[n 1,..., n  ]. Let s 1 and s 2 be two strings from the alphabet  ={  1,...,   } then – FD 1 (f(s 1 ), f(s 2 ))  ED (s 1,s 2 )

19 An Approximation to ED: Frequency Distance (FD 1 ) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] – pos = (4-2) + (2-1) = 3 – neg = (2-0) = 2 – FD 1 (f(s),f(q)) = 3 – ED(q,s) = 4 FD 1 (f(s 1 ),f(s 2 ))=max{pos,neg}. FD 1 (f(s 1 ),f(s 2 ))  ED(s 1,s 2 ). f(q) FD 1 (f(q),f(s)) f(s)

Frequency Distance Calculation /* u and v are  dimensional integer points */ Algorithm : FD 1 (u,v) posDistance := negDistance := 0 For i := 1 to  FD 1 ( u, v ) = max { posDist, negDist }

Wavelet Vector Computation Let s = c 1 c 2… c n be a string from the alphabet  ={  1,...,   } then Kth level wavelet transformation,  k (s), 0 <k< log 2 n of s is defined as:  k (s) = [v k,1,..., v k,n/ 2 k ] where v k,I = [A k,i, B k,i ], f (c i ) k = 0 A k-1,2i + A k-1,2i+1 0 < k < log 2 n 0 k = 0 A k-1,2i - A k-1,2i+1 0 < k < log 2 n 0<i<(n/2 k )-1 A k,i = B k,i =

22 Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1 ) = [2, 0, 1, 1] f(s 2 ) = [2, 1, 0, 1]  1 (s)= f(s 1 )+f(s 2 ) = [4, 1, 1, 2]  2 (s)= f(s 1 )-f(s 2 ) = [0, -1, 1, 0]

23 Wavelet Decomposition of a String: General Idea A i,j = f(s(j2 i : (j+1)2 i -1)) B i,j = A i-1,2j - A i-1,2j+1  (s)= First wavelet coefficient Second wavelet coefficient

Wavelet Transformation: Example s = T C A C n = |s| = 4  0 (s) = [v 0,0, v 0,1, v 0,2, v 0,3 ] = [ (A 0,0, B 0,0 ), (A 0,1, B 0,1 ), (A 0,2, B 0,2 ), (A 0,3, B 0,3 ) ] = [ (f(t), 0), (f(c), 0), (f(a), 0), (f(c), 0) ] = [( [0,0,1], 0 ), ( [0,1,0], 0 ), ( [1,0,0], 0 ), ( [0,1,0], 0 ) ]  1 (s) = [ ([0,1,1], [0,-1,1]), ([1,1,0], [1,-1,0]) ]  2 (s) = [ ( [1,2,1], [-1,0,1] )] Second wavelet coefficient First wavelet coefficient

Wavelet Distance Calculation

Maximum Frequency Distance Calculation FD(s 1,s 2 ) = max { FD 1 (f (s 1 ), f (s 2 )), FD 2 (ψ(s 1 ),ψ(s 2 )) } FD 1 is the Frequency Distance FD 2 is the Wavelet Distance

27 MRS-Index Structure Creation w=2 a transform s1s1

28 s1s1 MRS-Index Structure Creation

29 s1s1 MRS-Index Structure Creation

30... s1s1 slide c times c=box capacity MRS-Index Structure Creation

31 s1s1... MRS-Index Structure Creation

32... T a,1 s1s1 W=2 a MRS-Index Structure Creation

33 Using Different Resolutions... T a,1 s1s1 W=2 a... T a+1,1 W=2 a+1

34 MRS-Index Structure

35 MRS-index properties Relative MBR volume (Precision) decreases when – c increases. – w decreases. MBRs are highly clustered. Box volume Box Capacity

36 Frequency Distance to an MBR Let q be the query string of length 2 i where a <= i <= a + l - 1. Given an MBR B, we define FD(q,B)= min (s belongs to B) FD(q,s) f(q) FD(f(q),f(s)) f(s) f(q) FD(f(q),B) B

Range Search Algorithm

Range Queries w= w= w= w= s1s1 s2s2 sdsd 1=1= 2 12 1 3 23 2 q q1q1 q2q2 q3q3 1. Partition the query string into subqueries at various resolutions available in our index. 2. Perform a partial range query for each subquery on the corresponding row of the index structure, and refine ε. 3. Disk pages corresponding to last result set are read, and postprocessing is done to elminate false retrievals.

K-Nearest Neighbor Algorithm

40 k-Nearest Neighbor Query k = 3

41 k-Nearest Neighbor Query k = 3

42 k-Nearest Neighbor Query [KSF+96, SK98] k = 3

43 k-Nearest Neighbor Query k = 3 r = Edit distance to 3 rd closest substring r

44 Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from ( ) – chr02, chr18, chr21, chr22 – Plotted results are from chr18 dataset. Queries are selected from data set randomly for 512  |q|  An NFA based technique [BYN99] is implemented for comparison.

45 Experimental Results 1: Effect of Box Capacity (10-NN ) The cost of the MRS-index increases as the box capacity increases. The cost of the MRS-index is much lower than the NFA technique for all these box capacities. Although using 2-wavelet coefficient slightly improves the performance for the same box capacity, the size of the index structure is doubled. For same amount of memory, the single coefficient version performs better

46 Experimental Results 2: Effect of Window Size (10-NN) The MRS-index structure outperforms the NFA technique for all the window sizes. The performance of the MRS index structure itself improves as the window size increases.

47 Experimental Results 3: k-NN queries The performance of the MRS-index structure drops for large values of k, it still performs better than the NFA technique. Achieved speedups up to 45 for 10 nearest neighbors. The speedup for 200 nearest neighbors is 3. As the number of nearest neighbors increases, the performance of the MRS-index structure approaches to that of the NFA technique.

48 Experimental Results 4: Range Queries The MRS-index structure performed up to 12 times faster than the NFA technique. The performance of the MRS-index structure improved when the queries are selected from different data strings. This is because the DNA strings have a high self similarity. The performance of the MRS index structure deteriorates as the error rate increases. This is because the size of the candidate set increases as the error rate increases.

49 Discussion In-memory (index size is 1-2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k-NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique.

50 THANK YOU