E.G.M. PetrakisSearching Signals and Patterns1 Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately The ideal method should be: Fast: faster than sequential scanning Correct: returns all qualifying object Dynamic: allows for insertions, deletions, updates
E.G.M. PetrakisSearching Signals and Patterns2 Similarity Queries Range queries: find all objects within distance e from the query D(Q,I) < e, where D,e: user defined Nearest Neighbor (NN): find the k most similar objects All-pairs (“spatial join”) queries: find all pairs of objects O i,O j within distance e of each other D(O i,O j ) < e
E.G.M. PetrakisSearching Signals and Patterns3 Similarity queries (cont,d) Whole matching: the whole query Q matches an object O i the image is 512x512, the query is 512x512 Partial matching: the query specifies only a part of an object find parts of objects that match the query the images are 512x512, the query is 32x32
E.G.M. PetrakisSearching Signals and Patterns4 Object Types 1D signals: time sequences scientific data digitized voice or music 2D signals: digitized images (gray scale, color) video clips General objects: text, multimedia documents
E.G.M. PetrakisSearching Signals and Patterns5 Applications In many applications searching for similar patterns helps in predictions, decision making, data mining etc. Financial Marketing & production of 1D signals Scientific databases DNA/genome databases Audio databases Image and Video databases
E.G.M. PetrakisSearching Signals and Patterns6 Queries Find companies whose stock prices move similarly or with similar pattern of growth Find products with similar selling patterns Find if a musical score is similar to one of the copyrighted scores Find images that look like a sunset Find X-rays showing lung tumor
E.G.M. PetrakisSearching Signals and Patterns7 Indexing [Agrawal et.al 93] To achieve faster than sequential scanning the objects are indexed Extract f features from each object and apply a SAM to index this object Search the SAM to retrieve promising objects Clean-up the response The indexing method must be correct (i.e., has no “misses”), have small space overhead and be dynamic
E.G.M. PetrakisSearching Signals and Patterns8 Objects are mapped to points A query Q becomes a sphere with radius e Mapping Objects to Space
E.G.M. PetrakisSearching Signals and Patterns9 Mapping Objects to Points F( ): mapping function D f : object distance in feature space D: object distance in actual space Selection of F( ) and D f ? Ideally, D f (Q i,O j ) = D(Q i,O j ) The mapping preserves the distances The mapping should guarantee no misses
E.G.M. PetrakisSearching Signals and Patterns10 GEMINI [Faloutsos 96] GEMINI: Generic Multimedia Indexing 1.Define F( ): mapping of objects to f features (objects become vectors) 2.Determine the distance function D f in the f space 3.Guarantee correctness: prove that D f < D 4.Apply a SAM (e.g., R-tree) to index the f- dimensional vectors 5.Apply the Search Algorithm to eliminate flase drops.
E.G.M. PetrakisSearching Signals and Patterns11 Search Algorithm Problem: Retrieve all objects satisfying D(Q,O) < e Retrieve points D f (Q i,O j ) < e Retrieve the actual objects S Keep only those satisfying D(Q,S) < e (discard false alarms)
E.G.M. PetrakisSearching Signals and Patterns12 Lower Bounding Lemma: To guarantee no false dismissals F( ) should satisfy D f (Q,O i ) <= D(Q,O i ) for all Q, O i Proof: prove that if an object qualifies for the query, it will be retrieved in the feature space D f (Q,O i ) <= e but since D f (Q,O i ) <= D(Q,O i ) we have that D(Q,O i ) <= e
E.G.M. PetrakisSearching Signals and Patterns13 Indexing 1D Signals Find all signals S=(s 1,s 2,…S n ) within distance e from Q=(q 1,q 2,…q n ) D(Q,S) < e s i, q i : amplitudes at time I D is defined as Apply GEMINI But how F( ) and D f ( ) are defined?
E.G.M. PetrakisSearching Signals and Patterns14 Definition of F, D DFT maps signals s=(s 1,s 2,…s n ) to the frequency spectrum S=(S 1,S 2,…S n ) F( ) takes first f c Fourier coefficients f c : “cut-off” frequency (e.g., f c = 5) Signals become points in an f = 2f c space (because the coefficients s are complex numbers) D f is defined as
E.G.M. PetrakisSearching Signals and Patterns15 D f Lower Bounds D Let S, Q be the DFTs of s, q Parseval’s: the energy in the time and frequency domains is the same This implies that and D(Q,S) <= D (q,s) because D is computed using f c <= n fewer terms
E.G.M. PetrakisSearching Signals and Patterns16 Experiments Faster than sequential for all set sizes Slower but more accurate for more coefficients The trade-of reaches an equilibrium for f=3 or 4
E.G.M. PetrakisSearching Signals and Patterns17 Intuition For the majority of 1D signals there will be a few frequencies with high amplitudes If we index only the first few f c (f c < 5 or 10) coefficients we shall have only a few false drops R-trees can handle up to 20 dimensions for point data
E.G.M. PetrakisSearching Signals and Patterns18 NN Queries [Korn. et. al. 98] Find the k-NN’s of query Q: 1.Search the SAM to the find the k-NN’s [e.g., Rous95] using D f 2.Compute D for all these k objects 3.Let E = max{D(q,s i )}, 1<= i <= k 4.Issue a range query D(q,s) <= E on the SAM and retrieve a new set of objects 5.Compute their actual distances D(q,s) 6.Output the nearest k objects
E.G.M. PetrakisSearching Signals and Patterns19 Correctness of NN Algorithm Lemma: the algorithm has no misses Proof: Let s k be the k-NN retrieved object and s l be the l-th NN object (l < k), prove D(q,s l ) < D(q,s k ) (then the l-th object is retrieved too !!) If the algorithm did not retrieve s l then the range query (step 4) has missed it: D f (q,s l ) > E From lower bounding: D(q,s l ) > D f (q,s l ) > E ® However, D f (q,s k ) D(q,s k ) which contradicts ®
E.G.M. PetrakisSearching Signals and Patterns20 Partial Matching [Faloutsos94] Problem: given N data sequences S 1,S 2,…S N and a query Q, locate data subsequences that match a query subsequence locate stock prices with similar monthly patterns of growth extract f features, apply a SAM etc.
E.G.M. PetrakisSearching Signals and Patterns21 Methodology Locate matching window of length w on signal (length(S)–w+1 positions) Assume minimum query length w the method handles any query shorter queries are of no interest Longer queries are split into w- queries
E.G.M. PetrakisSearching Signals and Patterns22 Splitting a Query Mapping sequences S=(s 1,s 2, s 3 ) and S’=(s’ 1,s’ 2 ) and query Q=(q 1,q 2 ) q1q1 q 2 s1s1 s2s2 s3s3 s’ 1 s’ 2 e e F2F2 F1F1
E.G.M. PetrakisSearching Signals and Patterns23 Indexing Subsequences I-naive method: index all w-trails Inefficient in terms of space and speed 1:f increase in storage, tall, slow R-tree ST-index: index the w-trails in groups Subsequent trails are similar Grouping in the f-dimensional feature space Index rectangles containing similar trails
E.G.M. PetrakisSearching Signals and Patterns24 Grouping of Subsequences Organize w-trails in the f space in rectangles so that disk accesses are minimized Fixed number of points per rectangle, but which is the optimal number? Smaller rectangles, less disk accesses a rectangle L=(l 1,l 2,…l n ) causes Π(l i +0.5) accesses an m point rectangle causes Π(l i +0.5)/m accesses
E.G.M. PetrakisSearching Signals and Patterns25 I-Adaptive Algorithm Map the points of w-trails in rectangles in the f space Assign the first point of a w-trail to a rectangle For each successive point, if it increases the cost of the rectangle start a new rectangle, else include it in the same rectangle
E.G.M. PetrakisSearching Signals and Patterns26 Naïve Method Fixed number of points per rectangle
E.G.M. PetrakisSearching Signals and Patterns27 I-Adaptive Method Variable number of points per rectangle Smaller rectangles, less disk accesses
E.G.M. PetrakisSearching Signals and Patterns28 Range Queries [Petrakis 02] Input : query Q, distances D,D f, tolerance e Output : signals S satisfying D(Q,S) <= e 1.Decompose Q = (q 1,q 2,…,q n ) 2.Apply D f (q i,s j ) <= e, store results in A i 3.Compute 4.For each S in A compute D(Q,S) 5.Output sequences satisfying D(Q,S) <= e
E.G.M. PetrakisSearching Signals and Patterns29 NN Queries [Petrakis 02] Input: query Q, distance D, D f,, number k Output: the k sequences most similar to Q 1.Decompose Q = (q 1,q 2,…,q n ) 2.Apply a k-NN query for each q i Retrieve k distinct w-trails (incremental k-NN search) [Hjaltason 99] Compute e i their max distance from Q 3.Compute e = min{e i } 4.Apply a range query D(Q,S) <=e 5.Output the k sequences closest to Q
E.G.M. PetrakisSearching Signals and Patterns30 References R. Agrawal, C. Faloutsos, A. Swani, “Efficient Similarity Search in Sequence Databases”, Proc. of FODO Conf, Oct. 1993Efficient Similarity Search in Sequence Databases C. Faloutsos, M. Ranganathan, Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases”, Proc. of SIGMOD, May 1994Fast Subsequence Matching in Time-Series Databases P. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas, “Fast and Effective Retrieval of Medical Tumor Shapes”, IEEE TKDE, Vol. 11, 1998Fast and Effective Retrieval of Medical Tumor Shapes Euripides G.M. Petrakis: "Fast Retrieval by Spatial Structure in Image DataBases", Journal of Visual Languages and Computing, 2002 (to appear)Fast Retrieval by Spatial Structure in Image DataBases N. Rousopoulos, S. Kelley, F. Vincent: “Nearest-Neighbor Queries”, Proc. ACM SIGMOD, May 1995Nearest-Neighbor Queries G. R. Hjaltason and H. Samet: “Distance Browsing in Spatial Databases”, ACM Trans. on Inf.Syst., 24(2):265–318, 1999