E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Spatial and Temporal Data Mining V. Megalooikonomou Spatial Access Methods (SAMs) II (some slides are based on notes by C. Faloutsos)
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Multimedia indexing C. Faloutsos.
Searching on Multi-Dimensional Data
Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Multimedia DBs.
Spatio-temporal Databases Time Parameterized Queries.
Time Series Indexing II. Time Series Data
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Data Mining: Concepts and Techniques Mining time-series data.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Multimedia DBs. Time Series Data
Spatial Indexing I Point Access Methods.
Based on Slides by D. Gunopulos (UCR)
Spatial and Temporal Data Mining
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Indexing Time Series.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Image Based Positioning System Ankit Gupta Rahul Garg Ryan Kaminsky.
Multimedia and Time-series Data
CH 14 Multimedia IR. Multimedia IR system The architecture of a Multimedia IR system depends on two main factors –The peculiar characteristics of multimedia.
Subsequence Matching in Time Series Databases Xiaojin Xu
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Exact indexing of Dynamic Time Warping
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Presented by Ho Wai Shing
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Nearest Neighbor Queries using R-trees
15-826: Multimedia Databases and Data Mining
Spatio-temporal Pattern Queries
K Nearest Neighbor Classification
Data Mining: Concepts and Techniques — Chapter 8 — 8
15-826: Multimedia Databases and Data Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Similarity Search: A Matching Based Approach
Data Mining: Concepts and Techniques — Chapter 8 — 8
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal method should be:  Fast: faster than sequential scanning  Correct: returns all qualifying object  Dynamic: allows for insertions, deletions, updates

E.G.M. PetrakisSearching Signals and Patterns2 Similarity Queries  Range queries: find all objects within distance e from the query  D(Q,I) < e, where D,e: user defined  Nearest Neighbor (NN): find the k most similar objects  All-pairs (“spatial join”) queries: find all pairs of objects O i,O j within distance e of each other D(O i,O j ) < e

E.G.M. PetrakisSearching Signals and Patterns3 Similarity queries (cont,d)  Whole matching: the whole query Q matches an object O i  the image is 512x512, the query is 512x512  Partial matching: the query specifies only a part of an object  find parts of objects that match the query  the images are 512x512, the query is 32x32

E.G.M. PetrakisSearching Signals and Patterns4 Object Types  1D signals:  time sequences  scientific data  digitized voice or music  2D signals:  digitized images (gray scale, color)  video clips  General objects:  text, multimedia documents

E.G.M. PetrakisSearching Signals and Patterns5 Applications  In many applications searching for similar patterns helps in predictions, decision making, data mining etc.  Financial  Marketing & production of 1D signals  Scientific databases  DNA/genome databases  Audio databases  Image and Video databases

E.G.M. PetrakisSearching Signals and Patterns6 Queries  Find companies whose stock prices move similarly or with similar pattern of growth  Find products with similar selling patterns  Find if a musical score is similar to one of the copyrighted scores  Find images that look like a sunset  Find X-rays showing lung tumor

E.G.M. PetrakisSearching Signals and Patterns7 Indexing [Agrawal et.al 93]  To achieve faster than sequential scanning the objects are indexed  Extract f features from each object and apply a SAM to index this object  Search the SAM to retrieve promising objects  Clean-up the response  The indexing method must be correct (i.e., has no “misses”), have small space overhead and be dynamic

E.G.M. PetrakisSearching Signals and Patterns8  Objects are mapped to points  A query Q becomes a sphere with radius e Mapping Objects to Space

E.G.M. PetrakisSearching Signals and Patterns9 Mapping Objects to Points  F( ): mapping function  D f : object distance in feature space  D: object distance in actual space  Selection of F( ) and D f ?  Ideally, D f (Q i,O j ) = D(Q i,O j )  The mapping preserves the distances  The mapping should guarantee no misses

E.G.M. PetrakisSearching Signals and Patterns10 GEMINI [Faloutsos 96]  GEMINI: Generic Multimedia Indexing 1.Define F( ): mapping of objects to f features (objects become vectors) 2.Determine the distance function D f in the f space 3.Guarantee correctness: prove that D f < D 4.Apply a SAM (e.g., R-tree) to index the f- dimensional vectors 5.Apply the Search Algorithm to eliminate flase drops.

E.G.M. PetrakisSearching Signals and Patterns11 Search Algorithm  Problem: Retrieve all objects satisfying D(Q,O) < e  Retrieve points D f (Q i,O j ) < e  Retrieve the actual objects S  Keep only those satisfying D(Q,S) < e (discard false alarms)

E.G.M. PetrakisSearching Signals and Patterns12 Lower Bounding  Lemma: To guarantee no false dismissals F( ) should satisfy  D f (Q,O i ) <= D(Q,O i ) for all Q, O i  Proof: prove that if an object qualifies for the query, it will be retrieved in the feature space  D f (Q,O i ) <= e but since D f (Q,O i ) <= D(Q,O i ) we have that D(Q,O i ) <= e

E.G.M. PetrakisSearching Signals and Patterns13 Indexing 1D Signals  Find all signals S=(s 1,s 2,…S n ) within distance e from Q=(q 1,q 2,…q n )  D(Q,S) < e  s i, q i : amplitudes at time I  D is defined as  Apply GEMINI  But how F( ) and D f ( ) are defined?

E.G.M. PetrakisSearching Signals and Patterns14 Definition of F, D  DFT maps signals s=(s 1,s 2,…s n ) to the frequency spectrum S=(S 1,S 2,…S n )  F( ) takes first f c Fourier coefficients  f c : “cut-off” frequency (e.g., f c = 5)  Signals become points in an f = 2f c space (because the coefficients s are complex numbers)  D f is defined as

E.G.M. PetrakisSearching Signals and Patterns15 D f Lower Bounds D  Let S, Q be the DFTs of s, q  Parseval’s: the energy in the time and frequency domains is the same  This implies that and  D(Q,S) <= D (q,s) because D is computed using f c <= n fewer terms

E.G.M. PetrakisSearching Signals and Patterns16 Experiments  Faster than sequential for all set sizes  Slower but more accurate for more coefficients  The trade-of reaches an equilibrium for f=3 or 4

E.G.M. PetrakisSearching Signals and Patterns17 Intuition  For the majority of 1D signals there will be a few frequencies with high amplitudes  If we index only the first few f c (f c < 5 or 10) coefficients we shall have only a few false drops  R-trees can handle up to 20 dimensions for point data

E.G.M. PetrakisSearching Signals and Patterns18 NN Queries [Korn. et. al. 98]  Find the k-NN’s of query Q: 1.Search the SAM to the find the k-NN’s [e.g., Rous95] using D f 2.Compute D for all these k objects 3.Let E = max{D(q,s i )}, 1<= i <= k 4.Issue a range query D(q,s) <= E on the SAM and retrieve a new set of objects 5.Compute their actual distances D(q,s) 6.Output the nearest k objects

E.G.M. PetrakisSearching Signals and Patterns19 Correctness of NN Algorithm  Lemma: the algorithm has no misses  Proof: Let s k be the k-NN retrieved object and s l be the l-th NN object (l < k), prove D(q,s l ) < D(q,s k ) (then the l-th object is retrieved too !!)  If the algorithm did not retrieve s l then the range query (step 4) has missed it: D f (q,s l ) > E  From lower bounding: D(q,s l ) > D f (q,s l ) > E ®  However, D f (q,s k ) D(q,s k ) which contradicts ®

E.G.M. PetrakisSearching Signals and Patterns20 Partial Matching [Faloutsos94]  Problem: given N data sequences S 1,S 2,…S N and a query Q, locate data subsequences that match a query subsequence  locate stock prices with similar monthly patterns of growth  extract f features, apply a SAM etc.

E.G.M. PetrakisSearching Signals and Patterns21 Methodology  Locate matching window of length w on signal (length(S)–w+1 positions)  Assume minimum query length w  the method handles any query  shorter queries are of no interest  Longer queries are split into w- queries

E.G.M. PetrakisSearching Signals and Patterns22 Splitting a Query  Mapping sequences S=(s 1,s 2, s 3 ) and S’=(s’ 1,s’ 2 ) and query Q=(q 1,q 2 ) q1q1 q 2 s1s1 s2s2 s3s3 s’ 1 s’ 2 e e F2F2 F1F1

E.G.M. PetrakisSearching Signals and Patterns23 Indexing Subsequences  I-naive method: index all w-trails  Inefficient in terms of space and speed  1:f increase in storage, tall, slow R-tree  ST-index: index the w-trails in groups  Subsequent trails are similar  Grouping in the f-dimensional feature space  Index rectangles containing similar trails

E.G.M. PetrakisSearching Signals and Patterns24 Grouping of Subsequences  Organize w-trails in the f space in rectangles so that disk accesses are minimized  Fixed number of points per rectangle, but which is the optimal number?  Smaller rectangles, less disk accesses  a rectangle L=(l 1,l 2,…l n ) causes Π(l i +0.5) accesses  an m point rectangle causes Π(l i +0.5)/m accesses

E.G.M. PetrakisSearching Signals and Patterns25 I-Adaptive Algorithm  Map the points of w-trails in rectangles in the f space  Assign the first point of a w-trail to a rectangle  For each successive point, if it increases the cost of the rectangle start a new rectangle, else include it in the same rectangle

E.G.M. PetrakisSearching Signals and Patterns26 Naïve Method  Fixed number of points per rectangle

E.G.M. PetrakisSearching Signals and Patterns27 I-Adaptive Method  Variable number of points per rectangle  Smaller rectangles, less disk accesses

E.G.M. PetrakisSearching Signals and Patterns28 Range Queries [Petrakis 02]  Input : query Q, distances D,D f, tolerance e  Output : signals S satisfying D(Q,S) <= e 1.Decompose Q = (q 1,q 2,…,q n ) 2.Apply D f (q i,s j ) <= e, store results in A i 3.Compute 4.For each S in A compute D(Q,S) 5.Output sequences satisfying D(Q,S) <= e

E.G.M. PetrakisSearching Signals and Patterns29 NN Queries [Petrakis 02]  Input: query Q, distance D, D f,, number k  Output: the k sequences most similar to Q 1.Decompose Q = (q 1,q 2,…,q n ) 2.Apply a k-NN query for each q i  Retrieve k distinct w-trails (incremental k-NN search) [Hjaltason 99]  Compute e i their max distance from Q 3.Compute e = min{e i } 4.Apply a range query D(Q,S) <=e 5.Output the k sequences closest to Q

E.G.M. PetrakisSearching Signals and Patterns30 References  R. Agrawal, C. Faloutsos, A. Swani, “Efficient Similarity Search in Sequence Databases”, Proc. of FODO Conf, Oct. 1993Efficient Similarity Search in Sequence Databases  C. Faloutsos, M. Ranganathan, Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases”, Proc. of SIGMOD, May 1994Fast Subsequence Matching in Time-Series Databases  P. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas, “Fast and Effective Retrieval of Medical Tumor Shapes”, IEEE TKDE, Vol. 11, 1998Fast and Effective Retrieval of Medical Tumor Shapes  Euripides G.M. Petrakis: "Fast Retrieval by Spatial Structure in Image DataBases", Journal of Visual Languages and Computing, 2002 (to appear)Fast Retrieval by Spatial Structure in Image DataBases  N. Rousopoulos, S. Kelley, F. Vincent: “Nearest-Neighbor Queries”, Proc. ACM SIGMOD, May 1995Nearest-Neighbor Queries  G. R. Hjaltason and H. Samet: “Distance Browsing in Spatial Databases”, ACM Trans. on Inf.Syst., 24(2):265–318, 1999