Fast Subsequence Matching in Time-Series Databases.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Trees for spatial indexing
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Indexing of Time Series by Major Minima and Maxima Eugene Fink Kevin B. Pratt Harith S. Gandhi.
0 Two-dimensional color images 2-D color image (QBIC) –Compute a k-element color histogram for each image 16×10 6 → 256 A: color-to-color similarity matrix.
Data Mining: Concepts and Techniques Mining time-series data.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Similarity Searches in Sequence Databases
Distance Functions for Sequence Data and Time Series
5 - 1 § 5 The Divide-and-Conquer Strategy e.g. find the maximum of a set S of n numbers.
1. 2 General problem Retrieval of time-series similar to a given pattern.
Based on Slides by D. Gunopulos (UCR)
Spatial and Temporal Data Mining
Lecture 6 Divide and Conquer for Nearest Neighbor Problem Shang-Hua Teng.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Presented by Arun Qamra
Indexing Time Series.
Unit 7 Fourier, DFT, and FFT 1. Time and Frequency Representation The most common representation of signals and waveforms is in the time domain Most signal.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Multimedia and Time-series Data
5 -1 Chapter 5 The Divide-and-Conquer Strategy A simple example finding the maximum of a set S of n numbers.
Subsequence Matching in Time Series Databases Xiaojin Xu
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
CSC 211 Data Structures Lecture 13
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Exact indexing of Dynamic Time Warping
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
The Frequency Domain Digital Image Processing – Chapter 8.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Spatial Data Management
Multidimensional Access Structures
RE-Tree: An Efficient Index Structure for Regular Expressions
CS 213: Data Structures and Algorithms
Distance Functions for Sequence Data and Time Series
Multi - Way Number Partitioning
Orthogonal Range Searching and Kd-Trees
Spatio-temporal Pattern Queries
Distance Functions for Sequence Data and Time Series
Introduction to Database Systems
Data Mining: Concepts and Techniques — Chapter 8 — 8
Probabilistic Data Management
Data Mining: Concepts and Techniques — Chapter 8 — 8
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Handwritten Characters Recognition Based on an HMM Model
Time Relaxed Spatiotemporal Trajectory Joins
Data Mining: Concepts and Techniques — Chapter 8 — 8
Donghui Zhang, Tian Xia Northeastern University
CSE 326: Data Structures Lecture #14
Efficient Aggregation over Objects with Extent
Presentation transcript:

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez

Time series? Definition Applications Financial markets Weather forecasting Healthcare

What kind of problem are we trying to solve? Whole sequence matching Given a database S with n sequences, all of them equally long, and a query sequence Q of the same length. Find all sequences in S that match with Q. Subsequence matching Given a database S with n sequences, with potentially different lengths, and a query sequence Q. Find all sequences in S that contain Q.

Useful notation Given a sequence S Given two sequences, S and Q Len(S) denotes the length of the sequence S[i] denotes the ith element S[i:j] denotes the subsequence between S[i] and S[j] Given two sequences, S and Q D(S,Q) denotes the distance between S and Q. Euclidean Distance bound: e Max. distance for two sequences to be considered “equal”

Naïve approaches Sequential scanning R-tree Clearly unfeasible R-tree Might work, but dimensionality is extremely high (proportional to sequence length)‏ Poor performance What can we do to improve performance?

Dimensionality reduction Redundant data, lots of patterns Feature extraction Data transformation Cosine Wavelet Fourier <-- we'll focus on this.

Discrete Fourier Transformation Map a sequence x in time-domain to a sequence X in frequency-domain Reversible! Fast and easy-to-implement algorithms Energy preservation property Key concept in dimensionality reduction. Just keep the first 2 or 3 coefficients.

Parseval's theorem Let S and Q be the original sequences. S' and Q' after applying DFT. D(S,Q) = D(S',Q') Why is this important? Distance underestimation, remember the bound e. D(S,Q) < e ---> D(S', Q') < e We will get no false dismissals.

Subsequence Matching The problem: Solutions: You are given a collection of N sequences of real numbers. (S1, S2, .., Sn). Potentially different length. User specifies query subsequence of length Q and the tolerance e, the max. acceptable dis-similarity. You want all to return all the sequences along with the correct offsets k that matches the query and acceptable e. Solutions: many!

Possible Solutions 1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match. 2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree. 3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree. Note: I-Naive and ST-Index are similar in the initial steps.

Possible Solutions I-naive *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏ Procedure: 1) Use the "sliding window" to find every subsequence in a sequence. 2) DFT those subsequences of size w to a point in featured space. 3) A trail is produced of Len(S)-w+1 points.

Possible Solutions I-naive Procedure cont: 4) Store all the points of the trails in feature space in a spatial access method. (R*-tree)‏ 5) When presented with a query of length w and tolerance e, extract the features of the query and perform the spatial access range query with radius e. 6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query. Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).

Possible Solutions ST-Index *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏ Procedure: 1) Use the "sliding window" to find every subsequence in a sequence. 2) DFT those subsequences of size w to a point in featured space. 3) A trail is produced of Len(S)-w+1 points.

Possible Solutions ST-Index Procedure cont. 4) Divide the trail of points in feature space into sub-trails. (algorithm mentioned later)‏ 5) Represent each of them in a MBR. 6) Store the MBR into a spatial access method. (ie. R*-Tree)‏

MBRs in F-Dimension

MBRs in F-Dimension

MBRs in F-Dimension

MBRs in F-Dimension

MBRs in F-Dimension

Insertions Problem: How do we divide these trails into sub-trails? Two heuristics: 1) Every sub-trail has a predetermined, fixed number. (I-fixed)‏ 2) Every sub-trail has a predetermined, fixed length. (I-fixed)‏ Solution: Use an "adaptive heuristic." (I-adaptive)‏

I-adaptive Algorithm - Based on the idea of the marginal cost of a point in terms of disk accesses. Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR Algorithm Assign the first point of the trail in a sub-trail. FOR each successive point IF it increase the marginal cost of the current sub-trail THEN start another sub-trail ELSE include it in the current sub-trail

I-adaptive Algorithm

Searching Consider the sub-trail length w and distance bound e. Let Q be the query sequence If Len(Q) = w, it's all good. Algorithm Search_Short: Use DFT to map Q to a point q in feature space. Make it a sphere with radius e. Retrieve all the sub-trails whose MBRs intersect the query region using our index. Throw away false alarms.

Searching Now, what if Len(Q) > w? Requires more analysis, but basically we have that Len(Q) = p*w So we can split Q in several subsequences of length p. What about the radius? r = e/sqrt(p)‏

Searching So we have... Algorithm Search_Long: Break sequence Q in p sub-queries with radius e/sqrt(p)‏ Retrieve from the index all the sub-trails whose MBRs insersect at least one of the other sub-query regions. Examine the sub-sequences, discard false alarms.

Experimental results

Experimental results Stock price database with ~300,000 points 1 number = 4 bytes DFT keeping first 3 coefficients (actually 6) w = 512 bytes R*-tree

Experimental results Space Time - “short” queries (Len(Q) = w)‏ Naïve methods: 24mb This method: 5kb Time - “short” queries (Len(Q) = w)‏ 3 to 100 times better response times Time - “long” queries (Len(Q) > w)‏ 10 to 100 times better response times