Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Indexing DNA Sequences Using q-Grams
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Indexing and Range Queries in Spatio-Temporal Databases
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Spatial Mining.
Indexing Network Voronoi Diagrams*
Spatial Database Systems
2-dimensional indexing structure
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Spatial Access Methods Chapter 26 of book Read only 26.1, 26.2, 26.6 Dr Eamonn Keogh Computer Science & Engineering Department University of California.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Spatio-Temporal Databases
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
0 Two-dimensional color images 2-D color image (QBIC) –Compute a k-element color histogram for each image 16×10 6 → 256 A: color-to-color similarity matrix.
Data Mining: Concepts and Techniques Mining time-series data.
Spatial Indexing SAMs.
Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Spatial Queries Nearest Neighbor Queries.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial and Temporal Data Mining
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Indexing Time Series.
CH 14 Multimedia IR. Multimedia IR system The architecture of a Multimedia IR system depends on two main factors –The peculiar characteristics of multimedia.
AAU A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing Presented by YuQing Zhang  Slobodan Rasetic Jorg Sander James Elding Mario A.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Subsequence Matching in Time Series Databases Xiaojin Xu
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
CSIS7101 – Advanced Database Technologies Spatio-Temporal Data (Part 1) On Indexing Mobile Objects Kwong Chi Ho Leo Wong Chi Kwong Simon Lui, Tak Sing.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
Spatial Data Management
Spatial Database Systems
Chapter 25: Advanced Data Types and New Applications
Spatio-Temporal Databases
Data Mining: Concepts and Techniques — Chapter 8 — 8
Spatial Indexing I R-trees
Data Mining: Concepts and Techniques — Chapter 8 — 8
Scale-Space Representation for Matching of 3D Models
R-trees: An Average Case Analysis
Data Mining: Concepts and Techniques — Chapter 8 — 8
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland at College Park Presented by Rui Li

Abstract Goal: To find an efficient indexing method to locate time series in a database Main Idea: –Map each time series into a small set of multidimensional rectangles in feature space –Rectangles can be readily indexed using traditional spatial access methods, e.g., R*-tree

Introduction Hot Problem: Searching similar patterns in time-series databases Applications: –financial, marketing and production time series, e.g. stock prices –scientific databases, e.g. weather, geological, environmental data

Introduction (cont.) Similarity Queries: –Whole Matching –Subsequence Matching partial matching report time series along with offset

Introduction (cont.) Whole Matching (Previous Work) –Use a distance-preserving transform (e.g., DFT) to extract f features from time series (e.g., the first f DFT coefficients), and then map them into points in the f-dimensional feature space –Spatial access method (e.g., R*-trees) can be used to search for approximate queries

Introduction (cont.) Subsequence Matching (Goal) –Map time series into rectangles in feature space –Spatial access methods as the eventual indexing mechanism

Background To guarantee no false dismissals for range queries, the feature extraction function F() should satisfy the following formula: Parseval Theorem: –The DFT preserves the Euclidean distance between two time series

Proposed Method Mapping each time series to a trail in feature space –Use a sliding window of size w and place it at every possible offset –For each such placement of the window, extract the features of the subsequence inside the window –A time series of length L is mapped to a trail in feature space, consisting of L-w+1 points: one point for each offset

Example1

Example2 (a) a sample stock-price time series (b) its trail in the feature space of the 0-th and 1-st DFT coefficients (c) its trail of the 1-st and 2-nd DFT coefficients

Proposed Method (cont.) Indexing the trails –Simply storing the individual points of the trail in an R*-tree is inefficient –Exploit the fact that successive points of the trail tend to be similar, i.e., the contents of the sliding window in nearby offsets tend to be similar –Divide the trail into sub-trails and represent each of them with its minimum bounding (hyper)-rectangle (MBR) –Store only a few MBRs

Proposed Method (cont.) Indexing the trails (cont.) –Can guarantee ‘no false dismissals’: when a query arrives, all the MBRs that intersect the query region are retrieved, i.e., all the qualifying sub-trails are retrieved, plus some false alarms

Return to example1 ε

Proposed Method (cont.) Indexing the trails (cont.) –Map a time series into a set of rectangles in feature space –Each MBR corresponds to a sub-trail

Proposed Method (cont.) For each MBR we have to store –, which are the offsets of the first and last such positionings –A unique identifier for each time series –The extent of the MBR in each dimension, i.e., Store the MBRs in an R*-tree –Recursively group the MBRs into parent MBRs, grandparent MBRs, etc.

Example1 (cont.) –assuming a fan-out of 4

Proposed Method (cont.) The structure of a leaf node and a non-leaf node

Proposed Method (cont.) Two questions –Insertions: when a new time series is inserted, what is a good way to divide its trail into sub-trails –Queries: how to handle queries, especially the ones that are longer than the sliding window

Proposed Method (cont.) Insertion – Dividing trails into sub- trails –Goal: Optimal division so that the number of disk accesses is minimized

Example3 fixed heuristic adaptive heuristic

Proposed Method (cont.) Insertion (cont.) –Group trail-points into sub-trails by means of an adaptive heuristic –Based on a greedy algorithm, using a cost function to estimate the number of disk accesses for each of the options

Proposed Method (cont.) Insertion (cont.) –The cost function: where is the sides of the n- dimensional MBR of a node in an R-tree –The marginal cost of each point: where k is the number of points in this MBR

Proposed Method (cont.) Insertion (cont.) –Algorithm: Assign the first point of the trail to a sub-trail ( would be a predefined small MBR ) FOR each successive point IF it increases the marginal cost of the current sub-trail THEN start a new sub-trail ELSE include it into the current sub-trail

Proposed Method (cont.) Insertion (cont.) –The algorithm may not work well under certain circumstances –The algorithm’s goal is to minimize the size of each MBR, why don’t we use clustering techniques!

Proposed Method (cont.) Searching – Queries longer than w –If Len(Q)=w, the searching algorithm goes like: Map Q to a point q in the feature space; the query corresponds to a sphere with center q and radius ε Retrieve the sub-trails whose MBRs intersect the query region Examine the corresponding time series, and discard the false alarms

Proposed Method (cont.) Searching (cont.) –If Len(Q)>w, consider the following Lemma: Consider two sequences Q and S of the same length Len(Q)=Len(S)=p*w Consider their p disjoint subsequences and where and where If Q AND S agree within tolerance ε, then at least one of the pairs of corresponding subsequence agree within tolerance

Proposed Method (cont.) Searching (cont.) –If Len(Q)>w, the searching algorithm goes like: The query time series Q is broken into p sub-queries which correspond to p spheres in the feature space with radius Retrieve the sub-trails whose MBRs intersect at least one of the sub-query regions Examine the corresponding subsequences of the time series, and discard the false alarms

Experiments Experiments are ran on a stock prices database of 329,000 points Only the first 3 frequencies of the DFT are used; thus the feature space has 6 dimensions (real and imaginary parts of each retained DFT coefficient) Sliding window size w=512

Experiments (cont.) Query time series were generated by taking random offsets into the time series and obtaining subsequences of length Len(Q) from those offsets

Experiments (cont.) For groups of experiments were carried out –Comparison of the proposed method against the method that has sub-trails with only one point each –Experiments to compare the response time –Experiments with queries longer than w –Experiments with larger databases

Related Works (citations) Continuous queries over data streams Similarity indexing with M-tree/SS- tree, etc. Efficient time series matching by wavelets Fast similarity search in the presence of noise, scaling, and translation in time-series databases

Thank you!