A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Fast Algorithms For Hierarchical Range Histogram Constructions
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
2-dimensional indexing structure
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
A Multiresolution Symbolic Representation of Time Series
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
AAU A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing Presented by YuQing Zhang  Slobodan Rasetic Jorg Sander James Elding Mario A.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Analysis of Constrained Time-Series Similarity Measures
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Subsequence Matching in Time Series Databases Xiaojin Xu
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
R-Tree. 2 Spatial Database (Ia) Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Exact indexing of Dynamic Time Warping
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
LINKED LISTS.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
Spatial Data Management
Updating SF-Tree Speaker: Ho Wai Shing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CS522 Advanced database Systems
A paper on Join Synopses for Approximate Query Answering
B+-Trees.
Spatial Indexing I Point Access Methods.
Temporal Indexing MVBT.
Temporal Indexing MVBT.
B+-Trees.
Database Management Systems (CS 564)
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Indexing and Hashing Basic Concepts Ordered Indices
Lecture 2- Query Processing (continued)
Efficient Aggregation over Objects with Extent
Presentation transcript:

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos

Time Series and Similarity Search Time Series A sequence (ordered collection) of real values. X=x 1,x 2,..,x n where n can be very large Similarity Given a query sequence q, a database S of N sequences S 1, S 2,..., S N, a distance measure D and a tolerance threshold ε, two time series are regarded as similar within range ε when D(x, y) ≤ ε

Time Series and Similarity Search Whole matching Given a collection of N data sequences of real numbers S 1, S 2, …, S N and a query sequence Q, we want to find those data sequences that are within distance ε from Q. The data and query sequences must have the same length Subsequence matching Given N data sequences S 1, S 2,…, S N of arbitrary lengths, a query sequence Q and a tolerance ε, we want to identify the data sequences S i that contain matching subsequences. Report those data sequences, along with the correct offsets within the data sequences that best match the query sequence

Dimensionality Reduction Instead of sequential scanning the database in order to find similarity, an indexing method is required to reduce the query time Every time series of length n is regarded as a point in the n- dimensional space Indexing structures are used: R-Trees and their variants (the R*-tree is the most commonly used data structure) R-trees scale badly with the increase of dimensionality To efficiently search time series databases each sequence is represented as a multidimensional vector and dimensionality is reduced to a degree that index structures can be applied efficiently

Dimensionality Reduction There are the following known dimensionality reduction techniques: Fourier Transform (DFT) Wavelet Transform (DWT) Piecewise Aggregate Approximation (PAA) Singular Value Decomposition (SVD) Chebyshev Polynomials (Cheb) Piecewise Linear Approximation (PLA) Adaptive Piecewise Constant Approximation (APCA) Symbolic Aggregate Approximation (SAX)

Dimensionality Reduction using DFT Let X=(x 1,x 2,…,x n ) be a time series We take the Fourier transform We keep only the first f c coefficients

Dimensionality Reduction using DFT Motivation behind using DFT: In most applications the data does not exhibit rapid changes (e.g. stock data) Therefore, the energy of a time series vector is concentrated in the lower frequencies. The first DFT coefficients contain this information for the lower frequencies The reconstruction error using only a few coefficients is not big

Dimensionality Reduction using DFT

Dimensionality Reduction Techniques X X' X X' X X' DFTDWTSVD

Dimensionality Reduction Techniques X X' X X' X X' ChebPLAPAA

X X' aaaaaabbbbbccccccbbccccdddddddd dcbadcba X X Dimensionality Reduction Techniques APCASAX

Dimensionality Reduction Basically, dimensionality reduction is a technique for approximating the original sequence of size n by another sequence of much smaller length Any dimensionality reduction technique potentially suffers from two problems: False alarms Occur when objects that appear to be close in the index space are actually distant. False alarms are removed in a post-processing stage. False dismissals Occur when qualifying objects are missed because they appear distant in the index space. They cannot be tolerated by the system.

The aforementioned techniques guarantee no false dismissals False alarms occur because every transform of a n- dimensional point to a point in a space of reduced dimensionality approximates the true location of the transformed point. CPU time is highly dependent on the implementation. A more accurate and objective measure of the effectiveness of a dimensionality reduction technique is the pruning power Evaluating the different techniques

Subsequence similarity search FRM Allows similarity matching between time series of variable size Predefines a window length ω It divides every time series in the database into n-ω sliding windows It indexes all these subsequences with pointers that point to the original sequence they belong to It divides the query sequence into n/ω disjoint windows Then they search for all the query subsequences the similar subsequences in the index The candidate set is furthermore examined in order to remove any false alarms

Subsequence similarity search FRM Disadvantage: When a MBR is found, we include all the points in this MBR in the candidate set

Subsequence similarity search DUAL MATCH It is the dual of FRM Predefines a window length ω It divides every time series in the database into n/ω disjoint windows It indexes all these subsequences with pointers that point to the original sequence they belong to It divides the query sequence into n-ω sliding windows Then they search for all the query subsequences the similar subsequences in the index The candidate set is furthermore examined in order to remove any false alarms It is shown by the authors’ experiments that DUAL MATCH outperforms FRM

Dimensionality reduction and query time Two factors that affect similarity search queries: Index search The larger the preselected dimensionality, the larger the search time in the R*-tree False alarms The smaller the preselected dimensionality, the lower the accuracy and the larger the number of false alarms. Note that a false alarm is expensive since we have to fetch the ‘false’ time series from the database and sequentially compare it with the query sequence.

Question ? Is there an optimal dimensionality and if there is how do we select it?

The authors’ point of view The problem of finding the optimal combination between accuracy and index search time has not been addressed yet The authors that proposed the various dimensionality reduction techniques conduct experiments and test the pruning power and CPU time when using different dimensionalities They empirically find the optimal dimensionality counting the number of page accesses, after they issue all the queries But what do we do when we have an application that supports online queries or when we do not have all the queries in advance? A dynamic solution must be found!!

A naïve solution We keep multiple R*-trees with different dimensionalities, e.g. 2, 4, 6, …, 16 Each one indexes the whole database We adopt a heuristic function to count the page accesses We define proper thresholds We start by using the R*-tree with the lowest dimensionality, e.g. 2-dimensonal As queries arrive, if the thresholds are exceeded, we change the tree we use, e.g. from the 2-dimensional to the 4-dimensional

A naïve solution (cont.) Advantages We can dynamically adjust the dimensionality of our index as queries arrive We don’t have to decide the dimensionality in advance It can work well with online queries Disadvantages Increased space Suppose that there is only a fraction of the database that is affected by the queries and result in false alarms, then the whole tree will unnecessarily upgrade its dimensionality The unaffected fraction of the database will then be indexed with a tree of higher dimensionality than it is actually needed

Convention for the purposes of our discussion ‘Simple’ time series A time series that concentrates its energy in its low frequencies ‘Complicated’ time series A time series that concentrates its energy in its high frequencies

Observations In a dataset, there could exist simple time series as well as complicated times series A single time series can contain simple subsequences as well as complicated subsequences, e.g. there might be some ‘white’ noise in a time series The query could be simple or complicated The query defines where we will have the false alarms, since the query is the one that falls in an MBR and causes false alarms there

Conclusions It is necessary for our solution to be query adaptive, since we search the index according to the queries and the queries are the ones that cause the alarms Since not all the indexed subsequences are of same complexity, our structure must support different dimensionalities for different portions of the database This means that our structure should support MBRs of different dimensionalities

Proposed solution We will use a modification of the R*-tree We will also use the DUAL MATCH technique for similarity matching We do not have any knowledge about our dataset, not about the nature of the queries The queries can be offline or online Note that for online queries, most of the time, applications just use an active window, which can be regarded as a sliding window The structure is being adjusted throughout time, as new queries arrive

The structure The structure starts as a 2- dimensional R*-tree Every f c -dimensional point has a pointer to its original time series as well as to its complete transformation vector Along with every node/MBR, we associate a variable dim that holds the dimensionality of the current node dim=2

The structure We search the index based on the dimensionality of every node We keep a heuristic function for the false alarms in every leaf MBR and we define a threshold If the threshold is exceeded, we increase the dimensionality of the corresponding leaf MBR This is achieved by retrieving more coefficients from the transformation vector pointed by the pointer dim=4 dim=2

dim=4 The structure At some point the dimensionality of all (or most of) the leaf MBRs of a particular internal MBR may increase their dimensionality At that point we have to increase the dimensionality of the internal MBR as well  this is not trivial (it will be discussed later) This can propagate up to the root dim=2 dim=4 dim=6dim=2

The structure At some point we will result in a R*-tree which has different dimensionality for different subtrees Benefit The subsequences that were creating the false alarms in the beginning have now increased their dimensionality and they do not cause alarms any more  the index search time has increased The subsequences that didn’t cause alarms in the beginning haven’t changed their dimensionality and therefore the index search time for these subsequences has remained unchanged

The structure Worst case There are two possibilities: The dataset subsequences are all complicated The query sequences are all complicated In the above cases the whole R*-tree may have a uniform dimensionality over its MBRs It may even reach the maximum dimensionality (i.e.16 ) Even in the worst case we have the contribution that we can dynamically define the optimal dimensionality for our R*-tree

Concerns How do we merge subtrees when the dimensionality of their children MBRs has increased? Updates. Insertions, Deletions What will the percentage of the children MBRs that should decide when to upgrade the dimensionality of their parent MBR be? What will the heuristic function of false alarms be? How about increasing the dimensionality from 2 to 6 instead of increasing it from 2 to 4? There should be a special handling for PLA, SAX, PAA and APCA We should be careful with merges and splits according to the fan out, since when we increase the dimensionality we increase the space as well How do we decide if we want to decrease the dimensionality? If we prove that we outperform the original R*-tree for all the dimensionality reduction techniques then we have a good contribution (the experimental section can be large)

Thank You ! For those that are interested, I have an extended bibliography for the issues covered