A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos
Time Series and Similarity Search Time Series A sequence (ordered collection) of real values. X=x 1,x 2,..,x n where n can be very large Similarity Given a query sequence q, a database S of N sequences S 1, S 2,..., S N, a distance measure D and a tolerance threshold ε, two time series are regarded as similar within range ε when D(x, y) ≤ ε
Time Series and Similarity Search Whole matching Given a collection of N data sequences of real numbers S 1, S 2, …, S N and a query sequence Q, we want to find those data sequences that are within distance ε from Q. The data and query sequences must have the same length Subsequence matching Given N data sequences S 1, S 2,…, S N of arbitrary lengths, a query sequence Q and a tolerance ε, we want to identify the data sequences S i that contain matching subsequences. Report those data sequences, along with the correct offsets within the data sequences that best match the query sequence
Dimensionality Reduction Instead of sequential scanning the database in order to find similarity, an indexing method is required to reduce the query time Every time series of length n is regarded as a point in the n- dimensional space Indexing structures are used: R-Trees and their variants (the R*-tree is the most commonly used data structure) R-trees scale badly with the increase of dimensionality To efficiently search time series databases each sequence is represented as a multidimensional vector and dimensionality is reduced to a degree that index structures can be applied efficiently
Dimensionality Reduction There are the following known dimensionality reduction techniques: Fourier Transform (DFT) Wavelet Transform (DWT) Piecewise Aggregate Approximation (PAA) Singular Value Decomposition (SVD) Chebyshev Polynomials (Cheb) Piecewise Linear Approximation (PLA) Adaptive Piecewise Constant Approximation (APCA) Symbolic Aggregate Approximation (SAX)
Dimensionality Reduction using DFT Let X=(x 1,x 2,…,x n ) be a time series We take the Fourier transform We keep only the first f c coefficients
Dimensionality Reduction using DFT Motivation behind using DFT: In most applications the data does not exhibit rapid changes (e.g. stock data) Therefore, the energy of a time series vector is concentrated in the lower frequencies. The first DFT coefficients contain this information for the lower frequencies The reconstruction error using only a few coefficients is not big
Dimensionality Reduction using DFT
Dimensionality Reduction Techniques X X' X X' X X' DFTDWTSVD
Dimensionality Reduction Techniques X X' X X' X X' ChebPLAPAA
X X' aaaaaabbbbbccccccbbccccdddddddd dcbadcba X X Dimensionality Reduction Techniques APCASAX
Dimensionality Reduction Basically, dimensionality reduction is a technique for approximating the original sequence of size n by another sequence of much smaller length Any dimensionality reduction technique potentially suffers from two problems: False alarms Occur when objects that appear to be close in the index space are actually distant. False alarms are removed in a post-processing stage. False dismissals Occur when qualifying objects are missed because they appear distant in the index space. They cannot be tolerated by the system.
The aforementioned techniques guarantee no false dismissals False alarms occur because every transform of a n- dimensional point to a point in a space of reduced dimensionality approximates the true location of the transformed point. CPU time is highly dependent on the implementation. A more accurate and objective measure of the effectiveness of a dimensionality reduction technique is the pruning power Evaluating the different techniques
Subsequence similarity search FRM Allows similarity matching between time series of variable size Predefines a window length ω It divides every time series in the database into n-ω sliding windows It indexes all these subsequences with pointers that point to the original sequence they belong to It divides the query sequence into n/ω disjoint windows Then they search for all the query subsequences the similar subsequences in the index The candidate set is furthermore examined in order to remove any false alarms
Subsequence similarity search FRM Disadvantage: When a MBR is found, we include all the points in this MBR in the candidate set
Subsequence similarity search DUAL MATCH It is the dual of FRM Predefines a window length ω It divides every time series in the database into n/ω disjoint windows It indexes all these subsequences with pointers that point to the original sequence they belong to It divides the query sequence into n-ω sliding windows Then they search for all the query subsequences the similar subsequences in the index The candidate set is furthermore examined in order to remove any false alarms It is shown by the authors’ experiments that DUAL MATCH outperforms FRM
Dimensionality reduction and query time Two factors that affect similarity search queries: Index search The larger the preselected dimensionality, the larger the search time in the R*-tree False alarms The smaller the preselected dimensionality, the lower the accuracy and the larger the number of false alarms. Note that a false alarm is expensive since we have to fetch the ‘false’ time series from the database and sequentially compare it with the query sequence.
Question ? Is there an optimal dimensionality and if there is how do we select it?
The authors’ point of view The problem of finding the optimal combination between accuracy and index search time has not been addressed yet The authors that proposed the various dimensionality reduction techniques conduct experiments and test the pruning power and CPU time when using different dimensionalities They empirically find the optimal dimensionality counting the number of page accesses, after they issue all the queries But what do we do when we have an application that supports online queries or when we do not have all the queries in advance? A dynamic solution must be found!!
A naïve solution We keep multiple R*-trees with different dimensionalities, e.g. 2, 4, 6, …, 16 Each one indexes the whole database We adopt a heuristic function to count the page accesses We define proper thresholds We start by using the R*-tree with the lowest dimensionality, e.g. 2-dimensonal As queries arrive, if the thresholds are exceeded, we change the tree we use, e.g. from the 2-dimensional to the 4-dimensional
A naïve solution (cont.) Advantages We can dynamically adjust the dimensionality of our index as queries arrive We don’t have to decide the dimensionality in advance It can work well with online queries Disadvantages Increased space Suppose that there is only a fraction of the database that is affected by the queries and result in false alarms, then the whole tree will unnecessarily upgrade its dimensionality The unaffected fraction of the database will then be indexed with a tree of higher dimensionality than it is actually needed
Convention for the purposes of our discussion ‘Simple’ time series A time series that concentrates its energy in its low frequencies ‘Complicated’ time series A time series that concentrates its energy in its high frequencies
Observations In a dataset, there could exist simple time series as well as complicated times series A single time series can contain simple subsequences as well as complicated subsequences, e.g. there might be some ‘white’ noise in a time series The query could be simple or complicated The query defines where we will have the false alarms, since the query is the one that falls in an MBR and causes false alarms there
Conclusions It is necessary for our solution to be query adaptive, since we search the index according to the queries and the queries are the ones that cause the alarms Since not all the indexed subsequences are of same complexity, our structure must support different dimensionalities for different portions of the database This means that our structure should support MBRs of different dimensionalities
Proposed solution We will use a modification of the R*-tree We will also use the DUAL MATCH technique for similarity matching We do not have any knowledge about our dataset, not about the nature of the queries The queries can be offline or online Note that for online queries, most of the time, applications just use an active window, which can be regarded as a sliding window The structure is being adjusted throughout time, as new queries arrive
The structure The structure starts as a 2- dimensional R*-tree Every f c -dimensional point has a pointer to its original time series as well as to its complete transformation vector Along with every node/MBR, we associate a variable dim that holds the dimensionality of the current node dim=2
The structure We search the index based on the dimensionality of every node We keep a heuristic function for the false alarms in every leaf MBR and we define a threshold If the threshold is exceeded, we increase the dimensionality of the corresponding leaf MBR This is achieved by retrieving more coefficients from the transformation vector pointed by the pointer dim=4 dim=2
dim=4 The structure At some point the dimensionality of all (or most of) the leaf MBRs of a particular internal MBR may increase their dimensionality At that point we have to increase the dimensionality of the internal MBR as well this is not trivial (it will be discussed later) This can propagate up to the root dim=2 dim=4 dim=6dim=2
The structure At some point we will result in a R*-tree which has different dimensionality for different subtrees Benefit The subsequences that were creating the false alarms in the beginning have now increased their dimensionality and they do not cause alarms any more the index search time has increased The subsequences that didn’t cause alarms in the beginning haven’t changed their dimensionality and therefore the index search time for these subsequences has remained unchanged
The structure Worst case There are two possibilities: The dataset subsequences are all complicated The query sequences are all complicated In the above cases the whole R*-tree may have a uniform dimensionality over its MBRs It may even reach the maximum dimensionality (i.e.16 ) Even in the worst case we have the contribution that we can dynamically define the optimal dimensionality for our R*-tree
Concerns How do we merge subtrees when the dimensionality of their children MBRs has increased? Updates. Insertions, Deletions What will the percentage of the children MBRs that should decide when to upgrade the dimensionality of their parent MBR be? What will the heuristic function of false alarms be? How about increasing the dimensionality from 2 to 6 instead of increasing it from 2 to 4? There should be a special handling for PLA, SAX, PAA and APCA We should be careful with merges and splits according to the fan out, since when we increase the dimensionality we increase the space as well How do we decide if we want to decrease the dimensionality? If we prove that we outperform the original R*-tree for all the dimensionality reduction techniques then we have a good contribution (the experimental section can be large)
Thank You ! For those that are interested, I have an extended bibliography for the issues covered