Time Series Indexing II
Time Series Data A time series is a collection of observations made sequentially in time. time axis value axis
TS Databases A Time Series Database stores a large number of time series Similarity queries Exact match or sub-sequence match Range or nearest neighbor But first we should define the similarity model E.g. D(X,Y) for X = x 1, x 2, …, x n, and Y = y 1, y 2, …, y n
Similarity Models Euclidean and Lp based Edit Distance and LCS based Probabilistic (using Markov Models) Landmarks The appropriate similarity model depends on the application
Euclidean model Query Q n datapoints S Q Euclidean Distance between two time series Q = {q 1, q 2, …, q n } and S = {s 1, s 2, …, s n } Distance Rank Database n datapoints
Similarity Retrieval Range Query Find all time series S where Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan … very slow A better approach GEMINI
GEMINI Solution: Quick-and-dirty' filter: extract m features (numbers, eg., avg., etc.) map into a point in m-d feature space organize points with off-the-shelf spatial access method (‘SAM’) discard false alarms
GEMINI Range Queries Build an index for the database in a feature space using an R-tree Algorithm RangeQuery(Q, ) 1. Project the query Q into a point in the feature space 2. Find all candidate objects in the index within 3. Retrieve from disk the actual sequences 4. Compute the actual distances and discard false alarms
GEMINI NN Query Algorithm K_NNQuery(Q, K) 1. Project the query Q in the same feature space 2. Find the candidate K nearest neighbors in the index 3. Retrieve from disk the actual sequences pointed to by the candidates 4. Compute the actual distances and record the maximum 5. Issue a RangeQuery(Q, max) 6. Compute the actual distances, keep K
GEMINI GEMINI works when: D feature (F(x), F(y)) <= D(x, y) Note that the closer the feature distance to the actual one the better.
Problem How to extract the features? How to define the feature space? Fourier transform Wavelets transform Averages of segments (Histograms or APCA)
Fourier transform DFT (Discrete Fourier Transform) Transform the data from the time domain to the frequency domain highlights the periodicities SO?
DFT A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons; economy follows 50-year cycle temperature follows daily and yearly cycles Many real signals follow (multiple) cycles
How does it work? Decomposes signal to a sum of sine (and cosine) waves. Q:How to assess ‘similarity’ of x with a wave? 0 1 n-1 time value x ={x 0, x 1,... x n-1 }
How does it work? A: consider the waves with frequency 0, 1,...; use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=0 0 1 n-1 time value freq. f=1 (sin(t * 2 n) )
How does it work? A: consider the waves with frequency 0, 1,...; use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=2
How does it work? ‘basis’ functions 0 1 n sine, freq =1 sine, freq = n cosine, f=1 cosine, f=2
How does it work? Basis functions are actually n-dim vectors, orthogonal to each other ‘similarity’ of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions
How does it work? Since e jf = cos(f) + j sin(f) (j=sqrt(-1)), we finally have:
DFT: definition Discrete Fourier Transform (n-point): inverse DFT
DFT: definition Good news: Available in all symbolic math packages, eg., in ‘mathematica’ x = [1,2,1,2]; X = Fourier[x]; Plot[ Abs[X] ];
DFT: properties Observation - SYMMETRY property: X f = (X n-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus we use only the first half numbers
DFT: Amplitude spectrum Amplitude Intuition: strength of frequency ‘f’ time count freq. f AfAf freq: 12
DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what?
DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what? A1: compression A2: pattern discovery A3: forecasting
DFT: Parseval’s theorem sum( x t 2 ) = sum ( | X f | 2 ) Ie., DFT preserves the ‘energy’ or, alternatively: it does an axis rotation: x0 x1 x = {x0, x1}
Lower Bounding lemma Using Parseval’s theorem we can prove the lower bounding property! So, apply DFT to each time series, keep first 3-10 coefficients as a vector and use an R- tree to index the vectors R-tree works with euclidean distance, OK.
Wavelets - DWT DFT is great - but, how about compressing opera? (baritone, silence, soprano?) time value
Wavelets - DWT Solution#1: Short window Fourier transform But: how short should be the window?
Wavelets - DWT Answer: multiple window sizes! -> DWT
Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eightths...
Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7
Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, level 1
Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 level 2
Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 etc...
Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 Q: map each coefficient on the time-freq. plane t f
Wavelets - construction d2,0 x0 x1 x2 x3 x4 x5 x6 x7 s1,0 + - d1,0 s1,1d1, s2,0 Q: map each coefficient on the time-freq. plane t f
Wavelets - Drill: t f time value Q: baritone/silence/soprano - DWT?
Wavelets - Drill: t f time value Q: baritone/soprano - DWT?
Wavelets - construction Observation1: ‘+’ can be some weighted addition ‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’) Observation2: unlike DFT/DCT, there are *many* wavelet bases: Haar, Daubechies-4, Daubechies-6,...
Advantages of Wavelets Better compression (better RMSE with same number of coefficients) closely related to the processing of the mammalian eye and ear Good for progressive transmission handle spikes well usually, fast to compute (O(n)!)
Feature space Keep the d most “important” wavelets coefficients Normalize and keep the largest Lower bounding lemma: the same as DFT
PAA and APCA Another approach: segment the time series into equal parts, store the average value for each part. Use an index to store the averages and the segment end points
Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar X X' DFT Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999 eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Korn, Jagadish, Faloutsos 1997 Feature Spaces X X' DWT X X' SVD
Piecewise Aggregate Approximation (PAA) value axis time axis Original time series (n-dimensional vector) S={s 1, s 2, …, s n } n’-segment PAA representation (n’-d vector) S = {sv 1, sv 2, …, sv n’ } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)
Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv 1, sv 2, …, sv N } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 sv 1 sv 2 sv 3 sv 4 sr 1 sr 2 sr 3 sr 4 n’/2-segment APCA representation (n’-d vector) S= { sv 1, sr 1, sv 2, sr 2, …, sv M, sr M } (M is the number of segments = n’/2) Adaptive Piecewise Constant Approximation (APCA)
Reconstruction error PAA Reconstruction error APCA APCA approximates original signal better than PAA Improvement factor =
APCA Representation can be computed efficiently Near-optimal representation can be computed in O(nlog(n)) time Optimal representation can be computed in O(n 2 M) (Koudas et al.)
Q D LB (Q’,S) Distance Measure S Q D(Q,S) Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) S S Q’Q’
Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree) R1 R3 R2 R4 2M-dimensional APCA space S6 S5 S1 S2 S3 S4 S8 S7 S9 R2 R3 R4 R3 R4 R1 S3S4S5 S6 S7 S8 S9 S2S1 R2
k-nearest neighbor Algorithm R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 MINDIST(Q,R2) MINDIST(Q,R4) MINDIST(Q,R3) Q For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U
Index Modification for MINDIST Computation APCA point S= { sv 1, sr 1, sv 2, sr 2, …, sv M, sr M } S1 S2 S3 S5 S4 S6 S8 S9 R1 R3 R2 R4 APCA rectangle S= (L,H) where L= { smin 1, sr 1, smin 2, sr 2, …, smin M, sr M } and H = { smax 1, sr 1, smax 2, sr 2, …, smax M, sr M } sv 1 sv 2 sv 3 sv 4 sr 1 sr 2 sr 3 sr 4 smax 3 smin 3 smax 1 smin 1 smax 2 smin 2 smax 4 smin 4 S7
REGION 3 REGION 2 REGION 1 MBR Representation in time-value space value axis time axis L= { l 1, l 2, l 3, l 4, l 5, l 6 } We can view the MBR R=(L,H) of any node U as two APCA representations L= { l 1, l 2, …, l (N-1), l N } and H= { h 1, h 2, …, h (N-1), h N } l1l1 l2 l2 l3l3 l4 l4 l6 l6 l5l5 H= { h 1, h 2, h 3, h 4, h 5, h 6 } h1h1 h2 h2 h3h3 h4 h4 h5h5 h6h6
Regions M regions associated with each MBR; boundaries of ith region: REGION i l (2i-1) h (2i-1) h 2i l (2i-2) +1 h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 value axis time axis l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 REGION 1 REGION 3 REGION 2
Regions h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 value axis time axis l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 REGION 2 t1t2 REGION 3 REGION 1 ith region is active at time instant t if it spans across t The value s t of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2)
MINDIST Computation For time instant t, MINDIST(Q, R, t) = min region G active at t MINDIST(Q,G,t) h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 t1 REGION 3 REGION 2 REGION 1 MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((q t1 - h1) 2, (q t1 - h3) 2 ) =(q t1 - h1) 2 MINDIST(Q,R) = Lemma3: MINDIST(Q,R) D(Q,C) for any time series C under node U
Images - color what is an image? A: 2-d array
Images - color Color histograms, and distance function
Images - color Mathematically, the distance function is:
Images - color Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question
Images - color possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable
Images - color performance: time selectivity w/ avg RGB seq scan
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them?
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation)
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions?
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3:... )
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction?
Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD)
Images - shapes Performance: ~10x faster # of features kept log(# of I/Os) all kept