Time Series Indexing II

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Advertisements

Dimensionality Reduction Techniques Dimitrios Gunopulos, UCR.
Time Series II.
CMU SCS : Multimedia Databases and Data Mining Lecture #23: DSP tools – Fourier and Wavelets C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #22: DSP tools – Fourier and Wavelets C. Faloutsos.
Fast Algorithms For Hierarchical Range Histogram Constructions
Transform Techniques Mark Stamp Transform Techniques.
Fourier Transform (Chapter 4)
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Spatial and Temporal Data Mining
Multimedia DBs.
Time Series Indexing II. Time Series Data
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Dimensionality Reduction
Data Mining: Concepts and Techniques Mining time-series data.
Multimedia DBs. Time Series Data
1. 2 General problem Retrieval of time-series similar to a given pattern.
Based on Slides by D. Gunopulos (UCR)
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Spatial and Temporal Data Mining
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Introduction to Wavelets
A Multiresolution Symbolic Representation of Time Series
Dimensionality Reduction
Digital Image Processing Final Project Compression Using DFT, DCT, Hadamard and SVD Transforms Zvi Devir and Assaf Eden.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
Time Series Data Analysis - II
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Exact Indexing of Dynamic Time Warping
Multimedia and Time-series Data
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
1 Chapter 5 Image Transforms. 2 Image Processing for Pattern Recognition Feature Extraction Acquisition Preprocessing Classification Post Processing Scaling.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
DCT.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Exact indexing of Dynamic Time Warping
Content-Based Image Retrieval Using Block Discrete Cosine Transform Presented by Te-Wei Chiang Department of Information Networking Technology Chihlee.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Fourier Transform.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
By Dr. Rajeev Srivastava CSE, IIT(BHU)
CS654: Digital Image Analysis Lecture 11: Image Transforms.
Fourier Transform (Chapter 4) CS474/674 – Prof. Bebis.
Digital Image Processing Lecture 8: Fourier Transform Prof. Charlene Tsai.
Dense-Region Based Compact Data Cube
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
Singular Value Decomposition and its applications
Wavelet Transform Advanced Digital Signal Processing Lecture 12
Data Transformation: Normalization
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
15-826: Multimedia Databases and Data Mining
Multi-resolution analysis
15-826: Multimedia Databases and Data Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
15-826: Multimedia Databases and Data Mining
Data Mining: Concepts and Techniques — Chapter 8 — 8
Chapter 15: Wavelets (i) Fourier spectrum provides all the frequencies
Data Mining: Concepts and Techniques — Chapter 8 — 8
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Time Series Indexing II

Time Series Data A time series is a collection of observations made sequentially in time. 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.2000 .. 24.6250 24.6750 24.7500 50 100 150 200 250 300 350 400 450 500 23 24 25 26 27 28 29 value axis time axis

TS Databases A Time Series Database stores a large number of time series Similarity queries Exact match or sub-sequence match Range or nearest neighbor But first we should define the similarity model E.g. D(X,Y) for X = x1, x2, …, xn , and Y = y1, y2, …, yn

Similarity Models Euclidean and Lp based Edit Distance and LCS based Probabilistic (using Markov Models) Landmarks The appropriate similarity model depends on the application

Euclidean model Query Q Database Distance 0.98 0.07 0.21 0.43 Rank 4 1 n datapoints Database n datapoints Distance 0.98 0.07 0.21 0.43 Rank 4 1 2 3 S Q Euclidean Distance between two time series Q = {q1, q2, …, qn} and S = {s1, s2, …, sn}

Similarity Retrieval Range Query Nearest Neighbor query Find all time series S where Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan … very slow A better approach GEMINI

GEMINI Solution: Quick-and-dirty' filter: extract m features (numbers, eg., avg., etc.) map into a point in m-d feature space organize points with off-the-shelf spatial access method (‘SAM’) discard false alarms

GEMINI Range Queries Build an index for the database in a feature space using an R-tree Algorithm RangeQuery(Q, e) Project the query Q into a point in the feature space Find all candidate objects in the index within e Retrieve from disk the actual sequences Compute the actual distances and discard false alarms

GEMINI NN Query Algorithm K_NNQuery(Q, K) Project the query Q in the same feature space Find the candidate K nearest neighbors in the index Retrieve from disk the actual sequences pointed to by the candidates Compute the actual distances and record the maximum Issue a RangeQuery(Q, emax) Compute the actual distances, keep K

GEMINI GEMINI works when: Dfeature(F(x), F(y)) <= D(x, y) Note that the closer the feature distance to the actual one the better.

Problem How to extract the features? How to define the feature space? Fourier transform Wavelets transform Averages of segments (Histograms or APCA)

Fourier transform DFT (Discrete Fourier Transform) Transform the data from the time domain to the frequency domain highlights the periodicities SO?

DFT A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons; economy follows 50-year cycle temperature follows daily and yearly cycles Many real signals follow (multiple) cycles

How does it work? value x ={x0, x1, ... xn-1} time n-1 1 Decomposes signal to a sum of sine (and cosine) waves. Q:How to assess ‘similarity’ of x with a wave? value x ={x0, x1, ... xn-1} time 1 n-1

How does it work? value value freq. f=1 (sin(t * 2 p/n) ) freq. f=0 A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity) 1 n-1 time value freq. f=1 (sin(t * 2 p/n) ) 1 n-1 time value freq. f=0

How does it work? value freq. f=2 time 1 n-1 A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity) 1 n-1 time value freq. f=2

How does it work? 1 n-1 cosine, f=1 sine, freq =1 1 n-1 1 n-1 1 n-1 ‘basis’ functions 1 n-1 cosine, f=1 sine, freq =1 1 n-1 cosine, f=2 sine, freq = 2 1 n-1 1 n-1

How does it work? Basis functions are actually n-dim vectors, orthogonal to each other ‘similarity’ of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions

How does it work? Since ejf = cos(f) + j sin(f) (j=sqrt(-1)), we finally have:

DFT: definition Discrete Fourier Transform (n-point): inverse DFT

DFT: definition Good news: Available in all symbolic math packages, eg., in ‘mathematica’ x = [1,2,1,2]; X = Fourier[x]; Plot[ Abs[X] ];

DFT: properties Observation - SYMMETRY property: Xf = (Xn-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus we use only the first half numbers

DFT: AMPLITUDE SPECTRUM Intuition: strength of frequency ‘f’ count Af freq: 12 time freq. f

DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what?

DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what? A1: compression A2: pattern discovery A3: forecasting

DFT: Parseval’s theorem sum( xt 2 ) = sum ( | X f | 2 ) Ie., DFT preserves the ‘energy’ or, alternatively: it does an axis rotation: x1 x = {x0, x1} x0

Lower Bounding lemma Using Parseval’s theorem we can prove the lower bounding property! So, apply DFT to each time series, keep first 3-10 coefficients as a vector and use an R-tree to index the vectors R-tree works with euclidean distance, OK.

Wavelets - DWT value time DFT is great - but, how about compressing opera? (baritone, silence, soprano?) value time

Wavelets - DWT Solution#1: Short window Fourier transform But: how short should be the window?

Wavelets - DWT Answer: multiple window sizes! -> DWT

Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eightths ...

Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction ....... level 1 d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction level 2 d2,0 s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction etc ... s2,0 d2,0 s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction Q: map each coefficient on the time-freq. plane f s2,0 d2,0 t s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction Q: map each coefficient on the time-freq. plane f s2,0 d2,0 t s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - Drill: Q: baritone/silence/soprano - DWT? f t time value

Wavelets - Drill: Q: baritone/soprano - DWT? f t time value

Wavelets - construction Observation1: ‘+’ can be some weighted addition ‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’) Observation2: unlike DFT/DCT, there are *many* wavelet bases: Haar, Daubechies-4, Daubechies-6, ...

Advantages of Wavelets Better compression (better RMSE with same number of coefficients) closely related to the processing of the mammalian eye and ear Good for progressive transmission handle spikes well usually, fast to compute (O(n)!)

Feature space Keep the d most “important” wavelets coefficients Normalize and keep the largest Lower bounding lemma: the same as DFT

PAA and APCA Another approach: segment the time series into equal parts, store the average value for each part. Use an index to store the averages and the segment end points

Feature Spaces X X' DFT X X' DWT X X' SVD 20 40 60 80 100 120 140 X X' DFT X X' DWT 20 40 60 80 100 120 140 X X' SVD 20 40 60 80 100 120 140 eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 1 2 3 4 5 6 7 Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999 Korn, Jagadish, Faloutsos 1997

Piecewise Aggregate Approximation (PAA) value axis time axis Original time series (n-dimensional vector) S={s1, s2, …, sn} sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svn’ } PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

Can we improve upon PAA? n’-segment PAA representation (n’-d vector) sv1 sv2 sv3 sv4 sv5 sv6 sv7 sv8 n’-segment PAA representation (n’-d vector) S = {sv1 , sv2, …, svN } Adaptive Piecewise Constant Approximation (APCA) sv1 sv2 sv3 sv4 sr1 sr2 sr3 sr4 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2)

APCA approximates original signal better than PAA Reconstruction error PAA Reconstruction error APCA Improvement factor = 1.69 3.77 1.21 1.03 3.02 1.75

APCA Representation can be computed efficiently Near-optimal representation can be computed in O(nlog(n)) time Optimal representation can be computed in O(n2M) (Koudas et al.)

Distance Measure Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S) S Q’ Q DLB(Q’,S) DLB(Q’,S)

Index on 2M-dimensional APCA space R1 R3 R2 R4 R2 R3 R4 R1 S3 S4 S5 S6 S7 S8 S9 S2 S1 2M-dimensional APCA space S6 S5 S1 S2 S3 S4 S8 S7 S9 The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation. Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)

k-nearest neighbor Algorithm MINDIST(Q,R2) MINDIST(Q,R3) R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 Q MINDIST(Q,R4) The k-nearest neighbor search traverse the nodes of the multidimensional index structure in the order of the distance from the query. We define the node distance as MINDIST which is the minimum distance from the query to any point in the node boundary. In case of Query Point Movement, MinDist is computed as the distance between the centroid of relevant points and the point P which is defined as in this equation. For any node U of the index structure with MBR R, MINDIST(Q,R) £ D(Q,S) for any data item S under U

Index Modification for MINDIST Computation APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } smax3 smin3 R1 sv3 S2 S5 R3 S3 smax1 smin1 smax2 smin2 S1 S6 S4 sv1 R2 smax4 smin4 R4 S8 sv2 S9 sv4 S7 sr1 sr2 sr3 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }

MBR Representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN } and H= { h1, h2, …, h(N-1), hN } REGION 2 H= { h1, h2, h3, h4 , h5, h6 } h1 h2 h3 h4 h5 h6 value axis time axis l3 l4 l6 l5 REGION 1 l1 l2 REGION 3 L= { l1, l2, l3, l4 , l5, l6 }

Regions l(2i-1) h(2i-1) h2i l(2i-2)+1 h3 h5 h2 h4 h6 l3 l1 l2 l4 l6 l5 REGION i l(2i-1) h(2i-1) h2i l(2i-2)+1 M regions associated with each MBR; boundaries of ith region: h3 h1 h5 h2 h4 h6 value axis time axis l3 l1 l2 l4 l6 l5 REGION 1 REGION 3 REGION 2

Regions t1 t2 h3 l3 h5 l1 l5 l2 l4 h2 h4 h6 l6 ith region is active at time instant t if it spans across t The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 t1 t2 h3 value axis h1 l3 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 l6 time axis

MINDIST Computation t1 minregion G active at t MINDIST(Q,G,t) For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) t1 MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h2 h4 h6 MINDIST(Q,R) = l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U

Images - color what is an image? A: 2-d array

Images - color Color histograms, and distance function

Mathematically, the distance function is: Images - color Mathematically, the distance function is:

Images - color Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question

Images - color possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable

Images - color time performance: seq scan w/ avg RGB selectivity

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them?

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation)

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions?

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... )

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction?

Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD)

Images - shapes log(# of I/Os) all kept # of features kept Performance: ~10x faster log(# of I/Os) all kept # of features kept