Indexing Time Series.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Time Series II.
CMU SCS : Multimedia Databases and Data Mining Lecture #23: DSP tools – Fourier and Wavelets C. Faloutsos.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
CMU SCS : Multimedia Databases and Data Mining Lecture #22: DSP tools – Fourier and Wavelets C. Faloutsos.
Transform Techniques Mark Stamp Transform Techniques.
Engineering Mathematics Class #15 Fourier Series, Integrals, and Transforms (Part 3) Sheng-Fang Huang.
Fourier Transform (Chapter 4)
Spatial and Temporal Data Mining
Mining Time Series.
Multimedia DBs.
Time Series Indexing II. Time Series Data
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Data Mining: Concepts and Techniques Mining time-series data.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Distance Functions for Sequence Data and Time Series
Multimedia DBs. Time Series Data
1. 2 General problem Retrieval of time-series similar to a given pattern.
Based on Slides by D. Gunopulos (UCR)
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Spatial and Temporal Data Mining
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Digital Image Processing Final Project Compression Using DFT, DCT, Hadamard and SVD Transforms Zvi Devir and Assaf Eden.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Discrete Time Periodic Signals A discrete time signal x[n] is periodic with period N if and only if for all n. Definition: Meaning: a periodic signal keeps.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Time Series I.
Goals For This Class Quickly review of the main results from last class Convolution and Cross-correlation Discrete Fourier Analysis: Important Considerations.
Multimedia and Time-series Data
Image Processing Fourier Transform 1D Efficient Data Representation Discrete Fourier Transform - 1D Continuous Fourier Transform - 1D Examples.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
WAVELET TRANSFORM.
1 Chapter 5 Image Transforms. 2 Image Processing for Pattern Recognition Feature Extraction Acquisition Preprocessing Classification Post Processing Scaling.
Wavelet-based Coding And its application in JPEG2000 Monia Ghobadi CSC561 final project
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Mining Time Series.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
EE104: Lecture 5 Outline Review of Last Lecture Introduction to Fourier Transforms Fourier Transform from Fourier Series Fourier Transform Pair and Signal.
DCT.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
7- 1 Chapter 7: Fourier Analysis Fourier analysis = Series + Transform ◎ Fourier Series -- A periodic (T) function f(x) can be written as the sum of sines.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Fourier Transform.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
CS 376b Introduction to Computer Vision 03 / 17 / 2008 Instructor: Michael Eckmann.
By Dr. Rajeev Srivastava CSE, IIT(BHU)
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
CS654: Digital Image Analysis Lecture 11: Image Transforms.
Chapter 13 Discrete Image Transforms
The Frequency Domain Digital Image Processing – Chapter 8.
Fourier Transform (Chapter 4) CS474/674 – Prof. Bebis.
Digital Image Processing Lecture 8: Fourier Transform Prof. Charlene Tsai.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Time Series Indexing II
Fast Subsequence Matching in Time-Series Databases.
Singular Value Decomposition and its applications
k is the frequency index
An Example of 1D Transform with Two Variables
Distance Functions for Sequence Data and Time Series
Data Mining: Concepts and Techniques — Chapter 8 — 8
k is the frequency index
Data Mining: Concepts and Techniques — Chapter 8 — 8
Chapter 15: Wavelets (i) Fourier spectrum provides all the frequencies
Data Mining: Concepts and Techniques — Chapter 8 — 8
Presentation transcript:

Indexing Time Series

Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases Image and video databases

Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series

Time Series Data A time series is a collection of observations made sequentially in time. 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.2000 .. 24.6250 24.6750 24.7500 50 100 150 200 250 300 350 400 450 500 23 24 25 26 27 28 29 value axis time axis

Time Series Problems (from a database perspective) The Similarity Problem X = x1, x2, …, xn and Y = y1, y2, …, yn Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve efficiently similar time series (Indexing for Similarity Queries)

Types of queries whole match vs sub-pattern match range query vs nearest neighbors all-pairs query

Examples Find companies with similar stock prices over a time interval Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams

distance function: by expert (eg, Euclidean distance) day $price 1 365 distance function: by expert (eg, Euclidean distance)

Problems Define the similarity (or distance) function Find an efficient algorithm to retrieve similar time series from a database (Faster than sequential scan) The Similarity function depends on the Application

Metric Distances What properties should a similarity distance have? D(A,B) = D(B,A) Symmetry D(A,A) = 0 Constancy of Self-Similarity D(A,B) >= 0 Positivity D(A,B)  D(A,C) + D(B,C) Triangular Inequality

Euclidean Similarity Measure View each sequence as a point in n-dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as p=1 Manhattan distance p=2 Euclidean distance

Euclidean model Query Q Database Distance 0.98 0.07 0.21 0.43 Rank 4 1 n datapoints Database n datapoints Distance 0.98 0.07 0.21 0.43 Rank 4 1 2 3 S Q Euclidean Distance between two time series Q = {q1, q2, …, qn} and S = {s1, s2, …, sn}

Advantages Easy to compute: O(n) Allows scalable solutions to other problems, such as indexing clustering etc...

Similarity Retrieval Range Query Nearest Neighbor query Find all time series S where Nearest Neighbor query Find all the k most similar time series to Q A method to answer the above queries: Linear scan … very slow A better approach GEMINI

GEMINI Solution: Quick-and-dirty' filter: extract m features (numbers, eg., avg., etc.) map into a point in m-d feature space organize points with off-the-shelf spatial access method (‘SAM’) retrieve the answer using a NN query discard false alarms

GEMINI Range Queries Build an index for the database in a feature space using an R-tree Algorithm RangeQuery(Q, e) Project the query Q into a point q in the feature space Find all candidate objects in the index within e Retrieve from disk the actual sequences Compute the actual distances and discard false alarms

GEMINI NN Query Algorithm K_NNQuery(Q, K) Project the query Q in the same feature space Find the candidate K nearest neighbors in the index Retrieve from disk the actual sequences pointed to by the candidates Compute the actual distances and record the maximum Issue a RangeQuery(Q, emax) Compute the actual distances, return best K

GEMINI GEMINI works when: Dfeature(F(x), F(y)) <= D(x, y) Note that, the closer the feature distance to the actual one, the better.

Problem How to extract the features? How to define the feature space? Fourier transform Wavelets transform Averages of segments (Histograms or APCA) Chebyshev polynomials .... your favorite curve approximation...

Fourier transform DFT (Discrete Fourier Transform) Transform the data from the time domain to the frequency domain highlights the periodicities SO?

DFT A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons; economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles

How does it work? value x ={x0, x1, ... xn-1} s ={s0, s1, ... sn-1} Decomposes signal to a sum of sine and cosine waves. Q:How to assess ‘similarity’ of x with a (discrete) wave? value x ={x0, x1, ... xn-1} s ={s0, s1, ... sn-1} time 1 n-1

How does it work? Freq=1/period value value A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity) Freq=1/period 1 n-1 time value freq. f=1 (sin(t * 2 p/n) ) 1 n-1 time value freq. f=0

How does it work? A: consider the waves with frequency 0, 1, ...; use the inner-product (~cosine similarity) 1 n-1 time value freq. f=2

How does it work? 1 n-1 cosine, f=1 sine, freq =1 1 n-1 1 n-1 1 n-1 ‘basis’ functions 1 n-1 cosine, f=1 sine, freq =1 1 n-1 cosine, f=2 sine, freq = 2 1 n-1 1 n-1

How does it work? Basis functions are actually n-dim vectors, orthogonal to each other ‘similarity’ of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions

How does it work? Since ejf = cos(f) + j sin(f) (j=sqrt(-1)), we finally have:

DFT: definition Discrete Fourier Transform (n-point): inverse DFT

DFT: properties Observation - SYMMETRY property: Xf = (Xn-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus we use only the first half numbers

DFT: Amplitude spectrum Intuition: strength of frequency ‘f’ count Af freq: 12 time freq. f

DFT: Amplitude spectrum excellent approximation, with only 2 frequencies! so what?

n = 128 The graphic shows a time series with 128 points. 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Raw Data The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). C 20 40 60 80 100 120 140 n = 128

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Raw Data We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown). C 20 40 60 80 100 120 140 . . . . . . . . . . . . . .

We have discarded of the data. C n = 128 N = 8 Cratio = 1/16 C’ Truncated Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Raw Data Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... C 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 n = 128 N = 8 Cratio = 1/16 C’ 20 40 60 80 100 120 140 We have discarded of the data.

Sorted Truncated Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Raw Data C 1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416 C’ 20 40 60 80 100 120 140 Instead of taking the first few coefficients, we could take the best coefficients

DFT: Parseval’s theorem sum( xt 2 ) = sum ( | X f | 2 ) Ie., DFT preserves the ‘energy’ or, alternatively: it does an axis rotation: x1 x = {x0, x1} x0

Lower Bounding lemma Using Parseval’s theorem we can prove the lower bounding property! So, apply DFT to each time series, keep first 3-10 coefficients as a vector and use an R-tree to index the vectors R-tree works with euclidean distance, OK.

Wavelets - DWT DFT is great - but, how about compressing opera? (baritone, silence, soprano?) value time

Wavelets - DWT Solution#1: Short window Fourier transform But: how short should be the window?

Wavelets - DWT Answer: multiple window sizes! -> DWT

Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eightths ... Basis functions are step functions with different lenghts

Wavelets - construction x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction ....... level 1 d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction level 2 d2,0 s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction etc ... s2,0 d2,0 s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction Q: map each coefficient on the time-freq. plane f s2,0 d2,0 t s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - construction Q: map each coefficient on the time-freq. plane f s2,0 d2,0 t s1,0 ....... d1,0 d1,1 s1,1 + - x0 x1 x2 x3 x4 x5 x6 x7

Wavelets - Drill: Q: baritone/silence/soprano - DWT? f t time value

Wavelets - Drill: Q: baritone/soprano - DWT? f t time value

Wavelets - construction Observation1: ‘+’ can be some weighted addition ‘-’ is the corresponding weighted difference (‘Quadrature mirror filters’) Observation2: unlike DFT/DCT, there are *many* wavelet bases: Haar, Daubechies-4, Daubechies-6, ...

Advantages of Wavelets Better compression (better RMSE with same number of coefficients) closely related to the processing of the mammalian eye and ear Good for progressive transmission handle spikes well usually, fast to compute (O(n)!)

Feature space Keep the d most “important” wavelets coefficients Normalize and keep the largest Lower bounding lemma: the same as DFT