Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimedia DBs. Time Series Data 050100150200250300350400450500 23 24 25 26 27 28 29 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.3250 25.2250.

Similar presentations


Presentation on theme: "Multimedia DBs. Time Series Data 050100150200250300350400450500 23 24 25 26 27 28 29 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.3250 25.2250."— Presentation transcript:

1 Multimedia DBs

2 Time Series Data 050100150200250300350400450500 23 24 25 26 27 28 29 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.3250 25.2250 25.2000 25.1750.. 24.6250 24.6750 24.6250 24.6750 24.7500 A time series is a collection of observations made sequentially in time. time axis value axis

3 PAA and APCA Feature extraction for GEMINI: Fourier Wavelets Another approach: segment the time series into equal parts, store the average value for each part. Use an index to store the averages and the segment end points

4 0 1 2 3 4 5 6 7 Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 020 40 60 80100 120 140 X X' DFT Agrawal, Faloutsos, Swami 1993 Chan & Fu 1999 eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 Korn, Jagadish, Faloutsos 1997 Feature Spaces X X' DWT 020 40 60 80100 120 140 X X' SVD 020 40 60 80100 120 140

5 Piecewise Aggregate Approximation (PAA) value axis time axis Original time series (n-dimensional vector) S={s 1, s 2, …, s n } n’-segment PAA representation (n’-d vector) S = {sv 1, sv 2, …, sv n’ } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

6 Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv 1, sv 2, …, sv N } sv 1 sv 2 sv 3 sv 4 sv 5 sv 6 sv 7 sv 8 sv 1 sv 2 sv 3 sv 4 sr 1 sr 2 sr 3 sr 4 n’/2-segment APCA representation (n’-d vector) S= { sv 1, sr 1, sv 2, sr 2, …, sv M, sr M } (M is the number of segments = n’/2) Adaptive Piecewise Constant Approximation (APCA)

7 1.69 3.02 1.21 1.75 3.77 1.03 Reconstruction error PAA Reconstruction error APCA APCA approximates original signal better than PAA Improvement factor =

8 APCA Representation can be computed efficiently Near-optimal representation can be computed in O(nlog(n)) time Optimal representation can be computed in O(n 2 M) (Koudas et al.)

9 Q D LB (Q’,S) Distance Measure S Q D(Q,S) Exact (Euclidean) distance D(Q,S) Lower bounding distance D LB (Q,S) S S Q’Q’

10 Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree) R1 R3 R2 R4 2M-dimensional APCA space S6 S5 S1 S2 S3 S4 S8 S7 S9 R2 R3 R4 R3 R4 R1 S3S4S5 S6 S7 S8 S9 S2S1 R2

11 k-nearest neighbor Algorithm R1 S7 R3 R2 R4 S1 S2 S3 S5 S4 S6 S8 S9 MINDIST(Q,R2) MINDIST(Q,R4) MINDIST(Q,R3) Q For any node U of the index structure with MBR R, MINDIST(Q,R)  D(Q,S) for any data item S under U

12 Index Modification for MINDIST Computation APCA point S= { sv 1, sr 1, sv 2, sr 2, …, sv M, sr M } S1 S2 S3 S5 S4 S6 S8 S9 R1 R3 R2 R4 APCA rectangle S= (L,H) where L= { smin 1, sr 1, smin 2, sr 2, …, smin M, sr M } and H = { smax 1, sr 1, smax 2, sr 2, …, smax M, sr M } sv 1 sv 2 sv 3 sv 4 sr 1 sr 2 sr 3 sr 4 smax 3 smin 3 smax 1 smin 1 smax 2 smin 2 smax 4 smin 4 S7

13 REGION 3 REGION 2 REGION 1 MBR Representation in time-value space value axis time axis L= { l 1, l 2, l 3, l 4, l 5, l 6 } We can view the MBR R=(L,H) of any node U as two APCA representations L= { l 1, l 2, …, l (N-1), l N } and H= { h 1, h 2, …, h (N-1), h N } l1l1 l2 l2 l3l3 l4 l4 l6 l6 l5l5 H= { h 1, h 2, h 3, h 4, h 5, h 6 } h1h1 h2 h2 h3h3 h4 h4 h5h5 h6h6

14 Regions M regions associated with each MBR; boundaries of ith region: REGION i l (2i-1) h (2i-1) h 2i l (2i-2) +1 h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 value axis time axis l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 REGION 1 REGION 3 REGION 2

15 Regions h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 value axis time axis l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 REGION 2 t1t2 REGION 3 REGION 1 ith region is active at time instant t if it spans across t The value s t of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2)

16 MINDIST Computation For time instant t, MINDIST(Q, R, t) = min region G active at t MINDIST(Q,G,t) h3h3 h1h1 h5h5 h2 h2 h4 h4 h6 h6 l3l3 l1l1 l2 l2 l4 l4 l6 l6 l5l5 t1 REGION 3 REGION 2 REGION 1 MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((q t1 - h1) 2, (q t1 - h3) 2 ) =(q t1 - h1) 2 MINDIST(Q,R) = Lemma3: MINDIST(Q,R)  D(Q,C) for any time series C under node U

17 Approximate Search A simpler definition of the distance in the feature space is the following: But there is one problem… what? D LB (Q’,S)

18 Multimedia dbs A multimedia database stores also images Again similarity queries (content based retrieval) Extract features, index in feature space, answer similarity queries using GEMINI Again, average values help!

19 Images - color what is an image? A: 2-d array

20 Images - color Color histograms, and distance function

21 Images - color Mathematically, the distance function is:

22 Images - color Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question

23 Images - color possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable

24 Images - color performance: time selectivity w/ avg RGB seq scan

25 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them?

26 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation)

27 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions?

28 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3:... )

29 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction?

30 Images - shapes distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD)

31 Images – shapes Performance: ~10x faster # of features kept log(# of I/Os) all kept

32 Is d(u,v) = sqrt ((u-v) T A(u-v) ) a metric? x T Ax = Σ x i x j A ij = Σ λ i x i 2 λ i is the ith eigenvalue x i is the projection of x along the ith eigenvector d(u,v) = sqrt ((u-v) T A(u-v) ) = sqrt ( Σ λ i (u i -v i ) 2 ) d(u,v) >= 0, d(u,u) = 0, d(u,v) = d(v,u) d(u,w) <= d(u,v) + d(v,w), provided sqrt (Σ λ i (u i -w i ) 2 ) <= sqrt (Σ λ i (u i -v i ) 2 ) + sqrt(Σ λ i (v i -w i ) 2 ) sqrt(Σ (√λ i u i - √λ i w i ) 2 ) <= sqrt(Σ (√λ i u i - √λ i v i ) 2 ) + sqrt(Σ(√λ i v i - √λ i w i ) 2 ) Metric condition for Lp norm

33 Filtering in QBIC Histogram column vectors x, y of length n Σ x i = 1, Σ y i = 1 Difference z = (x-y) Σ z i = 0 Contribution of each color bin to a smaller set of colors: V T = (c 1, c 2,.., c n ), each c i is a column vector of length 3 x avg = V T x, y avg = V t y, column vectors of length 3

34 Filtering in QBIC Distances d avg 2 = (x avg - y avg ) T (x avg - y avg ) = (V T z) T (V T z) = z T VV t z = z T W z d hist 2 = z T A z d hist 2 >= λ 1 d avg 2, where λ 1 is the smallest eigenvalue of A’z = λW’z

35 Filtering in QBIC Rewrite z to remove the extra condition that Σ z i = 0. z’ becomes a (n-1) dimensional column vector z T A z = z’ T A’ z’ and z T W z = z’ T W’ z’ A’ and W’ are (n-1)x(n-1) matrices Show that z’ T A’ z’ >= λ 1 z’ T W’ z’

36 Proof of z’ T A’ z’ >= λ 1 z’ T W’ z’ Minimize wrt z’, z’ T A’ z’, subject to the constraint z’ T W’ z’ = C. Same as minimizing wrt z’, z’ T A’ z’ - λ(z’ T W’ z’ - C) Differentiate wrt z and set to 0 A’z’ = λW’ z’ λ and z’ must be eigenvalues and eigenvectors resp. of A’z’ = λW’ z’

37 Proof of z’ T A’ z’ >= λ 1 z’ T W’ z’ z’ T A’ z’ = λz’ T W’ z’ = λC To minimize z’ T A’ z’, we must choose the smallest eigenvalue λ 1. The minimization of z’ T A’ z’, under z’, subject to the constraint z’ T W’ z’ = C equals λ 1 C If z’ T W’ z’ = C > 0 then z’ T A’ z’ >= λ 1 C If z’ T W’ z’ = 0 then z’ T A’ z’ >= 0, A’ is positive semi-definite


Download ppt "Multimedia DBs. Time Series Data 050100150200250300350400450500 23 24 25 26 27 28 29 25.1750 25.2250 25.2500 25.2750 25.3250 25.3500 25.4000 25.3250 25.2250."

Similar presentations


Ads by Google