Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.

Similar presentations


Presentation on theme: "Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM."— Presentation transcript:

1

2 Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM SIGMOD 1997

3 Hongtao Cheng 2 Outline  Introduction  Alternative Methods  SVD and SVDD  Experiments  Conclusions and summary

4 Hongtao Cheng 3 Introduction - datasets Datasets we are dealing with  N x M matrix, N represents time sequences, M represents duration of time. X i,j represents the value.  Matrix is huge(Gigabytes).  Number of rows N >> Numbers of columns M. N  O(10^6) M  O(100)  No Updates on matrix or rare updates

5 Hongtao Cheng 4 Introduction – a sample database  Query 1: what was the amount of sales to GHI Inc. on July 11, 1996?  Query 2: find the total sales to business customers (ABC, DEF, GHI, and KLM) for the week ending July 12, 1996

6 Hongtao Cheng 5 Introduction … The reality  Data is compressed.  Accessing specific data is very difficult.  Decision support and data mining requires the ability to perform ad hoc queries. Solution  “Processing run” (inefficient, limited, accurate)  Quick reconstruction of compressed data. (efficient, “random access”, loss of accuracy) SVD is the chosen technique for this paper

7 Hongtao Cheng 6 Alternative methods  String Compression  Clustering  Spectral Methods  SVD & SVDD

8 Hongtao Cheng 7 String Compression (lossless)  Algorithms: Lempel-Ziv algorithm, Huffman coding, arithmetic coding.  Uncompress the whole database to get the value of a cell in the matrix.  Works fine with continuous stream of queries.  Enhancement  Segment the data and compress each segment independently.  Most queries follow a particular form  Not effective for real ad hoc querying

9 Hongtao Cheng 8 Clustering  Algorithm: find the cluster-representative for i-th customer, and return its j-th entry to get value of cell x i,j. In short, x i,j = f( i, j )  Widely used in information retrieval for grouping, pattern matching, social and natural sciences for statistical analysis  Not scale – up in our case.  Use off-the-shelf clustering method for the experiment.

10 Hongtao Cheng 9 Spectral Methods  Algorithm: DFT(discrete fourier transform) and other associated methods(DCT, DWT).  Widely used in signal processing.  Comparison with SVD  SMs have poor performance for spikes or abrupt jumps of input signals. SVD handles that well.  SVD can be applied to heterogeneous, M-dimensional vectors. SMs can’t.  Use DCT method for the experiment

11 Hongtao Cheng 10 SVD and SVDD  SVD – Singular value decomposition  Usage:  Statistical analysis  Text retrieval  Pattern recognition  Dimensionality reduction  Face recognition  Particularly useful in linear regression, matrix approximation

12 Hongtao Cheng 11 SVD – intuition behind SVD  In N x M matrix X, x i,j can be grouped together called “pattern” or “principal component”  For M = 2 in Figure 1, x’ gives the “best” axis to project values.

13 Hongtao Cheng 12 Algorithm of SVD - Theorem

14 Hongtao Cheng 13 Algorithm of SVD – an example U  customer-to-pattern similarity matrix Observation: V  day-to-pattern similarity matrix V j  unit vectors correspond to the directions for optimal projection of the given set of points. I-th row vector of Ux   the coordinates of the ith data vector(“customer”).

15 Hongtao Cheng 14 Algorithm of SVD  V and  pinned in memory  Requires O(k) compute time, independent of N and M  Only one disk access is required to perform this reconstruction

16 Hongtao Cheng 15 Algorithm of SVDD  Singular Value Decomposition with Deltas  Maintain a set of triples of the form (row, column, delta)  Delta is difference between the actual value and the value SVD constructs  Clean up gross errors

17 Hongtao Cheng 16 Algorithm of SVDD …  Data structure of SVDD  U  K opt eighenvalues  V  Additionally, store  k opt triples of the form (row, column, delta)  Reconstruction  One disk access to fetch ith row of U  One disk access to fetch delta (using hash table)  Tradeoff: store outlier data

18 Hongtao Cheng 17 Experiment  Two types of queries  Specific data element query  Aggregation query  Two datasets  Phone100k (0.2 Gigabytes)  Stocks (341 Kbytes)  Error measurement method: RMSPE  Four compression methods  Hierarchical clustering: (b*k*M + N*b) bytes for k clusters  DCT: (N*k*b) bytes for k coefficients  SVD: (N*k+k+k*M) bytes for k principal components  SVDD: (N*k+k+k*M+D*O(b)) bytes for k principal components

19 Hongtao Cheng 18 Output comparison of Four methods  SVDD did best. K opt = K max  DCT didn’t do well. It did better in “stocks” than “phone2000”  Plain SVD and clustering were close to each other.  SVDD gives a satisfactory result.(10:1 CR, 2% ER; 50:1 CR, 10% ER)

20 Hongtao Cheng 19 Errors of SVD and SVDD  Worst case error in SVD is very large.  SVDD bounds the error pretty well.

21 Hongtao Cheng 20 Observation of errors  Steep initial drop in error.  Most matrix cells has an error substantially less than the mean error RMSPE.  SVDD get rid of the worst case cell error and give a close approximation.

22 Hongtao Cheng 21 Error for aggregate queries (SVDD)  Normalized query error Qerr.  50 queries and approximately 10% of the data cells included.  The error was well under 0.5% even with 50:1 CR.  Estimates of answers to aggregate queries can be obtained through sampling.

23 Hongtao Cheng 22 Scale-up (SVDD)  Error is around 2% at the 10:1 CR  The graphs are homogeneous.

24 Hongtao Cheng 23 Scale – up (SVD vs. SVDD)  Error of SVD increases with dataset size.  Error of SVDD remains constant with dataset size.

25 Hongtao Cheng 24 Conclusion  Lossy compression problem and its solutions  Signal processing  Pattern recognition  Information retrieval (clustering)  Matrix algebra (SVD)  SVD algorithm  SVDD properties  Excellent compression rate and Satisfactory result  Bound the worst case error of individual data values pretty well  Only three passes over the dataset  Dimensionality reduction of given dataset.  Arbitrary vectors can be handled without additional effort.


Download ppt "Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM."

Similar presentations


Ads by Google