Download presentation
Presentation is loading. Please wait.
2
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM SIGMOD 1997
3
Hongtao Cheng 2 Outline Introduction Alternative Methods SVD and SVDD Experiments Conclusions and summary
4
Hongtao Cheng 3 Introduction - datasets Datasets we are dealing with N x M matrix, N represents time sequences, M represents duration of time. X i,j represents the value. Matrix is huge(Gigabytes). Number of rows N >> Numbers of columns M. N O(10^6) M O(100) No Updates on matrix or rare updates
5
Hongtao Cheng 4 Introduction – a sample database Query 1: what was the amount of sales to GHI Inc. on July 11, 1996? Query 2: find the total sales to business customers (ABC, DEF, GHI, and KLM) for the week ending July 12, 1996
6
Hongtao Cheng 5 Introduction … The reality Data is compressed. Accessing specific data is very difficult. Decision support and data mining requires the ability to perform ad hoc queries. Solution “Processing run” (inefficient, limited, accurate) Quick reconstruction of compressed data. (efficient, “random access”, loss of accuracy) SVD is the chosen technique for this paper
7
Hongtao Cheng 6 Alternative methods String Compression Clustering Spectral Methods SVD & SVDD
8
Hongtao Cheng 7 String Compression (lossless) Algorithms: Lempel-Ziv algorithm, Huffman coding, arithmetic coding. Uncompress the whole database to get the value of a cell in the matrix. Works fine with continuous stream of queries. Enhancement Segment the data and compress each segment independently. Most queries follow a particular form Not effective for real ad hoc querying
9
Hongtao Cheng 8 Clustering Algorithm: find the cluster-representative for i-th customer, and return its j-th entry to get value of cell x i,j. In short, x i,j = f( i, j ) Widely used in information retrieval for grouping, pattern matching, social and natural sciences for statistical analysis Not scale – up in our case. Use off-the-shelf clustering method for the experiment.
10
Hongtao Cheng 9 Spectral Methods Algorithm: DFT(discrete fourier transform) and other associated methods(DCT, DWT). Widely used in signal processing. Comparison with SVD SMs have poor performance for spikes or abrupt jumps of input signals. SVD handles that well. SVD can be applied to heterogeneous, M-dimensional vectors. SMs can’t. Use DCT method for the experiment
11
Hongtao Cheng 10 SVD and SVDD SVD – Singular value decomposition Usage: Statistical analysis Text retrieval Pattern recognition Dimensionality reduction Face recognition Particularly useful in linear regression, matrix approximation
12
Hongtao Cheng 11 SVD – intuition behind SVD In N x M matrix X, x i,j can be grouped together called “pattern” or “principal component” For M = 2 in Figure 1, x’ gives the “best” axis to project values.
13
Hongtao Cheng 12 Algorithm of SVD - Theorem
14
Hongtao Cheng 13 Algorithm of SVD – an example U customer-to-pattern similarity matrix Observation: V day-to-pattern similarity matrix V j unit vectors correspond to the directions for optimal projection of the given set of points. I-th row vector of Ux the coordinates of the ith data vector(“customer”).
15
Hongtao Cheng 14 Algorithm of SVD V and pinned in memory Requires O(k) compute time, independent of N and M Only one disk access is required to perform this reconstruction
16
Hongtao Cheng 15 Algorithm of SVDD Singular Value Decomposition with Deltas Maintain a set of triples of the form (row, column, delta) Delta is difference between the actual value and the value SVD constructs Clean up gross errors
17
Hongtao Cheng 16 Algorithm of SVDD … Data structure of SVDD U K opt eighenvalues V Additionally, store k opt triples of the form (row, column, delta) Reconstruction One disk access to fetch ith row of U One disk access to fetch delta (using hash table) Tradeoff: store outlier data
18
Hongtao Cheng 17 Experiment Two types of queries Specific data element query Aggregation query Two datasets Phone100k (0.2 Gigabytes) Stocks (341 Kbytes) Error measurement method: RMSPE Four compression methods Hierarchical clustering: (b*k*M + N*b) bytes for k clusters DCT: (N*k*b) bytes for k coefficients SVD: (N*k+k+k*M) bytes for k principal components SVDD: (N*k+k+k*M+D*O(b)) bytes for k principal components
19
Hongtao Cheng 18 Output comparison of Four methods SVDD did best. K opt = K max DCT didn’t do well. It did better in “stocks” than “phone2000” Plain SVD and clustering were close to each other. SVDD gives a satisfactory result.(10:1 CR, 2% ER; 50:1 CR, 10% ER)
20
Hongtao Cheng 19 Errors of SVD and SVDD Worst case error in SVD is very large. SVDD bounds the error pretty well.
21
Hongtao Cheng 20 Observation of errors Steep initial drop in error. Most matrix cells has an error substantially less than the mean error RMSPE. SVDD get rid of the worst case cell error and give a close approximation.
22
Hongtao Cheng 21 Error for aggregate queries (SVDD) Normalized query error Qerr. 50 queries and approximately 10% of the data cells included. The error was well under 0.5% even with 50:1 CR. Estimates of answers to aggregate queries can be obtained through sampling.
23
Hongtao Cheng 22 Scale-up (SVDD) Error is around 2% at the 10:1 CR The graphs are homogeneous.
24
Hongtao Cheng 23 Scale – up (SVD vs. SVDD) Error of SVD increases with dataset size. Error of SVDD remains constant with dataset size.
25
Hongtao Cheng 24 Conclusion Lossy compression problem and its solutions Signal processing Pattern recognition Information retrieval (clustering) Matrix algebra (SVD) SVD algorithm SVDD properties Excellent compression rate and Satisfactory result Bound the worst case error of individual data values pretty well Only three passes over the dataset Dimensionality reduction of given dataset. Arbitrary vectors can be handled without additional effort.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.