Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue
introduction analyzing massive multi-dimensional datasets –complex aggregate queries over large parts of the data –exploratory nature –promptness over accuracy, but with guarantees –resort in approximate query processing over precomputed synopses (e.g., histograms, samples, wavelets) numerous data management applications require to continuously generate, process and analyze data on-line –the data streaming paradigm –summarize in real time, using small space and in one pass –provide approximate query answers with quality guarantees provide useful data summarization –need to measure inaccuracy, application dependent
outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue
wavelets basics wavelet decomposition is a mathematical tool for the hierarchical decomposition of functions –applications in signal/image processing used extensively as a data reduction tool in db scenarios: –selectivity estimation for large aggregate queries –fast approximate query answers –general purpose streaming synopsis features –efficient: performs in linear time and space (vs. histograms ~N 2 )) –high compression ratio, small-B property –generalizes to multiple dimensions
example assume a data vector d of 8 values wavelet tree (a.k.a. error tree) every node contributes positively to the leaves in its left subtree and negatively to the leaves in its right subtree iteratively perform pair-wise averaging and semi differencing averages are not needed
outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue
wavelet synopses any set of B coefficients constitutes a B-term wavelet synopsis –stored as pairs –implicitly all non-stored coefficients are set to zero introduces reconstruction error per point estimate e = |d-d|
measuring accuracy use some norm to aggregate individual errors L 2 norm: Σe i 2 is the sum squared error (sse) –sse = 224 L ∞ norm: max e i is the maximum absolute error –max-abs-error = 10 generalized to any weighted L p norm: Σw i e i p –e.g. max-rel-error = max (1/d i )e i = 10/4 = 250% vector of point errors e vector of data values d
optimal synopses a B-term wavelet synopsis can be optimized for any error metric sse optimal synopses are straightforward –wavelet transformation is orthonormal (after normalization) by Parseval’s theorem L 2 norm is preserved –choose the highest in absolute (normalized) value coefficients other (weighted or non) L p norm optimal synopses require superlinear (quadratic) time in N –dynamic programming over the wavelet tree
interesting issues I/O efficiency issues when dealing with massive multi- dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05] –during transformation try to minimize I/Os –efficient maintenance as new data are appended (requires more than just some updating) how about optimizing for workloads of range-sum queries? –no known results (without using the prefix-sum array) –ranges overlap arbitrarily no easy dynamic programming formulation exists
outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue
working over data streams main challenges when data are streaming: –stream items are only seen once –require small working space –process stream items quickly –provide an answer quickly with quality guarantees two models depending on how a data vector a is rendered time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time) turnstile model stream elements are updates of type (i,±u) which implies a[i] a[i] ± u and, further, do not appear ordered in i
streaming wavelet synopses time series model –at most only logN coefficients are affected –a large number of coefficients has finalized value –can perform bottom-up dynamic programming (space required is prohibitive) –greedy techniques should be deployed instead turnstile model –even optimizing for the sse is hard [G. Cormode, M. Garofalakis, D. Sacharidis ‘06] –other error metrics have not been studied
outline introduction background –wavelet basics –example wavelet synopses –example –error metrics –optimal synopses –interesting issues data streams –models –streaming wavelet synopses epilogue
wavelet synopses are a highly successful data summarization technique yet, several problems remain open: –optimize for range query workloads –greedy (time-series) streaming algorithms –other metrics for general (turnstile) streaming data
thank you!
unrestricted wavelet synopses the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!) quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1 restricted synopsis: keep the overall average 8 m.a.e. = 6 unrestricted synopsis: keep the overall average but change its value to 7 m.a.e. = 5