Presentation is loading. Please wait.

Presentation is loading. Please wait.

A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST.

Similar presentations


Presentation on theme: "A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST."— Presentation transcript:

1 A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented By: Charanmai Koorapati Ramesh Harika Guniganti

2 A GENDA Introduction Motivation Prior Work Wavelet Decomposition Building Wavelet Synopses Processing Relational Queries Experimental Study Quality Metrics Query Execution Times Conclusion

3 D ECISION S UPPORT S YSTEMS Comparative sales figures between one week and the next Projected revenue figures based on new product sales assumptions The consequences of different decision alternatives, given past experience in a context that is described

4 M OTIVATION DSS users pose very complex queries to the underlying DBMS that require complex operations over Gigabytes or Terabytes of disk- resident data. SQL Query Exact Answer Decision Support Systems Long Response Times!

5  Exact answers NOT always required.  User may prefer a fast, approximate answer. SQL Query Exact Answer Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST!! Long Response Times! Decision Support Systems GB/TB

6 A PPROXIMATE Q UERY P ROCESSING Viable solution for dealing with Huge amounts of data High query complexities Increasingly stringent response-time requirements

7 P RIOR W ORK  Sampling Based Techniques Limitations: Join operator on two uniform samples Non- aggregate query  Histogram Based Techniques Limitations: Storage overhead Construction cost achieve reasonable error rates for high dimensional data sets.

8 W AVELET B ASED T ECHNIQUES Wavelet -mathematical function used to divide a given function or continuous-time signal into different frequency componentscontinuous-time signal and study each component with a resolution that matches its scale. This paper extends the scope of earlier work, establishing the viability and effectiveness of wavelets as a generic approximate query processing tool for modern high-dimensional DSS applications.

9 A PPROXIMATE Q UERY P ROCESSING USING W AVELETS Novel approach consisting of two steps- Multi dimensional Haar wavelets - effective, compact synopses Novel query processing alogorithms - fast and accurate approximate query answers

10 W AVELET D ECOMPOSITION /T RANSFORM One- dimensional Haar Wavelets Data vector A = [2,2,5,7] Wavelet transform, W A = [4,-2,0,-1] ResolutionAveragesDetail Coefficients 2[2,2,5,7]- 1[2,6][0,-1] 0[4][-2] Wavelet Coefficient

11 N ORMALIZED W AVELET T RANSFORM To equalize the importance of all the wavelet coefficients, we normalize the final entries of W A, by dividing each wavelet coefficient by √2 ^l, where l is the level of resolution. Thus W A = [4,-2,0,-1/ √2]

12 M ULTIDIMENSIONAL H AAR W AVELETS Standard Decomposition First, fix an ordering for the data dimensions(say 1,2,… d) and then proceed to apply the complete one-dimensional wavelet transform for each one dimensional “row” of array cells along dimension k, for all k=1,2…d. Non- standard Decomposition Given an ordering for the data dimensions (1,2,…d), we perform one step of pairwise averaging and differencing for each one dimensional row of array cells along dimension k, for each k=1,…d. This process is repeated recursively only on quadrant containing averages across all dimensions.

13 N ON -S TANDARD D ECOMPOSITION

14 E XAMPLE D ECOMPOSITION O F A 4×4 A RRAY

15 M ULTIDIMENSIONAL H AAR C OEFFICIENTS - S EMANTICS AND R EPRESENTATION

16 S UPPORT R EGIONS A ND S IGNS FOR 16 NONSTANDARD 2- DIMENSIONAL H AAR BASIS FUNCTIONS

17 Haar wavelet coefficient can be represented with the triple W= where 1) W.R is d-dimensional support hyper-rectangle of W Along each dimension j,1<=j<=d Low boundary value - W.R.bound[j].lo High boundary value - W.R.bound[j].hi Coefficient W contributes to each data cell of A[i 1,…i d ] satisfying the condition W.R.bound[j].lo <= i j <= W.R.bound[j].hi for all dimensions j, 1<= j<=d

18 2) W.S stores sign information for all d-dimensional quadrants of W.R. The two elements of the sign vector of coefficient W along dimensions j are denoted by W.S.sign[j].lo, W.S.sign[j].hi corresponding to lower and upper half of W.R’s extent along dimension j. The sign information is computed as a product of the d-sign entries that map to that quadrant. 3) W.v is the (scalar) magnitude of coefficient W. This is exactly the quantity that W contributes to all data array cells enclosed in W.R.

19 B UILDING W AVELET -C OEFFICIENT S YNOPSES Joint Data Distribution Array Array 0 1 2 3 Attr1 32103210 Attr2 3 6 4 Relation (ROLAP) Representation Capturing d-dimensional array A R (joint frequency distribution) from relational table R (“set of tuples” ROLAP)

20 What is the size of the wavelet-coefficient synopsis?

21 P ROCESSING R ELATIONAL Q UERIES I N W AVELET -C OEFFICIENT DOMAIN Wavelet Synopses Approximate Relations Query Results in Wavelet Domain Final Approximate Results Render Querying in Wavelet Domain Querying in Relation Domain Compressed domain (FAST) Relation domain (SLOW) Reduce relations into compact wavelet-coefficient synopses

22 W AVELET Q UERY P ROCESSING join project select set of coefficients set of coefficients set of coefficients Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: set of wavelet coefficients output: (multi)set of tuples render

23 Q UICK R EVIEW O F N OTATIONS

24 S ELECTION O PERATOR ( SELECT )

25 S ELECTION -- R ELATIONAL D OMAIN In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells Dim. D2 6 3 7 3 32 2 4 1 1 8 6 3 Query Range Dim. D1 Joint Data Distribution Array Relation

26 A PPROXIMATE Q UERY E XECUTION E NGINE P ROCESS F OR S ELECT

27 S ELECTION -- W AVELET D OMAIN - - + + + - - + + - D2 D1 Query Range - + - + - + D2 D1

28 P ROJECTION O PERATOR ( PROJECT )

29 P ROJECTION - W AVELET D OMAIN

30 J OIN O PERATOR ( JOIN )

31 E QUI - JOIN -- R ELATIONAL D OMAIN Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in join dimension(s)), (2) compute output coefficients 3 Coefficients A1 (+) and A3 (-) contribute to this cell Coefficients B2 (+), and B3 (+) contribute to this cell Join along D1 Joint Data Distribution of Relation 1 of Relation 1 Joint Data Distr. of Relation 2 of Relation 2 7 6 Dim. D2 Dim. D3 Join Dim. D1 Relation 1 Relation 2

32

33

34 E QUI - JOIN -- W AVELET D OMAIN - + D3 D1 - - + + D2 D1 v1 v2 Join output coefficient: D3 D1 + D2 - v = v1 * v2

35 E XPERIMENTAL S TUDY Improved Answer Quality Low Synopsis Construction Costs Fast Query Execution

36 E RROR M ETRICS FOR S ET -V ALUED Q UERY A NSWERS Need an error metric for (multi)sets that accounts for both differences in element frequencies differences in element values Proposed Solutions MAC (Match-And-Compare) Error [IP99]: based on perfect bipartite graph matching EMD (Earth Mover’s Distance) Error [CGR00, RTG98]: based on bipartite network flows

37 Q UERY E XECUTION T IMES

38 SELECT-J OIN -SUM QUERY ERRORS ON REAL - LIFE DATA

39 SELECT query errors on real-life data

40 SELECT-SUM QUERY ERRORS ON REAL - LIFE DATA

41 CONCLUSION Multidimensional wavelets as an effective tool for general purpose approximate query processing in modern, high dimensional applications. The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain. Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of our wavelet-based approach compared to both sampling and histograms

42 Questions??? THANK YOU


Download ppt "A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST."

Similar presentations


Ads by Google