Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD.

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD 2001, Santa Barbara

Talk Outline zAggregate Queries zMotivation for Approximate Answering zMulti-Resolution Aggregate Tree (MRA-Tree) zProgressive Algorithm with Error Bounds zExperimental Evaluation zSummary and Future Work

Aggregate Queries 9 6 3 8 2 7 Q S min Q = 2 max Q = 7 count Q = 3 sum Q = 2+7+6 = 15 avg Q = 15/3 = 5

Evaluating Aggregate Queries zExact answering yScan all points of D checking each against Q yRetrieve points in Q via a multi-dimensional index on D zBoth linear/index scan can be very expensive zApproximate answering yMany applications (selectivity estimation, data analysis, visualization) do not require exact answers

Motivating Examples My boss needs to see the income aggregates in 10 minutes! How many tanks 10 miles from me? Boss

Techniques for Approximate Aggregate Queries zOnline estimation (Interactive) ySampling zOffline estimation (Data Synopsis) ySampling, Histograms, Wavelets zOur Technique: yOnline estimator via a scan of a modified multi- dimensional index (MRA-Tree) yAllows incremental tradeoff of accuracy for response time, with guaranteed error bounds

Multi-Resolution Aggregate Tree (MRA-Tree) zAn MRA-Tree can be instantiated with any of the popular multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.)  A non-leaf node contains (for each of its subtrees) four aggregates { MIN,MAX,COUNT,SUM } zA leaf node contains the actual data points zTree operations are identical with those of the plain (non-MRA) tree with the consideration that aggregates must be maintained

MRA-Tree Example min max count sum Non-Leaf Node Leaf Nodes 24 3 5 4 2 3 9 4 4 2 9 5 1 4 4 2 6 1 6 6 1 2 2 1 6

Progressive Algorithm Outline zWe want yBest answer for given time yShortest time for given precision of the answer yRefine an answer at will, trading time for precision zHow we achieve it yDo a prioritized traversal of nodes of the MRA-tree yMaintain an estimate of the answer E(agg Q ) yMaintain a 100% interval of confidence I = [L, H], such that L  agg Q  H

Generic Algorithm (1) Q N disjoint contains Q N Q N is contained Q N partially overlaps zTwo sets of nodes: yNP (partial contribution to the query) yNC (complete contribution)

Generic Algorithm (2) zInitialize NP with the root zAt each iteration: Remove one node N from NP and for each N child of its children ydiscard, if N child disjoint with Q yinsert into NP if Q is contained or partially overlaps with N child y“insert” into NC if Q contains N child (we only need to maintain agg NC ) Q N

Generic Algorithm (3) Node in NP Node in NC  To instantiate the algorithm for { MIN,MAX,COUNT,SUM,AVG }: yError Bounds. xInterval I=[L, H] : L  agg Q  H yTraversal Policy. xWhich node from NP to explore next? Minimize |I| yEstimation. xProvide an estimate of the answer: E(agg Q )

MIN (and MAX) 3 9 4 5 Interval min NC = min { 4, 5 } = 4 min NP = min { 3, 9 } = 3 L = min {min NC, min NP } = 3 H = min NC = 4 hence, I = [3, 4] Estimate Lower bound: E(min Q ) = L = 3 Traversal Choose N  NP: min N = min NP

COUNT (and SUM) 10 20% 25% 8 6 9 Interval count NC = 9+6 = 15 count NP = 8+10 = 18 L = count NC = 15 H = count NC + count NP = 33 hence, I = [15, 33] Estimate E(count Q ) = L + 0.25  8 + 0.2  10 = 19 Traversal Choose N  NP: count N  count M,  M  NP

AVG A B min max count sum A 5 10 5 35 B – – 10 55 Interval Current avg NC = 55/10 = 5.5 10 5 5 5 Estimate E(avg Q ) = E(sum Q )/ E(count Q ) Traversal – max count N – max (max N -avg NC ), (avg NC -min N ) Distribution of Values {5, 5, 5, 10, 10} Maximum possible: (55+2  10) / (10+2) = 6.25 Minimum possible: (55+3  5) / (10+3) = 5.38 hence, I = [5.38, 6.25]

Experiments zSynthetic datasets 2-4D zReal datasets: 2D spatial (USGS) and 4D (UCI KDD Forest Cover) zMRA-quadtree and MRA-Rtree indices zWe study yMRA-tree Vs. “plain” tree yMRA-tree Vs. online sampling yAccuracy of estimation yScalability with database size

MRA-Quadtree (Nodes Visited)

MRA-Quadtree (Error Reduction) Absolute Relative Error =

MRA-Rtree (2D, USGS) I/O Performance DB Size =

Estimation vs. Maximum Error (4D, Forest Cover, sel. 16% / axis)

MRA-Rtree vs. Online Sampling Estimation Accuracy (4D, Forest Cover)

Database Size (3D Synthetic, exact, 10% spatial sel.)

Summary zMRA-Tree is a modified multi-dimensional index for approximate answering of aggregate queries zFor exact answer yfaster than “plain” index zAdvantages over offline estimators yProgressively improving answers yError bounds zAdvantages over sampling yBetter estimate for same I/O zAlgorithm scales gracefully with database size

Future Work (QUASAR Project, UC Irvine) zScalability with high dimensionality, by using a dedicated high-D index structure zScalability in high update rate environments zApproximate query processing of general SQL queries using dedicated data structures, similar to MRA-tree

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD.

Similar presentations

Presentation on theme: "Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD.

Similar presentations

Presentation on theme: "Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure Iosif Lazaridis, Sharad Mehrotra University of California, Irvine SIGMOD."— Presentation transcript:

Similar presentations

About project

Feedback