Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Similar presentations


Presentation on theme: "Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT."— Presentation transcript:

1

2 Online Aggregation 2011-4-15 Liu Long

3 Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT

4 An Example Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG 3.262574342

5 A Better Approach Don’t process in batch! Online aggregation: represent the final result

6 Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

7 Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

8 Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;

9 Goals & Requirements Continuous output –Non-blocking query plans Time/precision control Fairness/partiality

10 Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work A Naïve ApproachModify the database engine

11 A Na ï ve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;

12 A Na ï ve Approach Drawbacks: –No grouping –No guarantee of continuous output –No guarantee of fairness (or control over partiality)

13 1. Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping Sampling: – could introduce new sampling access methods (e.g. Olken ’ s work 93’)

14 User Interface

15 2. Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can ’ t sort! sorting blocks sorting is unfair –Must use hash-based techniques –Hybrid hashing(84)! especially for duplicate elimination. –“ Hybrid Cache ” (96) even better. Memorization sorting

16 User Interface

17 3. Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Sol ’ n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule

18 User Interface

19 4. Join Algorithms Non-Blocking Joins

20 5. Query Optimization Entirely avoid sorting in an online aggregation system Extend “Interesting Orders”(79) to online aggregation User control vs. performance? –Running multiple versions of a query(Rdb, 96)

21 6. Aggregate Functions Add a 4th aggregate function –SUM, COUNT, AVG Use the formulae provided in the Appendix Extended aggs need to return running confidence intervals

22 7. API Current API uses built-in methods –Three basic functions: speedupGroup, slowDownGroup, stopGroup e.g. select StopGroup( val ) Very flexible. Easy to code –The fourth function: setSkipFactor(val, int)

23 User Interface Inter-tuple speed is critical!!

24 8. Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we ’ re within  of the right answer 3 types of estimates  Conservative (Hoeffding ’ s inequality)  Large-Sample (Central Limit Theorems)  Deterministic Previous work + new results from Peter Haas

25 Conservative confidence interval Based on Hoeffding’s inequality Valid for all

26 Large-Sample confidence interval Based on central limit theorems(CLT) Appropriate when n is small enough

27 Deterministic confidence interval With probability 1 Useful only when n is very large

28 Appendix: formulae

29 Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

30 Evaluation Testbed –prototype in Postgres95 –96MB main memoty –1GB disk Dataset –University of Wisconsin –Single table –1,547,606 rows, about 316.6MB

31 Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Traditional plan

32 Access Methods, Big Group Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College L: 925K tuples

33 Access Methods, Small Group Surprise! Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College S: 15K tuples

34 Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

35 Summary A new approach to aggregation! –Implementation of a RDBMS engine –Continuous output –Hash-based group by –Duplicate elimination –Index striding –New APIs –User Interface

36 Future Work Better UI –online data visualization graphical aggregate great for panning, zooming Checkpointing/continuation –also continuous data streams Extension of statistical results: –simultaneous confidence intervals

37 What makes it outstanding ? Leading a new subject Applicable to practical problems Sufficient and broad analysis –Join algorithms –Various statistical estimation theories Implementation with real system A little bit luck

38 Online Aggregation With MR Progress Estimation Multi-job MapReduce Statistical model for MapReduce Join Algorithms

39 Thank you! 2011-4-15 Liu Long


Download ppt "Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT."

Similar presentations


Ads by Google