Online Aggregation Liu Long
Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT
An Example Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG
A Better Approach Don’t process in batch! Online aggregation: represent the final result
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;
Goals & Requirements Continuous output –Non-blocking query plans Time/precision control Fairness/partiality
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work A Naïve ApproachModify the database engine
A Na ï ve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;
A Na ï ve Approach Drawbacks: –No grouping –No guarantee of continuous output –No guarantee of fairness (or control over partiality)
1. Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping Sampling: – could introduce new sampling access methods (e.g. Olken ’ s work 93’)
User Interface
2. Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can ’ t sort! sorting blocks sorting is unfair –Must use hash-based techniques –Hybrid hashing(84)! especially for duplicate elimination. –“ Hybrid Cache ” (96) even better. Memorization sorting
User Interface
3. Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Sol ’ n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule
User Interface
4. Join Algorithms Non-Blocking Joins
5. Query Optimization Entirely avoid sorting in an online aggregation system Extend “Interesting Orders”(79) to online aggregation User control vs. performance? –Running multiple versions of a query(Rdb, 96)
6. Aggregate Functions Add a 4th aggregate function –SUM, COUNT, AVG Use the formulae provided in the Appendix Extended aggs need to return running confidence intervals
7. API Current API uses built-in methods –Three basic functions: speedupGroup, slowDownGroup, stopGroup e.g. select StopGroup( val ) Very flexible. Easy to code –The fourth function: setSkipFactor(val, int)
User Interface Inter-tuple speed is critical!!
8. Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we ’ re within of the right answer 3 types of estimates Conservative (Hoeffding ’ s inequality) Large-Sample (Central Limit Theorems) Deterministic Previous work + new results from Peter Haas
Conservative confidence interval Based on Hoeffding’s inequality Valid for all
Large-Sample confidence interval Based on central limit theorems(CLT) Appropriate when n is small enough
Deterministic confidence interval With probability 1 Useful only when n is very large
Appendix: formulae
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
Evaluation Testbed –prototype in Postgres95 –96MB main memoty –1GB disk Dataset –University of Wisconsin –Single table –1,547,606 rows, about 316.6MB
Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Traditional plan
Access Methods, Big Group Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College L: 925K tuples
Access Methods, Small Group Surprise! Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College S: 15K tuples
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
Summary A new approach to aggregation! –Implementation of a RDBMS engine –Continuous output –Hash-based group by –Duplicate elimination –Index striding –New APIs –User Interface
Future Work Better UI –online data visualization graphical aggregate great for panning, zooming Checkpointing/continuation –also continuous data streams Extension of statistical results: –simultaneous confidence intervals
What makes it outstanding ? Leading a new subject Applicable to practical problems Sufficient and broad analysis –Join algorithms –Various statistical estimation theories Implementation with real system A little bit luck
Online Aggregation With MR Progress Estimation Multi-job MapReduce Statistical model for MapReduce Join Algorithms
Thank you! Liu Long