Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Online Aggregation 2011-4-15 Liu Long

Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT

An Example Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG 3.262574342

A Better Approach Don’t process in batch! Online aggregation: represent the final result

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;

Goals & Requirements Continuous output –Non-blocking query plans Time/precision control Fairness/partiality

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work A Naïve ApproachModify the database engine

A Na ï ve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;

A Na ï ve Approach Drawbacks: –No grouping –No guarantee of continuous output –No guarantee of fairness (or control over partiality)

1. Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping Sampling: – could introduce new sampling access methods (e.g. Olken ’ s work 93’)

User Interface

2. Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can ’ t sort! sorting blocks sorting is unfair –Must use hash-based techniques –Hybrid hashing(84)! especially for duplicate elimination. –“ Hybrid Cache ” (96) even better. Memorization sorting

User Interface

3. Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Sol ’ n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule

User Interface

4. Join Algorithms Non-Blocking Joins

5. Query Optimization Entirely avoid sorting in an online aggregation system Extend “Interesting Orders”(79) to online aggregation User control vs. performance? –Running multiple versions of a query(Rdb, 96)

6. Aggregate Functions Add a 4th aggregate function –SUM, COUNT, AVG Use the formulae provided in the Appendix Extended aggs need to return running confidence intervals

7. API Current API uses built-in methods –Three basic functions: speedupGroup, slowDownGroup, stopGroup e.g. select StopGroup( val ) Very flexible. Easy to code –The fourth function: setSkipFactor(val, int)

User Interface Inter-tuple speed is critical!!

8. Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we ’ re within  of the right answer 3 types of estimates  Conservative (Hoeffding ’ s inequality)  Large-Sample (Central Limit Theorems)  Deterministic Previous work + new results from Peter Haas

Conservative confidence interval Based on Hoeffding’s inequality Valid for all

Large-Sample confidence interval Based on central limit theorems(CLT) Appropriate when n is small enough

Deterministic confidence interval With probability 1 Useful only when n is very large

Appendix: formulae

Evaluation Testbed –prototype in Postgres95 –96MB main memoty –1GB disk Dataset –University of Wisconsin –Single table –1,547,606 rows, about 316.6MB

Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Traditional plan

Access Methods, Big Group Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College L: 925K tuples

Access Methods, Small Group Surprise! Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College S: 15K tuples

Summary A new approach to aggregation! –Implementation of a RDBMS engine –Continuous output –Hash-based group by –Duplicate elimination –Index striding –New APIs –User Interface

Future Work Better UI –online data visualization graphical aggregate great for panning, zooming Checkpointing/continuation –also continuous data streams Extension of statistical results: –simultaneous confidence intervals

What makes it outstanding ? Leading a new subject Applicable to practical problems Sufficient and broad analysis –Join algorithms –Various statistical estimation theories Implementation with real system A little bit luck

Online Aggregation With MR Progress Estimation Multi-job MapReduce Statistical model for MapReduce Join Algorithms

Thank you! 2011-4-15 Liu Long

Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Similar presentations

Presentation on theme: "Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Similar presentations

Presentation on theme: "Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT."— Presentation transcript:

Similar presentations

About project

Feedback