Download presentation
Presentation is loading. Please wait.
Published byEaster Fitzgerald Modified over 9 years ago
2
Online Aggregation 2011-4-15 Liu Long
3
Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT
4
An Example Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG 3.262574342
5
A Better Approach Don’t process in batch! Online aggregation: represent the final result
6
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
7
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
8
Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;
9
Goals & Requirements Continuous output –Non-blocking query plans Time/precision control Fairness/partiality
10
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work A Naïve ApproachModify the database engine
11
A Na ï ve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;
12
A Na ï ve Approach Drawbacks: –No grouping –No guarantee of continuous output –No guarantee of fairness (or control over partiality)
13
1. Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping Sampling: – could introduce new sampling access methods (e.g. Olken ’ s work 93’)
14
User Interface
15
2. Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can ’ t sort! sorting blocks sorting is unfair –Must use hash-based techniques –Hybrid hashing(84)! especially for duplicate elimination. –“ Hybrid Cache ” (96) even better. Memorization sorting
16
User Interface
17
3. Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Sol ’ n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule
18
User Interface
19
4. Join Algorithms Non-Blocking Joins
20
5. Query Optimization Entirely avoid sorting in an online aggregation system Extend “Interesting Orders”(79) to online aggregation User control vs. performance? –Running multiple versions of a query(Rdb, 96)
21
6. Aggregate Functions Add a 4th aggregate function –SUM, COUNT, AVG Use the formulae provided in the Appendix Extended aggs need to return running confidence intervals
22
7. API Current API uses built-in methods –Three basic functions: speedupGroup, slowDownGroup, stopGroup e.g. select StopGroup( val ) Very flexible. Easy to code –The fourth function: setSkipFactor(val, int)
23
User Interface Inter-tuple speed is critical!!
24
8. Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we ’ re within of the right answer 3 types of estimates Conservative (Hoeffding ’ s inequality) Large-Sample (Central Limit Theorems) Deterministic Previous work + new results from Peter Haas
25
Conservative confidence interval Based on Hoeffding’s inequality Valid for all
26
Large-Sample confidence interval Based on central limit theorems(CLT) Appropriate when n is small enough
27
Deterministic confidence interval With probability 1 Useful only when n is very large
28
Appendix: formulae
29
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
30
Evaluation Testbed –prototype in Postgres95 –96MB main memoty –1GB disk Dataset –University of Wisconsin –Single table –1,547,606 rows, about 316.6MB
31
Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Traditional plan
32
Access Methods, Big Group Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College L: 925K tuples
33
Access Methods, Small Group Surprise! Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College S: 15K tuples
34
Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work
35
Summary A new approach to aggregation! –Implementation of a RDBMS engine –Continuous output –Hash-based group by –Duplicate elimination –Index striding –New APIs –User Interface
36
Future Work Better UI –online data visualization graphical aggregate great for panning, zooming Checkpointing/continuation –also continuous data streams Extension of statistical results: –simultaneous confidence intervals
37
What makes it outstanding ? Leading a new subject Applicable to practical problems Sufficient and broad analysis –Join algorithms –Various statistical estimation theories Implementation with real system A little bit luck
38
Online Aggregation With MR Progress Estimation Multi-job MapReduce Statistical model for MapReduce Join Algorithms
39
Thank you! 2011-4-15 Liu Long
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.