Online Aggregation 2011-4-15 Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Slides:



Advertisements
Similar presentations
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Advertisements

Query Processing and Optimizing on SSDs Flash Group Qingling Cao
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang
CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley.
Query Processing (overview)
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Physical Database Monitoring and Tuning the Operational System.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
A Crystal Ball for Data-Intensive Processing CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali.
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Access Path Selection in a Relational Database Management System Selinger et al.
CSCE Database Systems Chapter 15: Query Execution 1.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Page 1 Online Aggregation for Large MapReduce Jobs Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie VLDB 2011 IDS Fall Seminar
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
CS4432: Database Systems II Query Processing- Part 2.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Query Processing CS 405G Introduction to Database Systems.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS4432: Database Systems II Query Processing- Part 1 1.
Execution Plans Detail From Zero to Hero İsmail Adar.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Ripple Joins for Online Aggregation
Evaluation of Relational Operations
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Spatial Online Sampling and Aggregation
Relational Operations
Overview of Query Evaluation
Evaluation of Relational Operations: Other Techniques
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Online Aggregation Liu Long

Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT

An Example Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG

A Better Approach Don’t process in batch! Online aggregation: represent the final result

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

Ideal Approach Select AVG(grade) from ENROLL GROUP BY major;

Goals & Requirements Continuous output –Non-blocking query plans Time/precision control Fairness/partiality

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work A Naïve ApproachModify the database engine

A Na ï ve Approach SELECT running_avg(grade), running_confidence(grade), running_interval(grade), FROM ENROLL;

A Na ï ve Approach Drawbacks: –No grouping –No guarantee of continuous output –No guarantee of fairness (or control over partiality)

1. Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping Sampling: – could introduce new sampling access methods (e.g. Olken ’ s work 93’)

User Interface

2. Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can ’ t sort! sorting blocks sorting is unfair –Must use hash-based techniques –Hybrid hashing(84)! especially for duplicate elimination. –“ Hybrid Cache ” (96) even better. Memorization sorting

User Interface

3. Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Sol ’ n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule

User Interface

4. Join Algorithms Non-Blocking Joins

5. Query Optimization Entirely avoid sorting in an online aggregation system Extend “Interesting Orders”(79) to online aggregation User control vs. performance? –Running multiple versions of a query(Rdb, 96)

6. Aggregate Functions Add a 4th aggregate function –SUM, COUNT, AVG Use the formulae provided in the Appendix Extended aggs need to return running confidence intervals

7. API Current API uses built-in methods –Three basic functions: speedupGroup, slowDownGroup, stopGroup e.g. select StopGroup( val ) Very flexible. Easy to code –The fourth function: setSkipFactor(val, int)

User Interface Inter-tuple speed is critical!!

8. Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we ’ re within  of the right answer 3 types of estimates  Conservative (Hoeffding ’ s inequality)  Large-Sample (Central Limit Theorems)  Deterministic Previous work + new results from Peter Haas

Conservative confidence interval Based on Hoeffding’s inequality Valid for all

Large-Sample confidence interval Based on central limit theorems(CLT) Appropriate when n is small enough

Deterministic confidence interval With probability 1 Useful only when n is very large

Appendix: formulae

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

Evaluation Testbed –prototype in Postgres95 –96MB main memoty –1GB disk Dataset –University of Wisconsin –Single table –1,547,606 rows, about 316.6MB

Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Traditional plan

Access Methods, Big Group Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College L: 925K tuples

Access Methods, Small Group Surprise! Select AVG(grade), Interval(0.95) From ENROLL GROUP BY college; College S: 15K tuples

Content Question & Motivation The Approach –Goals –Implementation –Evaluation Summary & Future Work

Summary A new approach to aggregation! –Implementation of a RDBMS engine –Continuous output –Hash-based group by –Duplicate elimination –Index striding –New APIs –User Interface

Future Work Better UI –online data visualization graphical aggregate great for panning, zooming Checkpointing/continuation –also continuous data streams Extension of statistical results: –simultaneous confidence intervals

What makes it outstanding ? Leading a new subject Applicable to practical problems Sufficient and broad analysis –Join algorithms –Various statistical estimation theories Implementation with real system A little bit luck

Online Aggregation With MR Progress Estimation Multi-job MapReduce Statistical model for MapReduce Join Algorithms

Thank you! Liu Long