Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Slides:



Advertisements
Similar presentations
CS 540 Database Management Systems
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing (overview)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Physical Database Monitoring and Tuning the Operational System.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Overview of Implementing Relational Operators and Query Evaluation
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Methodology – Physical Database Design for Relational Databases.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Query Processing CS 405G Introduction to Database Systems.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 540 Database Management Systems
Query Processing and Query Optimization CS 157B Dennis Le Weishan Wang.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
CS4432: Database Systems II Query Processing- Part 1 1.
Execution Plans Detail From Zero to Hero İsmail Adar.
1 VLDB, Background What is important for the user.
15.1 – Introduction to physical-Query-plan operators
CS 540 Database Management Systems
CS 440 Database Management Systems
Teradata Join Processing
Ripple Joins for Online Aggregation
Methodology – Physical Database Design for Relational Databases
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Overview of Query Optimization
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley Presented By: Arjav Dave

Overview Introduction Goals Implementation Performance Evaluation Future Work

Problems with Aggregation Aggregation is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time and then the final answer is returned. Users are forced to wait without feedback while the query is being processed. Aggregate queries are generally used to get a rough picture of a large body, yet they are computed with painstaking precision.

What is Online Aggregation and Why? Permits users to observe the progress of their aggregation. Controls the execution on the fly.

Example Consider the query: select avg(final_grade) from grades where course_name=‘cse186’ Without index the query will scan all records before returning the answer.

Example using online aggregation The user can stop the query processing if the result is within specified confidence interval.

Interface with groups Consider a query with group by clause having 6 groups in output There are 6 stop signs each for one group. They can be used to stop the group processing. Such an interface is easy for the non-statistical users.

Usability Goals Continuous Observation: Statistical, Graphical and other intuitive interfaces. Interfaces must be extensible for each aggregate function. Control of Time/Precision: User should be able to terminate processing at any time controlling trade off between time and precision. Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Performance Goals Minimum time to Accuracy: Minimize time required to produce useful estimate of the final answer. Minimize time to completion: Minimize the time required to produce the final answer. Pacing: The running aggregates should be updated at regular rates, without being so frequent that they overburden the user or user interface.

Building a system for Online Aggregation Two Approaches: Naïve Approach Modifying a DBMS

Naive Approach Consider the query: select running_avg(final_grade), running_confidence(final_grade), running_interval(final_grade) from grades; POSTGRES supports user defined function which can be defined for simple aggregates. For complex aggregates performance and functional issues arise.

Modifying a DBMS Modify the database engine to support online aggregation. Random access to Data: ▫ Necessary to get the statistically meaningful estimates of the precision of running aggregates. Access Methods: ▫Heap Scan ▫Index Scan ▫Sampling from Indices

Types of Access Methods Heap Scan: ▫Generally stored in random order, so easy to fetch ▫In clustered files there may be some logical order, so choose other access method for queries over that attributes. Index Scan: ▫Returns tuples based on some attribute or in groups based on some attribute. Sampling from Indices: ▫Ideal for producing meaningful confidence intervals. ▫Less efficient than other two.

Fair, Non-Blocking Group By Traditional technique does sorting and then grouping. But sorting blocks Instead hash the input relation on its grouping columns. But does not perform better as the number of groups increases. Hybrid hashing or its optimized version Hybrid Cache can be used. For DISTINCT queries same technique can be used.

Index Striding Update for the group with few members will be very infrequent. Index Striding: It uses Round-Robin technique to fetch tuples from different groups fairly. Advantages: ▫Output is updated according to default or user settings. ▫Delivery of tuples of a group can be stopped, so delivery of tuples for other groups is much faster.

Non-Blocking Join Algorithms Sort-Merge Join: Sorting blocks. Merge Join: Not acceptable for access methods that fetch tuples in sorted order. Hybrid Hash Join: Acceptable if inner relation is small, as it blocks to hash the inner relation. Pipeline Hash: Less Efficient than hybrid hash join. But efficient if both the relations are large. Nested-Loops join: Not useful for joins with large inner relation non-indexed.

Optimization Avoid Sorting Function to calculate cost for Blocking sub-operations (e.g. Hashing inner relation in hybrid hash join) according to processing time should be exponential. Maximize user control. Trade off need to be evaluated between the output rate and the time to completion.

Aggregate Functions New aggregate functions must be defined to return running confidence intervals. Query Executor must be modified to provide running aggregate values for display. An API must be provided to control the rate e.g. stopGroup, speedupGroup, slowDownGroup, setSkipFactor.

Running Confidence Intervals Precision of running aggregates is given by running confidence intervals Three types: ▫Conservative Confidence Interval ▫Large Sample Intervals ▫Deterministic Confidence Interval

Future Work Enhancing User Interface Support for Nested Queries Checkpointing and Continuation Tracking online queries

QUESTIONS?