An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

By Snigdha Rao Parvatneni
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Query Processing (overview)
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
Chapter 11 Operating Systems
Computer Organization and Architecture
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
1 Database Query Execution Zack Ives CSE Principles of DBMS Ullman Chapter 6, Query Execution Spring 1999.
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
Computer System Architectures Computer System Software
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
An Example Data Stream Management System: TelegraphCQ INF5100, Autumn 2009 Jarle Søberg.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
CS4432: Database Systems II Query Processing- Part 2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Query Processing CS 405G Introduction to Database Systems.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
CS 540 Database Management Systems
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Query Optimization for Stream Databases Presented by: Guillermo Cabrera Fall 2008.
Memory Management.
Chapter 2 Memory and process management
Module 11: File Structure
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
Parallel Programming By J. H. Wang May 2, 2017.
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 9: Virtual-Memory Management
Database Query Execution
Lecture 2- Query Processing (continued)
Evaluation of Relational Operations: Other Techniques
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 13: I/O Systems.
Presentation transcript:

An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Slides by Peng Li, Modified by Rachel Pottinger Modified by Ben Zhu, Yidan Liu Presenter: Ben Zhu Discussion Leader: Yidan Liu

Outline Motivations for Tukwila Tukwila Architecture Interleaving of planning and execution Adaptive Query Operators Dynamic Collector Double Pipelined Join Summary

Why data integration? The goal is to provide a uniform query interface to a multitude of data sources. The key advantage of data integration is that it frees users from having to do the followings, locate the sources relevant to their query interact with each source independently manually combine the data from the different sources

The main challenges of the design of DISs: Query Reformulation The construction of wrapper programs Query optimizers and efficient query execution engines

The need for adaptivity Absence of statistics Unpredictable data arrival characteristics Overlap and redundancy among sources Optimizing the time to the initial answers to the query Network bandwidth generally constrains the data sources to be “small”

Tukwila Architecture

Interleaving of planning and execution Novel features of Tukwila: The optimizer can create a partial plan if essential statistics are missing or uncertain The optimizer generates both annotated operator trees and appropriate event-condition-action rules. Optimizer conserves the state of its search space when it calls the execution engine, so it is able to resume optimization in an incremental fashion.

Interleaving of planning and execution – Query plans Operators in Tukwila are organized into pipelined units called fragments A plan includes a partially-ordered set of fragments and a set of global rules A fragment consists of a fully pipelined tree of physical operators and a set of local rules. At the end of a fragment, results are materialized, and the rest of the plan can be re- optimized or rescheduled

Interleaving of planning and execution - Rules Rules are the key mechanism for implementing several kinds of adaptive behavior in Tukwila Re-optimization the optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size  re-invoke optimizer Contingent planning check properties of the result to select the next fragment Adaptive operators policy for memory overflow resolution and collectors Rescheduling reschedule if a source times out

Interleaving of planning and execution - Rule format When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, causing it to check its condition. If the condition is true, the rule fires, executing the actions.

Discussion 1 For one of the following motivating situations of Tukwila Absence of statistics Unpredictable data arrival characteristics Overlap and redundancy among sources Optimizing the time to initial answers Q1: Can you give some examples where the chosen topic matters? Q2: If you are a member of Tukwila team, what rules or policy would you have to deal with the problem? Discussion Form 4 groups (3~4 person per group, one team per topic) Discuss Q1 and Q2 for one topic (5 ~ 7 minutes)

Interleaving of planning and execution – Query Execution Executes a query plan (basic function) Gathers statistics about each operation Handles exception conditions or re-invoke the optimizer

Interleaving of planning and execution – Event Handling Interprets rules attached to query execution plan Execution system may generate events at any time, which are fed into an event queue For each event, the event handler uses a hash table to find all matching rules in the active set For each active rule, it evaluates the conditions and if they are satisfied, all of the rule’s actions are executed

Adaptive Query Operators - Dynamic Collectors Performs a union over a large number of overlapping sources Standard union operators can’t handle errors or decide to ignore slow mirror data sources Tukwila treats the task as a primitive operator so that it can provide guidance about the order in which data sources should be accessed A key distinguishing aspect: the flexibility to contact only some of the sources

Adaptive Query Operators Double Pipelined Join Issues with conventional joins Sort merge joins & indexed joins --can not be pipelined, require an initial sorting or indexing step Nested loops joins and hash joins --follow an asymmetric execution model For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline.

Adaptive Query Operators Double Pipelined Join Challenges in the context of data integration Optimizer may not know the relative size of each relation The time to first tuple is important, so it may be better to use the larger data source as the inner relation if it sends data faster The time to first tuple is extended by the hash join’s non-pipelined behavior when it is reading the inner relation

Adaptive Query Operators Double Pipelined Hash Join Symmetric and incremental Produces tuples almost immediately and masks slow data source transmission rates Data-driven in behavior: each join relation sends tuples through the join operator as quickly as possible. At any time, all of the data encountered so far has been joined and the resulting tuples have already been output Trade-off is that it MUST hold hash tables for BOTH relations in memory

Adaptive Query Operators Double Pipelined Hash Join 2 problems occur It follows a data-driven, bottom-up execution model, while Tukwila is top-down, iterator-based Multithreading: the join consists of separate threads for output, left child and right child As each child reads tuples, it places them into a small tuple transfer queue It requires enough memory to hold both join relations

Adaptive Query Operators Double Pipelined Hash Join Handling memory overflow Take some portion of the hash table and swap it to disk Incremental left flush: switch to a strategy of reading only tuples from the right-side relation and as necessary flush a bucket from the left-side relation’s hash table when system runs out of memory – gradually degrade into hybrid hash, flushing buckets lazily Incremental symmetric flush: pick a bucket to flush to disk and flush the bucket from both sources Incremental left flush will perform fewer disk I/Os Incremental symmetric flush may have reduced latency since both relations continue to be processed in parallel

Discussion 2 When this paper was written, perhaps it was okay to claim that “the sizes of most data integration queries are expected to be only moderately large”. But does this hold today, especially with the coming era of “cloud computing”? Specifically: 1. Do double pipelined hash joins seem efficient enough for today’s data? 2. Would you used double pipelined hash joins in non-data integration applications?

Summary Challenges for data integration and motivations for Tukwila General Tukwila architecture Interleaving of planning and execution Dynamic collector operator Double pipelined hash join

Eddies: Continuously Adaptive Query Processing Ran Avnur, Jesepth M. Hellestein University of California, Berkeley Previously presented by Hongrae Lee Modified by Ben Zhu, Yidan Liu Presenter: Ben Zhu Discussion Leader: Yidan Liu

Outline Introduction Reorderability of plans Rivers and Eddies Routing tuples in Eddies Summary

Static Query Processing Traditional query processing scheme 1. Optimizing a query 2. Executing a static query plan This traditional scheme is not appropriate for Large scale widely-distributed information resources or Massively parallel database systems !

New Requirements Increased complexity in large-scale system Hardware and workload Data User interface We want query execution plans To be reoptimized regularly during query processing To allow system to adapt dynamically to fluctuations in computing resources, data characteristics, and user preferences

Run-Time Fluctuations Three properties that vary during query processing The costs of operators Their selectivities The rates at which tuples arrive from the inputs The first and third issues commonly occur in wide area environments, the second one commonly arises due to correlations between predicates and the order of tuple delivery These issues may become more common in cluster (shared-nothing) systems

Discussion 1 "we favor adaptability over best-case performance" 1. Does this seem reasonable? In this case? In general? 2. If adaptivity is needed only when best-case missing or could also be a general strategy in regular query processing. Do you think it is good or bad to apply it in traditional query processing? Please give reasons or examples to support your opinions.

Eddy

Two Challenges for This Scheme How can we reorder operators? Reorderability of plans How should we route tuples? Routing tuples in Eddies

Reorderability of Plans Synchronization Barriers One task waits for other tasks to be finished Barriers limit concurrency, and hence performance It is desirable to minimize the overhead of synchronization barriers in a dynamic performance environment Issues affect the overhead: the frequency of barriers and the gap between arrival times of the two inputs at the barrier

Reorderability of Plans Moments of Symmetry The barrier where the order of the inputs to a join can often be changed without modifying any state in the join Allow reordering of the inputs to a single binary operator The combination of commutativity and moments of symmetry allows for very aggressive reordering of a plan tree

Join Algorithms and Reordering Constraints on reordering Unindexed join input is ordered before the indexed input Preserving the ordered inputs Some join algorithms work only for equijoins Join algorithms in Eddy We favor join algorithms with Frequent moments of symmetry Adaptive or nonexistent barriers Minimal ordering constraints  Rules out hybrid hash join, merge joins, and nested loops joins – Choice: Ripple Join Frequently-symmetric versions of traditional iteration, hashing and indexing schemes Favors adaptivity over best-case performance

Ripple Joins Ripple joins Have moments of symmetry at each “corner” of a rectangular ripple Are designed to allow changing rates for each input Offer attractive adaptivity features at a modest overhead in performance and memory footprint BlockIndexHash

Rivers and Eddies River A shared-nothing parallel query processing framework Pre-optimization A heuristic pre-optimizer must choose how to initially pair off relations into joins An eddy in the River Is implemented via a module in a river Encapsulates the scheduling of its participating operators Explicitly merges multiple unary and binary operators into a single n-ary operator within a query plan A tuple is associated with a tuple descriptor containing a vector of Ready and Done bits

Routing Tuples in Eddies An eddy module Directs the flow of tuples from the inputs through the various operators to the output Provides the flexibility to allow each tuple to be routed individually through the operators The routing policy determines the efficiency

Na ï ve eddy Tuples enter eddy with low priority, and when returned to eddy from an operator are given high priority  Tuples flow completely through eddy before new tuples Prevents being ‘clogged’ with new tuples Edges in a River DFG -> Fixed-size queue -> back- pressure in a fluid flow Production along the input to any edge is limited by the rate of consumption at the output Tuples are routed to the low-cost operator first Cost-aware policy Selectivity-unaware policy

Learning Selectivity : Lottery Scheduling To track both Consumption (determined by cost) Production (determined by cost and selectivity) Lottery Scheduling Maintain ‘tickets’ for an operator An operator’s chance of receiving the tuple ∝ The counts of tickets The eddy can track (learn) an ordering of the operators that gives good overall efficiencyDebitCredit EddyOperatorEddyOperator

Responding to Dynamic Fluctuations Eddies couldn’t adaptively react over time to the changes in performance and data characteristics Use a window scheme instead of point scheme Banked tickets for running a lottery Escrow tickets for measuring efficiency during the window At the beginning of the window, the value of the escrow account replaces the banked account, and the escrow account is reset It ensures that operators “re-prove themselves” each window

Some Experimental Results

Discussion 2 If you were to design an adaptive query processor, what could be the possible tradeoffs you need to balance? Which would you rather use: Tukwila or Eddies? Why?

Summary Eddies are A query processing mechanism that allow fine- grained, adaptive, online optimization Beneficial in the unpredictable query processing environments Challenges To develop eddy ‘ticket’ policies that can be formally proved to converge quickly To attack the remaining static aspects To harness the parallelism and adaptivity available to us in rivers To explore the application of eddies and rivers to the generic space of dataflow programming

Thank you