Adaptive Query Processing Part I

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
1 Relational Query Optimization Module 5, Lecture 2.
IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,
Robust Query Processing through Progressive Optimization Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic Presented.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Applying Control Theory to Stream Processing Systems Wei Xu Bill Kramer Joe Hellerstein.
Query Processing Presented by Aung S. Win.
by B. Zadrozny and C. Elkan
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
CMSC424: Database Design Instructor: Amol Deshpande
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Robust Query Processing through Progressive Optimization SIGMOD 2004 Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic.
1 Elke. A. Rundensteiner Worcester Polytechnic Institute Elisa Bertino Purdue University 1 Rimma V. Nehme Microsoft.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Query Optimization for Stream Databases Presented by: Guillermo Cabrera Fall 2008.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
CPU Scheduling Scheduling processes (or kernel-level threads) onto the cpu is one of the most important OS functions. The cpu is an expensive resource.
Chiu Luk CS257 Database Systems Principles Spring 2009
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
CS 540 Database Management Systems
CS 440 Database Management Systems
Applying Control Theory to Stream Processing Systems
A paper on Join Synopses for Approximate Query Answering
Proactive Re-optimization
Robust Query Processing through Progressive Optimization
William Stallings Computer Organization and Architecture
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Operating Systems Processes Scheduling.
Database Performance Tuning and Query Optimization
Introduction to Query Optimization
Chapter 6: CPU Scheduling
Database Management Systems (CS 564)
CSCI1600: Embedded and Real Time Software
CPU Scheduling G.Anuradha
Database Query Execution
Outline Introduction Background Distributed DBMS Architecture
Dynamic Query Optimization
Relational Query Optimization
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Processes and operating systems
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Chapter 11 Database Performance Tuning and Query Optimization
Kabra and DeWitt presented by Zack Ives CSE 590DB, May 11, 1998
Self-organizing Tuple Reconstruction in Column-stores
Query Optimization.
PSoup: A System for streaming queries over streaming data
Computational Advertising and
Adaptive Query Processing (Background)
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
CSCI1600: Embedded and Real Time Software
Robust Query Processing through Progressive Optimization
Presentation transcript:

Adaptive Query Processing Part I Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 20, 2008 Slides co-authored by Amol Deshpande & Vijayshankar Raman

Query Processing: Adapting to the World Data independence facilitates modern DBMS technology Separates specification (“what”) from implementation (“how”) Optimizer maps declarative query  algebraic operations Platforms, data, conditions are constantly changing: Query processing adapts implementation strategy to runtime conditions Static applications  dynamic environments A big part of the appeal of DBMS technology! d app dt << d env

Query Optimization and Processing (As Established in System R [SAC+’79]) Professor Course Student cardinalities index lo/hi key > UPDATE STATISTICS ❚ > SELECT * FROM Professor P, Course C, Student S WHERE P.pid = C.pid AND S.sid = C.sid ❚ Dynamic Programming + Pruning Heuristics

Traditional Optimization Is Breaking In traditional settings: Queries over many tables Unreliability of traditional cost estimation Success & maturity make problems more apparent, critical In new environments: e.g. data integration, web services, streams, P2P, sensor nets, hosting Unknown and dynamic characteristics for data and runtime Increasingly aggressive sharing of resources and computation Interactivity in query processing Note two distinct themes lead to the same conclusion: Unknowns: even static properties often unknown in new environments and often unknowable a priori Dynamics: can be very high Motivates intra-query adaptivity denv dt

A Call for Greater Adaptivity System R adapted query processing as stats were updated Measurement/analysis: periodic Planning/actuation: once per query Improved thru the late 90s (see [Graefe ’93] [Chaudhuri ’98]) Better measurement, models, search strategies INGRES adapted execution many times per query Each tuple could join with relations in a different order Different plan space, overheads, frequency of adaptivity Didn’t match applications & performance at that time Recent work considers adaptivity in new contexts ZI: statement was made that both have zero-overhead actuation, but is that really true for INGRES?? Query Post-Mortem reveals different relational plan spaces System R is direct: each query mapped to a single relational algebra stmt Ingres’ decision space generates a union of plans over horizontal partitions this “super-plan” not materialized -- recomputed via FindMin

Low-Overhead Adaptivity: Non-pipelined Execution

Late Binding; Staged Execution materialization point C B MJ sort Normal execution: pipelines separated by materialization points e.g., at a sort, GROUP BY, etc. sort NLJ A R Materialization points make natural decision points where the next stage can be changed with little cost: Re-run optimizer at each point to get the next stage Choose among precomputed set of plans – parametric query optimization [INSS’92, CG’94, …]

Mid-query Reoptimization [KD’98,MRS+04] C B MJ sort B C HJ MJ sort NLJ A R Choose checkpoints at which to monitor cardinalities Balance overhead and opportunities for switching plans If actual cardinality is too different from estimated, Avoid unnecessary plan re-optimization (where the plan doesn’t change) Re-optimize to switch to a new plan Try to maintain previous computation during plan switching Where? When? How?

Where to Place Checkpoints? MJ More checkpoints  more opportunities for switching plans Overhead of (simple) monitoring is small [SLMK’01] Consideration: it is easier to switch plans at some checkpoints than others MJ C Lazy sort B sort Eager NLJ A R Lazy checkpoints: placed above materialization points No work need be wasted if we switch plans here Eager checkpoints: can be placed anywhere May have to discard some partially computed results Useful where optimizer estimates have high uncertainty

When to Re-optimize? Suppose actual cardinality is different from estimates: how high a difference should trigger a re-optimization? Idea: do not re-optimize if current plan is still the best Heuristics-based [KD’98]: e.g., re-optimize < time to finish execution Validity range [MRS+04]: precomputed range of a parameter (e.g., a cardinality) within which plan is optimal Place eager checkpoints where the validity range is narrow Re-optimize if value falls outside this range Variation: bounding boxes [BBD’05]

How to Reoptimize Getting a better plan: Plug in actual cardinality information acquired during this query (as possibly histograms), and re-run the optimizer Reusing work when switching to the better plan: Treat fully computed intermediate results as materialized views Everything that is under a materialization point Note: It is optional for the optimizer to use these in the new plan Other approaches are possible (e.g., query scrambling [UFA’98]) ASIDE it is not always easy to exploit cardinalities over intermediate results. E.g, given actual |sab(R)| and |sbc (R)|, what is best estimate for |sabc(R)|? Recent work using maximum entropy [MMKTHS 05]

Pipelined Execution Part I: Adaptive Selection Ordering

Adaptive Selection Ordering Complex predicates on single relations common e.g., on an employee relation: ((salary > 120000) AND (status = 2)) OR ((salary between 90000 and 120000) AND (age < 30) AND (status = 1)) OR … Selection ordering problem: Decide the order in which to evaluate the individual predicates against the tuples We focus on conjunctive predicates (containing only AND’s) Example Query select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’;

Basics: Static Optimization Find a single order of the selections to be used for all tuples Query Query plans considered select * from R where R.a = 10 and R.b < 20 and R.c like ‘%name%’; R.a = 10 R.b < 20 R result R.c like … R.b < 20 R.c like … R result R.a = 10 3! = 6 distinct plans possible

Static Optimization Cost metric: CPU instructions Computing the cost of a plan Need to know the costs and the selectivities of the predicates R1 R2 R3 R.a = 10 R.b < 20 R result R.c like … costs c1 c2 c3 selectivities s1 s2 s3 cost per c1 + s1 c2 + s1 s2 c3 tuple Independence assumption cost(plan) = |R| * (c1 + s1 * c2 + s1 * s2 * c3)

Static Optimization Rank ordering algorithm for independent selections [IK’84] Apply the predicates in the decreasing order of rank: (1 – s) / c where s = selectivity, c = cost For correlated selections: NP-hard under several different formulations e.g. when given a random sample of the relation Greedy algorithm, shown to be 4-approximate [BMMNW’04]: Apply the selection with the highest (1 - s)/c Compute the selectivities of remaining selections over the result Conditional selectivities Repeat Conditional Plans ? [DGHM’05]

Adaptive Greedy [BMMNW’04] Context: Pipelined query plans over streaming data Example: Three independent predicates R.a = 10 R.b < 20 R.c like … Costs 1 unit Initial estimated selectivities 0.05 0.1 0.2 Optimal execution plan orders by selectivities (because costs are identical) R.a = 10 R.b < 20 R result R.c like … R1 R2 R3

Adaptive Greedy [BMMNW’04] Monitor the selectivities over recent past (sliding window) Re-optimize if the predicates not ordered by selectivities R.a = 10 R.b < 20 R result R.c like … R1 R2 R3 Randomly sample R.a = 10 estimate selectivities of the predicates over the tuples of the profile Rsample R.b < 20 Profile R.c like … Reoptimizer IF the current plan not optimal w.r.t. these new selectivities THEN reoptimize using the Profile

Adaptive Greedy [BMMNW’04] Correlated Selections Must monitor conditional selectivities R.a = 10 R.b < 20 R result R.c like … R1 R2 R3 Rsample Randomly sample (Profile) monitor selectivities sel(R.a = 10), sel(R.b < 20), sel(R.c …) monitor conditional selectivities sel(R.b < 20 | R.a = 10) sel(R.c like … | R.a = 10) sel(R.c like … | R.a = 10 and R.b < 20) Reoptimizer Uses conditional selectivities to detect violations Uses the profile to reoptimize O(n2) selectivities need to be monitored

Adaptive Greedy [BMMNW’04] Advantages: Can adapt very rapidly Handles correlations Theoretical guarantees on performance [MBMW’05] Not known for any other AQP algorithms Disadvantages: May have high runtime overheads Profile maintenance Must evaluate a (random) fraction of tuples against all operators Detecting optimality violations Reoptimization cost Can require multiple passes over the profile

Eddies [AH’00] Query processing as routing of tuples through operators A traditional pipelined query plan R.a = 10 R.b < 20 R result R.c like … R1 R2 R3 Pipelined query execution using an eddy Eddy R result R.a = 10 R.c like … R.b < 20 An eddy operator Intercepts tuples from sources and output tuples from operators Executes query by routing source tuples through operators Encapsulates all aspects of adaptivity in a “standard” dataflow operator: measure, model, plan &actuate.

Eddies [AH’00] An R Tuple: r1 Eddy R r1 result r1 a b c … R.a = 10 15 10 “name” Eddy R result R.a = 10 R.c like … R.b < 20 r1 r1

Eddies [AH’00] An R Tuple: r1 Eddy R result r1 ready bit i : 1  operator i can be applied 0  operator i can’t be applied Operator 1 Operator 2 Operator 3 a b c … ready done 15 10 “name” 111 000 Eddy R result R.a = 10 R.c like … R.b < 20 r1

Eddies [AH’00] An R Tuple: r1 Eddy R result r1 done bit i : 1  operator i has been applied 0  operator i hasn’t been applied Operator 1 Operator 2 Operator 3 a b c … ready done 15 10 “name” 111 000 Eddy R result R.a = 10 R.c like … R.b < 20 r1

Eddies [AH’00] An R Tuple: r1 Eddy R result r1 Operator 1 Operator 2 Operator 3 a b c … ready done 15 10 “name” 111 000 Used to decide validity and need of applying operators Eddy R result R.a = 10 R.c like … R.b < 20 r1

Eddies [AH’00] An R Tuple: r1 not satisfied r1 r1 Eddy R r1 satisfied Operator 1 Operator 2 Operator 3 a b C … ready done 15 10 “name” 101 010 a b C … ready done 15 10 “name” 111 000 not satisfied Eddy R result R.a = 10 R.c like … R.b < 20 For a query with only selections, ready = complement(done) r1 r1 r1 satisfied eddy looks at the next tuple r1

Eddies [AH’00] An R Tuple: r2 Eddy r2 R satisfied result a b c … Operator 1 Operator 2 Operator 3 a b c … 10 15 “name” satisfied Eddy R result R.a = 10 R.c like … R.b < 20 r2

Eddies [AH’00] An R Tuple: r2 Eddy R satisfied result r2 r2 Operator 1 Operator 2 Operator 3 a b c … ready done 10 15 “name” 000 111 satisfied Eddy R result R.a = 10 R.c like … R.b < 20 if done = 111, send to output r2 r2

Eddies [AH’00] Adapting order is easy Just change the operators to which tuples are sent Can be done on a per-tuple basis Can be done in the middle of tuple’s “pipeline” How are the routing decisions made? Using a routing policy Operator 1 Operator 2 Operator 3 Eddy R result R.a = 10 R.c like … R.b < 20

Routing Policies that Have Been Studied Deterministic [D03] Monitor costs & selectivities continuously Re-optimize periodically using rank ordering (or A-Greedy for correlated predicates) Lottery scheduling [AH00] Each operator runs in thread with an input queue “Tickets” assigned according to tuples input / output Route tuple to next eligible operator with room in queue, based on number of “tickets” and “backpressure” Content-based routing [BBDW05] Different routes for different plans based on attribute values

Next Time… For joins, there is a very peculiar characteristic: cost goes up with state size We can make short-term decisions that create new state but seem efficient initially Later that state can make EVERYTHING more expensive!