L4.2.2. Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1.

Slides:



Advertisements
Similar presentations
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
CS 540 Database Management Systems
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
Session – 10 QUERY OPTIMIZATION Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
CS 347Notes 041 CS 347: Distributed Databases and Transaction Processing Notes04: Query Optimization Hector Garcia-Molina.
1 Distributed Databases Review CS347 June 6, 2001.
Institut für Scientific Computing – Universität WienP.Brezany Optimization of Distributed Queries Univ.-Prof. Dr. Peter Brezany Institut für Scientific.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
1 Optimization. 2 Why Optimize? Given a query of size n and a database of size m, how big can the output of applying the query to the database be? Example:
Access Path Selection in a Relation Database Management System (summarized in section 2)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
1 6. Distributed Query Optimization Chapter 9 Optimization of Distributed Queries.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Οι διαφάνειες καλύπτουν μέρος των Κεφαλαίων 7&8: Distributed Database QueryProcessing and Optimization.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
CPS216: Advanced Database Systems Notes 07:Query Execution Shivnath Babu.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.8/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Query Processing and Query Optimization Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System.
CS742 – Distributed & Parallel DBMSPage 3. 1M. Tamer Özsu Outline Introduction & architectural issues Data distribution  Distributed query processing.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Management System
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 12: Query Processing
Evaluation of Relational Operations
Access Path Selection in a Relational Database Management System
Database Management Systems (CS 564)
Outline Introduction Background Distributed DBMS Architecture
Distributed Database Management Systems
Lecture 2- Query Processing (continued)
Advance Database Systems
Chapter 12 Query Processing (1)
Distributed Database Management Systems
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Distributed Database Management Systems
Presentation transcript:

L Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1

L Distributed Query Optimization Algorithms -- 2 System R (Centralized) Algorithm v Simple (one relation) queries are executed according to the best access path. v Execute joins  Determine the possible ordering of joins  Determine the cost of each ordering  Choose the join ordering with the minimal cost v For joins, two join methods are considered:  Nested loops  Merge join

L Distributed Query Optimization Algorithms -- 3 System R Algorithm -- Example Names of employees working on the CAD/CAM project v Assume  EMP has an index on ENO,  ASG has an index on PNO,  PROJ has an index on PNO and an index on PNAME

L Distributed Query Optimization Algorithms -- 4 System R Algorithm -- Example v Choose the best access paths to each relation  EMP: sequential scan (no selection on EMP)  ASG: sequential scan (no selection on ASG)  PROJ: index on PNAME (there is a selection on PROJ based on PNAME) v Determine the best join ordering  EMP ASG PROJ  ASG PROJ EMP  PROJ ASG EMP  ASG EMP PROJ  EMP  PROJ ASG  PROJ  EMP ASG  Select the best ordering based on the join costs evaluated according to the two methods

L Distributed Query Optimization Algorithms -- 5 System R Example (cont'd) v Best total join order is one of EMPASG PROJ EMP ASGASG EMPPROJ × EMPASG PROJEMP × PROJ (ASG EMP) PROJ (PROJ ASG) EMP PROJ ASG (ASG EMP) PROJ (PROJ ASG) EMP

L Distributed Query Optimization Algorithms -- 6 System R Algorithm v (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP. v Final plan:  select PROJ using index on PNAME  then join with ASG using index on PNO  then join with EMP using index on ENO

L Distributed Query Optimization Algorithms -- 7 System R* Distributed Query Optimization v Total-cost minimization. Cost function includes local processing as well as transmission. v Algorithm  For each relation in query tree find the best access path  For the join of n relations find the optimal join order strategy  each local site optimizes the local query processing

L Distributed Query Optimization Algorithms -- 8 Data Transfer Strategies v Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode v Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple

L Distributed Query Optimization Algorithms -- 9 Join Strategy 1 v External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s v Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R)

L Distributed Query Optimization Algorithms Join Strategy 2 v Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)

L Distributed Query Optimization Algorithms Join Strategy 3 v Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R)

L Distributed Query Optimization Algorithms Join Strategy 4 v Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R) v Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost

L Distributed Query Optimization Algorithms Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES 0, and the site as chosen site Splits ES 0 into ES 1 : ship one relation of join to the site of other relation ES 2 : these two relations are joined locally and the result is transmitted to the chosen site If cost(ES 1 ) + cost(ES 2 ) + LC > cost (ES 0 ) select ES 0, else select ES 1 and ES 2. The process can be recursively applied to ES 1 and ES 2 till no more benefit occurs

L Distributed Query Optimization Algorithms Hill Climbing Algorithm - Example  SAL  PNAME=“CAD/CAM” PROJ ASG EMP PNO TITLE ENO PAY Ignore the local processing cost Length of tuples is 1 for all relation Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost =

L Distributed Query Optimization Algorithms HCA - Example Site1 EMP(8) Site2 PAY(4 ) Site3 PROJ(1) Site4 ASG(10) ? ? ? TITLE ES 1 ES 2 ES 3 Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost = Solution 1 Cost = Solution 2 Cost = ES 1 ES 2 ES 3 ESo is the “BEST”

L Distributed Query Optimization Algorithms Hill Climbing Algorithm - Comments v Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. v If there are local minimas, it may not find the global minima v If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) COST =

L Distributed Query Optimization Algorithms SDD-1 Algorithm v SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles. v Cost of semijoin: Cost (R SJ A S) = C MSG + C TR *size(  A (S)) v Benefit is the cost of transferring irrelevant tuple Benefit(R SJ A S) = (1-SF SJ (S.A)) * size(R) * C TR v A semijoin is beneficial if cost < benefit.

L Distributed Query Optimization Algorithms SDD-1: The Algorithm v initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing v most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected v the above step is done until no more beneficial joins are left v assembly site selection to perform local operations v postoptimization removes unnecessary semijoins

L Distributed Query Optimization Algorithms SDD1 - Example SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO Site 1 EMP Site 2 ASG Site 3 PROJ ENO PNO

L Distributed Query Optimization Algorithms SDD1 - First Iteration v SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 v SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 v SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400 v SJ4: PROJ SJ ASG benefit = 0; cost = 400 v SJ1 is selected v ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP v Semijoin selectivity factor is reduced; it is approximated by SF SJ (G’.ENO)= 0.8*0.3 = 0.24, SF SJ (G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120

L Distributed Query Optimization Algorithms SDD-1 - Second & Third Iterations Second iteration v SJ2: ASG’ SJ PROJ benefit=(1- 0.4)*900=540 cost=200; v SJ3: EMP SJ ASG’; benefit=( )*1500=1140 cost=120 v SJ4: PROJ SJ ASG’, benefit=(1- 0.3)*2000=1400 cost=120 è SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600 SF SJ (J’)=0.4*0.3=0.12 size(J’.PNO)=200*0.3=60 Third Iteration v SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; v SJ3: EMP SJ ASG’; benefit=( )*1500=1140 cost=120 è SJ3 is selected reduces size of E to 1500*0.24=360 è Finally SJ2 is selected, with size of G as 108

L Distributed Query Optimization Algorithms Local Optimization v Each site optimizes the plan to be executed at the site v A centralized query optimization problem

L Distributed Query Optimization Algorithms SDD-1 - Assembly Site Selection v After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 è Site 3 is chosen as assembly site v SJ4 is removed in post optimization. Site1 EMP Site3 PROJ Site2 ASG (ASG SJ EMP) SJ PROJ  site 3 (EMP SJ ASG)  site 3 join at site 3