Query optimization in distributed database systems.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

CS4432: Database Systems II
CS CS4432: Database Systems II Logical Plan Rewriting.
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
1 CSE 480: Database Systems Lecture 22: Query Optimization Reference: Read Chapter 15.6 – 15.8 of the textbook.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
COMP 451/651 Optimizing Performance
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Lecture 5 on Query Optimization
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing & Optimization
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
Access Path Selection in a Relation Database Management System (summarized in section 2)
Query Processing Presented by Aung S. Win.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Dr. Alexandra I. Cristea.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
DDBMS Distributed Database Management Systems Fragmentation
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
1 Algebra of Queries Classical Relational Algebra It is a collection of operations on relations. Each operation takes one or two relations as its operand(s)
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
Relational Algebra COMP3211 Advanced Databases Nicholas Gibbins
Chapter 14: Query Optimization
Query Optimization Heuristic Optimization
Parallel Databases.
Database Management System
CS257 Query Optimization.
Query Optimization Kush Kashyap B.Tech -IT.
Prepared by : Ankit Patel (226)
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
File Processing : Query Processing
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Outline Introduction Background Distributed DBMS Architecture
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Chapter 12 Query Processing (1)
Evaluation of Relational Operations: Other Techniques
Distributed Database Management Systems
Presentation transcript:

Query optimization in distributed database systems

2 Framework for query optimization The selection of a query processing strategy involves: –determining the physical copies of the fragments upon which to execute the query –selecting the order of the execution of operations, particularly, this involves the determination of a „good” sequence of joins –selecting the method for executing each operation

3 Transmission cost Transmission requirements are neutral with respect to systems; they are typically a function of the amount of data transmitted among sites The optimization of a distributed query can be partitioned into two independent problems: the distribution of the access strategy among sites, which is done considering transmission only, and the determination of local access strategies at each site, which use traditional methods of centralized databases Transmission cost: TC(X) = C 0 + C 1 * x

4 Database Profile Database profile: The number of tuples in each relation Ri (card(Ri)) The size of each attribute A (size(A) ) The size of Ri (size(Ri)) is sum of the sizes of its attributes For each attribute A in each relation Ri: the number of distinct values appearing in Ri (val(A[Ri])), max and min LDBS1LDBS2 Supply1 Dept1 Supply2 Dept2

5 Database Profile Supplycard(Supply)= Deptcard(dept)= 30

6 Database Profile Supply1card(Supply1)= site(Supply1) = 1 Dept1card(dept)= 10 site(Dept1) = 2

7 Profile of partial results of algebraic operations - SELECTION Let S denote the result of performing a unary relation over a relation R Cardinality - to each selection we associate a selectivity factor  which is the fraction of tuples satisfying it In simple selection attribute = value (A=v),  can be defined as follows:  = 1/val(A[Ri]) under the assumptions that values are homogeneously distributed. Thus card(S) =  * card(R)

8 Profile of partial results of algebraic operations - SELECTION Size: selection does not affect the size of relations size(S) = size(R) Distinct values : depends on the selection criterion Consider an attribute B which is not used in selection formula. The determination of val(B[S]) may be as follows Given n=card(R) - objects uniformly distributed over m = val(B[R]) colors. How many different colors c= val(B[S]) are selected if we take just r objects?

9 Profile of partial results of algebraic operations - SELECTION Yao approximation: r, for r < m/2 c(n, m, r) = (r+m)/3for m/2 < r < 2m m, for r > 2m

10 Profile of partial results of algebraic operations - PROJECTION Let S denote the result of performing a unary relation over a relation R Cardinality – projection affects the cardinality of operands since duplicates are eliminated from the result. This effect is difficult to evaluate, the following three rules can be applied –If the projection involves a single attribute A, set card(S) = val(A[R]) –If the product  Ai  Attr(S) val(Ai[R]) is less than card(R), where Attr(S) are the attributes in the result of the projection, set card(S) =  Ai  Attr(S) val(Ai[R])

11 Profile of partial results of algebraic operations - PROJECTION –If the projection includes a key of R, set card(S) = card(R) Note that if the system does not eliminate duplicates, the cardinality of the result is the same as the cardinality of the operand relation Size: the size of the result of a projection is reduced to the sum of the sizes of attributes in its specification Distinct values : the distinct values of projected attributes are the same as in the operand relation

12 Profile of partial results of algebraic operations – GROUP BY Let G denote the attributes on which the grouping is performed, AF indicates the aggregate functions to be evaluated Cardinality – we give an upper bound on the cardinality of S: card(S) <  Ai  G val(Ai[R]) Size: for all attributes A appearing in G size(R.A) = size (S.A) Distinct values : for all attributes A appearing in G val(A[S]) = val(A[R])

13 Profile of partial results of algebraic operations – UNION Cardinality – we have: card(T) < card(R) + card(S) Equality holds when duplicates are not eliminated Size: we have size(T) = size(R) = size(S) Distinct values : an upper bound is val(A[T]) < val(A[R]) + val(A[S])

14 Profile of partial results of algebraic operations – DIFFERENCE Cardinality – we have: max(0, card((R)-card(S)) < card(T) < card(R) Size: we have size(T) = size(R) = size(S) Distinct values : an upper bound is val(A[T]) < val(A[R])

15 Profile of partial results of algebraic operations – CARTESIAN PRODUCT Cardinality – we have: card(T) < card(R) x card(S) Size: we have size(T) = size(R) + size(S) Distinct values : the distinct values of attributes are the same as in the operand relation

16 Profile of partial results of algebraic operations – JOIN Cardinality – estimating precisely the cardinality of T is very complex; we can give an upper bound to card(T) because card(T) < card(R) x card(S), but this value is usually much higher than the actual cardinality. Assuming that all the values of A in R appear also as values of B in S and vice versa and that the two attributes are both uniformly distributed over tuples of R and S, we have card(T) = (card(R) x card(S))/val(A[R]) if one of the two attributes, say A, is a key of R, then card(T) = card(S)

17 Profile of partial results of algebraic operations – JOIN Size: we have size(T) = size(R) + size(S) In the case of natural join the size of the join attribute must be subtracted from the size of the result Distinct values : if A is a join attribute, an upper bound is val(A[T]) < min(val(A[R]), val(B[S]) ) if A is not a join attribute, an upper bound is val(A[T]) < val(A[R]) + val(B[S])

18 Profile of partial results of algebraic operations – SEMIJOIN Consider the semijoin T=R SJ A=B S Cardinality – the estimation of the cardinality of T is similar to that of a selection operation; we denote with  the selectivity of the semijoin operation, which measures the fraction of the tuples of R which belong to the result. The estimation is the following:  = 1/val(A[S]) / val(dom[A]) Given  card(T) =  * card(R)

19 Profile of partial results of algebraic operations – SEMIJOIN Size: The size of the result of a semijoin is the same as the size of its first operand size(T) = size(R) Distinct values : the number of distinct values of attributes which do not belong to the semijoin specification can be estimated using Yao’s formula with n= card(R), m=val(A[R]), and r =card(T). If A is the only attribute appearing in the semijoin specification, then val(A[T]) =  * val(A[R])

20 Architecture of a Query Processing Parser Query Rewrite Query Optimizer Internal rep. Catalog Plan Refinement Query Execution Engine result Base data plan query execution plan

21 Architecture of a Query Processing Parser: the query is parsed and translated into an internal representation (flex and bison can be used for the construction of SQL parser) Query Rewrite: query rewrite transforms a query in order to carry out optimizations that are good regardless of the physical state of the system (elimination of redundant predicates, unnesting of subqueries, simplification of expressions). Query rewrite is carried out by a rule engine Query Optimizer: this component carries out optimizations that depend on the physical state of the system. QO decides which index, which method, and in which order to execute operations of a query.

22 Architecture of a Query Processing Query optimizer: in distributed system QO must decide at which site each operation is to be executed. QO enumerates alternative plans and chooses the best plan using a cost estimation model Plan: specifies precisely how the query is to be executed. The nodes are operators, and every operator carries out one particular operation. The edges represent consumer- producer relationships of operators. Plan Refinement: this component transforms the plan into an executable plan. In DB2 this transformation involves the generation of an assembler-like code to evaluate expressions and predicates efficiently

23 Query evaluation plan PJ A1 NLJ A2=B2 scan temp receive Site 0 Inxscan(A) PJ A3 send Scan(B) SL C=cos PJ B3 send

24 Query evaluation plan Fragment reducers: a set of unary operations which apply to the same fragment are collected into programs Binary operations: joins and unions Optimization graph: nodes represent reduced fragments, and joins (unions) are represented by edges (hypernodes) A B A2=B2

25 Query Optimization (1) Plan enumeration with Dynamic Programming Input:SPJ query q on relations R1,..., Rn Output:A query plan for q 1.for i=1 to n do { 2.optPlan({Ri}) = accessPlans(Ri) 3.prunePlans(optPlan({Ri})) 4.} 5.for i=2 to n do { 6.for all S  {R1,..., Rn} such that |S| = i do { 7.optPlan(S) = 

26 Query Optimization (2) 8.for all O  S do { 9.optPlan(S) = optPlan(S)  joinPlans(optPlan(O), optPlan(S-O)) 10.prunePlans(optPlan(S)) 11.} 12. } 13. } 14. return optPlan({R1,..., Rn}) Problem: alternative plans cannot be immediately pruned

27 Query Optimization (3) Optimization criteria: –Classic cost model (total time, total resource consumption) – estimate the cost of every individual operator of the plan and then sum up these costs – this model is useful to estimate the overall throughput of a system –Mean response time model – estimate the lowest response time of a query

28 Query Execution Techniques Row blocking – implementation of send and receive operators is based on TCP/IP, UDP protocols; idea: ship tuples in a blockwise fashion Optimization of Multicasts: send data sequentially instead of sending data twice (NY  Berlin  Poznan) Joins with Horizontally Partitioned Data – (A1  A2) JN B or (A1 JN B)  (A2 JN B) If A and B are both partitioned than we have more plans Semijoin and Bloojoin programs

29 Semijoin Programs Semijoin between R and S over two attributes A and B is defined as follows: ( R SJ A=B S) JN A=B S is equal R JN A=B S 1. Send PJ B (S) to site R at a cost C0 + C1 * size(B) * val(B(S)) 2. Compute semijoin on R at a null cost; Let R’= R SJ A=B S 3. Send R’ to site S at a cost C0 + C1 * size(R) * card(R’) 4. Compute the join on site S at a null value

30 Reducers Semijoin programs can be regarded as reducers, i.e. Operations that can be applied to reduce the cardinality of their operands Let RED(Q, R) denote the set of reducer programs that can be built for a given relation R in a given query Q There is one reducer program, element of RED(Q, R), which reduces R more than all other programs – full reducer The problem : find all full reducers for the relations of a query (difficult task) Acyclic (tree queries) versus cyclic queries

31 Reducers Is it possible to give a limitation to the length of the full reducer? Tree queries – YES The limitation on the length of the full reducer amounts to n-1, where n is the number of nodes of the tree Cyclic queries – NO The limitation on the length of the ‘best’ reducer is linearly bound by the number of tuples of some relations of the query Best reducer does not mean full reducer

32 Example (1) RS T S R T A=A B=B C=C The final result is empty relation; the length of the reducers is 3*(m-1), where m is the number of tuples Cyclic query

33 Example (2) RS T S R T B=B C=C The final result - one tuple (a, x) Acyclic query

34 Testing the graph for cycles There are two cases in which cycles can be broken without changing the meaning of the query 1.In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.A), in which R, S, T are relation names, and A, B, C are attributes, any one of the edges can be dropped, as any edge can be obtained from the remaining ones by transitivity. 2. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.D), we can substitute (R.A=R.D) for (T.C=R.D) because, by transitivity, T.C must equal R.A; the remaining graph contains two edges (R.S) and (S.T) and is acyclic, because an interrelation clause can be sabstituted by an intrarelation clause