L4: Query Optimization (1) - 1 L4: Query Processing and Optimization v 4.1 Query Processing  Query Decomposition  Data Localization v 4.1 Query Optimization.

Slides:



Advertisements
Similar presentations
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMSPage 6. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database Design.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed Query Processing –An Overview
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Session – 10 QUERY OPTIMIZATION Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Institut für Scientific Computing – Universität WienP.Brezany Optimization of Distributed Queries Univ.-Prof. Dr. Peter Brezany Institut für Scientific.
Distributed Databases
CMSC724: Database Management Systems Instructor: Amol Deshpande
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing & Optimization
L Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Query Processing Presented by Aung S. Win.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Dr. Alexandra I. Cristea.
low level data manipulation
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
Session-9 Data Management for Decision Support
1 6. Distributed Query Optimization Chapter 9 Optimization of Distributed Queries.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Οι διαφάνειες καλύπτουν μέρος των Κεφαλαίων 7&8: Distributed Database QueryProcessing and Optimization.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
Overview of Query Processing
Department of Computer Science and Engineering, HKUST Slide Query Processing and Optimization Query Processing and Optimization.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level.
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.8/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Query optimization in distributed database systems.
Query Processing Bayu Adhi Tama, MTI. 1 ownerNoclient © Pearson Education Limited 1995, 2005.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Chapter 18 Query Processing. 2 Chapter - Objectives u Objectives of query processing and optimization. u Static versus dynamic query optimization. u How.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
CSCI Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science Department AUC.
Relational Algebra p BIT DBMS II.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 18 Query Processing and Optimization. Chapter Outline u Introduction. u Using Heuristics in Query Optimization –Query Trees and Query Graphs –Transformation.
Chapter 13: Query Processing
CS4432: Database Systems II Query Processing- Part 1 1.
Relational Algebra COMP3211 Advanced Databases Nicholas Gibbins
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
CS742 – Distributed & Parallel DBMSPage 3. 1M. Tamer Özsu Outline Introduction & architectural issues Data distribution  Distributed query processing.
Query Processing and Query Optimization Database System Implementation CSE 507 Slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts.
Database System Implementation CSE 507
Database Management System
Chapter 15 QUERY EXECUTION.
Outline Introduction Background Distributed DBMS Architecture
Distributed Database Management Systems
Advance Database Systems
Distributed Database Management Systems
Distributed Database Management Systems
Presentation transcript:

L4: Query Optimization (1) - 1 L4: Query Processing and Optimization v 4.1 Query Processing  Query Decomposition  Data Localization v 4.1 Query Optimization

L4: Query Optimization (1) - 2 Query Processing v Any high-level query (SQL) on a database must be processed, optimized and executed by the DBMS v The high-level query is scanned, and parsed to check for syntactic correctness v An internal representation of a query is created, which is either a query tree or a query graph v The DBMS then devises an execution strategy for retrieving the result of the query. (An execution strategy is a plan for executing the query by accessing the data, and storing the intermediate results) v The process of choosing one out of the many execution strategies is known as query optimization

L4: Query Optimization (1) - 3 Query Processor v A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query v For a DDBMS, the QP also does data localization for the query based on the fragmentation scheme and generates the execution strategy that incorporates the communication operations involved in processing the query

L4: Query Optimization (1) - 4 Query Optimizer v Queries expressed in SQL can have multiple equivalent relational algebra query expressions v The distributed query optimizer must select the ordering of relational algebra operations, sites to process data, and possibly the way data should be transferred. This makes distributed query processing significantly more difficult

L4: Query Optimization (1) - 5 Complexity of Relational Algebra Operations v The relational algebra is used to express the output of the query. The complexity of relational algebra operations play a role in defining some of the principles of query optimization. All complexity measures are based on the cardinality of the relation v Operations Complexity Select, Project (w/o duplicate elimination)O(n) Project (with duplicate elimination), GroupO(n logn) Join, Semi-join, Division, Set OperatorsO(n logn) Cartesian ProductO(n 2 ) This was given in the book (p194). It is over simplified.

L4: Query Optimization (1) - 6 Characteristics of Query Processors v Statistics  fragment cardinality and size  size and number of distinct values for each attribute. detailed histograms of attribute values for better selectivity estimation. v Decision Sites  one site or several sites participate in selection of strategy v Exploitation of network topology  wide area network ­ communication cost  local area network ­ parallel execution

L4: Query Optimization (1) - 7 Characteristics of Query Processors v Exploitation of replicated fragments  larger number of possible strategies v Use of Semijoins  reduce size of data transfer  increase # of messages and local processing  good for fast or slow networks?

L4: Query Optimization (1) - 8 Layers of Query Processing QUERY DECOMPOSITION DATA LOCALIZATION GLOBAL OPTIMIZATION LOCAL OPTIMIZATION FRAGMENT SCHEMA STATISTICS ON FRAGMENTS LOCAL SCHEMA GLOBAL SCHEMA Calculus Query on Distributed Relations Algebra Query on Distributed Relations Fragment Query Optimized Fragment Query With Communication Operations Optimized Local Queries CONTROL SITE LOCAL SITE

L4: Query Optimization (1) - 9 Query Decomposition v Normalization  Convert from general language (SQL) to a “standard” form (e.g., Relational Algebra)  Query qualification is written in a normalized form (CNF or DNF) for subsequent manipulation v Analysis  The query is analyzed for syntactic semantic correctness v Simplification  Redundant predicates are eliminated to obtain simplified queries v Restructuring  The calculus query is translated to optimal algebraic query representation

L4: Query Optimization (1) - 10 Query Decomposition: Normalization v There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF)  CNF: (p 11  p 12 ...  p 1n ) ...  (p m1  p m2 ...  p mn )  DNF: (p 11  p 12 ...  p 1n ) ...  (p m1  p m2 ...  p mn )  OR's mapped into union  AND's mapped into join or selection v Lexical and syntactic analysis  check validity  check for attributes and relations  type checking on the qualification

L4: Query Optimization (1) - 11 Example Select A,C From R,S Where (R.B=1 and S.D=2) or (R.C>3 and S.D.=2)  (R.B=1 v R.C>3)  S.D.=2 RS Conjunctive normal form  A, C

L4: Query Optimization (1) - 12 Query Decomposition: Analysis v Queries are rejected because  the attributes or relations are not defined in the global schema; or  operations used in qualifiers are semantically incorrect v For only those queries that do not use disjunction or negation semantic correctness can be determined by using query graph v One node of the query graph represents result sites, others operand relations, edge between nodes operand nodes represent joins, and edge between operand node and result node represents project

L4: Query Optimization (1) - 13 Analysis: Detect invalid expressions E.g.: Select * from R where R.A =3  R does not have “A” attribute

L4: Query Optimization (1) - 14 Query Graph and Join Graph SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND G.JNO = J.JNO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog'' EMPResultASG G.JNO = J.JNO E. ENo = G. ENO Resp Ename DUR >= 36 JNAME = ``CAD'' Title = ``Prog'' EMPPROJASG G.JNO = J.JNO E. ENo = G. ENO PROJ

L4: Query Optimization (1) - 15 Disconnected Query Graph v Semantically incorrect conjunctive multivariable query without negation have query graphs which are not connected SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog'' EMPResultASG E. ENo = G. ENO Resp Ename DUR >= 36JNAME = ``CAD'' Title = ``Prog'' PROJ

L4: Query Optimization (1) - 16 Simplification: Eliminating Redundancy v Elimination of redundant predicates using well known idempotency rules: p  p = p;p  p = p; p  true = true; p  false = p; p  true = p; p  false = false; p 1  (p 1  p 2 ) = p 1 ; p 1  (p 1  p 2 ) = p 1 v Such redundant predicates arise when user query is enriched with several predicates to incorporate view­ relation correspondence, and ensure semantic integrity and security

L4: Query Optimization (1) - 17 Eliminating Redundancy-- An Example SELECT TITLE FROM E WHERE (NOT (TITLE = ``Programmer'') AND (TITLE = ``Programmer'' OR TITLE = ``Elec.Engr'') AND NOT (TITLE = ``Elec.Engr'')) OR ENAME = ``J.Doe''; SELECT TITLE FROM E WHERE ENAME = ``J.Doe'';

L4: Query Optimization (1) - 18 Eliminating Redundancy-- An Example p1 = p2 = p3 = The disjunctive normal form of the query is = ( ¬ p1  p1  ¬ p2)  ( ¬ p1  p2  ¬ p2)  p3 = (false  ¬ p2)  ( ¬ p1  false)  p3 = false  false  p3 = p3 Let the query qualification is ( ¬ p1  (p1  p2)  ¬ p2)  p3

L4: Query Optimization (1) - 19 Query Decomposition: Rewriting v Rewriting calculus query in relational algebra;  straightforward transformation from relational calculus to relational algebra, and  restructuring relational algebra expression to improve performance

L4: Query Optimization (1) - 20 Rewriting -- Transformation Rules (I) v Commutativity of binary operations: R  S  S  R R  S  S  R v Associativity of binary operations: (R  S)  T  R  ( S  T ) v Idempotence of unary operations: grouping of projections and selections   A’ (  A’’ (R ))   A’ (R ) for A’  A’’  A   p1(A1) (  p2(A2) (R ))   p1(A1)  p2(A2) (R ) R S  S R (R S) T  R (S T)

L4: Query Optimization (1) - 21 Rewriting -- Transformation Rules (II) v Commuting selection with projection  A1, …, An (  p (Ap) (R ))   A1, …, An (  p (Ap) (  A1, …, An, Ap (R ))) v Commuting selection with binary operations  p (Ai) (R  S)  (  p (Ai) (R))  S  p (Ai) (R S)  (  p (Ai) (R)) S  p (Ai) (R  S)   p (Ai) (R)   p (Ai) (S) v Commuting projection with binary operations  C (R  S)   A (R)   B (S) C = A  B  C (R S)   C (R)  C (S)  C (R  S)   C (R)   C (S)

L4: Query Optimization (1) - 22 An SQL Query and Its Query Tree ASGEMP  ENAME  (ENAME<>“J.DOE” )  (JNAME=“CAD/CAM” )  (Dur=12  Dur=24) PROJ SELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME <> `J.Doe' AND JName = `CAD' AND (Dur=12 or Dur=24 ) JNO ENO

L4: Query Optimization (1) - 23 Query Decomposition: Rewriting  ENAME  JNO  JNO, ENAME  JNO, ENO  ENO, ENAME  Dur=12  Dur=24  JNAME=“CAD/CAM”  ENAME<>“J.DOE” PROJASGEMP ENO JNO

L4: Query Optimization (1) - 24 Data Localization Input: Algebraic query on distributed relations v Determine which fragments are involved v Localization program  substitute for each global query its materialization program  optimize

L4: Query Optimization (1) - 25 Data Localization-- An Example PROJ ASG1 EMP1  ENAME  Dur=12  Dur=24  JNAME=“CAD/CAM”  ENAME<>“J.DOE” ENO JNO EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) ASG is fragmented into ASG1 =  ENO  “E3” (ASG) ASG2 =  ENO >“E3” (ASG) EMP2EMP3  ASG2 ASG1 

L4: Query Optimization (1) - 26 Reduction with Selection EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) SELECT * FROM EMP WHERE ENO=“E5” EMP1EMP2EMP3   ENO=“E5” EMP2  ENO=“E5” EMP  ENO=“E5” Given Relation R, F R ={R 1, R 2, …, R n } where R j =  pj (R)  pi (R j ) =  if  x  R:  (p i (x)  p j (x))

L4: Query Optimization (1) - 27 Reduction with join EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) ASG is fragmented into ASG1 =  ENO  “E3” (ASG) ASG2 =  ENO >“E3” (ASG) ASG1 EMP1 ENO EMP2EMP3  ASG2 ASG1  SELECT * FROM EMP, ASG WHERE EMP.ENO=ASG.ENO ENO ASG EMP

L4: Query Optimization (1) - 28 ASG1 EMP1 ENO EMP2EMP3  ASG2 ASG1  Reduction with Join (I) (R1  R2) S  (R1 S)  (R2 S) ASG1EMP1 ENO ASG1EMP2 ENO ASG2EMP2 ENO ASG1EMP3 ENO ASG2EMP3 ENO ASG2EMP1 ENO 

L4: Query Optimization (1) - 29 Reduction with Join (II) ASG1 EMP1 ENO ASG2 EMP2 ENO ASG2 EMP3 ENO  Given R i =  pi (R) and R j =  pj (R) R i Rj =  if  x  R i,  y  R j :  (p i (x)  p j (y)) Reduction with join 1. Distribute join over union 2. Eliminate unnecessary work

L4: Query Optimization (1) - 30 Reduction for VF v Find useless intermediate relations Relation R defined over attributes A = {A1, A2, …, An} vertically fragmented as R i =  A’ (R) where A’  A  K,D (R i ) is useless if the set of projection attributes D is not in A’ EMP1=  ENO,ENAME (EMP) EMP2=  ENO,TITLE (EMP) SELECT ENAME FROM EMP EMP2 EMP1 ENO  ENAME EMP1  ENAME

L4: Query Optimization (1) - 31 Reduction for DHF Distribute joins over union Apply the join reduction for horizontal fragmentation EMP1:  TITLE=“Programmer” (EMP) EMP2:  TITLE  “Programmer” (EMP) ASG1: ASG ENO EMP1 ASG2: ASG ENO EMP2 SELECT * FROM EMP, ASG WHERE ASG.ENO = EMP.ENO AND EMP.TITLE = “Mech. Eng.” ASG1 EMP1 ENO EMP2  ASG2 ASG1   TITLE=“MECH. Eng.”

L4: Query Optimization (1) - 32 Reduction for DHF (II)  ASG1 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ASG2 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ASG2 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ENO EMP2 ASG2 ASG1   TITLE=“Mech. Eng.” Selection first Joins over union

L4: Query Optimization (1) - 33 Reduction for HF v Remove empty relations generated by contradicting selection on horizontal fragments; v Remove useless relations generated by projections on vertical fragments; v Distribute joins over unions in order to isolate and remove useless joins

L4: Query Optimization (1) - 34 Reduction for HF --An Example EMP1 =  ENO  “E4” (  ENO,ENAME (EMP)) EMP2 =  ENO>“E4” (  ENO,ENAME (EMP)) EMP3 =  ENO,TITLE (EMP) QUERY SELECT ENAME FROM EMP WHERE ENO = “E5” ASG1 ENO EMP3 EMP2 EMP1   ENO=“E5”  ENAME EMP2  ENO=“E5”  ENAME

L4: Query Optimization (1) - 35 Why Optimization – An Example Query Select ename From EMP e, ASG g Where e.Eno = g. Eno And resp = ‘‘manager’’ EMP(eno, ename, title) ASG(eno, jno, resp, dur) Find the name of the employees who are managing a project? ASG EMP ASG   resp=”manager”  EMP.Eno=ASG.Eno  Ename Database SQL Query RA tree

L4: Query Optimization (1) - 36 Example - Strategies EMP1 =  ENO <= 100 (EMP) at site 1 EMP2 =  ENO > 100 (EMP) at site 2 ASG1 =  ENO <= 100 (ASG) at site 3 ASG2 =  ENO > 100 (ASG) at site 4 Fragment Schema Query site: Site 5 ENO  ASG1  resp=“manager ” EMP1 ENO ASG2   resp=“manager ” EMP2 Site 5  ASG1  resp=“manager ” EMP1 ENO ASG2  EMP2 Plan A Plan B ASG1’ASG2’

L4: Query Optimization (1) - 37 Example – DB Statistics & Costs Database Statistics v EMP has 400 tuples, v ASG has 1000 tuples, v there are 20 managers in ASG v the data is uniformly distributed among sites. v ASG and EMP are locally clustered on attributes RESP and ENO, respectively Costs v tuple access t acc = 1 unit, v tuple transfer t trans = 10 units,

L4: Query Optimization (1) - 38 Costs for Example Plan v The cost of Plan A: Produce ASG’ = 20  t acc =20 (processing locally) Transfer ASG’ = 20 *t trans =200(transfer to EMP site) Produce EMP’ = (10+10) * t acc * 2 = 40(join at the EMP site) Transfer EMP’ = 20 * t trans =200(send to Site 5) Total cost = 460 v The cost of Plan B: Transfer EMP = 400 * t trans = 4,000(send EMP to Site 5) Transfer ASG = 1000 * t trans = 10,000(send ASG to Site 5) Produce ASG’ = 1000 * t acc = 1,000(selection at Site 5) Join EMP and ASG’ = 400 * 20 * t acc = 8,000 (join at Site 5) Total cost = 23,000

L4: Query Optimization (1) - 39 Query Optimization v Problems in query optimization 1. Determining the physical copies of the fragments upon which to execute the fragment query expressions (also known as materialization) 2. Selecting the order of execution of operations 3. Selecting the method for executing each operation v The above problems are not independent, for instance, the choice of the best materialization for a query depends on the order in which operations are executed. But they are treated as independent. Further,  We bypass (1) by taking materialization for granted  We bypass (3) by clustering all operations at the same site as a local database system dependent problem

L4: Query Optimization (1) - 40 Query Optimization - Objectives v The selection of alternative query execution strategies is made based on predetermined objectives v Two main objectives:  minimize the total processing time (total cost) –network and computers at nodes do not get loaded. –Response time cannot be guaranteed  minimize the response time –allocation must facilitate parallel execution of the query –but throughput may decrease and cost can be higher than total cost v Total processing time (cost) is the sum of all the time (cost) incurred in executing the query (CPU, I/O, data transfer) v Response time is the elapsed time from the initiation till the completion of the query

L4: Query Optimization (1) - 41 Optimization Algorithms – The Issues v Cost model  cost components  weights for each components  costs for primitive operations v Search space  The set of equivalent algebra expressions (query trees) v Search strategies  How do we move inside the search space  Exhaustive search, heuristics, …

L4: Query Optimization (1) - 42 Cost Models v The cost measures are: I/O and CPU for centralized DBMSs and I/O, CPU and data transfer costs for DDBMS v Total cost = CPU cost + I/O cost + communication cost  CPU cost: C cpu * #insts  I/O cost:C i/o * #i/os  Communication CostC msg *#msgs + C tr *#bytes –C cpu, C i/o, C tr and C msg are all assumed to be constants. v Response time = sum (sequential operations)  C cpu *s_#insts  C i/o *s_#i/os  C msg *s_#msg + c tr *s_#bytes –S_x stands for maximum number of sequential x’s that need to be executed to process the query

L4: Query Optimization (1) - 43 Intermediate Result Size v The size of the intermediate relations produced during the execution facilitates the selection of the execution strategy v This is useful in selecting an execution strategy that reduces data transfer v The sizes of intermediate relations need to be estimated based on cardinalities of relations and lengths of attributes v R{A 1, A 2,..., A n } fragmented as R 1,R 2,…, R n the statistical data collected typically are  len(A i ), length of attribute A i in bytes  min(A i ) and max(A i ) for ordered domains  card(dom(A i )) unique values in dom[A i ]  Number of tuples in each fragment card(R j )

L4: Query Optimization (1) - 44 Intermediate Size Estimation v Join selectivity factor SF j (r,s) = card(r * s) / card(r) * card(s) v Selecton selectivity factor SF S (F) = card(  F (r)) / card(r) v size(r) = card(r) * len(r) v Cardinality of intermediate relations  SF S (A = value) = 1/card(dom(A))  SF S (A > value) = max(A) - value/max(A)-min(A)  SF S (A < value) = value - min(A)/max(A)-min(A)  Sf s (p(A i )  p(A j )) = sf s (p(A i )) * sf s (p(A j ))  Sf s (p(A i )  p(A j )) = sf s (p(A i )) + sf s (p(A j )) - sf s (p(A i )) * sf s (p(A j ))  SF S (A  {values}) = SF S (A = value) * card(values)

L4: Query Optimization (1) - 45 Intermediate Size Estimation (II) v Projection card(  a (r)) = card(r) v Cartesian product card(r X s) = card(r) * card(s) v Join card(R A=B S) = card(s); if A is key in R, B is foreign key in S card(R A=B S) = SF J (R,S) * card(r) * card(s) v Union Upper bound = card(r) + card(s) Lower bound = max{card(r), card(s)}

L4: Query Optimization (1) - 46 Cost of Processing Primitive Operations v Selection v Projection v Union v Join  nested-loops  sort-merge  hash-based v For distributed join, semi-join is proposed to perform joins

L4: Query Optimization (1) - 47 Semi-join R S R’=  A (R) S’ = R’ S S’ R S’ R S Amount of data transferred: |R’| + |S’| 1. join is replaced with a project; followed by semi-join; and then join 2. the project and join operations are done at one site, and semi-join at another site 3.amount of data transferred: |R’| + |S’|

L4: Query Optimization (1) - 48 Semi-join versus Join v using sem-ijoin increases local processing costs because a relation must be scanned twice (join, project) v For joining intermediate relations produced during semi-join one cannot exploit indices on the base relations v Semi-join may not be good when communication costs are low

L4: Query Optimization (1) - 49 Search Space v Search space is characterized by alternative execution plans v Most optimizers focus on join trees v For N relations, there are O(N!) equivalent join trees SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO ENO ASG EMP PNO PROJ ENO ASG EMP PNO PROJ ASG EMP PNO,ENO PROJ 

L4: Query Optimization (1) - 50 Restricting Search Space v O(N!) is large v Considering join methods, the search space is even bigger v Restrict by means of heuristics  Ignore cartisian product  … v Restrict the shape of the join tree  Only consider deep trees  …. R1R1 R2R2 R3R3 R1R1 R2R2 R3R3 R4R4 R4R4 R1R1 R2R2 R3R3 R4R4 deep tree Left-deep tree bushy tree

L4: Query Optimization (1) - 51 Search Strategy v How to move in the search space to find the optimal plan v Deterministic  Start from base relations and build plans by adding relations at each step  Dynamic programming: breadth-first  Greedy: depth-first v Randomized  Search for the optimal one around a particular starting point –simulated annealing –iterative improvement

L4: Query Optimization (1) - 52 Search Strategies -- Example R1R1 R2R2 R3R3 R4R4 R1R1 R2R2 R1R1 R2R2 R3R3 R1R1 R3R3 R4R4 R2R2 R1R1 R3R3 R2R2 R4R4 R1R1 R2R2 R3R3 R4R4 Deterministic Randomized

L4: Query Optimization (1) - 53 INGRES CQO v Uses a dynamic query optimization technique that recursively breaks up a calculus query (SQL) into manageable smaller queries v A multivariable query is first decomposed into a sequence of queries having an unique variable in common v Each monovariable query is processed by optimizing the access to a single relation v The algorithm first executes unary operations and tries to minimize the sizes of intermediate results in ordering binary operations

L4: Query Optimization (1) - 54 INGRES CQO Algorithm - Detachment v SELECT Q.B, R.C, T.D v FROM O, Q, R, T v WHERE p1(O.X) AND p2(O.X, Q.W, R.U, T.V); v into sub queries  SELECT O.X into O '  FROM O  WHERE p1(O.X);  SELECT Q.B, R.C, T.D  FROM O ', Q, R, T  WHERE p2(O’.X, Q.W, R.U, T.V);

L4: Query Optimization (1) - 55 INGRES CQO Algorithm - Substitution v A n-variable query that cannot be detached is substituted by set of v (n-1)-variable queries by using tuple substitution. v Consider R(v) then q(v,x,y,w) is replaced by a set of queries {q ' (t,x,y,w) | t  R} v After multiple substitutions, a set of monovariable queries are generated, and then executed in a pipeline fashion

L4: Query Optimization (1) - 56 INGRES CQO Algorithm - Example v E(ENO, ENAME, TITLE), G(ENO, JNO, RESP, DUR), J(JNO, JNAME, BUDGET) v q1: SELECT ENAME FROM E,G,J WHERE E.ENO=G.ENO AND G.JNO=J.JNO AND JNAME= " CAD " ; v Detachment: v q11: SELECT JNO INTO JVAR FROM J WHERE JNAME= " CAD " ; v q ' : SELECT ENAME FROM E, G, JVAR WHERE E.ENO=G.ENO AND G.JNO=JVAR.JNO; v q ' is further detached to v q12: SELECT G.ENO INTO GVAR FROM G, JVAR WHERE v G.JNO= JVAR.JNO; v q13: SELECT E.ENAME FROM E,GVAR WHERE E.ENO=GVAR.ENO;

L4: Query Optimization (1) - 57 INGRES CQO Algorithm - Example v Substitution v The order of processing q is q11->q12->q13 v q12 is replaced by the set of queries v {q12t = SELECT G.ENO into GVAR FROM G WHERE G.JNO = t.JNO |  t  JVAR} v q13 is replaced by set of queries v {q13t = SELECT ENAME FROM E WHERE E.ENO = t.eno |  t  GVAR}

L4: Query Optimization (1) - 58 Distributed INGRES Query Optimization Algorithm v Let there be n relations R 1,R 2,...,R n involved in a n-variable query. R j i denotes the fragment of R i stored at site j (m sites), data transfer cost of sending #bytes to k sites is CC k (#bytes) v Broadcast network v CC k (#bytes) = CC 1 (#bytes) v if max j=1,m (  i=1,n (size(R j i )) > max i=1,n (size(R i )) v then v the processing site is j which has largest amount of data v else v R p is the largest relation and sites of R p are the processing sites

L4: Query Optimization (1) - 59 Distributed INGRES Query Optimization Algorithm v Point-to-point network v CC k (#bytes) = k*CC 1 (#bytes) v The choice of R p that minimizes data transfer is the largest relation; partition R p to increase parallelism; let sites be placed in decreasing order of useful data for the query  i=1,n size(R j i ) >  i=1,n size(R j+1 i ), v then the choice of number of sites k at which processing needs to be done is given by v if  i<>p (size(R i ) - size(R 1 i )) >size(R 1 p ) then k =1 v else k is the largest j such that v  i<>p (size(R i ) - size(R j i ))<=size(R j p ) v this rule chooses a site as processing site only if the amount of data it receives is smaller than amount of data it sends out if it were not the processing site. Step 3.3 transfers all the fragments to their processing sites. In Step 3.4, MVQ ' is executed.

L4: Query Optimization (1) - 60 Distributed INGRES Query Optimization Example v Consider J JN G, where J and G are fragmented. Assume following allocation and sizes of fragments. v Site1Site2Site3Site4Total v J v G v Total v Point-to-Point network, send each J i to site3 v Broadcast network, broadcast G to sites 1,2, and 4