Bancos de Dados Avançados Processamento de Consultas

Bancos de Dados Avançados Processamento de Consultas
DCC030 - TCC: Bancos de Dados Avançados (Ciência Computação) DCC049 - TSI: Bancos de Dados Avançados (Sistemas Informação) DCC842 - Bancos de Dados (Pós-Graduação) MIRELLA M. MORO

Introdução Árvore de consulta
Consulta costuma ter N estratégias de execução possíveis: Otimização de consulta = escolher estratégia adequada Estratégia de execução Plano de consulta Código pode ser: Executado diretamente (modo interpretado) Armazenado e executado + tarde (modo compilado) Figura Passos típicos durante a execução de uma consulta de alto nível. Elmasri/Navathe 4ª Ed

Sumário Processamento de Consultas
1. Tradução SQL  Álgebra Relacional 3. Algoritmos para operações Ordenação externa SELEÇÃO, JUNÇÃO, PROJEÇÃO, AGREGAÇÃO, JUNÇÃO EXTERNA 2. Otimização de consultas Com heurísticas Com estimativas de custo (seletividade) MATERIAL BASEADO EM : Elmasri/Navathe 4a ed cap 15 / 6a ed cap 19

Seletividade e Estimativas de Custo
Componentes de Custo Funções de Custo para SELECT Funções de Custo para JOIN

Using Selectivity and Cost Estimates in Query Optimization
4/6/2019 Cost-based query optimization: Estimate and compare the costs of executing a query using different execution strategies and choose the strategy with the lowest cost estimate. (Compare to heuristic query optimization) Issues Cost function Number of execution strategies to be considered Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Cost Components for Query Execution
4/6/2019 Cost Components for Query Execution Access cost to secondary storage [próximo slide] Storage cost – intermediate files Computation cost – in-memory ops on buffers (sorting records, merging records for a join, etc) Memory usage cost – # buffers needed Communication cost – shipping query + results Focus on different components Large DBs: main emphasis on (1) Small DBs: main emphasis on (3) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Cost Components for Query Execution
4/6/2019 Cost Components for Query Execution Access cost to secondary storage Search for, reading, writing data blocks Type of access structures: ordering, hashing, primary/secondary index Block allocation: contiguously, scattered For most DBMSs, this is the cost to be minimized Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Catalog Information Used in Cost Functions
4/6/2019 Catalog Information Used in Cost Functions Information about the size of a file number of records (tuples) (r), record size (R), number of blocks (b) blocking factor (bfr) (number of records that fit in one block) Information about indexes and indexing attributes of a file Number of levels (x) of each multilevel index Number of first-level index blocks (bI1) Number of distinct values (d) of an attribute Selectivity (sl) of an attribute (fraction of records satisfying an equality condition on the attribute) Selection cardinality (s) of an attribute. (s = sl * r) (average number of records that satisfy an equality selection condition on that attribute) E.g. for a key attribute: d = r, sl = 1/r and s=1 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Examples of Cost Functions for SELECT
4/6/2019 Examples of Cost Functions for SELECT S1. Linear search (brute force) approach CS1a = b; For an equality condition on a key, CS1a = (b/2)* if the record is found; otherwise CS1b = b. S2. Binary search: CS2 = log2b + ┌(s/bfr) ┐–1 For an equality condition on a unique (key) attribute CS2 =log2b* S3. Using a primary index (S3a) or hash key (S3b) to retrieve a single record CS3a = x + 1; CS3b = 1 for static or linear hashing CS3b = 1 for extendible hashing *Na média, um registro é encontrado depois de pesquisar metade das chaves da tabela *Se for busca pelo atributo chave, s=1 Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Examples of Cost Functions for SELECT (cont.)
4/6/2019 Examples of Cost Functions for SELECT (cont.) S4. Using an ordering index to retrieve multiple records with inequality condition (<,>,≠,≤,≥): For the comparison condition on a key field with an ordering index, CS4 = x + (b/2)* S5. Using a clustering index to retrieve multiple records on equality condition: CS5=x+┌ (s/bfr)* ┐ S6. Using a secondary (B+-tree) index: For equality in key attribute, CS6a = x + 1 For equality comparison, CS6b = x + 1+ s* For inequality condition (<,>,≠,≤,≥): CS6c = x + (bI1/2) + (r/2)** *Na média, metade dos registros satisfazem à condição *s registros em s/bfr blocos satisfazem à condição Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) *s registros em s blocos: pois é índice secundário no qual o arquivo não está ordenado pelo campo não-chave ** na média, metade dos registros satisfazem à condição: metade dos blocos do primeiro nível do índice são acessados, mais metade do número de registros são acessados pelo índice Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Examples of Cost Functions for SELECT (cont.)
4/6/2019 Examples of Cost Functions for SELECT (cont.) S7. Conjunctive selection: Use either S1 or one of the methods S2 to S6 to solve. For the latter case, use one condition to retrieve the records and then check in the memory buffer whether each retrieved record satisfies the remaining conditions in the conjunction. S8. Conjunctive selection using a composite index: Same as S3a, S5 or S6a, depending on the type of index. Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Example of Using Cost Functions
Schema EMPLOYEE (SSN, fname, minit, lname, bdate, address, sex, salary, super_ssn, dno) Statistics rE = records bE = 2000 disk blocks bfrE = 5 records/block Access paths Salary: clustering index xSalary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 Queries σ ssn=‘ ’(EMPLOYEE) σ dno>5(EMPLOYEE) σ dno=5(EMPLOYEE) σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE) COST?! Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

σ ssn=‘123456789’(EMPLOYEE) Statistics
rE = records bE = 2000 disk blocks bfrE = 5 records/block Access paths Salary: clustering index xSalary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S1. Linear search (brute force) approach CS1b = b / 2 (for a key attribute)  Custo = 1000 S6. Using a secondary (B+-tree) index: For an equality comparison, CS6a = x + s;  Custo = = 5 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

σ dno > 5(EMPLOYEE) Statistics
rE = records bE = 2000 disk blocks bfrE = 5 records/block Access paths Salary: clustering index xSalary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S1. Linear search (brute force) approach CS1a = b;  Custo = 2000 S6. Using a secondary (B+-tree) index: For an comparison condition such as >, <, >=, or <= : CS6a = x + (bI1/2) + (r/2)  Custo = 2 + 4/ /2 = 5004 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

σ dno = 5(EMPLOYEE) Statistics
rE = records bE = 2000 disk blocks bfrE = 5 records/block Access paths Salary: clustering index xSalary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S1. Linear search (brute force) approach CS1a = b;  Custo = 2000 S6. Using a secondary (B+-tree) index: For an equality comparison, CS6a = x + s;  Custo = = 82 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE)
Statistics rE = records bE = 2000 disk blocks bfrE = 5 records/block Access paths Salary: clustering index xSalary = 3, average selection cardinality ssalary = 20 SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Sex: secondary index , xsex = 1. dsex = 2 distinct values, ssex = 5000 S1. Linear search (brute force) approach CS1a = b;  Custo = 2000 Dno = 5 :: S6. Using a secondary (B+-tree) index: For an equality comparison, CS6a = x + s;  Custo = = 82 salary> :: S4. Using an ordering index to retrieve multiple records: CS4 = x + (b/2);  Custo = /2 = 1003 Sex=‘F’:: S6. Using a secondary (B+-tree) index:  Custo = = 5001 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

σ dno=5 AND salary>30000 AND sex=‘F’(EMPLOYEE)
S1. Linear search (brute force) approach CS1a = b;  Custo = 2000 Dno = 5 :: S6. Using a secondary (B+-tree) index: For an equality comparison, CS6a = x + s;  Custo = = 82 salary> :: S4. Using an ordering index to retrieve multiple records: CS4 = x + (b/2);  Custo = /2 = 1003 Sex=‘F’:: S6. Using a secondary (B+-tree) index:  Custo = = 5001 Optimizer chooses S6a on the secondary index on Dno (dno=5) retrieves the records Others (salary>30000 and sex=‘F’) checked for each selected record after it is retrieved into main memory Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Examples of Cost Functions for JOIN
4/6/2019 Examples of Cost Functions for JOIN Join selectivity (js) js = | (R C S) | / | R x S | = | (R C S) | / (|R| * |S |) If condition C does not exist, js = 1; If no tuples from the relations satisfy condition C, js = 0; Usually, 0 <= js <= 1; Size of the result file after join operation | (R C S) | = js * |R| * |S | Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Examples of Cost Functions for JOIN (cont.)
4/6/2019 Examples of Cost Functions for JOIN (cont.) J1. Nested-loop join: CJ1 = bR + (bR*bS) + ((js* |R|* |S|)/bfrRS) (Use R for outer loop)  cost of writing resulting file to disk Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

4/6/2019 Examples of Cost Functions for JOIN (cont.) J2. Single-loop join (using an access structure to retrieve the matching record(s)) If an index exists for the join attribute B of S with index levels xB, we can retrieve each record s in R and then use the index to retrieve all the matching records t from S that satisfy t[B] = s[A]. The cost depends on the type of index. [NEXT SLIDE] Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

4/6/2019 Examples of Cost Functions for JOIN (cont.) J2. Single-loop join: cost depending on type of index For a secondary index, CJ2a = bR + (|R| * (xB + sB)) + ((js* |R|* |S|)/bfrRS); cost of writing For a clustering index, CJ2b = bR + (|R| * (xB + (sB/bfrB))) + ((js* |R|* |S|)/bfrRS); For a primary index, CJ2c = bR + (|R| * (xB + 1)) + ((js* |R|* |S|)/bfrRS); If a hash key exists for one of the two join attributes — B of S CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS); - sB is the selection cardinality for join attribute B in S Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

4/6/2019 Examples of Cost Functions for JOIN (cont.) J3. Sort-merge join: CJ3a = CS + bR + bS + ((js* |R|* |S|)/bfrRS); (CS: Cost for sorting files) Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r) Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Join and Buffers J1: Nested-loop (for each R retrieve S and test join condition) Read as many blocks as possible at a time into memory from outer loop file number of buffers available = nB-2 (needs 1 to read the other file + 1 to result) Read one block from INNER then probe OUTER in memory EMPLOYEE  dno=dnumber DEPARTMENT COST: EMPLOYEE as outer #blocks accessed for outer = bE #blocks accessed for inner = bD #times (nB-2) blocks of outer are loaded =  bE/ (nB-2)  COST = #block accessed = bE + bD *  bE/ (nB-2)  COST: DEPARTMENT as outer #block accesses = bD + bE *  bD/ (nB-2)  Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r)

Join and Buffers  MUITO MELHOR COLOCAR A MENOR RELAÇÃO COMO OUTER
J1: Nested-loop (for each R retrieve S and test join condition) EMPLOYEE  dno=dnumber DEPARTMENT nB = 7 blocks (buffers) EMPLOYEE: rE=6000 records in bE=2000 disk blocks DEPARTMENT: rD=50 records in bD=10 disk blocks COST: EMPLOYEE as outer #block accesses = bE + bD *  bE/ (nB-2)  = * 2000/5 = 6000 COST: DEPARTMENT as outer #block accesses = bD + bE *  bD/ (nB-2)  = * 10/5 = 4010  MUITO MELHOR COLOCAR A MENOR RELAÇÃO COMO OUTER

Join and Selection Factor
J2: Single loop join (for each R probe S for matching values) Join selection factor (equi-join): percentage of records in a file that will be joined with records in the other file DEPARTMENT  mgr_ssn=ssn EMPLOYEE If we have indexes on both sides, the cost depends: 1. From Employee to Department bE + rE * (xmgr_ssn + 1) 2. From Department to Employee bD + rD * (xssn + 1) Number of records (tuples) (r), Record size (R), Number of blocks (b) Blocking factor (bfr) #levels (x) of each multilevel index #first-level index blocks (bI1) #distinct values (d) of an attribute Selectivity (sl) of an attribute Selection cardinality (s) an att. (s = sl * r)

Join and Selection Factor
J2: Single loop join (for each R probe S for matching values) DEPARTMENT  mgr_ssn=ssn EMPLOYEE DEPARTMENT: rD=50 recordsin bD=10 disk blocks JSDE = 50  every department has a manager on Employee EMPLOYEE: rE=6000 recordsin bE=2000 disk blocks JSED = 50/6000  only 50 employees are managers on Department INDEX on mgr_ssn on DEPARTMENT: xmrg_ssn=2 INDEX on ssn on EMPLOYEE: xssn = 4 Employee  probe DEPARTMENT using index mgr_ssn bE + rE * (xmgr_ssn + 1) = * 3 = Department  probe EMPLOYEE using index ssn bD + rD * (xssn + 1) = * 5 = 260

Multiple Relation Queries and Join Ordering
4/6/2019 Multiple Relation Queries and Join Ordering A query joining n relations will have n-1 join operations, and hence can have a large number of different join orders when we apply the algebraic transformation rules. Current query optimizers typically limit the structure of a (join) query tree to that of left-deep (or right-deep) trees. Left-deep tree: a binary tree where the right child of each non-leaf node is always a base relation. Amenable to pipelining Could utilize any access paths on the base relation (the right child) when executing the join. R4 R3 R1 R2 Elmasri&Navathe – Fundamentos de Banco de Dados, 4a ed

Example of Using Cost Functions
Schema EMPLOYEE (SSN, fname, minit, lname, bdate, address, sex, salary, super_ssn, dno) DEPARTMENT (dnumber, dname, mgr_ssn, mgr_start_date) Statistics rE = records bE = 2000 disk blocks bfrE = 5 recors/block rD = 125 records bD = 13 disk blocks Access paths SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Dnumber: primary index, xdnumber = 1 level Mgr_ssn: secondary index, average selection cardinality smgr_ssn = 1, levels xmgr_ssn = 2 Queries EMPLOYEE  dno=dnumber DEPARTMENT  js = 1/125; bfrED = 4 records per block PARE E PENSE NAS POSSÍVEIS RESPOSTAS

EMPLOYEE  dno=dnumber DEPARTMENT
J1: CJ1 = bR + (bR*bS) + ((js* |R|* |S|)/bfrRS) J2: For a secondary index, CJ2a = bR + (|R| * (xB + sB)) + ((js* |R|* |S|)/bfrRS); For a clustering index, CJ2b = bR + (|R| * (xB + (sB/bfrB))) + ((js* |R|* |S|)/bfrRS); For a primary index, CJ2c = bR + (|R| * (xB + 1)) + ((js* |R|* |S|)/bfrRS); If a hash key exists for one of the two join attributes — B of S CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS);

J1 (nested loop join) with EMPLOYEE as outer loop CJ1 = bE + (bE*bD) + ((js* |E|* |D|)/bfrED) = *13 + (1/125 * * 125)/4 = 30500 J1 with DEPARTMENT as outer loop CJ1 = bD + (bD*bE) + ((js* |D|* |E|)/bfrDE) = * (1/125 * *125/4) = 28513 J2 (single loop join) with EMPLOYEE as outer loop (dnumber as primary idx) CJ2c = bE + (|E| * (xdnumber + 1)) + ((js* |E|* |D|)/bfrED) = (10000*2) + (1/125 * *125/4) = 24500 J2 with DEPARTMENT as outer loop (dno as secondary idx) CJ2a = bD + (|D| * (xdno + sdno)) + ((js* |D|* |E|)/bfrED) = 13 + (125*(2+80)) + (1/125 * * 125/4) = 12763 Statistics: rE = records bE = 2000 disk blocks bfrE = 5 recors/block rD = 125 records bD = 13 disk blocks Access paths SSN: secondary index xssn =4 (sssn = 1) Dno: secondary index, xdno = 2, bi1 dno = 4. ddno = 125 distinct values, sdno = 80 Dnumber: primary index, xdnumber = 1 level Mgr_ssn: secondary index, average selection cardinality smgr_ssn = 1, levels xmgr_ssn = 2 Queries EMPLOYEE  dno=dnumber DEPARTMENT  js = 1/125; bfrED = 4 records per block

J1 (nested loop join) with EMPLOYEE as outer loop CJ1 = bE + (bE*bD) + ((js* |E|* |D|)/bfrED) = 30500 J1 with DEPARTMENT as outer loop CJ1 = bD + (bD*bE) + ((js* |D|* |E|)/bfrDE) = 28513 J2 (single loop join) with EMPLOYEE as outer loop (dnumber as primary idx) CJ2c = bE + (|E| * (xdnumber + 1)) + ((js* |E|* |D|)/bfrED) = 24500 J2 with DEPARTMENT as outer loop (dno as secondary idx) CJ2a = bD + (|D| * (xdno + sdno)) + ((js* |D|* |E|)/bfrED) = 12763 NOTE: with 15 buffers or more DEPARTMENT fits memory (13 blocks) 1 buffer for result 1 for EMPLOYEE CJ2 = bE + bD + ((js* rE* rD)/bfrED) = 4513

Sugestões de Exercícios DCCbda/pdfs/exercicios/exercicios-otimizacao
Sugestões de Exercícios DCCbda/pdfs/exercicios/exercicios-otimizacao.pdf RAMAKRISHNAM 3rd Ed 14.4 e 14.6 : given two relations and their statistics, calculate cost of joining them 15.4 1) e 2): given a relation, a sql statement, and options for indexes, define the cost of the best plan

Bancos de Dados Avançados Processamento de Consultas

Similar presentations

Presentation on theme: "Bancos de Dados Avançados Processamento de Consultas"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bancos de Dados Avançados Processamento de Consultas

Similar presentations

Presentation on theme: "Bancos de Dados Avançados Processamento de Consultas"— Presentation transcript:

Similar presentations

About project

Feedback