C. Faloutsos Query Optimization – part 1 Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications C. Faloutsos Query Optimization – part 1
General Overview - rel. model Relational model - SQL Functional Dependencies & Normalization Physical Design Indexing Query optimization Transaction processing 15-415 - C. Faloutsos
Overview of a DBMS casual Naïve user user DBA DML parser DML precomp. DDL parser trans. mgr buffer mgr catalog Data-files 15-415 - C. Faloutsos
Overview - detailed Why q-opt? Equivalence of expressions C. Faloutsos Overview - detailed Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies 15-415 - C. Faloutsos CMU - 15-415
Why Q-opt? SQL: ~declarative good q-opt -> big difference eg., seq. Scan vs B-tree index, on P=1,000 pages 15-415 - C. Faloutsos
Q-opt steps bring query in internal form (eg., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best 15-415 - C. Faloutsos
Q-opt - example p s TAKES STUDENT select name from STUDENT, TAKES where c-id=‘415’ and STUDENT.ssn=TAKES.ssn 15-415 - C. Faloutsos
Q-opt - example p STUDENT TAKES s p Canonical form s STUDENT TAKES 15-415 - C. Faloutsos
Hash join; merge join; nested loops; Q-opt - example STUDENT TAKES s p Hash join; merge join; nested loops; Index; seq scan 15-415 - C. Faloutsos
Overview - detailed Why q-opt? Equivalence of expressions C. Faloutsos Overview - detailed Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies 15-415 - C. Faloutsos CMU - 15-415
Equivalence of expressions A.k.a.: syntactic q-opt in short: perform selections and projections early More details: see transf. rules in text 15-415 - C. Faloutsos
Equivalence of expressions Q: How to prove a transf. rule? A: use RTC, to show that LHS = RHS, eg: 15-415 - C. Faloutsos
Equivalence of expressions 15-415 - C. Faloutsos
Equivalence of expressions 15-415 - C. Faloutsos
Equivalence of expressions Q: how to disprove a rule?? 15-415 - C. Faloutsos
Equivalence of expressions Selections perform them early break a complex predicate, and push simplify a complex predicate (‘X=Y and Y=3’) -> ‘X=3 and Y=3’ 15-415 - C. Faloutsos
Equivalence of expressions Projections perform them early (but carefully…) Smaller tuples Fewer tuples (if duplicates are eliminated) project out all attributes except the ones requested or required (e.g., joining attr.) 15-415 - C. Faloutsos
Equivalence of expressions Joins Commutative , associative Q: n-way join - how many diff. orderings? 15-415 - C. Faloutsos
Equivalence of expressions Joins - Q: n-way join - how many diff. orderings? A: Catalan number ~ 4^n Exhaustive enumeration: too slow. 15-415 - C. Faloutsos
Q-opt steps bring query in internal form (eg., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best 15-415 - C. Faloutsos
Cost estimation Eg., find ssn’s of students with an ‘A’ in 415 (using seq. scanning) How long will a query take? CPU (but: small cost; decreasing; tough to estimate) Disk (mainly, # block transfers) How many tuples will qualify? (what statistics do we need to keep?) 15-415 - C. Faloutsos
Cost estimation Statistics: for each relation ‘r’ we keep … Sr #1 #2 #3 #nr Statistics: for each relation ‘r’ we keep nr : # tuples; Sr : size of tuple in bytes 15-415 - C. Faloutsos
Cost estimation Statistics: for each relation ‘r’ we keep … Sr Statistics: for each relation ‘r’ we keep … V(A,r): number of distinct values of attr. ‘A’ (recently, histograms, too) #1 #2 #3 … #nr 15-415 - C. Faloutsos
Derivable statistics fr: blocking factor = max# records/block (=?? ) … fr Sr #1 #2 #br fr: blocking factor = max# records/block (=?? ) br: # blocks (=?? ) SC(A,r) = selection cardinality = avg# of records with A=given (=?? ) 15-415 - C. Faloutsos
Derivable statistics fr: blocking factor = max# records/block (= B/Sr ; B: block size in bytes) br: # blocks (= nr / fr ) 15-415 - C. Faloutsos
Derivable statistics SC(A,r) = selection cardinality = avg# of records with A=given (= nr / V(A,r) ) (assumes uniformity...) – eg: 10,000 students, 10 colleges – how many students in SCS? 15-415 - C. Faloutsos
Additional quantities we need: For index ‘i’: fi: average fanout (~50-100) HTi: # levels of index ‘i’ (~2-3) ~ log(#entries)/log(fi) LBi: # blocks at leaf level HTi 15-415 - C. Faloutsos
Statistics Where do we store them? How often do we update them? 15-415 - C. Faloutsos
Q-opt steps bring query in internal form (eg., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans selections; sorting; projections joins estimate cost; pick best 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br Selections – eg., select * from TAKES where grade = ‘A’ Plans? 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br Plans? seq. scan binary search (if sorted & consecutive) index search if an index exists 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br seq. scan – cost? br (worst case) br/2 (average, if we search for primary key) 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br binary search – cost? if sorted and consecutive: ~log(br) + SC(A,r)/fr (=blocks spanned by qual. tuples) 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br estimation of selection cardinalities SC(A,r): non-trivial – details later 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br method#3: index – cost? levels of index + blocks w/ qual. tuples ... case#1: primary key case#2: sec. key – clustering index case#3: sec. key – non-clust. index 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br method#3: index – cost? levels of index + blocks w/ qual. tuples .. case#1: primary key – cost: HTi + 1 HTi 15-415 - C. Faloutsos
Cost estimation + plan generation Sr method#3: index – cost? levels of index + blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index HTi + SC(A,r)/fr … HTi #br 15-415 - C. Faloutsos
Cost estimation + plan generation … fr Sr #1 #2 #br method#3: index – cost? levels of index + blocks w/ qual. tuples ... case#3: sec. key – non-clust. index HTi + SC(A,r) (actually, pessimistic...) 15-415 - C. Faloutsos
Cost estimation – arithm. examples find accounts with branch-name = ‘Perryridge’ account(branch-name, balance, ...) 15-415 - C. Faloutsos
Arithm. examples – cont’d n-account = 10,000 tuples f-account = 20 tuples/block V(balance, account) = 500 distinct values V(branch-name, account) = 50 distinct values for branch-index: fanout fi = 20 15-415 - C. Faloutsos
Arithm. examples Q1: cost of seq. scan? A1: 500 disk accesses Q2: assume a clustering index on branch-name – cost? 15-415 - C. Faloutsos
Cost estimation + plan generation Sr method#3: index – cost? levels of index + blocks w/ qual. tuples #1 fr #2 case#2: sec. key – clustering index HTi + SC(A,r)/fr … HTi #br 15-415 - C. Faloutsos
Arithm. examples A2: HTi + SC(branch-name, account)/f-account HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+) SC(..)= # qual records = nr/V(A,r) = 10,000/50 = 200 tuples spanning 200/20 blocks = 10 blocks 15-415 - C. Faloutsos
Arithm. examples A2 final answer: 2+10= 12 block accesses (vs. 500 block accesses of seq. scan) footnote: in all fairness seq. disk accesses: ~2msec or less random disk accesses: ~10msec 15-415 - C. Faloutsos
Q-opt steps bring query in internal form (eg., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans selections; sorting; projections joins estimate cost; pick best 15-415 - C. Faloutsos