Monday, 5/13/2002 Hash table indexes, query optimization

Slides:



Advertisements
Similar presentations
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Advertisements

Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
Hash Table indexing and Secondary Storage Hashing.
External Memory Hashing. Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
CS 440 Database Management Systems
Query Optimization Heuristic Optimization
CPS216: Data-intensive Computing Systems
Lecture 26: Query Optimizations and Cost Estimation
Lecture 21: Hash Tables Monday, February 28, 2005.
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 24: Query Execution and Optimization
Chapter 15 QUERY EXECUTION.
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Lecture 26: Query Optimization
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
External Memory Hashing
Indexing and Hashing Basic Concepts Ordered Indices
CSE 544: Lectures 13 and 14 Storing Data, Indexes
Query Optimization and Perspectives
Index tuning Hash Index.
Lecture 6: Data Storage and Indexes
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
CPS216: Advanced Database Systems
Lecture 24: Query Execution
Lecture 25: Query Optimization
CSE 326: Data Structures: Sorting
CPSC-608 Database Systems
CPSC-608 Database Systems
Wednesday, 5/8/2002 Hash table indexes, physical operators
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
Lecture 20: Indexes Monday, February 27, 2006.
CS4433 Database Systems Indexing.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 23: Monday, November 25, 2002.
CSE 544: Optimizations Wednesday, 5/10/2006.
Lecture 26 Monday, December 3, 2001.
Lecture 29: Final Review Wednesday, December 11, 2002.
Lecture 24: Wednesday, November 27, 2002.
CS4432: Database Systems II
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Monday, 5/13/2002 Hash table indexes, query optimization CSE 544: Lecture 13 Monday, 5/13/2002 Hash table indexes, query optimization

Outline Hash tables: chapter 10 Query optimization: chapters 13, 14 Chaudhuri, An overview of query optimization in relational systems

Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: There are n buckets A hash function f(k) maps a key k to {0, 1, …, n-1} Store in bucket f(k) a pointer to record with key k Secondary storage: bucket = block, use overflow blocks when needed

Hash Table Example Assume 1 bucket (block) stores 2 keys + pointers h(e)=0 h(b)=h(f)=1 h(g)=2 h(a)=h(c)=3 e b f g a c 1 2 3

Searching in a Hash Table Search for a: Compute h(a)=3 Read bucket 3 1 disk access e b f g a c 1 2 3

Insertion in Hash Table Place in right bucket, if space E.g. h(d)=2 e b f g d a c 1 2 3

Insertion in Hash Table Create overflow block, if no space E.g. h(k)=1 More over- flow blocks may be needed e b f g d a c k 1 2 3

Hash Table Performance Excellent, if no overflow blocks Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).

Extensible Hash Table Allows has table to grow, to avoid performance degradation Assume a hash function h that returns numbers in {0, …, 2k – 1} Start with n = 2i << 2k , only look at first i most significant bits Remark: textbook looks at least significant bits first (no big deal)

Extensible Hash Table E.g. i=1, n=2, k=4 Note: we only look at the first bit (0 or 1) i=1 0(010) 1 1 1(011) 1

Insertion in Extensible Hash Table 0(010) 1 1 1(011) 1(110) 1

Insertion in Extensible Hash Table Now insert 1010 Need to extend table, split blocks i becomes 2 i=1 0(010) 1 1 1(011) 1(110), 1(010) 1

Insertion in Extensible Hash Table Now insert 1110 i=2 0(010) 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

Insertion in Extensible Hash Table Now insert 0000, then 0101 Need to split block i=2 0(010) 0(000), 0(101) 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

Insertion in Extensible Hash Table After splitting the block 00(10) 00(00) 2 i=2 01(01) 2 00 01 10(11) 10(10) 2 10 11 11(10) 2

Performance Extensible Hash Table No overflow blocks: access always one read However, collisions (when hash values are identical) may require overflow blocks BUT: Extensions can be costly and disruptive After an extension table may no longer fit in memory Performs poorly on skewed data (but good hash functions usually take care of that)

Linear Hash Table Idea: extend only one entry at a time Problem: n= no longer a power of 2 Let i be such that 2i <= n < 2i+1 After computing h(k), use last i bits: If last i bits represent a number > n, change msb from 1 to 0 (get a number <= n)

Linear Hash Table Example (01)00 (11)00 i=2 (01)11 BIT FLIP 00 01 (10)10 10

Linear Hash Table Example Insert 1000: overflow blocks… (01)00 (11)00 (10)00 i=2 (01)11 00 01 (10)10 10

Linear Hash Tables Extension: independent on overflow blocks Extend n:=n+1 when average number of records per block exceeds (say) 80%

Linear Hash Table Extension From n=3 to n=4 (01)00 (11)00 (01)00 (11)00 i=2 (01)11 00 (01)11 i=2 01 (10)10 10 (10)10 00 01 (01)11 Only need to touch one block (which one ?) 10 11

Linear Hash Table Extension From n=3 to n=4 finished Extension from n=4 to n=5 (new bit) Need to touch every single block (why ?) (01)00 (11)00 i=2 (10)10 00 01 (01)11 10 11

Optimization Overview S. Chaudhuri, An overview of query optimization in relational systems Three components: A search space (given by algebraic laws) A cost estimation technique An enumeration algorithms Two philosophies: Heuristics-based optimizations Cost-based optimizations

Algebraic Laws Commutative and Associative Laws Distributive Laws R U S = S U R, R U (S U T) = (R U S) U T R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T R S = S R, R (S T) = (R S) T Distributive Laws R (S U T) = (R S) U (R T)

Algebraic Laws Laws involving selection: s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R) s C OR C’(R) = s C(R) U s C’(R) s C (R S) = s C (R) S When C involves only attributes of R s C (R – S) = s C (R) – S s C (R U S) = s C (R) U s C (S) s C (R ∩ S) = s C (R) ∩ S

Algebraic Laws Example: R(A, B, C, D), S(E, F, G) s F=3 (R S) = ? s A=5 AND G=9 (R S) = ? D=E D=E

Algebraic Laws Laws involving projections PM(R S) = PN(PP(R) PQ(S)) Where N, P, Q are appropriate subsets of attributes of M PM(PN(R)) = PM,N(R) Example R(A,B,C,D), S(E, F, G) PA,B,G(R S) = P ? (P?(R) P?(S)) D=E D=E

Heuristic Based Optimizations Query rewriting based on algebraic laws Result in better queries most of the time Main heuristics: Push selections down the tree

Heuristic Based Optimizations pname pname s price>100 AND city=“Seattle” maker=name maker=name price>100 city=“Seattle” Product Company Product Company The earlier we process selections, less tuples we need to manipulate higher up in the tree (but may cause us to indexes).

Cost-Based Optimization Main optimization unit: set of joins, i.e. single select-from-where block Hence: the join reordering problem Optimization methods: Dynamic programming (System R, 1977), for joins: Conceptually cleanest Rule-based optimizations, for arbitrary queries: Volcano  SQL server Starburst  DB2

Join Trees R1 R2 …. Rn Join tree: A join tree represents a plan. An optimizer needs to inspect many (all ?) join trees R3 R1 R2 R4

Types of Join Trees Left deep (or left-linear): R4 R2 R5 R3 R1

Types of Join Trees Bushy: R3 R2 R4 R1 R5

Problem Given: a query R1 R2 … Rn Assume we have a function cost() that gives us the cost of every join tree Find the best join tree for the query

Justification for Handling Joins Separately sG Select A From R1, R2, …, Rn Where C GroupBy B Having G gB R3 sC2 R4 sC1 R5 R2 R1

Dynamic Programming Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset Compute best plan as follows: Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} A subset of {R1, …, Rn} is also called a subquery

Dynamic Programming For each subquery Q ⊆ {R1, …, Rn} compute the following: Size(Q) A best plan for Q: Plan(Q) The cost of that plan: Cost(Q) Additional complication: Consider interesting orders For each subquery Q ⊆ {R1, …, Rn}, generate one plan for each interesting order

Dynamic Programming Step 1: For each {Ri} do: Size({Ri}) = B(Ri) Plan({Ri}) = Ri Cost({Ri}) = (cost of scanning Ri)

Dynamic Programming Step i: For each Q ⊆ {R1, …, Rn} of cardinality i do: Compute Size(Q) (later…) For every pair of subqueries Q’, Q’’ s.t. Q = Q’ U Q’’ compute cost(Plan(Q’) Plan(Q’’)) Cost(Q) = the smallest such cost Plan(Q) = the corresponding plan

Dynamic Programming Return Plan({R1, …, Rn})

Dynamic Programming Heuristics for Reducing the Search Space Restrict to left linear trees Restrict to trees “without cartesian product”

Rule-based Optimizations Volcano: Main idea: let programmers define rewrite rules, based on the algebraic laws System searches for “best plan” by applying laws repeatedly Need to avoid cycles, etc. Join-reordering becomes harder, but can handle other operators too Starburst: Same, but keep larger nodes, corresponding to one select-from-where block Apply rewrite rules inter-blocks Do dynamic programming inside blocks