CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner
CS 4432lecture #10 - indexing & hashing2 1.B+-tree Odds and Ends 2.Hashing (briefly) Chapter 4 – INDEXING Wrap-up value record
CS 4432lecture #10 - indexing & hashing3 Root B+Tree Examplen=
CS 4432lecture #10 - indexing & hashing4 Comparison B-tree vs. indexed seq. file Less space, so lookup faster Inserts managed by overflow area Requires temporary restructuring Unpredictable performance Consumes more space, so lookup slower Each insert/delete potentially restructures Build-in restructuring Predictable performance
CS 4432lecture #10 - indexing & hashing5 DBA does not know when to reorganize DBA does not know how full to load pages of new index B-trees better …
CS 4432lecture #10 - indexing & hashing6 A la buffering… Is LRU a good policyfor B+tree buffers? Of course not! Should try to keep root in memory at all times (and perhaps some nodes from second level)
CS 4432lecture #10 - indexing & hashing7 Interesting problem: For B+tree, how large should n be? … n is number of keys / node
CS 4432lecture #10 - indexing & hashing8 assumptions: n children per node and N records in database (1)Time to read B-Tree node from disk is (tseek + tread*n) msec. (2)Once in main memory, use binary search to locate key, (a + b log_2 n) msec (3)Need to search (read) log_n (N) tree nodes (4)t-search = (tseek + tread*n + (a + b*log_2(n)) * log n (N)
CS 4432lecture #10 - indexing & hashing9 Can get: f(n) = time to find a record f(n) n opt n FIND n opt by f’(n) = 0 øWhat happens to n opt as: Disk gets faster? CPU get faster? …
CS 4432lecture #10 - indexing & hashing10 Bulk Loading of B+ Tree For large collection of records, create B+ tree. Method 1: Repeatedly insert records slow. Method 2: Bulk Loading more efficient.
CS 4432lecture #10 - indexing & hashing11 Bulk Loading of B+ Tree Initialization: – Sort all data entries – Insert pointer to first (leaf) page in new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root
CS 4432lecture #10 - indexing & hashing12 Bulk Loading (Contd.) Index entries for leaf pages always entered into right-most index page When this fills up, it splits. (Split may go up right- most path to root.) Faster than repeated inserts, especially when one considers locking! 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree * 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root not yet in B+ tree Data entry pages
CS 4432lecture #10 - indexing & hashing13 Summary of Bulk Loading Method 1: multiple inserts. – Slow. – Does not give sequential storage of leaves. Method 2: Bulk Loading – Has advantages for concurrency control. – Fewer I/Os during build. – Leaves will be stored sequentially (and linked) – Can control “fill factor” on pages.
CS 4432lecture #10 - indexing & hashing14 key h(key) Hashing Buckets (typically 1 disk block)
CS 4432lecture #10 - indexing & hashing15 Example hash function Key = ‘x 1 x 2 … x n ’ n byte character string Have b buckets h: add x 1 + x 2 + ….. x n – compute sum modulo b
CS 4432lecture #10 - indexing & hashing16 This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Good hash Expected number of function:keys/bucket is the same for all buckets
CS 4432lecture #10 - indexing & hashing17 Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent
CS 4432lecture #10 - indexing & hashing18 Next: example to illustrate inserts, overflows, deletes h(K)
CS 4432lecture #10 - indexing & hashing19 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = d a c b h(e) = 1 e
CS 4432lecture #10 - indexing & hashing a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d
CS 4432lecture #10 - indexing & hashing21 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket
CS 4432lecture #10 - indexing & hashing22 How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible hashing Others …
CS 4432lecture #10 - indexing & hashing23 Extensible hashing : idea 1 (a) Use i of b bits output by hash function b h(K) use i grows over time…
CS 4432lecture #10 - indexing & hashing24 (b) Use directory h(K)[i ] to bucket Extensible hashing : idea 2
CS 4432lecture #10 - indexing & hashing25 Example: h(k) is 4 bits; 2 keys/bucket i = Insert New directory i =
CS 4432lecture #10 - indexing & hashing Insert: i = Example continued
CS 4432lecture #10 - indexing & hashing i = Insert: 1001 Example continued i = 3 3
CS 4432lecture #10 - indexing & hashing28 Extensible hashing: deletion Merge blocks and cut directory if possible (Reverse insert procedure)
CS 4432lecture #10 - indexing & hashing29 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - -
CS 4432lecture #10 - indexing & hashing30 Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Indexing vs Hashing
CS 4432lecture #10 - indexing & hashing31 INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Indexing vs Hashing
CS 4432lecture #10 - indexing & hashing32 The BIG picture…. Chapters 2 & 3: Storage, records, blocks... Chapter 4 & 5: Access Mechanisms - Indexes - B trees - Hashing - Multi key Chapter 6 & 7: Query Processing