Yan Huang - CSCI5330 Database Implementation – Access Methods This is a modified version of Prof. Hector Garcia Molina’s slides. All copy rights belong to the original author. 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Basic Concepts Value Search Key - set of attributes used to look up records in a file. search key pointer record ? value 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Index Evaluation Metrics Access types supported efficiently. E.g., Point query: find “Tom” Range query: find students whose age is between 20-40 Access time Update time Space overhead 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Ordered Indices In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods same order Search key 20 10 Primary index Also called clustering index The search key of a primary index is usually but not necessarily the primary key. 10 30 50 70 40 30 90 110 130 150 60 50 80 70 170 190 210 230 100 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods different order Search key Secondary index: non-clustering index. 10 20 30 40 50 60 70 ... 50 30 70 20 40 80 10 100 60 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Dense Index Sequential File 20 10 10 20 30 40 Dense Index: contains index records for every search-key values. 40 30 50 60 70 80 60 50 80 70 90 100 110 120 100 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Sparse Index Sequential File 20 10 10 30 50 70 Sparse Index: contains index records for only some search-key values. Applicable when records are sequentially ordered on search-key 40 30 90 110 130 150 60 50 80 70 170 190 210 230 100 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Secondary indexes Sequence field does not make sense! 50 30 30 20 80 100 70 20 Sparse index 90 ... 40 80 10 100 60 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Multilevel Index Sparse 2nd level Sequential File 20 10 10 90 170 250 10 30 50 70 40 30 90 110 130 150 330 410 490 570 60 50 80 70 170 190 210 230 100 90 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Multilevel Index Secondary indexes Sequence field 10 20 30 40 50 60 70 ... 50 30 10 50 90 ... sparse high level 70 20 40 80 10 100 60 90 Lowest level is dense Other levels are sparse 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Outline: Conventional indexes B+-Tree NEXT 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods NEXT: Another type of index Give up on sequentiality of index Try to get “balance” 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods B+Tree Example n=4 Root 100 120 150 180 30 3 5 11 120 130 180 200 30 35 100 101 110 150 156 179 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81k<95 95 Key is moved (not copied) from lower level non-leaf node to upper level non-leaf node 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 with key 57 with key 81 To record with key 85 Key is copied (not moved) from leaf node to non-leaf node 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods 35 Leaf: Non-leaf: 30 35 30 30 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Size of nodes: n pointers n-1 keys 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Don’t want nodes to be too empty Use at least Root : 2 pointers Non-leaf: n/2 pointers Leaf : (n-1)/2 keys 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Full node min. node Non-leaf Leaf 120 150 180 30 3 5 11 30 35 counts even if null 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer” 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrsdata keys Non-leaf (non-root) n n-1 n/2 n/2- 1 Leaf (non-root) n n-1 (n-1)/2 (n-1)/2 Root n n-1 2 1 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Insert into B+tree (a) simple case space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (a) Insert key = 32 n=4 100 30 3 5 11 30 31 32 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (b) Insert key = 7 n=4 100 30 7 3 5 11 30 31 3 5 7 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (c) Insert key = 160 n=4 100 160 120 150 180 180 150 156 179 180 200 160 179 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (d) New root, insert 45 n=4 30 new root 10 20 30 40 1 2 3 10 12 20 25 30 32 40 40 45 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (b) Coalesce with sibling Delete 50 n=5 10 40 100 40 10 20 30 40 50 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (c) Redistribute keys Delete 50 n=5 10 40 100 35 10 20 30 35 40 50 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (d) Non-leaf coalesce Delete 37 n=5 25 25 new root 10 20 30 40 40 30 25 26 1 3 10 14 20 22 30 37 40 45 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
B+tree deletions in practice Often, coalescing is not implemented Too hard and not worth it! 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Index Definition in SQL Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index gindex on country(gdp); To drop an index drop index <index-name> E.g.: drop index gindex; 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Multi-key Index Motivation: Find records where DEPT = “Toy” AND SAL > 50k 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Strategy I: Use one index, say Dept. Get all Dept = “Toy” records and check their salary I1 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Strategy II: Use 2 Indexes; Manipulate Pointers Toy Sal > 50k 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Strategy III: Multiple Key Index One idea: I2 I3 I1 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example Example Record Dept Index Salary 10k 15k Art Sales Toy 17k 21k Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Interesting application: Geographic Data DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> y x . . . 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Queries: What city is at <Xi,Yi>? What is within 5 miles from <Xi,Yi>? Which is closest point to <Xi,Yi>? 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example a 25 15 35 20 40 30 10 10 20 i d e h Search points near f Search points near b b n f 5 15 l o c j g m k h i a b c d e f g n o m l j k 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Queries Find points with Yi > 20 Find points with Xi < 5 Find points “close” to i = <12,38> Find points “close” to b = <7,24> 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Many types of geographic index structures have been suggested Quad Trees R Trees 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Two more types of multi key indexes Grid Bitmap index 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Grid Index Key 2 X1 X2 …… Xn V1 V2 Key 1 Vn To records with key1=V3, key2=X2 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods CLAIM Can quickly find records with key 1 = Vi Key 2 = Xj key 1 = Vi key 2 = Xj And also ranges…. E.g., key 1 Vi key 2 < Xj 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods But there is a catch with Grid Indexes! How is Grid Index stored on disk? Like Array... X1 X2 X3 X4 V1 V2 V3 Problem: Need regularity so we can compute position of <Vi,Xj> entry 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Solution: Use Indirection Buckets V1 V2 V3 *Grid only V4 contains pointers to buckets X1 X2 X3 -- -- -- -- -- 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods With indirection: Grid can be regular without wasting space We do have price of indirection 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Can also index grid on value ranges Salary Grid 0-20K 1 20K-50K 2 50K- 8 3 Linear Scale 1 2 3 Toy Sales Personnel 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Grid files Good for multiple-key search Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - - 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example Grid File for account Divide branch-name into non-uniform intervals ? Branch-name <Central and 10k<=balance<50k two attributes as search key Divide balance into non-uniform intervals What about Central<=branch-name<Townsend and 50k<=balance? 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example Grid File for account Bj Bk 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Grid Files (Cont.) Linear scales must be chosen to uniformly distribute records across cells. Otherwise there will be too many overflow buckets. Periodic re-organization to increase grid size will help. But reorganization can be very expensive. Space overhead of grid array can be high. 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Bitmap Indices Another index could be used for multiple valued search keys 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Bitmap Indices (Cont.) The income-level value of record 3 is L1 Bitmap(size = table size) Unique values of gender Unique values of income-level 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Bitmap Indices (Cont.) Some properties of bitmap indices Number of bitmaps for each attribute? Size of each bitmap? When is the bitmap matrix sparse and what attributes are good for bitmap indices? 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Bitmap Indices (Cont.) Bitmap indices generally very small compared with relation size E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space used by relation. If number of distinct attribute values is 8, bitmap is only 1% of relation size What about insertion? Deletion? 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Bitmap Indices Queries Sample query: Males with income level L1 10010 AND 10100 = 10000 even faster! What about the number of males with income level L1? 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Bitmap Indices Queries Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Hashing key h(key) <key> Buckets (typically 1 disk block) . 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Two alternatives . records (1) key h(key) . 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Two alternatives record (2) key h(key) key 1 Index Alt (2) for “secondary” search key 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example hash function Key = ‘x1 x2 … xn’ n byte character string Have b buckets h: add x1 + x2 + ….. xn compute sum modulo b 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods This may not be best function … Good hash Expected number of function: keys/bucket is the same for all buckets 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Next: example to illustrate inserts, overflows, deletes h(K) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 1 2 3 d a c b e h(e) = 1 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods EXAMPLE: deletion Delete: e f 1 2 3 a d b d c c e maybe move “g” up f g 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K) use i grows over time…. 00110101 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods (b) Use directory h(K)[i ] to bucket . . 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example: h(k) is 4 bits; 2 keys/bucket New directory 2 00 01 10 11 i = 1 i = 0001 1 1 1001 1 1100 1010 1100 Insert 1010 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example continued 2 0000 0111 0001 i = 2 00 01 10 11 1 0001 0111 2 1001 1010 Insert: 0111 0000 2 1100 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example continued 000 001 010 011 100 101 110 111 3 i = 0000 2 i = 0001 2 00 01 10 11 0111 2 1001 1010 2 1001 1010 Insert: 1001 2 1100 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Deletion example: Run thru insert example in reverse! 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Extensible hashing Summary Can handle growing files - with less wasted space - with no full reorganizations + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i (b) File grows linearly 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example b=4 bits, i =2, 2 keys/bucket 0101 can have overflow chains! insert 0101 Future growth buckets 0000 0101 1010 1111 00 01 10 11 m = 01 (max used block) If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 Rule 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example b=4 bits, i =2, 2 keys/bucket 0101 insert 0101 1111 0101 Future growth buckets 11 0000 1010 0101 10 1010 1111 00 01 10 11 m = 01 (max used block) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Example Continued: How to grow beyond this? 100 101 110 111 3 i = 2 0000 100 0101 101 0101 1010 1111 0101 00 01 10 11 . . . m = 11 (max used block) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods When do we expand file? Keep track of: # used slots total # of slots = U If U > threshold then increase m (and maybe i ) 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Linear Hashing Summary Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing + + Can still have overflow chains - 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Example: BAD CASE Very full Very empty Need to move m here… Would waste space... 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Summary Hashing - How it works - Dynamic hashing - Extensible - Linear 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Indexing vs Hashing Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods
Yan Huang - CSCI5330 Database Implementation – Access Methods Indexing vs Hashing INDEXING good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 1/14/2005 Yan Huang - CSCI5330 Database Implementation – Access Methods