Index Structures Chapter 13 of GUW September 16, 2019 ICS 541: Index Structures
Objectives Different ways of organizing blocks What is the best way to organize record in blocks to minimize: Query cost Exact match Partial match Range Join Insertion cost Deletion cost Update cost Storage cost September 16, 2019 ICS 541: Index Structures
- Lecture outline Basic Concepts Index on Sequential files Secondary Indexes B-Trees Hash Tables September 16, 2019 ICS 541: Index Structures
- Basic Concepts … Disk Index blocks Data blocks Empty blocks September 16, 2019 ICS 541: Index Structures
… - Basic Concepts … Index blocks result Query Data blocks September 16, 2019 ICS 541: Index Structures
… - Basic Concepts Block pointer Record pointer Record pointers take more space than block pointers. Why? September 16, 2019 ICS 541: Index Structures
- Indexes Sequential Files Dense Index Sparse Index Multiple level of Index Index with Duplicate Search Keys Managing Indexes During Data Modification September 16, 2019 ICS 541: Index Structures
-- Sequential Files Sequential File 10 20 30 40 50 60 70 80 90 100 September 16, 2019 ICS 541: Index Structures
-- Dense Index Sequential File Dense Index 10 20 30 40 50 60 70 80 90 Search key 50 60 70 80 60 50 80 70 90 100 110 120 100 90 September 16, 2019 ICS 541: Index Structures
-- Sparse Index Sequential File Sparse Index 10 20 30 40 50 60 70 80 90 110 130 150 60 50 80 70 170 190 210 230 100 90 September 16, 2019 ICS 541: Index Structures
-- Dense Vs Sparse Index: Example Relation Relation R with 1,000,000 tuples A block of size 4096 bytes (4k) 10 R tuples per block Data space required => 1000000/10 * 4k = 400MB. Dense index record size: 30 Bytes for search key + 8 bytes for record pointer Can fit 100 index records per block Dense index space = 1000000/100 * 4k = 40MB. Binary search cost log2(10000) = 14 disk accesses at most Keeping blocks (1/2, ¼, ¾, 1/8, …) in memory can lower disk access. Sparse index 1000 index blocks = 4MB Binary search cost log2(1000) = 10 disk accesses at most September 16, 2019 ICS 541: Index Structures
-- Multiple level of Index Sequential File Sparse 2nd level 20 10 10 90 170 250 10 30 50 70 40 30 90 110 130 150 330 410 490 570 60 50 80 70 170 190 210 230 100 90 September 16, 2019 ICS 541: Index Structures
-- Contiguous sequential file R1 K1 say: 1024 B per block if we want K3 block: get it at offset (3-1)1024 = 2048 bytes R2 K2 K3 R3 K4 R4 September 16, 2019 ICS 541: Index Structures
-- Sparse vs. Dense Tradeoff Less index space per record can keep more of index in memory Dense Can tell if any record exists without accessing file Must be used for secondary index September 16, 2019 ICS 541: Index Structures
--Terms Index sequential file Search key ( primary key) Primary index (on Sequencing field) Secondary index Dense index (all Search Key values in) Sparse index Multi-level index September 16, 2019 ICS 541: Index Structures
-- Duplicate keys … 10 10 20 20 30 30 40 45 September 16, 2019 ICS 541: Index Structures
Dense index, one way to implement? -- Duplicate keys … Dense index, one way to implement? 10 10 10 10 10 10 10 10 20 10 20 10 20 20 30 20 30 20 20 20 30 30 30 30 30 30 30 30 45 40 45 40 September 16, 2019 ICS 541: Index Structures
… -- Duplicate keys … 10 10 20 20 30 30 40 45 Dense index, better way? September 16, 2019 ICS 541: Index Structures
careful if looking for 20 or 30! … -- Duplicate keys …. 10 10 20 20 30 Sparse index, one way? careful if looking for 20 or 30! 10 10 10 20 20 10 30 30 20 30 45 40 September 16, 2019 ICS 541: Index Structures
place first new key from block … -- Duplicate keys … place first new key from block 10 should this be 40? 10 20 30 20 10 30 30 20 30 45 40 Sparse index, another way? September 16, 2019 ICS 541: Index Structures
a a a b … -- Duplicate keys Incase of primary index may point to first instance of each value only File Index a a a . b September 16, 2019 ICS 541: Index Structures
-- Managing Indexes During Data Modification Deletion from Sparse Index Insertion into Sparse Index September 16, 2019 ICS 541: Index Structures
--- Deletion from sparse index … 20 10 10 30 50 40 30 70 60 50 90 110 130 80 70 150 September 16, 2019 ICS 541: Index Structures
… --- Deletion from sparse index … delete record 40 20 10 10 30 50 40 30 70 60 50 90 110 130 80 70 150 September 16, 2019 ICS 541: Index Structures
… --- Deletion from sparse index … delete record 30 20 10 10 40 30 50 40 30 70 60 50 90 110 130 80 70 150 September 16, 2019 ICS 541: Index Structures
… ---- Deletion from sparse index … delete records 30 & 40 20 10 10 50 70 30 50 40 30 70 60 50 90 110 130 80 70 150 September 16, 2019 ICS 541: Index Structures
… --- Deletion from dense index … 20 10 10 20 30 40 30 40 60 50 50 60 70 80 70 80 September 16, 2019 ICS 541: Index Structures
… --- Deletion from dense index delete record 30 20 10 10 20 40 40 30 30 40 40 60 50 50 60 70 80 70 80 September 16, 2019 ICS 541: Index Structures
--- Insertion, sparse index case … 20 10 10 30 40 30 60 50 40 60 September 16, 2019 ICS 541: Index Structures
… --- Insertion, sparse index case … insert record 34 20 10 10 30 40 30 34 our lucky day! we have free space where we need it! 60 50 40 60 September 16, 2019 ICS 541: Index Structures
… --- Insertion, sparse index case … insert record 15 20 10 15 20 30 10 30 40 30 60 50 40 Illustrated: Immediate reorganization Variation: insert new block (chained file) update index 60 September 16, 2019 ICS 541: Index Structures
… --- Insertion, sparse index case insert record 25 20 10 25 overflow blocks (reorganize later...) 10 30 40 30 60 50 40 60 September 16, 2019 ICS 541: Index Structures
Similar Often more expensive . . . --- Insertion, dense index case September 16, 2019 ICS 541: Index Structures
- Secondary Indexes Design of Secondary Indexes Duplicate Values and Secondary Indexes Applications of Secondary Indexes Indirection in Secondary Indexes September 16, 2019 ICS 541: Index Structures
-- Design of Secondary Indexes … Sequence field 50 30 70 20 40 80 10 100 60 90 September 16, 2019 ICS 541: Index Structures
does not make sense! … -- Design of Secondary Indexes … 30 50 20 70 80 Sequence field Sparse index does not make sense! 50 30 30 20 80 100 70 20 90 ... 40 80 10 100 60 90 September 16, 2019 ICS 541: Index Structures
… -- Design of Secondary Indexes … Sequence field Dense index 10 20 30 40 50 60 70 ... 50 30 10 50 90 ... sparse high level 70 20 40 80 10 100 60 90 September 16, 2019 ICS 541: Index Structures
… -- Design of Secondary Indexes Lowest level is dense Record pointers Other levels are sparse Block pointers September 16, 2019 ICS 541: Index Structures
-- Duplicate Values and Secondary Indexes … 10 20 40 20 40 10 40 10 40 30 September 16, 2019 ICS 541: Index Structures
-- Duplicate Values and Secondary Indexes … one option... 10 20 10 20 40 20 Problem: excess overhead! disk space search time 20 30 40 40 10 40 10 40 ... 40 30 September 16, 2019 ICS 541: Index Structures
-- Duplicate Values and Secondary Indexes … another option... 10 20 10 40 20 20 Problem: variable size records in index! 40 10 30 40 40 10 40 30 September 16, 2019 ICS 541: Index Structures
-- Duplicate Values and Secondary Indexes … 10 20 10 20 30 40 40 20 50 60 ... 40 10 40 10 40 30 buckets September 16, 2019 ICS 541: Index Structures
---Why “bucket” idea is useful … Indexes Records Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary September 16, 2019 ICS 541: Index Structures
Query: Get employees in (Toy Dept) ^ (2nd floor) … ---Why “bucket” idea is useful … Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. index EMP Floor index Toy 2nd Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s. Used for text retrieval September 16, 2019 ICS 541: Index Structures
This idea used in text information retrieval. … ---Why “bucket” idea is useful This idea used in text information retrieval. Documents Inverted lists cat dog ...the cat is fat ... ...was raining cats and dogs... ...Fido the dog ... September 16, 2019 ICS 541: Index Structures
-- Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance September 16, 2019 ICS 541: Index Structures
--- Example Index (sequential) 10 39 31 35 36 32 38 34 33 20 30 40 50 continuous free space 10 39 31 35 36 32 38 34 33 overflow area (not sequential) 20 30 40 50 60 70 80 90 September 16, 2019 ICS 541: Index Structures
-- Next Another type of index Btree Has Schemes Give up on sequentiality of index Try to get “balance” Btree Has Schemes September 16, 2019 ICS 541: Index Structures
B+Tree Example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 30 35 100 101 110 150 156 179 September 16, 2019 ICS 541: Index Structures
-- Sample non-leaf 57 81 95 To keys 57> and <=81 To keys >95 To keys <57 September 16, 2019 ICS 541: Index Structures
-- Sample leaf node: 57 81 95 with key 57 with key 81 To record From non-leaf node to next leaf in sequence 57 81 95 with key 57 with key 81 To record with key 85 September 16, 2019 ICS 541: Index Structures
Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data -- Size of Nodes n keys n + 1 Pointers Use at least Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data September 16, 2019 ICS 541: Index Structures
--- Example: n = 3 Full node min. node Non-leaf Leaf 120 150 180 30 3 11 30 35 counts even if null September 16, 2019 ICS 541: Index Structures
-- B+tree rules tree of order n All leaves at same lowest level (balanced tree) Pointers in leaves point to records except for “sequence pointer” September 16, 2019 ICS 541: Index Structures
-- Insert into B+tree (a) simple case (b) leaf overflow space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root September 16, 2019 ICS 541: Index Structures
--- Insert key = 32 n=3 100 30 3 5 11 30 31 32 September 16, 2019 ICS 541: Index Structures
n=3 --- Insert key = 7 100 30 7 3 5 11 30 31 3 5 7 September 16, 2019 ICS 541: Index Structures
n=3 --- Insert key = 160 100 160 120 150 180 180 150 156 179 180 200 160 179 September 16, 2019 ICS 541: Index Structures
--- New root, insert 45 n=3 30 new root 10 20 30 40 1 2 3 10 12 20 25 32 40 40 45 September 16, 2019 ICS 541: Index Structures
-- Deletion from B+tree Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf September 16, 2019 ICS 541: Index Structures
n=4 --- Coalesce with sibling: Delete 50 10 40 100 40 10 20 30 40 50 September 16, 2019 ICS 541: Index Structures
n=4 --- Redistribute key: Delete 50 10 40 100 35 10 20 30 35 40 50 September 16, 2019 ICS 541: Index Structures
n=4 --- Non-leaf coalesce: Delete 37 new root 25 25 10 20 30 40 40 30 26 1 3 10 14 20 22 30 37 40 45 September 16, 2019 ICS 541: Index Structures
--- B+tree deletions in practice Often, coalescing is not implemented Too hard and not worth it! September 16, 2019 ICS 541: Index Structures
- Hashing key h(key) <key> Buckets (typically 1 disk block) . September 16, 2019 ICS 541: Index Structures
. . (1) key h(key) -- Two alternatives … records September 16, 2019 ICS 541: Index Structures
(2) key h(key) … -- Two alternatives record Index Alt (2) for “secondary” search key September 16, 2019 ICS 541: Index Structures
-- Example hash function … Key = ‘x1 x2 … xn’ n byte character string Have b buckets h: add x1 + x2 + ….. Xn compute sum modulo b September 16, 2019 ICS 541: Index Structures
… -- Example hash function This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Good hash Expected number of function: keys/bucket is the same for all buckets September 16, 2019 ICS 541: Index Structures
-- Within a bucket Do we keep keys sorted? Yes if CPU time critical, and Inserts/Deletes not too frequent September 16, 2019 ICS 541: Index Structures
-- Example to illustrate inserts, overflows, deletes h(K) September 16, 2019 ICS 541: Index Structures
… -- Example to illustrate insert … 1 2 3 INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 d a c b e h(e) = 1 September 16, 2019 ICS 541: Index Structures
Delete: e f c … -- Example to illustrate delete … a b d c d e f g 1 2 1 2 3 a d b d c c e maybe move “g” up f g September 16, 2019 ICS 541: Index Structures
--- Rule of thumb Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket September 16, 2019 ICS 541: Index Structures
-- How do we cope with growth? Static hashing Overflows and reorganizations Very expensive Solution Dynamic hashing Extensible Linear September 16, 2019 ICS 541: Index Structures
-- Extensible hashing: two ideas .. (a) Use i of b bits output by hash function b h(K) use i grows over time…. 00110101 September 16, 2019 ICS 541: Index Structures
… -- Extensible hashing: two ideas (b) Use directory h(K)[i ] to bucket . . September 16, 2019 ICS 541: Index Structures
--- Example: h(k) is 4 bits; 2 keys/bucket … New directory 2 00 01 10 11 i = 1 i = 0001 1 1 1001 1 1100 1010 1100 Insert 1010 September 16, 2019 ICS 541: Index Structures
… --- Example: h(k) is 4 bits; 2 keys/bucket … 0000 0111 0001 i = 2 00 01 10 11 1 0001 0111 2 1001 1010 Insert: 0111 0000 2 1100 September 16, 2019 ICS 541: Index Structures
… --- Example: h(k) is 4 bits; 2 keys/bucket 000 001 010 011 100 101 110 111 3 i = 0000 2 i = 0001 2 00 01 10 11 0111 2 1001 1010 2 1001 1010 Insert: 1001 2 1100 September 16, 2019 ICS 541: Index Structures
--- Extensible hashing: deletion Two option: No merging of blocks Merge blocks and cut directory if possible September 16, 2019 ICS 541: Index Structures
---- Deletion example Run thru insert example in reverse! September 16, 2019 ICS 541: Index Structures
-- Summary: Extensible hashing Can handle growing files - No full reorganizations + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - September 16, 2019 ICS 541: Index Structures
-- Summary: Extensible hashing Advantage No reorganization is needed One disk access per record Disadvantage Doubling bucket array is expensive The size of the bucket array may no longer fit into memory The number of bucket may be much bigger than the blocks Example: Splitting records can only be done in higher bits. September 16, 2019 ICS 541: Index Structures
-- Linear hashing 01110101 Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i (b) File grows linearly September 16, 2019 ICS 541: Index Structures
Example b=4 bits, i =2, 2 keys/bucket 0101 can have overflow chains! insert 0101 Future growth buckets 0000 0101 1010 1111 00 01 10 11 m = 01 (max used block) If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 Rule September 16, 2019 ICS 541: Index Structures
Example b=4 bits, i =2, 2 keys/bucket 0101 insert 0101 1111 0101 Future growth buckets 11 0000 1010 0101 10 1010 1111 00 01 10 11 m = 01 (max used block) September 16, 2019 ICS 541: Index Structures
Example Continued: How to grow beyond this? 100 101 110 111 3 i = 2 0000 100 0101 101 0101 1010 1111 0101 00 01 10 11 . . . m = 11 (max used block) September 16, 2019 ICS 541: Index Structures
--- When do we expand file? Keep track of: # used slots total # of slots = U If U > threshold then increase m (and maybe i ) September 16, 2019 ICS 541: Index Structures
-- Summary: Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing + + Can still have overflow chains - September 16, 2019 ICS 541: Index Structures
-- Hashing Summary Hashing - How it works - Dynamic hashing - Extensible - Linear September 16, 2019 ICS 541: Index Structures
-- Indexing Vs Hashing … Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 September 16, 2019 ICS 541: Index Structures
… -- Indexing vs Hashing INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 September 16, 2019 ICS 541: Index Structures
- Reading list Chapter 13 of GUW B-tree (WebCt) September 16, 2019 ICS 541: Index Structures
END September 16, 2019 ICS 541: Index Structures