1 More on Indexes Secondary Indexes B-Trees Source: our textbook, slides by Hector Garcia-Molina
2 Secondary Indexes uSometimes we want multiple indexes on a relation. wEx: search Candies(name,manf) both by name and by manufacturer uTypically the file would be sorted using the key (ex: name) and the primary index would be on that field. uThe secondary index is on any other attribute (ex: manf). uSecondary index also facilitates finding records, but cannot rely on them being sorted
3 Sparse Secondary Index? uNo! uSince records are not sorted on that key, cannot predict the location of a record from the location of any other record. uThus secondary indexes are always dense.
4 Sequence field Sparse index does not make sense!
5 Design of Secondary Indexes uAlways dense, usually with duplicates uConsists of key-pointer pairs ("key" means search key, not relation key) uEntries in index file are sorted by key uTherefore second-level index is sparse
6 Secondary indexes Sequence field sparse second- level dense first- level
7 Secondary Index and Duplicate Keys uScheme in previous diagram wastes space in the present of duplicate keys uIf a search key value appears n times in the data file, then there are n entries for it in the index.
8 Duplicate values & secondary indexes one option... Problem: excess overhead! disk space search time
9 Buckets uTo avoid repeating values, use a level of indirection uPut buckets between the secondary index file and the data file uOne entry in index for each search key K; its pointer goes to a location in a "bucket file", called the bucket for K uBucket holds pointers to all records with search key K
10 Duplicate values & secondary indexes buckets saves space as long as search-keys are larger than pointers and average key appears at least twice
11 Why “bucket” idea is useful IndexesRecords name: primary Emp (name,dept,floor,...) dept: secondary floor: secondary
12 Query: SELECT name FROM Emp WHERE dept = 'Toy' AND floor = 2 dept indexEmp floor index Toy 2 Intersect Toy dept bucket and floor 2 bucket to get set of matching Emp’s Saves disk I/O's
13 Summary of Indexes So Far uAdvantages: wsimple windex is sequential file, good for scans uDisadvantages weither inserts are expensive wor lose sequentiality (cf. next slide) uInstead use B-tree data structure to implement index
14 ExampleIndex (sequential) continuous free space overflow area (not sequential)
15 B-Trees uSeveral related data structures uKey features are: wautomatically adjust number of levels of indexes as size of data file changes wstorage on blocks is managed to keep every block between half full and full => no overflow blocks needed uWe'll actually study B+ trees
16 B-Tree Structure uan example of a balanced search tree: every root-to-leaf path has same length ueach node (vertex) in the tree is a block, which contains search keys and pointers uparameter n, which is largest value so that n+1 pointers and n keys fit in one block wEx: If block size is 4096 bytes, keys are 4 bytes, and pointers are 8 bytes, then n = 340.
17 Constraints on B-Tree Nodes uKeys in leaf nodes are copies of keys from data file, in sorted order uRoot contains between 2 and n+1 index node pointers uEach internal node contains between (n+1)/2 and n+1 index node pointers uEach non-leaf node consists of ptr 1,key 1,ptr 2,key 2,…,key m-1,ptr m where ptr i points to index node with keys between key i-1 and key i
18 Constraints (cont'd) uEach leaf contains between (n+1)/2 and n data record pointers, plus a "next leaf" pointer uAssociated with each data record pointer is a key, and the pointer points to the data record with that key
19 Example B-tree nodes with n = textbook notationmore concise notation Leaf: Non-leaf: to record with key 30 to record with key 35 to part of tree with keys < 30 to part of tree with keys ≥ 30
20 Sample non-leaf to keysto keysto keys to keys < 5757 k<8181 k<95
21 Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85
22 Full nodemin. node Non-leaf Leaf n= counts even if null
23 Root B-Tree Examplen= … to records …
24 Insert into B+tree (a) simple case wspace available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root
25 (a) Insert key = 32 n=
26 (a) Insert key = 7 n=
27 (c) Insert key = 160 n=
28 (d) New root, insert 45 n= new root
29 (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B-tree
30 (b) Coalesce with sibling wDelete n=4 40
31 (c) Redistribute keys wDelete n=4 35
(d) Non-leaf coalese wDelete 37 n= new root
33 B-tree deletions in practice –Often, coalescing is not implemented wToo hard and not worth it!
34 Applications of B-Trees uB-tree is used to implement indexes uThe data record pointers in the leaves correspond to the data record pointers in sequential indexes uSome example uses: wB-tree search key is primary key for data file, leaf pointers form a dense index on the file wB-tree search key is primary key for data file, leaf pointers form a sparse index on the file wB-tree search key is not primary key, leaf pointers form a dense index on the file
35 B-Trees with Duplicate Keys Change definition of B-tree: uIf key K appears in an internal node, then K is the smallest "new" key in the subtree S rooted at the pointer that follows K in the node u"New" means K does not appear in the part of the B-tree to the left of S but it does appear in S uAllow null key in certain situations
36 Example B-Tree with Duplicates
37 Lookup in B-Trees uAssume no duplicate keys. uAssume B-tree is a dense index. uTo find the record with key K, search starting at the root and ending at a leaf: wif current node is not a leaf and has keys K 1, K 2, …, K n, find the smallest key, K i, in the sequence that is ≤ K. wfollow the (i+1)-st pointer to a node at the next level and repeat wwhen a leaf node is reached, find the key with value K and follow the associated pointer to the data record
38 Range Queries with B-Trees uRange query: a query in which a range of values is sought. Examples: wSELECT * FROM R WHERE R.k > 40; wSELECT * FROM R WHERE R.k >= 10 AND R.k <= 25; uTo find all keys in the range [a,b]: wDo a lookup on a: leads to leaf where a could be wSearch the leaf for all keys ≥ a wIf we find a key > b, we are done wElse follow next-leaf pointer and continue searching in the next leaf wContinue until finding a key > b or no more leaves
39 Efficiency of B-Trees uB-trees allow lookup, insertion and deletion of records with very few disk I/Os uNumber of disk I/Os is number of levels in the B- tree plus cost of any reorganization uIf n is at least 10, then splitting/merging blocks will be rare and usually limited to the leaves uFor typical sizes of keys, pointers, blocks and files, 3 levels suffice (see next slide) uAlso can keep root block of B-tree in memory
40 Size of B-Tree uAssume w4096 bytes per block w4 bytes per key (e.g., integer) w8 bytes per pointer wno header info in the block uThen n = 340 (can keep n keys and n+1 pointers in a block) uAssume on average a block has 255 pointers uCount: wone node at level 1 (the root) w255 nodes at level 2 w255*255 = 65,025 nodes at level 3 (leaves) weach leaf has 255 pointers, so total number of records is more than 16 million