Indexes A Heap file allows record retrieval: by specifying the rid, or by scanning all records sequentially Sometimes, retrieval of records by specifying the values in one or more fields is needed (semantic search or value-based query), e.g., Find all students in CS dept; Find students with gpa > 3 Indexes are files (separate from the data file they index) that enable answering these value-based queries efficiently. Indexes contain “search keys”, k, which are values from the attribute being indexed and “data entries”, k*, which lead us to the records containing the search key value (usually pointers). 17
Index Classification Primary vs. secondary: If the search key contains the clustered primary key, then it is called a primary index, else it is called a secondary index. Clustered vs. unclustered: If the closeness of the data records is the same as the closeness of the data entries, the index is called a clustered index. A file can be clustered on at most 1 attribute (search key) Cost of retrieving data records through an index varies greatly based on whether index is clustered or not! 11
Primary Index PRIMARY INDEX: I(k,p) k = ordered or clustered "key" field values from ordered or clustered field of file with uniqueness property (individual value occurrences are "unique" i.e., each value can occur at most once.) p = pointer to page containing record(s) with value, k Primary indexes can be either: DENSE: (every record is indexed) or NON-DENSE: only key-values of records at the beginning of a page are indexed (anchor record of page). (and then the pointer is page-# only) Example: Assume the blocking factor (bfr) is 2 which means 2 records/page. STUDENT |S#|SNAME |LCODE |pg |17|BAID |NY2091|1 |25|CLAY |NJ5101|1 |32|THAISZ|NJ5102|2 |38|GOOD |FL6321|2 |57|BROWN |NY2092|3 |83|THOM |ND3450|3 Non-dense Primary Index on S# |S#|pg |17| 1 |32| 2 |57| 3 Dense Primary Index on S# |S#|pg offset |17| 1 0 |25| 1 1 |32| 2 0 |38| 2 1 |57| 3 0 |83| 3 1 RID Inserting and deleting are major problems. - must move records to maintain ordering - anchors change (in non-dense case) 11
Clustering Index like a primary index except that the attribute nee not be a key - the file must be clustered on the attribute, k - the pointer for any k is the address of 1st page with that k-value ENROLL2 |S#|C#|GRADE pg |17|6 | 96 |1 |25|6 | 76 |1 |32|6 | 62 |2 |38|6 | 98 |2 |32|6 | 91 |3 |25|7 | 68 |3 |32|8 | 89 |4 |17|9 | 95 |4 |C#|pg| Dense Clustering_Index on C# |6 | 1| |7 | 3| |8 | 4| |9 | 4| |C#|pg| Non-dense Clustering_Index on C# |6 | 1| (indexing new anchor records only) |8 | 4| There's no more search overhead with this 2nd type of non-dense clustering index, but - How can you know which page has C#=7? (search pages starting at pg=1) - How can you know which page has C#=9? (search pags starting at pg=4) 11
Secondary Index These indexes are the same as the previous except, - the file is need not be clustered on k - p points to the page or record containing k - every record must be indexed (dense) Option1: If there are multiple occurences of k, use multiple index entries for that k. S#|C#|GRADE ENROLL (unclustered C#) 32|8 | 89 |1 25|6 | 76 |1 32|6 | 62 |2 25|8 | 86 |2 38|6 | 98 |3 32|7 | 91 |3 17|5 | 96 |4 25|7 | 68 |4 17|8 | 95 |5 C#|pg Secondary_Index, Option1 on C# 5 | 4 6 | 1 6 | 2 6 | 3 7 | 3 7 | 4 8 | 1 8 | 2 8 | 4 Option2: Use repeating groups of pointers (requires variable length pointer(s) |C#|page |5 | 4 |6 | 1,2,3 |7 | 3,4 |8 | 1,2,4 Option3: Use 1 index entry for each value, 1 pointer to "list" or "linked list" of record pointers. (1 level of indirection) |S#|C#|GRADE pg ENROLL (unclustered C#) |32|8 | 89 |1 |25|6 | 76 |1 |32|6 | 62 |2 |25|8 | 86 |2 |38|6 | 98 |3 |32|7 | 91 |3 |17|5 | 96 |4 |25|7 | 68 |4 |17|8 | 95 |5 |C#| page Secondary_Index, opt3 on C# |5 | -->|4| |6 | -->|1|->|2|->|3| |7 | -->|3|->|4| |8 | -->|1|->|2|->|4| 11
Multi-level Index (made up of an index on an index) For any index, since it is a file clustered on the key, k, it can have a primary or clustering index on it. (constituting the second level of the multilevel index). _ STUDENT |S#|SNAME |LCODE |pg |17|BAID |NY2091|1 |25|CLAY |NJ5101|1 |32|THAISZ|NJ5102|2 |38|GOOD |FL6321|2 |57|BROWN |NY2092|3 |83|THOM |ND3450|3 |91|PARK |MN7334|4 |94|SIVA |OR1123|4 |S#|pg|pg(of index file) S#-index (nondense, primary) |17| 1|1 |32| 2|1 |57| 3|2 |91| 4|2 2nd_LEVEL (a second level, nondense index) |S#|pg| |17| 1| |57| 2| 11
Index Classification (Contd.) Ashby Cass Smith Sparse Index on Name Anchor records of each page If there is at least one index entry per existing attribute value, then it is called dense, else sparse Ashby, 25, 3000 Smith, 44, 3000 22 25 30 40 44 50 Data File Dense Index on Age 33 Bristow, 30, 2007 Basu, 33, 4003 Cass, 50, 5004 Tracy, 44, 5004 Daniels, 22, 6003 Jones, 40, 6003 Name, age, bonus Every sparse index must be clustered! Sparse indexes are smaller. Tree-structured indexing techniques support both range searches (AKA inequality searches) and equality searches. ISAM: (variation of multilevel clustering) static structure; B+ tree: dynamic, adjust gracefully under insert and delete. 13
ISAM K* K 1 2 m index entry 1 index entry per page of data file, of the form: <k,k*> sorted on the attrribute value, k. k* points to 1st page (possibly) containing k. Provides alternate entry points into the file – faster than binary search which has just one entry point. Index file may still be quite large. But we can apply the idea repeatedly! Non-leaf (inode Leaf Leaf pages contain data entries, <k,k*>. In inodes, k*=indirect ptr. Pages Overflow page Primary pages 4
Example ISAM Tree Where each node can hold 2 (k,k*) entries in any internal node or inode (non-leaf) add ptr for key_values < the first k-value Root 40,40* 20,20* 33,33* 51,51* 63,63* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97* 6
Insert k=23 Insert k=48 Insert k=41 Insert k=42 Index Primary Leaf 40,40* Insert k=42 Overflow Pages Leaf Index Primary 20,20* 33,33* 51,51* 63,63* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97* 23,23* 48,48* Need overflow page Need overflow page 41,41* Need overflow page 42,42* 7
Note that 51* appears in index levels, but not in leaf! Deleting 42 40,40* Deleting 51 Deleting 97 20,20* 33,33* 51,51* 63,63* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97* 23,23* 48,48* 41,41* 42,42* Note that 51* appears in index levels, but not in leaf! 8
B+ Tree: The Most Widely Used Index keeps tree height-balanced. Minimum 50% occupancy (except for root). Each node contains m entries, where d m 2d. d is called the degree or order of the index. Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (“Direct search set or index set”) 9
Example B+ Tree (d=2) Search begins at root, key comparisons direct it to a leaf. Search for 5 Search for15 Search for all data entries 24 Root 13 17 24 30 Leaves are doubly linked for fast sequential < search 2* 3* 5* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 15 is not in the file! 10
Example B+ Tree (contd.) Search for all data entries < 23 (note, this is the reason for the double linkage). Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 10
Inserting a Data Entry into a B+ Tree Find correct leaf L. Put data entry in L. If L has enough space, done! Else, must split L (into L and a new node L2) Redistribute entries, copy up (promote) middle key. middle value which was promoted and is now the anchor key for L2). This can happen recursively (e.g., if there is no space for the promoted middle value in the inode to which it is promoted) To split inode, redistribute entries evenly, but push up (promote) middle key. So promote means Copy up at leaf; Move up at inode. Splits “grow” tree only a root split increases height. Only tree growth possible: wider or 1 level taller at top. 6
Inserting 8* No room for 5, so split and move 17 up. appears once in the index. Contrast Entry to be inserted in parent node. (Note that 17 is moved up and only this with a leaf split.) 17 No room for 5, so split and move 17 up. Inserting 8* 2* 3* 5* 7* 24 30 5 13 5 17 24 30 13 No room for 8, so split. 5* 7* 8* 2* 3* 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 5 to be inserted in parent node. (Note that 5 is continues to appear in the new leaf node, L2, as anchor value.) s copied up and Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copy-up (leaf) and move-up (inode) 12
B+ Tree Before Inserting 8* Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 B+ Tree Before Inserting 8* 2* 3* Root 17 24 30 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5 7* 5* 8* After Inserting 8* Note height_increase, balance and occupancy maintenance. 13
Deleting a Data Entry from a B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and a sibling. Merge could propagate to root, and therefore decreasing height. 14
Example Tree After Inserting 8* 2* 3* Root 17 24 30 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5 7* 5* 8* Example Tree After Inserting 8* Root 2* 3* 17 30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29* Then Deleting 19*, 20* Deleting 19* is easy. Deleting 20* is done with re-distribution of 24* (and revision of anchor value (from 24 to 27) in inode. 15
Must merge. ... And Then Deleting 24* 2* 3* 17 30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29* ... And Then Deleting 24* Must merge. 2* 3* 17 30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 27 27* 29* Observe `toss’ of index entry, 27, now that inode is below min occupancy so merge it with its sibling and index entry, 17 can be `pulled down’ (sibling merge, followed by pull-down) 2* 3* 7* 14* 16* 22* 27* 29* 33* 34* 38* 39* 5* 8* Root 30 13 5 17 16
Summary Tree indexes are ideal for range-searches and equality searches. ISAM is a static structure. Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. Typically, 67% occupancy on average. Usually preferable to ISAM adjusts to growth gracefully. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS. Caution! There is much variation in implementation 23