12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15, 12.18
Basic Concepts Indexing - to speed up access to data Search key - attribute or attributes used to look up records in a file Index file - records of the form Two kinds ordered: search keys are stored in some order hash: search keys are distributed uniformly across “buckets” using a hash function Evaluation criteria access types supported efficiently, e.g., records with a specified value in the attribute records with a value falling in a specified range of values. record access, insertion, deletion times index overhead search-key pointer
Ordered Indices Index entries sorted on the search key value Primary index: in a sequentially ordered file, an index whose search key specifies the sequential order of the file often the primary key Secondary index: different from the file's sequential order Dense index:
12.4 A Sparse Index How do we insert/delete records when there is an index? E.g. insert record for Othertown? E.g. delete A-110 record? A-215 record?
12.5 Multilevel Index If a primary index does not fit in memory, access becomes $$$ Use a sparse index on a dense index to reduce #disk accesses outer index – a sparse index of primary index inner index – the primary index file Store outer index in main memory Insertion/deletion?
12.6 Secondary Indices To search on some attribute other than a primary key E.g. the balance field of account Secondary indices have to be dense
12.7 B + -Tree Index Files Problems with indexed-sequential files: performance degrades as file grows (many overflow blocks) periodic reorganization of entire file Typical node (size n) K i : search-key values (ordered in a node) P i : pointers to children (for non-leaf nodes) or buckets of records (for leaf nodes)
12.8 B + -Tree Index Files Properties all paths from root to leaf are of the same length root node has between 2 and n children non-root or leaf nodes have between n/2 1 and n children (pointers) leaf nodes have between (n–1)/2 and n–1 values insertions/deletions done in log time Automatic reorganization with small, local, changes 1 n/2 is the next integer ≥ n/2
12.9 Non-Leaf Nodes in B + -Trees A multi-level sparse index on the leaf nodes Properties: all the search-keys in the subtree to which P 1 points are less than K 1 for 2 i n – 1, all the search-keys in the subtree to which P i points have values greater than or equal to K i–1 and less than K j. P n points to search keys with values ≥K n-1 E.g. (n=3) components are P 1 K 1 P 2 K 2 P 3
12.10 Leaf Nodes in B + -Trees For i = 1, 2,..., n–1, P i either points to a file record with search-key value K i, or a bucket of pointers to file records, each record having search-key value K i (bucket structure only if search-key is not a primary key)
12.11 B + -tree f B + -tree for account (n = 3) Root has at least 2 children Other non-leaf nodes have between 1 and 3 children ( (n/2 and n) Leaf nodes have between 1 and 2 values ( (n–1)/2 and n –1) Queries: how would you find Downtown and Round Hill
12.12 B + -tree with n=5 Leaf nodes have between 2 and 4 values ( (n–1)/2 and n –1, with n = 5) Non-leaf nodes other than root have between 3 and 5 children ( (n/2 and n with n =5) Root has at least 2 children
12.13 Efficiency of Queries on B +- Trees Processing a query: traverse from the root to a leaf node K search-key values: path ≤ log n/2 (K) A node is generally the same size as a disk block With 1 million search key values and n = 100, ≤ log 50 (1,000,000) = 4 nodes are accessed in a lookup Balanced binary tree from CS 132: ~20 nodes are accessed in a lookup significant since every node access may need a disk I/O
12.14 Insertion in B + -Trees A record for Perryridge? follow tree and add to bucket A record for Othertown? put to right of Mianus and add record to database A record for Clearview? we need to add a new node
12.15 Insertion in B + -Trees Splitting a node: take the n(search-key value, pointer) pairs (including the one being inserted) in sorted order. Place the first n/2 in the original node, and the rest in a new node. let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. If the parent is full, split it and propagate the split further up. The splitting proceeds upwards till a node that is not full is found Worst case the root node is split, increasing the tree height by 1 Result of inserting Clearview in node containing Brighton and Downtown. Now there must be a node for Downtown in the next level up
12.16 Insertion in B + -Trees Before and after inserting “Clearview”. Now try: "Dashfield"
12.17 Deletion in B + -Trees Find the record to be deleted and remove it from the main file and from the bucket (if present) Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then insert all the search-key values in the two nodes into a single node (the one on the left), and delete the other node delete the pair (K i–1, P i ), where P i is the pointer to the deleted node, from its parent, recursively using the above procedure If the node has too few pointers due to the removal, and the entries in the node and a sibling fit into a single node, then redistribute the pointers between the node and a sibling update the corresponding search-key value in the node's parent Deletions cascade up until a node with n/2 or more pointers
12.18 Examples of B + -Tree Deletion Before and after deleting “Downtown” Removing the leaf node containing “Downtown” did not leave its parent with too few pointers. Cascaded deletions didn't go beyond the parent.
12.19 Examples of B + -Tree Deletion (Cont.) Node with “Perryridge” becomes underfull (empty) and merged with its sibling As a result “Perryridge” node’s parent became underfull, and was merged with its sibling (and an entry was deleted from their parent) Root node then had only one child and was deleted Delete “Perryridge”
12.20 Example of B + -tree Deletion (Cont.) Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling Search-key value in the parent’s parent changes as a result Delete “Perryridge” from earlier example
12.21 B + -Tree File Organization Index file degradation is addressed using B + -Tree indices Data file degradation is addressed using B + -Tree file organization Leaf nodes in a B + -tree file store records, instead of pointers Records use more space than pointers Try to keep at least entries in each sibling (data) node
12.22 B-Tree Index File Similar to B+-tree, but search-key values appear only once B+-tree on same data: Brighton bucket Clearview bucket
12.23 B-Tree Index Files (Cont.) Advantages: fewer tree nodes may find search-key value before reaching leaf node Disadvantages only small fraction of all search-key values are found early non-leaf nodes are larger, so n is smaller and the B-Tree deeper insertion and deletion more complicated implementation harder Typically, advantages of B-Trees do not out weigh disadvantages
12.24 Static Hashing Bucket: unit of storage containing one or more records (typically a disk block) Hash file organization: obtain the bucket of a record directly from its search-key value using a hash function Hash function: h(K) = B. K a search-key value, B a bucket address Used to locate records for access, insertion, and deletion If records with different search-key values are mapped to the same bucket, search the bucket sequentially to locate a record
12.25 Examples of Hash File Organization Assume 10 buckets Let a →1, b→2,... Method 1: h(k) returns this representation the first letter in k mod 10. E.g. h(Perryridge) = 6, h(Brighton) = 2 Is this a good hash function? Method 2: h(k) returns the sum of the characters representations mod 10 E.g. h(Perryridge) = 5, h(Brighton) = 3 (B →2, r→8, i→9, g→7, h→8, t→0, o→5, n→4, =3) An ideal hash function uniform: each bucket is assigned the same number of search-key values from the set of all possible values random: irrespective of the actual distribution of search-key values
12.26 Example of Hash File Organization Hash file for account, using branch-name as key and method 2
12.27 Handling Bucket Overflows Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list This scheme is called closed hashing An alternative, open hashing (the data indexed by the hash goes in the next available slot) is not suitable for databases
12.28 Hash Indices Hashing can be used for file organization and to create an index This is a secondary index (not on primary key)
12.29 Deficiencies of Static Hashing Hash function h maps search-keys to a fixed set of bucket addresses databases grow with time. If initial number of buckets is too small, performance will degrade due to overflows if file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially if database shrinks, space will be wasted Expensive option: periodic file re-organization with new hash function There are also techniques that allow a dynamic # of buckets good for databases that grow and shrink in size, will skip Hashing usually better at retrieving records with a specified key value Ordered indices preferred if range queries are common Ordered Indexing versus Hashing
12.30 Index Definition in SQL Create an index create index on ( ) E.g. create index b-index on branch(branch-name) create index b-index using btree on branch(branch-name) create index b-index using hash on branch(branch-name) To drop an index drop index