Download presentation
Presentation is loading. Please wait.
Published byReginald Russell Modified over 9 years ago
1
Temple University – CIS Dept. CIS331– Principles of Database Systems V. Megalooikonomou Indexing and Hashing I (based on notes by Silberchatz, Korth, and Sudarshan and notes by C. Faloutsos at CMU)
2
General Overview - rel. model Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing
3
Indexing- overview primary / secondary indices index-sequential (ISAM) B - trees, B+ - trees hashing static hashing dynamic hashing
4
Basic Concepts Indexing mechanisms speed up access to desired data E.g., author catalog in library Search Key - attribute to set of attributes used to look up records in a file An index file consists of records (called index entries) of the form Index files are typically much smaller than the original file Two basic kinds of indices: Ordered indices: search keys are stored in sorted order Hash indices: search keys are distributed uniformly across “buckets” using a “hash function” search-keypointer
5
Indexing once the records are stored in a file, how do you search efficiently? (e.g., ssn=123?)
6
Indexing
7
once the records are stored in a file, how do you search efficiently? brute force: retrieve all records, report the qualifying ones better: use indices (pointers) to locate the records directly
8
Indexing – main idea:
9
Measuring ‘goodness’ retrieval time? insertion / deletion? space overhead? reorganization? range queries?
10
Main concepts search keys are sorted in the index file and point to the actual records primary vs. secondary indices Clustering (sparse) vs non-clustering (dense) indices
11
Indexing Primary key index: on primary key (no duplicates)
12
Indexing secondary key index: duplicates may exist Address-index
13
Indexing secondary key index: typically, with ‘postings lists’ Postings lists
14
Main concepts – cont’d Clustering (= sparse) index: records are physically sorted on that key (and not all key values are needed in the index) Non-clustering (=dense) index: the opposite E.g.:
15
Indexing- Sparse index Clustering/sparse index on ssn >=123 >=456
16
Sparse Index Files Sparse Index: contains index records for only some search- key values Applicable when records are sequentially ordered on search-key To locate a record with search-key value K we: Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points Less space and less maintenance overhead for insertions and deletions Generally slower than dense index for locating records Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block
17
Indexing – Dense Index Non-clustering / dense index
18
Summary DenseSparse Primaryusual secondaryusualrare All combinations are possible… at most one sparse/clustering index as many as desired dense indices usually: one primary-key index (maybe clustering) and a few secondary-key indices (non-clustering)
19
Indexing- overview primary / secondary indices index-sequential (ISAM) B - trees, B+ - trees hashing static hashing dynamic hashing
20
ISAM What if index is too large to search sequentially? use a multilevel index…
21
ISAM >=123 >=456 block
22
ISAM - observations if index is too large, store it on disk and keep index-on-the-index usually two levels of indices, one first-level entry per disk block (why? )
23
ISAM - Multilevel Index
24
ISAM - observations What about insertions/deletions? >=123 >=456 124; peterson; fifth ave.
25
ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows Problems?
26
ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows overflow chains may become very long - what to do?
27
ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows overflow chains may become very long - thus: shut-down & reorganize start with ~80% utilization
28
ISAM - observations if index is too large, store it on disk and keep index on the index (in memory) usually two levels of indices, one first- level entry per disk block typically, blocks: 80% full initially (why? what are potential problems / inefficiencies?)
29
So far … indices (like ISAM) suffer in the presence of frequent updates sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive each record access may fetch a new block from disk alternative indexing structure: B - trees
30
Overview primary / secondary indices multilevel (ISAM) B - trees, B+ - trees hashing static hashing dynamic hashing
31
B-trees the most successful family of index schemes (B-trees, B +- trees, B * -trees) can be used for primary/secondary, clustering/non-clustering index they are balanced “n-way” search trees
32
B-trees Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required Advantage of B + -tree index files: automatic self-reorganization with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required Disadvantage of B + -trees: extra insertion and deletion overhead, space overhead Advantages of B + -trees outweigh disadvantages, and they are used extensively
33
B-trees E.g., B-tree of order 3 (i.e., at most 3 pointers from each node): 1 3 6 7 9 13 <6 >6 <9 >9
34
B-tree properties: each node, in a B-tree of order n : key order at most n pointers at least n/2 pointers (except root) all leaves at the same level if number of pointers is k, then node has exactly k-1 keys v1v2 …v n-1 p1 pn
35
Properties “block aware” nodes: each node -> disk page O(log (N)) for everything! (ins/del/search) typically, if N = 50 - 100, then 2 - 3 levels utilization >= 50%, guaranteed; on average 69%
36
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9
37
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9
38
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9
39
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9
40
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9 H steps (= disk accesses)
41
Queries Algorithm for exact match query? (e.g., ssn=8?) 1 3 6 7 9 13 <6 >6 <9 >9
42
Queries what about range queries? (e.g., 5<salary<8) Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )
43
Queries what about range queries? (e.g., 5<salary<8) Proximity/ nearest neighbor searches? (e.g., salary ~ 8 ) 1 3 6 7 9 13 <6 >6 <9 >9
44
Queries what about range queries? (eg., 5<salary<8) Proximity/ nearest neighbor searches? (eg., salary ~ 8 ) 1 3 6 7 9 13 <6 >6 <9 >9
45
B-trees: Insertion Insert in leaf; on overflow, push middle up (recursively) split: preserves B - tree properties
46
B-trees Easy case: Tree T0; insert ‘8’ 1 3 6 7 9 13 <6 >6 <9 >9
47
B-trees Tree T0; insert ‘8’ 1 3 6 7 9 13 <6 >6 <9 >9 8
48
B-trees Hardest case: Tree T0; insert ‘2’ 1 3 6 7 9 13 <6 >6 <9 >9 2
49
B-trees Hardest case: Tree T0; insert ‘2’ 1 2 6 7 9 13 3 push middle up
50
B-trees Hardest case: Tree T0; insert ‘2’ 6 7 9 1313 2 2 Ovf; push middle
51
B-trees Hardest case: Tree T0; insert ‘2’ 7 9 1313 2 6 Final state
52
B-trees - insertion Q: What if there are two middles? (e.g., order 4) A: either one is fine
53
B-trees: Insertion Insert in leaf; on overflow, push middle up (recursively – ‘propagate split’) split: preserves all B - tree properties (!!) notice how it grows: height increases when root overflows & splits Automatic, incremental re-organization (contrast with ISAM!)
54
INSERTION OF KEY ’K’ find the correct leaf node ’L’; if ( ’L’ overflows ){ split ’L’, by pushing the middle key upstairs to parent node ’P’; if (’P’ overflows){ repeat the split recursively; } else{ add the key ’K’ in node ’L’; /* maintaining the key order in ’L’ */ } Pseudo-code
55
Overview primary / secondary indices multilevel (ISAM) B – trees Dfn, Search, insertion, deletion B+ - trees hashing
56
Deletion Rough outline of algorithm: Delete key; on underflow, may need to merge In practice, some implementors just allow underflows to happen…
57
B-trees – Deletion Easiest case: Tree T0; delete ‘3’ 1 3 6 7 9 13 <6 >6 <9 >9
58
B-trees – Deletion Easiest case: Tree T0; delete ‘3’ 1 6 7 9 13 <6 >6 <9 >9
59
B-trees – Deletion Case1: delete a key at a leaf – no underflow Case2: delete non-leaf key – no underflow Case3: delete leaf-key; underflow, and ‘rich sibling’ Case4: delete leaf-key; underflow, and ‘poor sibling’
60
B-trees – Deletion Case1: delete a key at a leaf – no underflow (delete 3 from T0) 1 3 6 7 9 13 <6 >6 <9 >9
61
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (e.g., delete 6 from T0) 1 3 6 7 9 13 <6 >6 <9 >9 Delete & promote, i.e:
62
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (e.g., delete 6 from T0) 1 3 7 9 13 <6 >6 <9 >9 Delete & promote, i.e.:
63
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0) 17 9 13 <6 >6 <9 >9 Delete & promote, i.e.: 3
64
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0) 17 9 13 <3 >3 <9 >9 3 FINAL TREE
65
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0) Q: How to promote? A: pick the largest key from the left sub-tree (or the smallest from the right sub-tree) Observation: Every deletion eventually becomes a deletion of a leaf key
66
B-trees – Deletion Case2: delete a key at a non-leaf – no underflow (eg., delete 6 from T0) 17 9 13 <6 >6 <9 >9 Delete & promote, i.e.: 3
67
B-trees – Deletion Case1: delete a key at a leaf – no underflow Case2: delete non-leaf key – no underflow Case3: delete leaf-key; underflow, and ‘rich sibling’ Case4: delete leaf-key; underflow, and ‘poor sibling’
68
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 3 6 7 9 13 <6 >6 <9 >9 Delete & borrow, ie:
69
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 3 69 13 <6 >6 <9 >9 Delete & borrow, ie: Rich sibling
70
B-trees – Deletion Case3: underflow & ‘rich sibling’ ‘rich’ = can give a key, without underflowing ‘borrowing’ a key: always THROUGH the PARENT!
71
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 3 69 13 <6 >6 <9 >9 Delete & borrow, ie: Rich sibling NO!!
72
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 3 69 13 <6 >6 <9 >9 Delete & borrow, ie:
73
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 39 13 <6 >6 <9 >9 Delete & borrow, ie: 6
74
B-trees – Deletion Case3: underflow & ‘rich sibling’ (eg., delete 7 from T0) 1 39 13 <3 >3 <9 >9 Delete & borrow, through the parent 6 FINAL TREE
75
B-trees – Deletion Case1: delete a key at a leaf – no underflow Case2: delete non-leaf key – no underflow Case3: delete leaf-key; underflow, and ‘rich sibling’ Case4: delete leaf-key; underflow, and ‘poor sibling’
76
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) 1 3 6 7 9 13 <6 >6 <9 >9
77
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) 1 3 6 7 9 <6 >6 <9 >9
78
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) 1 3 6 7 9 <6 >6 <9 >9 A: merge w/ ‘poor’ sibling
79
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) Merge, by pulling a key from the parent exact reversal from insertion: ‘split and push up’, vs. ‘merge and pull down’ Ie.:
80
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) 1 3 6 7 <6 >6 A: merge w/ ‘poor’ sibling 9
81
B-trees – Deletion Case4: underflow & ‘poor sibling’ (eg., delete 13 from T0) 1 3 6 7 <6 >6 9 FINAL TREE
82
B-trees – Deletion Case4: underflow & ‘poor sibling’ -> ‘pull key from parent, and merge’ Q: What if the parent underflows? A: repeat recursively
83
B-tree deletion - pseudocode DELETION OF KEY ’K’ locate key ’K’, in node ’N’ if( ’N’ is a non-leaf node) { delete ’K’ from ’N’; find the immediately largest key ’K1’; /* which is guaranteed to be on a leaf node ’L’ */ copy ’K1’ in the old position of ’K’; invoke this DELETION routine on ’K1’ from the leaf node ’L’; else { /* ’N’ is a leaf node */... (next slide..)
84
B-tree deletion - pseudocode /* ’N’ is a leaf node */ if( ’N’ underflows ){ let ’N1’ be the sibling of ’N’; if( ’N1’ is "rich"){ /* ie., N1 can lend us a key */ borrow a key from ’N1’ THROUGH the parent node; }else{ /* N1 is 1 key away from underflowing */ MERGE: pull the key from the parent ’P’, and merge it with the keys of ’N’ and ’N1’ into a new node; if( ’P’ underflows){ repeat recursively } }
85
B-trees in practice In practice: no empty leaves; pointers to records 1 3 6 7 9 13 <6 >6 <9 >9 theory
86
B-trees in practice In practice: no empty leaves; pointers to records 1 3 6 7 9 13 <6 >6 <9 >9 practice
87
B-trees in practice In practice: 13 6 7 9 13 <6 >6 <9 >9 Ssn…… 3 7 6 9 1
88
B-trees in practice In practice, the formats are: - leaf nodes: (v1, rp1, v2, rp2, … vn, rpn) - Non-leaf nodes: (p1, v1, rp1, p2, v2, rp2, …) 13 6 7 9 13 <6 >6 <9 >9
89
Overview primary / secondary indices multilevel (ISAM) B – trees B+ - trees hashing
90
B+ trees - Motivation B-tree – print keys in sorted order: 1 3 6 7 9 13 <6 >6 <9 >9
91
B+ trees - Motivation B-tree needs back-tracking – how to avoid it? 1 3 6 7 9 13 <6 >6 <9 >9
92
Solution: B + - trees Facilitate sequential ops They string all leaf nodes together AND Replicate keys from non-leaf nodes, to make sure every key appears at the leaf level !!
93
B+ trees 1 3 6 6 9 9 <6 >=6>=6 <9 >=9>=9 713
94
B + -Trees (Cont.) All paths from root to leaf are of the same length Each node that is not a root or a leaf has between [n/2] and n children A leaf node has between [(n–1)/2] and n–1 values Special cases: If the root is not a leaf, it has at least 2 children If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values A B + -tree is a rooted tree satisfying the following properties:
95
B + -Tree Node Structure Typical node K i are the search-key values P i are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K 1 < K 2 < K 3 <... < K n–1
96
Leaf Nodes in B + -Trees - Properties For i = 1, 2,..., n–1, pointer P i either points to a file record with search-key value K i, or to a bucket of pointers to file records, each record having search-key value K i. Only need bucket structure if search-key does not form a primary key. If L i, L j are leaf nodes and i < j, L i ’s search-key values are less than L j ’s search-key values P n points to next leaf node in search-key order
97
Non-Leaf Nodes in B + -Trees - Properties Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P 1 points are less than K 1 For 2 i n – 1, all the search-keys in the subtree to which P i points have values greater than or equal to K i–1 and less than K m–1
98
B-Tree vs B+-Tree B-tree (above) and B+-tree (below) on same data
99
B+ tree insertion INSERTION OF KEY ’K’ insert search-key value to ’L’ such that the keys are in order; if ( ’L’ overflows) { split ’L’ ; insert (ie., COPY) smallest search-key value of new node to parent node ’P’; if (’P’ overflows) { repeat the B-tree split procedure recursively; /* Notice: the B-TREE split; NOT the B+ -tree */ }
100
B+-tree insertion – cont’d /* ATTENTION: a split at the LEAF level is handled by COPYING the middle key upstairs; A split at a higher level is handled by PUSHING the middle key upstairs */
101
B+ trees - insertion 1 3 6 6 9 9 <6 >=6>=6 <9 >=9>=9 713 Eg., insert ‘8’
102
B+ trees - insertion 1 3 6 6 9 9 <6 >=6 <9 >=9 713 Eg., insert ‘8’ 8
103
B+ trees - insertion 1 3 6 6 9 9 <6 >=6 <9 >=9 713 Eg., insert ‘8’ 8 COPY middle upstairs
104
B+ trees - insertion 1 3 6 6 9 <6 >=6 <9 >=9 9 13 Eg., insert ‘8’ COPY middle upstairs 7 8 7
105
B+ trees - insertion 1 3 6 6 9 <6 >=6 <9 >=9 9 13 Eg., insert ‘8’ COPY middle upstairs 7 8 7 Non-leaf overflow – just PUSH the middle
106
B+ trees - insertion 1 3 6 6 <6 >=6 >=9 9 13 Eg., insert ‘8’ 7 8 7 9 <7>=7 <9 FINAL TREE
107
B-Trees vs B+-Trees Advantages of B-Tree indices: May use less tree nodes than a corresponding B + -Tree. Sometimes possible to find search-key value before reaching leaf node. Disadvantages of B-Tree indices: Only small fraction of all search-key values are found early Non-leaf nodes are larger, so fan-out is reduced. Thus B-Trees typically have greater depth than corresponding B + -Tree Insertion and deletion more complicated than in B + -Trees Implementation is harder than B + -Trees. Typically, advantages of B-Trees do not out weigh disadvantages
108
B*-tree In B-trees, worst case util. = 50%, if we have just split all the pages how to increase the utilization of B - trees? … with B* - trees!
109
B-trees and B*-trees E.g., Tree T0; insert ‘2’ 1 3 6 7 9 13 <6 >6 <9 >9 2
110
B*-trees: deferred split! Instead of splitting, LEND keys to sibling! (through PARENT, of course!) 1 3 6 7 9 13 <6 >6 <9 >9 2
111
B*-trees: deferred split! Instead of splitting, LEND keys to sibling! (through PARENT, of course!) 1 2 3 6 9 13 <3 >3 <9 >9 2 7 FINAL TREE
112
B*-trees: deferred split! Notice: shorter, more packed, faster tree It’s a rare case, where space utilization and speed improve together BUT: What if the sibling has no room for our ‘lending’?
113
B*-trees: deferred split! BUT: What if the sibling has no room for our ‘lending’? A: 2-to-3 split: get the keys from the sibling, pool them with ours (and a key from the parent), and split in 3. Details: too messy (and even worse for deletion)
114
Conclusions all B – tree variants can be used for any type of index: primary/secondary, sparse (clustering), or dense (non- clustering) All have excellent, O(logN) worst-case performance for ins/del/search It’s the prevailing indexing method
115
Overview ordered indices primary / secondary indices index-sequential multilevel (ISAM) B - trees, B+ - trees hashing static hashing dynamic hashing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.