CS4433 Database Systems Indexing.

CS4433 Database Systems Indexing

Why Do We Learn This? Find out the desired information (by value) from the database (very) quickly! E.g., author catalog in library Indexing Common properties of indexes B+ trees Hash tables

What is Indexing? A “labeled” pointer to an (a collection of) item that satisfies some common property Examples in the Real World?

Theoretically, Indexes is …
An index on a file speeds up selections on the search key attributes(s) Search key = any subset of the attributes of a relation attributes used to look up records in a file. Search key is not the same as key (minimal set of attributes that uniquely identify a tuple (record) in a relation) An index file consists of records (called index entries) Entries in an index: (K, R), where: K: the search key R: pointers of the record OR record id OR record ids Index files are typically much smaller than the original file

Types of Indexes Ordered/Hash
Ordered indices: index entries are stored sorted on the search key value. E.g., author catalog in library. Hash indices: index entries are distributed uniformly across “buckets” based on search key using a “hash function”. Clustered/Unclustered Clustered = records sorted in the key order Unclustered = no Dense/sparse Dense = Index record appears for every search-key value in the file. Sparse = only some records have Primary/secondary Primary = on the primary key Secondary = on any key Some textbooks interpret these differently B+ tree / Hash table / …

Clustered, Dense Index Clustered: File is sorted on the index attribute Dense: sequence of (key, pointer) pairs 10 20 10 20 30 40 30 40 50 60 70 80 50 60 70 80

Clustered, Dense Index index on ID attribute of instructor relation

Dense Index Files (Cont.)
Dense index on dept_name, with instructor file sorted on dept_name

Clustered, Sparse Index
Sparse index: contains index records for only some search-key values e.g. one key per data block Applicable when records are sequentially ordered on search-key Save more space Sacrifice efficiency 10 20 10 30 50 70 30 40 90 110 130 150 50 60 70 80

Sparse Index Files To locate a record with search-key value K we:
Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points

Sparse Index Files (Cont.)
Compared to dense indices: Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records. Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Clustered Index with Duplicate Keys
Dense index: point to the first record with that key 10 10 20 30 40 10 20 50 60 70 80 20 30 40

Unclustered Indexes Often for indexing other attributes than primary key Always dense (why ?) The locality of values has been broken! 20 30 10 20 30 20 20 30 10 20 10 30

Clustered vs. Unclustered Index
Index entries Index entries (Index File) (Data file) Data Records Data Records CLUSTERED UNCLUSTERED

Secondary Indices Example
Secondary index on salary field of instructor Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. Secondary indices have to be dense 19

Primary and Secondary Indices
Indices offer substantial benefits when searching for records. BUT: Updating indices imposes overhead on database modification --when a file is modified, every index on the file must be updated, Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive Each record access may fetch a new block from disk Block fetch requires about 5 to 10 milliseconds, versus about nanoseconds for memory access 20

Multilevel Index If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index – a sparse index of primary index inner index – the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.

Multilevel Index (Cont.)

Index Update: Deletion
If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. Single-level index entry deletion: Dense indices – deletion of search-key is similar to file record deletion. Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

Index Update: Insertion
Single-level index insertion: Perform a lookup using the search-key value appearing in the record to be inserted. Dense indices – if the search-key value does not appear in the index, insert it. Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. If a new block is created, the first search-key value appearing in the new block is inserted into the index. Multilevel insertion and deletion: algorithms are simple extensions of the single-level algorithms

Secondary Indices Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index) satisfy some condition. Example 1: In the instructor relation stored sequentially by ID, we may want to find all instructors in a particular department Example 2: as above, but where we want to find all instructors with a specified salary or with salary in a specified range of values We can have a secondary index with an index record for each search-key value

B+ Trees What’s wrong with sequential index? Pros: easy/fast to access
Cons: hard to maintain the sequential property upon updates Periodic reorganization of entire file is required. performance degrades as file grows, since many overflow blocks get created. B+ Tree Intuition: Give up sequentiality of index and Try to get “balance” by dynamic reorganization automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. (Minor) disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.

Example of B+-Tree

B+-Tree Index Files (Cont.)
A B+-tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length Parameter d = the degree (order) Each node has [d, 2d] keys (except root) Each interior node that is not a root or a leaf has pointer to [d+1,2d+1] children. A leaf node has pointers [d+1,2d+1] to record Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (2d–1) values

B+-Tree Node Structure
Typical Node Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K1 < K2 < K3 < < Kn–1 (Initially assume no duplicate keys, address duplicates later) K1 K2 K3 p1 p2 p3 p4 [X , K1) [K1, K2) [K2, K3) [K3, Y)

B+ Trees Basics Internal node: Leaf: next leaf 30 120 240 40 50 60
[30, 120) [120, 240) [240, Y) 40 50 60 next leaf 40 50 60

Properties of a leaf node:
Leaf Nodes in B+-Trees Properties of a leaf node: For i = 1, 2, . . ., 2d, pointer Pi points to a file record with search-key value Ki, P2d+1 points to next leaf node in search-key order

Non-Leaf Nodes in B+-Trees
Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P1 points are less than K1 For 2  i  2d + 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki All the search-keys in the subtree to which P2d+1 points have values greater than or equal to K2d K1 .. K2d p1 p2 ,, P2d+1

Searching a B+ Tree Select name From people Where age = 25 Select name
Point queries with exact key values: Start at the root Proceed down, to the leaf Range queries: As above Then sequential traversal on leafs Select name From people Where 20 <= age and age <= 30

Queries on B+-Trees Find record with search-key value V. C=root
While C is not a leaf node { Let i be least value s.t. V  Ki. If no such exists, set C = last non-null pointer in C Else { if (V= Ki ) Set C = Pi +1 else set C = Pi} } Let i be least value s.t. Ki = V If there is such a value i, follow pointer Pi to the desired record. Else no record with search-key value k exists.

B+ Tree Example Root (d=1) d = 2 Select name From person
Where age = 30 (Where age >=30) (Where 20<=age and age <=30) Root (d=1) d = 2 80 20 60 100 120 140 10 15 18 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90

B+ Tree Design How large is d? Eack block will have space for 2d search key and 2d+1 pointers. Pick n as large as possible that fits into a block Example 14.10 Example: Key size = 4 bytes Pointer size = 8 bytes Block size = 4096 byes 2d x 4 + (2d+1) x 8 <= 4096 So, d = 170

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%.
average fan-out = 133 Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = page = Kbytes Level 2 = pages = Mbyte Level 3 = 17,689 pages = 133 MBytes

Insertion in a B+ Tree Insert (K, P): Find leaf where K belongs, insert If no overflow (2d keys or less), halt If overflow (2d+1 keys), split node, insert in parent: If leaf, keep K3 too in right node When root splits, new root has 1 key only that’s why root is special for degree satisfaction (K3, ) to parent K1 K2 K3 K4 K5 P0 P1 P2 P3 P4 p5 K1 K2 P0 P1 P2 K4 K5 P3 P4 p5

Insertion in a B+ Tree Insert K=19 80 20 60 100 120 140 10 15 18 20 30
50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After Insertion 80 20 60 100 120 140 10 15 18
19 20 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree Now Insert K=25 80 20 60 100 120 140 10 15 18
19 20 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After Insertion 80 20 60 100 120 140 10 15 18
19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree Now Split 80 20 60 100 120 140 10 15 18 19 20
25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After the Split 80 20 30 60 100 120 140 10 15
18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree Delete 30 80 20 30 60 100 120 140 10 15 18 19
25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree After Deleting 30 May change to 40, or not 80
20 30 60 100 120 140 10 15 18 19 20 25 40 50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree Delete 25 80 20 30 60 100 120 140 10 15 18 19
50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 25, Need to rebalance: Rotate
80 20 30 60 100 120 140 10 15 18 19 20 40 50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree Now Delete 40 80 19 30 60 100 120 140 10 15 18
50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 40, Rotation not possible. Need to merge nodes 80 19 30 60 100 120 140 10 15 18 19 20 50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

Deletion from a B+ Tree Final Tree 80 19 60 100 120 140 10 15 18 19 20
50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

B Tree Idea: Avoid duplicate keys
Have record pointers in non-leaf nodes to record to record to record with K with K with K3 to keys to keys to keys to keys < K K1<x<K K2<x<k >k3 K1 P1 K2 P2 K3 P3

B-Tree Example D = 2 Sequence pointers not useful now! 65 125 25 45 85
105 145 165 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180

Hash Tables Recall basics There are n buckets
A hash function f(k) maps a key k to {0, 1, …, n-1} Store in bucket f(k) a pointer to record with key k Secondary storage: bucket = block, use overflow blocks when needed

Hash Table Example Assume 1 bucket (block) stores 2 keys + pointers h(e)=0 h(b)=h(f)=1 h(g)=2 h(a)=h(c)=3 e b f g a c 1 2 3

Searching in a Hash Table
Search for a: Compute h(a)=3 Read bucket 3 1 disk access e b f g a c 1 2 3

Insertion in Hash Table
Place in right bucket, if there exists space E.g. h(d)=2 e b f g d a c 1 2 3

Insertion in Hash Table
Create overflow block, if no space E.g. h(k)=1 More overflow blocks may be needed e b f g d a c k 1 2 3

Hash Table Performance
Excellent, if no overflow blocks Degrades considerably when number of keys exceeds the number of buckets (i.e. many overflow blocks) Other problems: Memory requirement Dynamic maintenance Equality queries only!

Extensible Hash Table Allows hash table to grow, to avoid performance degradation Assume a hash function h that returns numbers in {0, …, 2k – 1} Start with n = 2i << 2k (size of the hash table), only look at first i most significant bits E.g. i=1, n=2, k=4 The first i bits (i = 1) i=1 0(010) 1 1 1(011) 1

Insertion in Extensible Hash Table
0(010) 1 1 1(011) 1(110) 1

0(010) 1 Now insert 1010 Need to extend table, split blocks i becomes 2 so n=4 1 1(011) 1(110), 1(010) 1

Now insert 1010 i=2 0(010) 1 00 01 10(11) 10(10) 2 Doubling the hash table 10 11 11(10) 2

Now insert 0000, then 0101 Need to split block i=2 0(010) 0(000), 0(101) 0(010) 0(000), 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

After splitting the block 00(10) 00(00) 2 i=2 01(01) 2 00 01 10(11) 10(10) 2 10 11 11(10) 2

Performance Extensible Hash Table
No overflow blocks: access always one read BUT: Extensions can be costly and disruptive After an extension table may no longer fit in memory

Linear Hash Table Idea: extend only one entry at a time
Problem: n= no longer a power of 2 Let i be #bits necessary to address n buckets 2i-1 < n <= 2i After computing h(k), use last i bits: If last i bits represent a number >= n, change msb from 1 to 0 (get a number < n)

Linear Hash Table Example
Insert (01)11 Bit flip: 11  01 (01)00 (11)00 i=2 (01)11 BIT FLIP 00 01 (10)10 10

Linear Hash Table Example
Insert 1000: overflow blocks… (01)00 (11)00 (10)00 i=2 (01)11 00 01 (10)10 10

Linear Hash Tables Extension: independent on overflow blocks
Extend n:=n+1 when average number of records per block exceeds (say) 80%

Linear Hash Table Extension
From n=3 to n=4 Only need to touch one block (which one ?) (01)00 (11)00 (01)00 (11)00 i=2 (01)11 00 (01)11 i=2 01 (10)10 10 (10)10 00 01 (01)11 10 11

Linear Hash Table Extension
From n=3 to n=4 finished Extension from n=4 to n=5 (new bit) Need to touch every single block (why ?) Need to look last 3 bits which affect all keys (01)00 (11)00 i=2 (10)10 00 01 (01)11 10 11

CS4433 Database Systems Indexing.

Similar presentations

Presentation on theme: "CS4433 Database Systems Indexing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS4433 Database Systems Indexing.

Similar presentations

Presentation on theme: "CS4433 Database Systems Indexing."— Presentation transcript:

Similar presentations

About project

Feedback