Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

Dr. Kalpakis CMSC 661, Principles of Database Systems http://www.csee.umbc.edu/~kalpakis/Courses/661 Index Structures [13]

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 2 Terms Search key Values for a collection of fields used to find records with “matching” values of these fields Index entry Pair (search key value, database pointer) Index structure A data structure that allows us to locate a record that satisfies the search key conditions Primary vs secondary index Primary indexes return database address of record Secondary return value of primary key of record Dense vs sparse Clustered vs non-clustered

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 3 Indexes on sequential files A index file consists of index entry records A sorted file of records is called a sequential file the sorted file itself is called the data file Records could be stored in contiguous sequence blocks (can do binary search!) a linked list of blocks Index file is sorted in the same order as the data file Dense indexes Have an index entry for each record in the data file Index file likely much smaller than data file May fit in main memory Useful to find records and answer some queries without reading the (likely larger) data file

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 4 Indexes on sequential files Three major approaches to reorganize a data file Use overflow blocks Insert new blocks in the sequential order Move records between adjacent block Since an index file, dense or sparse, is sorted on its search key, it is a sequential file itself Apply the same principles to it as for the data file

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 5 Secondary indexes Primary indexes determine the location of the indexed records Secondary indexes do not determine the location of the indexed records Secondary indexes are always dense, typically with duplicates Indirection for secondary indexes To avoid repeating the search key for duplicates, we can have for each distinct search key value a bucket with references to all the records with that search key value idea naturally leads to inverted indexes in Information Retrieval For each word, have a secondary index of documents that contain it Combine all these indexes into one

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 6 Conventional indexes Advantages Simple Index is a sequential file Good for scans Disadvantages Inserts are expensive Loose sequential order Become imbalanced

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 7 B-trees An index that organizes its blocks into a tree A B-tree with parameter n Is a balanced tree since all leaves are required to be at the same level Each node has space for n search keys and n+1 database pointers to blocks of the tree All internal nodes have more than ceil(n+1/2) children Each leaf has no less than floor(n+1/2) keys from the data file Root has at least two children

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 8 B-trees The keys in each node are stored in increasing sorted order The keys in the leafs Are copies of keys from the data file Appear among all the leaves in increasing sorted order Each key K in a non-leaf node Is flanked by two pointers to subtrees The subtree to its left contains keys < K The subtree to its right contains keys >= K K is the smallest key appearing among the leaves of the right subtree

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 9 B-tree example Root 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 n=3

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 10 B-tree example - non-leaf node 57 81 95 to keys < 57 to keys 57  k<81 to keys 81  k<95 to keys  95 from non-leaf node

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 11 B-tree example – leaf node 57 81 95 To record with key 57 with key 81 with key 85 to next leaf in sequence from non-leaf node

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 12 Operations and B-trees Lookup Equality queries Traverse the tree from the root down to an appropriate leaf using the keys stored at the nodes to decide which subtree(s) to visit Range queries in which records with search keys in a given range are sought for Insertion Find leaf block B to insert search key K If B is not full insert K there If B is full, create new block B’, distribute the keys among the two blocks, and recursively insert a search key for B’ to the parent of B, update the keys that appear on the paths from B (and B’) to the root

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 13 Operations and B-trees Deletion To delete search key K, locate leaf block B that contains K Delete entry for K from B If B now has fewer than the required keys/pointers If sibling node B’ has an extra entry that could give to B, move it to B Else, coalesce by moving all the entries from B to one of its siblings and recursively delete the entry for B from its parent Update the keys along the path from the modified blocks to the root Coalescing is often not implemented Too hard for little benefits

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 14 B-trees vs Indexed Sequential Files Concurrency control harder in B-trees B-trees consume more space Static index saves enough accesses to allow for reorganization On the other hand DBA does not know when to reorganize DBA does not know how full to load pages of new index Buffering B-tree has fixed buffer requirements Static index must read several overflow blocks to be efficient (large & variable size buffers needed for this) LRU is bad for B-trees (always keep the root block in memory) B-trees are better

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 15 Interesting problems What is a good index structure when records tend to be inserted with keys that are larger than existing values? we want to remove older data What is the best n? As disks get bigger, main memories increase, and CPUs get faster?

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 16 Static hashing Maintain an array of buckets, each bucket being a chain of blocks Use a hash function to compute the bucket index for record(s) with given search key K Search chain of blocks of bucket h(k) for the sought records Inserts/deletes are basic linked list insert/deletes Collisions can lead to deteriorated performance by having few long chains

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 17 Static hashing Uses fixed hash function and number of buckets If #buckets is too small, performance will degrade due to frequent overflows and long chains If #buckets is made large to anticipate future growth, secondary storage is wasted initially or whenever the data file shrinks One rather expensive option is to periodically rebuild the hash index Another option would be to devise hashing that allows for the hash function or #buckets to change dynamically with small overhead

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 18 Extensible hashing Uses a hash function h that generates values over [0, 2 b ) at any time it uses only a prefix (high order bits) of i bits of the hash i is initially 0 and changes as the size of the data file changes it maintains an bucket array with 2 i pointers to buckets the i-prefix h(K)[i] is used to index into the bucket array Multiple entries of the bucket array may point to the same bucket Each bucket j stores with it the bucket prefix i j All the keys it contains have the same value for their i j prefix There are 2 (i-ij) pointers to bucket j from the bucket array The number of distinct buckets changes dynamically due to coalescing and splitting of buckets.

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 19 Extensible hashing

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 20 Operations with extensible hashing Lookup bucket for search key K Use the i high order bits h(K)[i] of the hash to index into the bucket array and get the bucket Inserting search key K into bucket j If there is room in the bucket j insert record in the bucket Else split the bucket j and attempt the insertion again (use overflow buckets if full again) Deleting search key K from bucket j Delete K from bucket j Remove the bucket j if empty, by nullifying the relevant bucket array entry Coalescing of buckets can be done can coalesce only with a “buddy” bucket having same value of i j and same i j –1 prefix, if it is present Decreasing bucket array size is also possible

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 21 Splitting buckets in extensible hashing Splitting bucket j depends on the #pointers to it in the bucket array If i > i j (more than one pointer to bucket j) allocate a new bucket z with bucket prefix i j +1 Set the bucket prefix of j to i j +1 make the highest half of the bucket array entries pointing to j to now point to z remove and reinsert each record in bucket j If i = i j (only one pointer to bucket j) increment i and double the size of the bucket array replace each entry in the bucket array by two entries that point to the same bucket

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 22 Initial Extendable Hash structure, bucket size = 2 records Extensible hashing - example

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 23 After inserting of one Brighton and two Downtown records After inserting of Mianus Extensible hashing - example

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 24 After inserting three Perryridge records Extensible hashing - example

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 25 Extensible hashing - example Hash structure after insertion of Redwood and Round Hill records

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 26 Extensible hashing Advantages Hash performance does not degrade with growth of file Minimal space overhead No full reorganizations Disadvantages Extra level of indirection to find desired record Not bad, if bucket array is in memory Bucket array may itself become very big (larger than memory) Need a tree structure to locate desired record in the structure! Changing size of bucket array is an expensive operation Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 27 Linear hashing Hash function h that generates values over [0, 2 b ) Uses suffix h(K)[i] of i low order bits of hash h(K) Maintains array with n buckets Attempts to limit the load factor r/n, where r=#records Linear hashing has all the advantages of extensible hashing and Grows the #buckets slower Less expensive to change the size of the bucket array Long chains still possible

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 28 Operations for Linear hashing Lookup bucket for search key K If m=h(K)[i] < n then return bucket m else return bucket m-2 i-1 Insert record with search key K Lookup the bucket j for K If bucket j has space insert it there, else chain a new block to the bucket and insert the record there If the load factor is above a threshold add a new bucket to the bucket array, whose number is 1a 2 a 3 …a i split the records in bucket 0a 2 a 3 …a i with the new bucket If n=2 i then increment i by 1

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 29 Indexing vs hashing Hashing good for probes given key SELECT * FROM R WHERE R.K=5 Indexing good for range queries SELECT * FROM R WHERE R.K BETWEEN 5 AND 10;

- Dr. Kalpakis CMSC 661 - Dr. Kalpakis 30 Index definition in SQL CREATE {UNIQUE} INDEX idxname ON rname(attrs) DROP INDEX idxname;

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

Similar presentations

Presentation on theme: "Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

Similar presentations

Presentation on theme: "Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]"— Presentation transcript:

Similar presentations

About project

Feedback