CS4433 Database Systems Indexing.

Slides:

Advertisements

Similar presentations

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.

Advertisements

Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.

1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

Chapter 8 File organization and Indices.

Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?

1 Lecture 18: Indexes Monday, November 10, Midterm Problem 1a: select student.sname, avg(takes.grade) from student, takes where student.sid =

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)

1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.

1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.

DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,

Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 

12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,

Basic Concepts Indexing mechanisms used to speed up access to desired data. E.g., author catalog in library Search Key - attribute to set of attributes.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Indexing.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

1 Lecture 21: Hash Tables Wednesday, November 17, 2004.

Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.

Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 11: Indexing.

CS411 Database Systems Kazuhiro Minami 10: Indexing-1.

Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.

Data Indexing Herbert A. Evans.

Module 11: File Structure

CS522 Advanced database Systems

CS 540 Database Management Systems

Indexing and hashing.

CS 728 Advanced Database Systems Chapter 18

Azita Keshmiri CS 157B Ch 12 indexing and hashing

CS522 Advanced database Systems

Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.

Tree Indices Chapter 11.

Lecture 20: Indexing Structures

Extra: B+ Trees CS1: Java Programming Colorado State University

Lecture 21: Hash Tables Monday, February 28, 2005.

Database Management Systems (CS 564)

Chapter 11: Indexing and Hashing

Indexing And Hashing.

File organization and Indexing

Chapter 11: Indexing and Hashing

Indexing and Hashing Basic Concepts Ordered Indices

Lecture 21: Indexes Monday, November 13, 2000.

Tree-Structured Indexes

Lecture 19: Data Storage and Indexes

Lecture 21: B-Trees Monday, Nov. 19, 2001.

Lecture 6: Data Storage and Indexes

Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.

Chapter 11 Indexing And Hashing (1)

Database Design and Programming

CSE 544: Lecture 11 Storing Data, Indexes

Storage and Indexing.

Credit for some of the slides in this lecture goes to

General External Merge Sort

Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.

Monday, 5/13/2002 Hash table indexes, query optimization

Wednesday, 5/8/2002 Hash table indexes, physical operators

Indexing February 28th, 2003 Lecture 20.

Lecture 11: B+ Trees and Query Execution

Lecture 20: Indexes Monday, February 27, 2006.

Credit for some of the slides in this lecture goes to

Chapter 11: Indexing and Hashing

Advance Database System

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

CS4433 Database Systems Indexing

Why Do We Learn This? Find out the desired information (by value) from the database (very) quickly! E.g., author catalog in library Indexing Common properties of indexes B+ trees Hash tables

What is Indexing? A “labeled” pointer to an (a collection of) item that satisfies some common property Examples in the Real World?

What is Indexing? A “labeled” pointer to an (a collection of) item that satisfies some common property Examples in the Real World?

What is Indexing? A “labeled” pointer to an (a collection of) item that satisfies some common property Examples in the Real World?

Theoretically, Indexes is … An index on a file speeds up selections on the search key attributes(s) Search key = any subset of the attributes of a relation attributes used to look up records in a file. Search key is not the same as key (minimal set of attributes that uniquely identify a tuple (record) in a relation) An index file consists of records (called index entries) Entries in an index: (K, R), where: K: the search key R: pointers of the record OR record id OR record ids Index files are typically much smaller than the original file

Types of Indexes Ordered/Hash Ordered indices: index entries are stored sorted on the search key value. E.g., author catalog in library. Hash indices: index entries are distributed uniformly across “buckets” based on search key using a “hash function”. Clustered/Unclustered Clustered = records sorted in the key order Unclustered = no Dense/sparse Dense = Index record appears for every search-key value in the file. Sparse = only some records have Primary/secondary Primary = on the primary key Secondary = on any key Some textbooks interpret these differently B+ tree / Hash table / …

Clustered, Dense Index Clustered: File is sorted on the index attribute Dense: sequence of (key, pointer) pairs 10 20 10 20 30 40 30 40 50 60 70 80 50 60 70 80

Clustered, Dense Index index on ID attribute of instructor relation

Dense Index Files (Cont.) Dense index on dept_name, with instructor file sorted on dept_name

Clustered, Sparse Index Sparse index: contains index records for only some search-key values e.g. one key per data block Applicable when records are sequentially ordered on search-key Save more space Sacrifice efficiency 10 20 10 30 50 70 30 40 90 110 130 150 50 60 70 80

Sparse Index Files To locate a record with search-key value K we: Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points

Sparse Index Files (Cont.) Compared to dense indices: Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records. Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Clustered Index with Duplicate Keys Dense index: point to the first record with that key 10 10 20 30 40 10 20 50 60 70 80 20 30 40

Unclustered Indexes Often for indexing other attributes than primary key Always dense (why ?) The locality of values has been broken! 20 30 10 20 30 20 20 30 10 20 10 30

Clustered vs. Unclustered Index Index entries Index entries (Index File) (Data file) Data Records Data Records CLUSTERED UNCLUSTERED

Secondary Indices Example Secondary index on salary field of instructor Index record points to a bucket that contains pointers to all the actual records with that particular search-key value. Secondary indices have to be dense 19

Primary and Secondary Indices Indices offer substantial benefits when searching for records. BUT: Updating indices imposes overhead on database modification --when a file is modified, every index on the file must be updated, Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive Each record access may fetch a new block from disk Block fetch requires about 5 to 10 milliseconds, versus about 100 nanoseconds for memory access 20

Multilevel Index If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index – a sparse index of primary index inner index – the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.

Multilevel Index (Cont.)

Index Update: Deletion If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. Single-level index entry deletion: Dense indices – deletion of search-key is similar to file record deletion. Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

Index Update: Insertion Single-level index insertion: Perform a lookup using the search-key value appearing in the record to be inserted. Dense indices – if the search-key value does not appear in the index, insert it. Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. If a new block is created, the first search-key value appearing in the new block is inserted into the index. Multilevel insertion and deletion: algorithms are simple extensions of the single-level algorithms

Secondary Indices Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index) satisfy some condition. Example 1: In the instructor relation stored sequentially by ID, we may want to find all instructors in a particular department Example 2: as above, but where we want to find all instructors with a specified salary or with salary in a specified range of values We can have a secondary index with an index record for each search-key value

B+ Trees What’s wrong with sequential index? Pros: easy/fast to access Cons: hard to maintain the sequential property upon updates Periodic reorganization of entire file is required. performance degrades as file grows, since many overflow blocks get created. B+ Tree Intuition: Give up sequentiality of index and Try to get “balance” by dynamic reorganization automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. (Minor) disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.

Example of B+-Tree

B+-Tree Index Files (Cont.) A B+-tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length Parameter d = the degree (order) Each node has [d, 2d] keys (except root) Each interior node that is not a root or a leaf has pointer to [d+1,2d+1] children. A leaf node has pointers [d+1,2d+1] to record Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (2d–1) values

B+-Tree Node Structure Typical Node Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1 (Initially assume no duplicate keys, address duplicates later) K1 K2 K3 p1 p2 p3 p4 [X , K1) [K1, K2) [K2, K3) [K3, Y)

B+ Trees Basics Internal node: Leaf: next leaf 30 120 240 40 50 60 [30, 120) [120, 240) [240, Y) 40 50 60 next leaf 40 50 60

Properties of a leaf node: Leaf Nodes in B+-Trees Properties of a leaf node: For i = 1, 2, . . ., 2d, pointer Pi points to a file record with search-key value Ki, P2d+1 points to next leaf node in search-key order

Non-Leaf Nodes in B+-Trees Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P1 points are less than K1 For 2  i  2d + 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki All the search-keys in the subtree to which P2d+1 points have values greater than or equal to K2d K1 .. K2d p1 p2 ,, P2d+1

Searching a B+ Tree Select name From people Where age = 25 Select name Point queries with exact key values: Start at the root Proceed down, to the leaf Range queries: As above Then sequential traversal on leafs Select name From people Where 20 <= age and age <= 30

Queries on B+-Trees Find record with search-key value V. C=root While C is not a leaf node { Let i be least value s.t. V  Ki. If no such exists, set C = last non-null pointer in C Else { if (V= Ki ) Set C = Pi +1 else set C = Pi} } Let i be least value s.t. Ki = V If there is such a value i, follow pointer Pi to the desired record. Else no record with search-key value k exists.

B+ Tree Example Root (d=1) d = 2 Select name From person Where age = 30 (Where age >=30) (Where 20<=age and age <=30) Root (d=1) d = 2 80 20 60 100 120 140 10 15 18 20 30 40 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90

B+ Tree Design How large is d? Eack block will have space for 2d search key and 2d+1 pointers. Pick n as large as possible that fits into a block Example 14.10 Example: Key size = 4 bytes Pointer size = 8 bytes Block size = 4096 byes 2d x 4 + (2d+1) x 8 <= 4096 So, d = 170

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. average fan-out = 133 Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes

Insertion in a B+ Tree Insert (K, P): Find leaf where K belongs, insert If no overflow (2d keys or less), halt If overflow (2d+1 keys), split node, insert in parent: If leaf, keep K3 too in right node When root splits, new root has 1 key only that’s why root is special for degree satisfaction (K3, ) to parent K1 K2 K3 K4 K5 P0 P1 P2 P3 P4 p5 K1 K2 P0 P1 P2 K4 K5 P3 P4 p5

Insertion in a B+ Tree Insert K=19 80 20 60 100 120 140 10 15 18 20 30 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After Insertion 80 20 60 100 120 140 10 15 18 19 20 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree Now Insert K=25 80 20 60 100 120 140 10 15 18 19 20 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After Insertion 80 20 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree Now Split 80 20 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After the Split 80 20 30 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree Delete 30 80 20 30 60 100 120 140 10 15 18 19 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree After Deleting 30 May change to 40, or not 80 20 30 60 100 120 140 10 15 18 19 20 25 40 50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree Delete 25 80 20 30 60 100 120 140 10 15 18 19 50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 25, Need to rebalance: Rotate 80 20 30 60 100 120 140 10 15 18 19 20 40 50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree Now Delete 40 80 19 30 60 100 120 140 10 15 18 50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 40, Rotation not possible. Need to merge nodes 80 19 30 60 100 120 140 10 15 18 19 20 50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

Deletion from a B+ Tree Final Tree 80 19 60 100 120 140 10 15 18 19 20 50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

B Tree Idea: Avoid duplicate keys Have record pointers in non-leaf nodes to record to record to record with K1 with K2 with K3 to keys to keys to keys to keys < K1 K1<x<K2 K2<x<k3 >k3 K1 P1 K2 P2 K3 P3

B-Tree Example D = 2 Sequence pointers not useful now! 65 125 25 45 85 105 145 165 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180

Hash Tables Recall basics There are n buckets A hash function f(k) maps a key k to {0, 1, …, n-1} Store in bucket f(k) a pointer to record with key k Secondary storage: bucket = block, use overflow blocks when needed

Hash Table Example Assume 1 bucket (block) stores 2 keys + pointers h(e)=0 h(b)=h(f)=1 h(g)=2 h(a)=h(c)=3 e b f g a c 1 2 3

Searching in a Hash Table Search for a: Compute h(a)=3 Read bucket 3 1 disk access e b f g a c 1 2 3

Insertion in Hash Table Place in right bucket, if there exists space E.g. h(d)=2 e b f g d a c 1 2 3

Insertion in Hash Table Create overflow block, if no space E.g. h(k)=1 More overflow blocks may be needed e b f g d a c k 1 2 3

Hash Table Performance Excellent, if no overflow blocks Degrades considerably when number of keys exceeds the number of buckets (i.e. many overflow blocks) Other problems: Memory requirement Dynamic maintenance Equality queries only!

Extensible Hash Table Allows hash table to grow, to avoid performance degradation Assume a hash function h that returns numbers in {0, …, 2k – 1} Start with n = 2i << 2k (size of the hash table), only look at first i most significant bits E.g. i=1, n=2, k=4 The first i bits (i = 1) i=1 0(010) 1 1 1(011) 1

Insertion in Extensible Hash Table 0(010) 1 1 1(011) 1(110) 1

Insertion in Extensible Hash Table 0(010) 1 Now insert 1010 Need to extend table, split blocks i becomes 2 so n=4 1 1(011) 1(110), 1(010) 1

Insertion in Extensible Hash Table Now insert 1010 i=2 0(010) 1 00 01 10(11) 10(10) 2 Doubling the hash table 10 11 11(10) 2

Insertion in Extensible Hash Table Now insert 0000, then 0101 Need to split block i=2 0(010) 0(000), 0(101) 0(010) 0(000), 1 00 01 10(11) 10(10) 2 10 11 11(10) 2

Insertion in Extensible Hash Table After splitting the block 00(10) 00(00) 2 i=2 01(01) 2 00 01 10(11) 10(10) 2 10 11 11(10) 2

Performance Extensible Hash Table No overflow blocks: access always one read BUT: Extensions can be costly and disruptive After an extension table may no longer fit in memory

Linear Hash Table Idea: extend only one entry at a time Problem: n= no longer a power of 2 Let i be #bits necessary to address n buckets 2i-1 < n <= 2i After computing h(k), use last i bits: If last i bits represent a number >= n, change msb from 1 to 0 (get a number < n)

Linear Hash Table Example Insert (01)11 Bit flip: 11  01 (01)00 (11)00 i=2 (01)11 BIT FLIP 00 01 (10)10 10

Linear Hash Table Example Insert 1000: overflow blocks… (01)00 (11)00 (10)00 i=2 (01)11 00 01 (10)10 10

Linear Hash Tables Extension: independent on overflow blocks Extend n:=n+1 when average number of records per block exceeds (say) 80%

Linear Hash Table Extension From n=3 to n=4 Only need to touch one block (which one ?) (01)00 (11)00 (01)00 (11)00 i=2 (01)11 00 (01)11 i=2 01 (10)10 10 (10)10 00 01 (01)11 10 11

Linear Hash Table Extension From n=3 to n=4 finished Extension from n=4 to n=5 (new bit) Need to touch every single block (why ?) Need to look last 3 bits which affect all keys (01)00 (11)00 i=2 (10)10 00 01 (01)11 10 11