Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.

Slides:



Advertisements
Similar presentations
B+-Trees and Hashing Techniques for Storage and Index Structures
Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
1 Tree-Structured Indexes Module 4, Lecture 4. 2 Introduction As for any index, 3 alternatives for data entries k* : 1. Data record with key value k 2.
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.
Tree-Structured Indexes Lecture 5 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.
1 Tree-Structured Indexes Yanlei Diao UMass Amherst Feb 20, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Storage and Indexing February 26 th, 2003 Lecture 19.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope ( )
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
I/O Cost Model, Tree Indexes CS634 Lecture 5, Feb 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Tree-Structured Indexes Chapter 10
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.
Tree-Structured Indexes
COP Introduction to Database Structures
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Lecture 12 Lecture 12: Indexing.
B+-Trees and Static Hashing
File Organizations and Indexing
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management Notes #07 B+ Trees
Tree-Structured Indexes
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Tree-Structured Indexes
Storage and Indexing May 17th, 2002.
Indexing 1.
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Storage and Indexing.
General External Merge Sort
Tree-Structured Indexes
Indexing February 28th, 2003 Lecture 20.
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #06 B+ trees Instructor: Chen Li.
CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees
Presentation transcript:

Indexing

421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation of Access Methods based on measuring the number of page I/O’s I disk access in general more costly than CPU costs I CPU costs considered to be negligible in comparison with I/O I Analysis ignores gains of pre-fetching blocks of pages; thus, even I/O cost is only approximated. I Average-case analysis; based on several simplistic assumptions.  Good enough to show the overall trends!

421: Database Systems - Index Structures 3 Typical Operations q Scan over all records I SELECT * FROM Employee q Equality Search I SELECT * FROM Employee WHERE eid = 100 q Range Search I SELECT * FROM Employee WHERE age > 30 and age <= 50 q Insert  INSERT INTO Employee VALUES (23, ‘lilly’, 37) q Delete I DELETE FROM Employee WHERE eid = 100 I DELETE FROM Employee WHERE age >30 AND age <= 50 q Update I Delete+insert

421: Database Systems - Index Structures 4 Alternative File Organizations Various alternative file organizations exist; each is ideal for some operations, not so good in others: q Heap files: I Linked, unordered list of all pages of the file (e.g., per relation) I Suitable when typical access is a file scan retrieving all records. I Costs for equality search (read on avg. half the pages) and range search (read all pages) is high. I Cost for insert low (insert anywhere) I Cost for delete/update is cost of executing WHERE clause q Sorted Files: I Records are ordered according to one or more attributes of the relation I Outperforms heap files for equality and range queries on the ordering attribute (find first qualifying page with binary search in log2(number-of-pages) I Also good for ordered output I High insert and delete/update costs

421: Database Systems - Index Structures 5 Indexes q Even sorted file only support queries on sorted attributes. q In order to speed up selections on any collection of attributes, we can build an index for a relation over this collection. I Additional information that helps finding specific tuples faster I We call the collection of attributes over which the index is built the search key attributes for the index. I Any subset of the attributes of a relation can be the search key for an index on the relation. I Search key is not the same as primary key / key candidate (minimal set of attributes that uniquely identify a record in a relation).

421: Database Systems - Index Structures 6 B+ Tree: The Most Widely Used Index q Each node/leaf represents one page I Since the page is the transfer unit to disk q Leafs contain data entries (denoted as k*) I For now, assume each data entry represents one tuple. The data entry consists of two parts l Value of the search key l Record identifier (rid = (page-id, slot)) q Root and inner nodes have auxiliary index entries Root * 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 Index for Sailors On attribute sid

421: Database Systems - Index Structures 7 B+ Tree (contd.) q keep tree height-balanced. I Each path from root to tree has the same height q F = fanout = number of children for each node (~ number of index entries stored in node) q N = # leaf pages q Insert/delete at log F N cost; q Minimum 50% occupancy (except for root). I Each node contains d <= m <= 2d entries. I The parameter d is called the order of the tree. q Supports equality and range-searches efficiently. Index Entries Data Entries

421: Database Systems - Index Structures 8 Example B+ Tree q Example tree has height 2  Assume “ Select * from Emp where eid = 5 ” q Search begins at root, and key comparisons direct it to a leaf q Search for 5*, 15*, all data entries >= 24*... q Good for equality search AND range queries Root * 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13

421: Database Systems - Index Structures 9 Inserting a Data Entry q Find correct leaf L. q Put data entry onto L. I If L has enough space, done! I Else, must split L (into L and a new node L2) l Redistribute entries evenly, copy up middle key. l Insert index entry pointing to L2 into parent of L. q This can happen recursively I To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) q Splits “grow” tree; root split increases height. I Tree growth: gets wider or one level taller at top.

421: Database Systems - Index Structures 10 Inserting 8* into Example B+ Tree 2* 3* 5* 7* Assume that inner pages Can only contain 4 index entries 2* 3*5* 7* Insert into Leaf with leaf split Insert into internal node with node split 8*

421: Database Systems - Index Structures 11 Example: After Inserting 8*  Notice that root was split, leading to increase in height.  In this example, we can avoid split by redistributing entries; however, this is usually not done in practice. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

421: Database Systems - Index Structures 12 Data Entry k* in Index q An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k. q Three alternatives: I (1) Data tuple with key value k (direct indexing) I (2) (indirect indexing) I (3) (indirect indexing) q Choice of alternative for data entries is orthogonal to the indexing technique (B-tree, hashing etc.)

421: Database Systems - Index Structures 13 Alternatives for Data Entries q Alternative 1: (direct indexing) I If this is used, index structure is a file organization for data records (like sorted files). I At most one direct index on a given collection of data records. (Otherwise, data records duplicated, leading to redundant storage and potential inconsistency.) I If data records very large, # of pages containing data entries is high. Implies size of auxiliary information in the index is also large, typically. I NOTE: FOR THE REST OF THIS COURSE WE WILL NOT CONSIDER DIRECT INDEXING ANYMORE Index entries Data entries = data records

421: Database Systems - Index Structures 14 Alternatives for Data Entries (Contd.) q Alternatives 2 and 3: (indirect indexing) I Data entries typically much smaller than data records. So, better than direct indexing with large data records, especially if search keys are small. I If more than one index is required on a given file, at most one direct index; rest must use indirect indexing I Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length. Index entries Data entries Data records

421: Database Systems - Index Structures 15 Index Classification q Primary vs. secondary: If search key contains primary key, then called primary index. I Unique index: Search key contains a candidate key. q Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. I Clustered index can be of alternatives 1, 2 and 3 ! I A file can be clustered on at most one search key. I Cost of retrieving data records through index varies greatly based on whether index is clustered or not!

421: Database Systems - Index Structures 16 Clustered vs. Unclustered Index q Assume data itself (the real tuples) is stored in a Heap file. I To build clustered index, first sort the Heap file (with some free space on each page for future inserts). I Overflow pages may be needed for inserts. (Thus, order of data records is `close to’, but not identical to, the sort order.) Index entries Data entries (Data file) Data Records CLUSTERED UNCLUSTERED

421: Database Systems - Index Structures 17 B+-tree cost example I Relation R(A,B,C,D,E,F) I A and B are int (each 6 Bytes), C-F is char[40] (160 Bytes) l Size of tuple: 172 Bytes I 200,000 tuples I Each data page has 4 K and is around 80% full l 200,000*172/(0.8*4000) = pages I Values of B are within [0;19999] uniform distribution I Non-clustered B-tree for attribute B, alternative (2) I An index page has 4K and intermediate pages are filled between 50% - 100% I The size of an rid = 10 Bytes I The size of a pointer in intermediate pages: 8 Bytes I Index entry in root and intermediate pages: size(key)+size(pointer) = 6 Bytes + 8 Bytes = 14 Bytes

421: Database Systems - Index Structures 18 Size of B+tree I The average number of rids per data entry l Number of tuples / different values (if uniform) (Example 200,000/20,000 = 10) I The average length per data entry: l Key value + #rids * size of rid (Example: *10 = 106) I The average number of data entries per leaf page: l Fill-rate * page-size / length of data entry l Example: 0.75*4000 / 106 = 28 entries per page I The estimated number of leaf pages: l Number of entries = number of different values / #entries per page l Example / 28 = 715 I Number of entries intermediate page: l Fill-rate * page-size /length of index entry l Min fill-rate: 0.5, max fill rate: 1 l Example: 0.5 * 4000 / 14 = 143 entries ; 1* 4000/14 = 285 entries I Height is 3: the root has between three and four children l Three children: each child has around 715/3 = 238 entries l Four children: each child has around 715/4 = 179 entries

421: Database Systems - Index Structures 19 B+ Trees in Practice q Typical order d of inner nodes: 100 (I.e., an inner node has between 100 and 200 index entries) I Typical fill-factor: 67%. I average fanout = 133 q Leaf nodes have often less entries since data entries larger (rids) q Typical capacities (order of inner nodes 100, leaf with 100 rids): I Height 4: = 312,900,721 records I Height 3: = 2,352,637 records I Height 2: = 17,680 records q Can often hold top levels in buffer pool: I Level 1 (root) = 1 page = 4 Kbytes I Level 2 = 133 pages = 0.5 Mbyte I Level 3 = 17,689 pages = 70 MBytes

421: Database Systems - Index Structures 20 Index in DB2 q Simple I Create index ind1 on Sailors(sid); I drop index ind1;  Index also good for referential integrity (uniqueness) I Create unique index ind1 on Sailors(name) q Additional attributes I Create unique index ind1 on Sailors(sid) include (name) I Index only on sid I Data entry contains key value (sid) + name + rid  SELECT name FROM Sailors WHERE sid = 100 l Can be answered without accessing Sailors relation! q Clustered index I Create index ind1 on Sailors(sid) cluster

421: Database Systems - Index Structures 21 Index in DB2 q Index on multiple attributes: I Create index ind1 on Sailors(Age,Rating); I Order is important: l Here data entries are first ordered by age Sailors with the same age are then ordered by rating I Supports: l SELECT * FROM Sailors WHERE age = 20; l SELECT * FROM Sailors WHERE age = 20 AND rating < 5; I Does not support l SELECT * FROM Sailors WHERE rating < 5;

421: Database Systems - Index Structures 22

421: Database Systems - Index Structures 23 Summary for B+-trees q Tree-structured indexes are ideal for range- searches, also good for equality searches. I High fanout (F) means depth rarely more than 3 or 4. I Almost always better than maintaining a sorted file. q Can have several indices on same tables (over different attributes) q Most widely used index in database management systems because. One of the most optimized components of a DBMS.

421: Database Systems - Index Structures 24 File Organizations q Hashed Files:. I File is a collection of buckets. Bucket = primary page plus zero or more overflow pages. I Hashing function h: h(r) = bucket in which record r belongs. h looks at only some of the fields of r, called the search fields. I Best for equality search (only one page access and maybe access to overflow page) I No advantage for range queries I Fast insert I Cost on delete depends on cost for WHERE clause

421: Database Systems - Index Structures 25 More comments on B+-trees q Corresponding delete operations exist that might merge subtrees q B+-trees for predicate locking q Locking B+-trees I To allow concurrent access to the B-tree, internal locking protocol used (non 2PL -- in case of abort: logical undo!!!) I Special Index Locks used to implement “predicate- locking”