CPSC 404, Laks V.S. Lakshmanan1 Tree-Structured Indexes (Brass Tacks) Chapter 10 Ramakrishnan and Gehrke (Sections 10.3-10.8)

Slides:



Advertisements
Similar presentations
B+-Trees and Hashing Techniques for Storage and Index Structures
Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
1 Tree-Structured Indexes Module 4, Lecture 4. 2 Introduction As for any index, 3 alternatives for data entries k* : 1. Data record with key value k 2.
Multilevel Indexing and B+ Trees
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
CS4432: Database Systems II
B+-tree and Hashing.
Tree-Structured Indexes Lecture 5 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes Lecture 17 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.
1 Tree-Structured Indexes Yanlei Diao UMass Amherst Feb 20, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
FALL 2004CENG 351 File Structures1 Multilevel Indexing and B+ Trees Reference: Folk et al. Chapters: 9.1, 9.2, 10.1, 10.2,10.3,10.10.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Range Searches  `` Find all students with gpa > 3.0 ’’  If data is in sorted file, do.
Tree-Structured Indexes Lecture 5 R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Tree-Structured Indexes Lecture 5 R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Adapted from Mike Franklin
1.1 CS220 Database Systems Tree Based Indexing: B+-tree Slides courtesy G. Kollios Boston University via UC Berkeley.
Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree- and Hash-Structured Indexes Selected Sections of Chapters 10 & 11.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
B+ Tree Index tuning--. overview B + -Tree Scalability Typical order: 100. Typical fill-factor: 67%. –average fanout = 133 Typical capacities (root at.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
1 Database Systems ( 資料庫系統 ) November 1, 2004 Lecture #8 By Hao-hua Chu ( 朱浩華 )
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
I/O Cost Model, Tree Indexes CS634 Lecture 5, Feb 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Tree-Structured Indexes Chapter 10
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Tree-Structured Indexes R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.
Multilevel Indexing and B+ Trees
Multilevel Indexing and B+ Trees
Database Systems (資料庫系統)
Tree-Structured Indexes
Tree-Structured Indexes
COP Introduction to Database Structures
Tree-Structured Indexes (Brass Tacks)
B+-Trees and Static Hashing
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management Notes #07 B+ Trees
Introduction to Database Systems Tree Based Indexing: B+-tree
Tree-Structured Indexes
Tree-Structured Indexes
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Tree-Structured Indexes
Adapted from Mike Franklin
Tree-Structured Indexes
Indexing 1.
Database Systems (資料庫系統)
Database Systems (資料庫系統)
General External Merge Sort
Tree-Structured Indexes
Tree-Structured Indexes
Indexing February 28th, 2003 Lecture 20.
Tree-Structured Indexes
Tree-Structured Indexes
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #06 B+ trees Instructor: Chen Li.
CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees
Presentation transcript:

CPSC 404, Laks V.S. Lakshmanan1 Tree-Structured Indexes (Brass Tacks) Chapter 10 Ramakrishnan and Gehrke (Sections )

CPSC 404, Laks V.S. Lakshmanan2 What will I learn from this set of lectures? v How do B+trees work (for search)? v How can I tune B+trees for performance? v How can I maintain its balance against inserts and deletes? v How do I build a B+tree from scratch? v How can I handle key values that are very long – e.g., long song names or long names of people?

CPSC 404, Laks V.S. Lakshmanan3 B+ Tree: The Most Widely Used Index v Insert/delete at log F N cost; keep tree height- balanced. (F = fanout, N = # leaf pages) v Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. v Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct the search)

CPSC 404, Laks V.S. Lakshmanan4 Example B+ Tree v Search begins at root, and key comparisons direct it to a leaf (as in ISAM) v Binary search within a node v Search for 5*, 15*, all data entries >= 28*... Root * 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 insert

CPSC 404, Laks V.S. Lakshmanan5 B+Trees basics Node NN’s sibling 50,000 The key 50,000 separates or discriminates between N and its sibling. Plays a crucial role in search and update maintenance. Call 50,000 the separator/discriminator of N and its sibling.

CPSC 404, Laks V.S. Lakshmanan6 B+ Trees in Practice v Typical order: 100. Typical fill-factor: ln 2 = 66.5% (approx). –average fanout = 2 x 100 x 66.5% = 133 v Typical capacities: –Height 4: 1334 = 312,900,721 pages. –Height 3: 1333 = 2,352,637 pages v Can often hold top levels in buffer pool: –Level 1 = 1 page = 8 Kbytes –Level 2 = 133 pages = 1 MByte (approx.) –Level 3 = 17,689 pages = 133 MBytes (approx.) –Level 4 = 2,352,637 pages = GBytes (approx.) –Level 5 = 312,900,721 pages = tera bytes! (approx.) v For typical orders (d ~ ), a shallow B+tree can accommodate very large files. How tall a B+tree do we need to cover all of Canada’s taxpayer records?

CPSC 404, Laks V.S. Lakshmanan7 B+ Trees in practice v Suppose a node is implemented as a block, a block is 4 K, a key is 12 bytes, whereas a pointer is 8 bytes. –For a file occupying b (logical) blocks, what is the min/max/avg height of a B+tree index? –If we have m bytes of RAM, how many levels of the B+tree can we prefetch to speed up performance?

CPSC 404, Laks V.S. Lakshmanan8 Inserting a Data Entry into a B+ Tree v Find correct leaf L. v Put data entry onto L. – If L has enough space, done ! – Else, must split L (into L and a new node L2) u Redistribute entries evenly, copy up middle key. u Insert index entry pointing to L2 (i.e., the middle key copied up) into parent of L. v This can happen recursively – To split index node, redistribute entries evenly, but bump up middle key. (Contrast with leaf splits.) v Splits “grow” tree; root split increases height. – Tree growth: gets wider, maybe even one level taller.

CPSC 404, Laks V.S. Lakshmanan9 Inserting 8* into Example B+ TreeExample B+ Tree v Observe how minimum occupancy is guaranteed in both leaf and index pg splits. v Note difference between copy-up and bump-up; Why do we handle leaf page split and index page split differently? 2* 3*5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is continues to appear in the leaf.) s copied up and appears once in the index. Contrast Entry to be inserted in parent node. (Note that 17 is bumped up and only this with a leaf split.)

CPSC 404, Laks V.S. Lakshmanan10 Example B+ Tree After Inserting 8* v Notice that root was split, leading to increase in height. v In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* delete What would it look like?

CPSC 404, Laks V.S. Lakshmanan11 Deleting a Data Entry from a B+ Tree v Start at root, find leaf L where entry belongs. v Remove the entry. – If L is at least half-full, done! – If L has only d-1 entries, u Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). Adjust the key that separates L and its sibling. u If re-distribution fails, merge L and sibling. v If merge occurred, must delete separator entry (discriminating L & its sibling) from parent of L. v Merge could propagate to root, decreasing height. Question: Can L’s occupancy ever drop below d-1?

CPSC 404, Laks V.S. Lakshmanan12 Example Tree After (Inserting 8* Then) Deleting 19* and 20*...Inserting 8* v Deleting 19* is easy. v Deleting 20* is done with re-distribution. Notice how new middle key is copied up. 2*3* Root *16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*

CPSC 404, Laks V.S. Lakshmanan13... And Then Deleting 24* v Must merge. v Observe `toss’ of index entry (on right), and `pull down’ of index entry (below) *27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root

CPSC 404, Laks V.S. Lakshmanan14 Example of Non-leaf Redistribution v Tree is shown below during deletion of 24*. (What could be a possible initial tree?) v In contrast to previous example, can re- distribute entry from left child of root to right child. Root *16* 17*18* 20*33*34* 38* 39* 22*27*29*21* 7*5*8* 3*2*

CPSC 404, Laks V.S. Lakshmanan15 After Re-distribution v Intuitively, entries are re-distributed by “pushing through” the splitting entry in the parent node. v It suffices to re-distribute index entry with key 20; we’ve re-distributed 17 as well for illustration.

CPSC 404, Laks V.S. Lakshmanan16 Optimization 1: Prefix Key Compression v Important to increase fan-out. (Why?) v Key values in index entries only `direct traffic’; can often compress them. – E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and Devarakonda Murthy, we can abbreviate David Smith to Dav. (The other keys can be compressed too...) u Is this correct? Not quite! What if there is a data entry Davey Jones? (Can only compress David Smith to Davi) u In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left. v Insert/delete must be suitably modified. v This idea works for any field/attribute whose data type is a long string: e.g., customer code, shipping tracking number, etc.

CPSC 404, Laks V.S. Lakshmanan17 Analyzing insert/delete time. v Given a B+tree index with height h, what can you say about best case, average case, and worst case I/O? v It must have something to do with h. v But is is exactly h? v What exactly constitutes best, averege, and worst case?

CPSC 404, Laks V.S. Lakshmanan18 Optimization 2: Bulk Loading of a B+ Tree v If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. v Bulk Loading can be done much more efficiently. v Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Sorted pages of data entries; not yet in B+ tree

CPSC 404, Laks V.S. Lakshmanan19 Bulk Loading (contd.) 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root 6

CPSC 404, Laks V.S. Lakshmanan20 Bulk Loading (contd.) 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root 610

CPSC 404, Laks V.S. Lakshmanan21 Bulk Loading (contd.) 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44*

CPSC 404, Laks V.S. Lakshmanan22 Bulk Loading (contd.) 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* *

CPSC 404, Laks V.S. Lakshmanan23 Bulk Loading (contd.) 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* *

CPSC 404, Laks V.S. Lakshmanan24 Bulk Loading (Contd.) v Index entries for leaf pages always entered into right-most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.) v Much faster than repeated inserts, especially when one considers locking! 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree * 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root not yet in B+ tree Data entry pages

CPSC 404, Laks V.S. Lakshmanan25 Bulk Loading v How would you estimate time taken to build a B+tree index from scratch? –Naïve approach of repeated insertions. –Bulk loading. v Naïve approach: each insert is potentially a random block seek; cannot read/insert next data block until after previous block has been inserted. Very slow. v Bulk loading: dominant factor – sorting data (or data entries as appropriate). Followed by another sequential read (w/ proper buffer management). –Explored in exercises.

CPSC 404, Laks V.S. Lakshmanan26 Summary of Bulk Loading v Option 1: multiple inserts. – Slow. – Does not ensure sequential storage of leaves. v Option 2: bulk loading – Has advantages for concurrency control. – Fewer I/Os during build. – Leaves will be stored sequentially (and linked, of course). – Can control “fill factor” on pages.

CPSC 404, Laks V.S. Lakshmanan27 Summary v Tree-structured indexes are ideal for range-searches, also good for equality searches. v ISAM is a static structure. – Only leaf pages modified; overflow pages needed. – Overflow chains can degrade performance unless size of data set and data distribution stay constant. v B+ tree is a dynamic structure. – Inserts/deletes leave tree height-balanced; log F N cost. – High fanout (F) means depth (i.e., height) rarely more than 3 or 4. – Almost always better than simply maintaining a sorted file.

CPSC 404, Laks V.S. Lakshmanan28 Summary (Contd.) v Key compression increases fanout, reduces height. v Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. v Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.