CS 405G: Introduction to Database Systems 22 Index Chen Qian University of Kentucky.

Slides:



Advertisements
Similar presentations
B+-Trees and Hashing Techniques for Storage and Index Structures
Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
1 Tree-Structured Indexes Module 4, Lecture 4. 2 Introduction As for any index, 3 alternatives for data entries k* : 1. Data record with key value k 2.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
CS4432: Database Systems II
CS CS4432: Database Systems II Basic indexing.
B+-tree and Hashing.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
Tree-Structured Indexes Lecture 5 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes Lecture 17 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.
1 Tree-Structured Indexes Yanlei Diao UMass Amherst Feb 20, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
Tree-Structured Indexes Lecture 5 R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
CS 405G: Introduction to Database Systems 21 Storage Chen Qian University of Kentucky.
Adapted from Mike Franklin
Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree- and Hash-Structured Indexes Selected Sections of Chapters 10 & 11.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
B+ Tree Index tuning--. overview B + -Tree Scalability Typical order: 100. Typical fill-factor: 67%. –average fanout = 133 Typical capacities (root at.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
I/O Cost Model, Tree Indexes CS634 Lecture 5, Feb 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
Tree-Structured Indexes Chapter 10
CS 405G: Introduction to Database Systems 12. Index.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Tree-Structured Indexes R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.
Tree-Structured Indexes: Introduction
Tree-Structured Indexes
Tree-Structured Indexes
COP Introduction to Database Structures
CS 405G: Introduction to Database Systems
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
B+-Trees and Static Hashing
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management Notes #07 B+ Trees
Tree-Structured Indexes
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Tree-Structured Indexes
Tree-Structured Indexes
Indexing 1.
Database Systems (資料庫系統)
Storage and Indexing.
General External Merge Sort
Tree-Structured Indexes
Indexing February 28th, 2003 Lecture 20.
EECS 647: Introduction to Database Systems
Tree-Structured Indexes
EECS 647: Introduction to Database Systems
Tree-Structured Indexes
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #06 B+ trees Instructor: Chen Li.
CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees
Presentation transcript:

CS 405G: Introduction to Database Systems 22 Index Chen Qian University of Kentucky

12/4/2015Chen University of Kentucky2 Review The unit of disk read and write is Block (or called Page) The disk access time is composed by Seek time Rotation time Data transfer time

12/4/2015Chen University of Kentucky3 Review A row in a table, when located on disks, is called A record Two types of record: Fixed-length Variable-length

12/4/2015Chen University of Kentucky4 Review In an abstract sense, a file is A set of “records” on a disk In reality, a file is A set of disk pages Each record lives on A page Physical Record ID (RID) A tuple of

12/4/2015Chen University of Kentucky5 Today’s Topic How to locate data in a file fast? Introduction to indexing Tree-based indexes ISAM: Indexed sequence access method B + -tree

12/4/2015Chen University of Kentucky6 Basics Given a value, locate the record(s) with this value SELECT * FROM R WHERE A = value ; SELECT * FROM R, S WHERE R. A = S. B ; Other search criteria, e.g. Range search SELECT * FROM R WHERE A > value ; Keyword search database indexing Search

12/4/2015Chen University of Kentucky7 Dense and sparse indexes Dense: one index entry for each search key value Sparse: one index entry for each page Records must be clustered according to the search key

12/4/2015Chen University of Kentucky8 Dense versus sparse indexes Index size Sparse index is smaller Requirement on records Records must be clustered for sparse index Lookup Sparse index is smaller and may fit in memory Dense index can directly tell if a record exists Update Easier for sparse index

Tree-Structured Indexes: Introduction Tree-structured indexing techniques support both range selections and equality selections. ISAM =Indexed Sequential Access Method static structure; early index technology. B + tree: dynamic, adjusts gracefully under inserts and deletes. 12/4/20159Chen University of Kentucky

Motivation for Index ``Find all students with gpa > 3.0’’ If data file is sorted, do binary search Cost of binary search in a database can be quite high, Why? Simple idea: Create an `index’ file. Can do binary search on (smaller) index file! Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File P 0 K 1 P 1 K 2 P 2 K m P m index entry 12/4/201510Chen University of Kentucky

12/4/2015Chen University of Kentucky11 ISAM What if an index is still too big? Put a another (sparse) index on top of that!  ISAM (Index Sequential Access Method), more or less 100, 200, …, , 123, …, , …, 996 … Index blocks 200, … 100, 108, 119, , 129, … 901, 907, … 996, 997, … ……… Data blocks 192, 197, … 200, 202, … Example: look up 197

12/4/2015Chen University of Kentucky12 Updates with ISAM Overflow chains and empty data blocks degrade performance Worst case: most records go into one long chain Example: insert Overflow block Example: delete , 200, …, , 123, …, , …, 996 … Index blocks 200, … 100, 108, 119, , 129, … 901, 907, … 996, 997, … ……… Data blocks 192, 197, … 200, 202, …

12/4/2015Chen University of Kentucky13 A Note of Caution ISAM is an old-fashioned idea B+-trees are usually better, as we’ll see But, ISAM is a good place to start to understand the idea of indexing Upshot Don’t brag about being an ISAM expert on your resume Do understand how they work, and tradeoffs with B + - trees

12/4/2015Chen University of Kentucky14 B + -tree A hierarchy of intervals Balanced (more or less): good performance guarantee Disk-based: one node per page; large fan-out Max fan-out: 4

12/4/2015Chen University of Kentucky15 Sample B + -tree nodes Max fan-out: to keys 100 <=k < 120 to keys 120 <=k < 150 to keys 150 <= k < 180 to keys 180 <= k Non-leaf to records with these k values; or, store records directly in leaves to next leaf node in sequence Leaf to keys 100 · k

12/4/2015Chen University of Kentucky16 B + -tree balancing properties Height constraint: all leaves at the same lowest level Fan-out constraint: all nodes at least half full (except root) Max # Max #Min #Min # pointerskeys active pointers keys Non-leafff – 1 Rootff – 121 Leafff – 1

12/4/2015Chen University of Kentucky17 Lookups SELECT * FROM R WHERE k = 179; Not found SELECT * FROM R WHERE k = 32;

12/4/2015Chen University of Kentucky18 Range query SELECT * FROM R WHERE k > 32 AND k < 179; Max fan-out: Look up 32… And follow next-leaf pointers 35

12/4/2015Chen University of Kentucky19 Insertion Insert a record with search key value Max fan-out: 4 Look up where the inserted key should go… 32 And insert it right there

12/4/2015Chen University of Kentucky20 Another insertion example Insert a record with search key value Max fan-out: Oops, node is already full!

12/4/2015Chen University of Kentucky21 Node splitting Max fan-out: 4 Yikes, this node is also already full!

12/4/2015Chen University of Kentucky22 More node splitting In the worst case, node splitting can “propagate” all the way up to the root of the tree (not illustrated here) Splitting the root introduces a new root of fan-out 2 and causes the tree to grow “up” by one level Max fan-out:

12/4/2015Chen University of Kentucky23 Insertion B + -tree Insert Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must split L (into L and a new node L2) Distribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively Tree growth: gets wider and (sometimes) one level taller at top.

12/4/2015Chen University of Kentucky24 Deletion Delete a record with search key value Max fan-out: 4 Look up the key to be deleted… And delete it Oops, node is too empty! If a sibling has more than enough keys, steal one!

12/4/2015Chen University of Kentucky25 Stealing from a sibling Max fan-out: Remember to fix the key in the least common ancestor

12/4/2015Chen University of Kentucky26 Another deletion example Delete a record with search key value Max fan-out: 4 Cannot steal from siblings Then coalesce (merge) with a sibling!

12/4/2015Chen University of Kentucky27 Coalescing Max fan-out: 4 Remember to delete the appropriate key from parent Deletion can “propagate” all the way up to the root of the tree (not illustrated here) When the root becomes empty, the tree “shrinks” by one level

12/4/2015Chen University of Kentucky28 Deletion B+-tree Delete Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Tree shrink: gets narrower and (sometimes) one level lower at top.

Example B+ Tree - Inserting 8* In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Notice that root was split, leading to increase in height. 12/4/201529Chen University of Kentucky

Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Example Tree (including 8*) Delete 19* and 20*... 12/4/201530Chen University of Kentucky

Example Tree (including 8*) Delete 19* and 20*... Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* 2*3* Root *16* 33*34* 38* 39* 135 7*5*8* 22*24* 27 27*29* 12/4/201531Chen University of Kentucky

... And Then Deleting 24* Must merge. Observe `toss’ of index entry (key 27 on right), and `pull down’ of index entry (below) *27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root /4/201532Chen University of Kentucky

12/4/2015Chen University of Kentucky33 Performance analysis How many I/O’s are required for each operation? h, the height of the tree (more or less) Plus one or two to manipulate actual records Plus O(h) for reorganization (should be very rare if f is large) Minus one if we cache the root in memory How big is h? Roughly log fan-out N, where N is the number of records B + -tree properties guarantee that fan-out is least f / 2 for all non-root nodes Fan-out is typically large (in hundreds)—many keys and pointers can fit into one block A 4-level B + -tree is enough for typical tables

12/4/2015Chen University of Kentucky34 B + -tree in practice Complex reorganization for deletion often is not implemented (e.g., Oracle, Informix) Leave nodes less than half full and periodically reorganize Most commercial DBMS use B + -tree instead of hashing- based indexes because B + -tree handles range queries

12/4/2015Chen University of Kentucky35 The Halloween Problem Story from the early days of System R… UPDATE Payroll SET salary = salary * 1.1 WHERE salary >= ; There is a B + -tree index on Payroll(salary) The update never stopped (why?) Solutions? Scan index in reverse Before update, scan index to create a complete “to-do” list During update, maintain a “done” list Tag every row with transaction/statement id

12/4/2015Chen University of Kentucky36 B + -tree versus ISAM ISAM is more static; B + -tree is more dynamic ISAM is more compact (at least initially) Fewer levels and I/O’s than B + -tree Overtime, ISAM may not be balanced Cannot provide guaranteed performance as B + -tree does

12/4/2015Chen University of Kentucky37 B + -tree versus B-tree B-tree: why not store records (or record pointers) in non-leaf nodes? These records can be accessed with fewer I/O’s Problems? Storing more data in a node decreases fan-out and increases h Records in leaves require more I/O’s to access Vast majority of the records live in leaves!

12/4/2015Chen University of Kentucky38 Beyond ISAM, B-, and B + -trees Other tree-based indexes: R-trees and variants, GiST, etc. Hashing-based indexes: extensible hashing, linear hashing, etc. Text indexes: inverted-list index, suffix arrays, etc. Other tricks: bitmap index, bit-sliced index, etc. How about indexing subgraph search?

12/4/2015Chen University of Kentucky39 Summary Two types of queries Key-search Range-query B + -tree operations Search Insert Split child Delete Redistribution B + -tree sorting Next: disk-based sorting algorithms

Homework 4 Online this afternoon! 12/4/2015Chen University of Kentucky40