CS 44321 CS4432: Database Systems II Basic indexing.

Slides:

Advertisements

Similar presentations

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.

B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #7.

ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.

CS4432: Database Systems II

B+-tree and Hashing.

1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

CS 277 – Spring 2002Notes 41 CS 277: Database System Implementation Notes 4: Indexing Arthur Keller.

Tree-Structured Indexes Lecture 5 R & G Chapter 9 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.

Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.

1 Tree-Structured Indexes Yanlei Diao UMass Amherst Feb 20, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #7.

1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.

CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.

CS 4432lecture #71 CS4432: Database Systems II Lecture #7 Professor Elke A. Rundensteiner.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina.

1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

CS 255: Database System Principles slides: B-trees

1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;

CS4432: Database Systems II

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,

Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.

Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.

 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.

Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.

Index tuning-- B+tree. overview © Dennis Shasha, Philippe Bonnet 2001 B+-Tree Locking Tree Traversal –Update, Read –Insert, Delete phantom problem: need.

DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,

Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

B+ Tree Index tuning--. overview B + -Tree Scalability Typical order: 100. Typical fill-factor: 67%. –average fanout = 133 Typical capacities (root at.

B+ tree & B tree Extracted from Garcia Molina

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.

Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.

1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.

1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.

1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.

Tree-Structured Indexes Chapter 10

CS4432: Database Systems II

1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.

CS 405G: Introduction to Database Systems 12. Index.

1 Ullman et al. : Database System Principles Notes 4: Indexing.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.

Tree-Structured Indexes R & G Chapter 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my ax.” Abraham Lincoln.

Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.

CS 245: Database System Principles Notes 4: Indexing

Tree-Structured Indexes

CS 245: Database System Principles Notes 4: Indexing

(Slides by Hector Garcia-Molina,

B+-Trees and Static Hashing

Tree-Structured Indexes

CS222/CS122C: Principles of Data Management Notes #07 B+ Trees

Tree-Structured Indexes

B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.

Tree-Structured Indexes

CS 245: Database System Principles Notes 4: Indexing

Tree-Structured Indexes

Storage and Indexing.

Tree-Structured Indexes

Tree-Structured Indexes

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #06 B+ trees Instructor: Chen Li.

CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

CS CS4432: Database Systems II Basic indexing

CS Indexing : helps to retrieve data quicker for certain queries value= 1,000,000 Select * FROM Emp WHERE salary = 1,000,000; Chapter 13  value record

CS Topics Sequential Index Files (chap 13.1) Secondary Indexes (chap 13.2)

CS Sequential File

CS Sequential File Dense Index Every record is in index.

CS Sequential File Sparse Index Only first record per block in index.

CS Sequential File Sparse 2nd level

CS Note : DATA FILE or INDEX are “ordered files”. Question: How would we lay them out on disk ? - contiguous layout on disk ? - block-chained layout on disk ?

CS Questions: Do we want to build a dense 2 nd -level index for a dense index? Can we even do this ? Sequential File nd level? st level?

CS Notes on pointers: (1)Block pointer (used in sparse index) can be smaller than record pointer (used in dense index) BP RP

CS K1 K3 K4 K2 R1R2R3R4 say: 1024 B per block if we want K3 block: get it at offset (3-1)*1024 = 2048 bytes Note : If file is contiguous, then we can omit pointers

CS Sparse vs. Dense Tradeoff Sparse: Less index space per record can keep more of index in memory (Later: sparse better for insertions) Dense: Can tell if any record exists without accessing file (Later: dense needed for secondary indexes)

CS Terms Index sequential file Search key (  primary key) Primary index (on sequencing field) Secondary index Dense index (contains all search key values) Sparse index Multi-level index

CS Next: Duplicate keys Deletion/Insertion Secondary indexes

CS Duplicate keys

CS Dense index ! Point to each value ! Duplicate keys

CS Dense index. Point to each distinct value! Duplicate keys

CS Sparse index: point to start of block ! Duplicate keys careful if looking for 20 or 30!

CS Sparse index, another way ? Duplicate keys – place first new key from block should this be 40?

CS Duplicate values, primary index Index may point to first instance of each value only File Index Summary a a a b 

CS Next: Duplicate keys Deletion/Insertion Secondary indexes

CS Deletion from sparse index

CS Deletion from sparse index – delete record 40

CS Deletion from sparse index – delete record 30 40

CS Deletion from sparse index – delete records 30 &

CS 4432lecture #826 Deletion from dense index

CS Deletion from dense index – delete record 30 40

CS Insertion, sparse index case

CS Insertion, sparse index case – insert record our lucky day! we have free space where we need it!

CS Insertion, sparse index case – insert record Immediate reorganization Other variations?

CS Just Illustrated: -Immediate reorganization Now Variation: – insert new block (chained file)

CS Insertion, sparse index case – insert record overflow blocks (reorganize later...)

CS Insertion, dense index case Similar Often more expensive...

CS Next: Duplicate keys Deletion/Insertion Secondary indexes

CS Secondary indexes Sequence field Can I make a secondary index sparse ?

CS Secondary indexes Sequence field Sparse index does not make sense!

CS Secondary indexes Sequence field Must be dense index ! sparse high level allowed?

CS With secondary indexes: Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)

CS Duplicate values & secondary indexes

CS Duplicate values & secondary indexes one option... Problem: excess overhead! disk space search time

CS Duplicate values & secondary indexes another option Problem: variable size records in index!

CS Duplicate values & secondary indexes Another idea : Chain records with same key ! Problems: Need to add fields to data records for each index Need to follow chain to know records

CS Summary : Conventional Indexes –Basic Ideas: sparse, dense, multi-level… –Duplicate Keys –Deletion/Insertion –Secondary indexes

CS Multi-level Index Structures Sequence field first level (dense, if non- sequential) high Level (always sparse)

CS Sequential indexes : pros/cons ? Advantage: - Simple - Index is sequential file good for scans - Search efficient for static data Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance - Then search time unpredictable

CS ExampleSequential Index continuous free space overflow area (not sequential)

CS Another type of index Give up “sequentiality” of index Predictable performance under updates Achieve always balance of “tree” Automate restructuring under updates

CS Root B+Tree Examplen=

CS Sample non-leaf to keysto keysto keys to keys < 5757  k<8181  k<95 

CS Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85

CS In textbook’s notationn=3 Leaf: Non-leaf:

CS Size of nodes:n+1 pointers n keys (fixed)

CS Don’t want nodes to be too empty Use at least Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

CS Full nodemin. node Non-leaf Leaf n= counts even if null Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

CS B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

CS (3) Number of pointers/keys for B+tree Non-leaf (non-root) n+1n  (n+1)/ 2   (n+1)/ 2  - 1 Leaf (non-root) n+1n Rootn+1n11 Max Max Min Min ptrs keys ptrs  data keys  (n+ 1) / 2 

CS Root B+Tree Example : Searches

CS Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

CS (a) Insert key = 32 n=

CS (a) Insert key = 7 n=

CS (c) Insert key = 160 n=

CS (d) New root, insert 45 n= new root

CS Recap: Insert Data into B+ Tree Find correct leaf L. Put data entry onto L. – If L has enough space, done! – Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively – To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. – Tree growth: gets wider or one level taller at top.

CS (a) Simple case (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

CS (a) Delete key = 11 n=

CS (b) Coalesce with sibling –Delete n=4 40

CS (c) Redistribute keys –Delete n=4 35

CS (d) Coalese and Non-leaf coalese –Delete 37 n= new root

CS B+tree deletions in practice –Often, coalescing is not implemented –Too hard and not worth it!

CS Delete Data from B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. – If L is at least half-full, done! – If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height.

CS Concurrency control harder in B-Trees B-tree consumes more space DBA does not know when to reorganize DBA does not know how full to load pages of new index Buffering –B-tree: has fixed buffer requirements –Static index: must read several overflow blocks to be efficient (large & variable size buffers needed) Comparison: B-trees vs. static indexed sequential file

CS Speaking of buffering… Is LRU a good policyfor B+tree buffers?  Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level)

CS Comparison B-tree vs. indexed seq. file Less space, so lookup faster Inserts managed by overflow area Requires temporary restructuring Unpredictable performance Consumes more space, so lookup slower Each insert/delete potentially restructures Build-in restructuring Predictable performance

CS Interesting problem: For B+tree, how large should n be? … n is number of keys / node

CS assumptions: n children per node and N records in database (1)Time to read B-Tree node from disk is (tseek + tread*n) msec. (2)Once in main memory, use binary search to locate key, (a + b log_2 n) msec (3)Need to search (read) log_n (N) tree nodes (4)t-search = (tseek + tread*n + (a + b*log_2(n)) * log n (N)

CS  Can get: f(n) = time to find a record f(n) n opt n  FIND n opt by f’(n) = 0 øWhat happens to n opt as: Disk gets faster? CPU get faster? …

CS Bulk Loading of B+ Tree For large collection of records, create B+ tree. Method 1: Repeatedly insert records  slow. Method 2: Bulk Loading  more efficient.

CS Bulk Loading of B+ Tree Initialization: – Sort all data entries – Insert pointer to first (leaf) page in new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root

CS Bulk Loading (Contd.) Index entries for leaf pages always entered into right- most index page When this fills up, it splits. Split may go up right-most path to root. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree * 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root not yet in B+ tree Data entry pages

CS Summary of Bulk Loading Method 1: multiple inserts. – Slow. – Does not give sequential storage of leaves. Method 2: Bulk Loading – Has advantages for concurrency control. – Fewer I/Os during build. – Leaves will be stored sequentially (and linked) – Can control “fill factor” on pages.