CSE 232A Graduate Database Systems

CSE 232A Graduate Database Systems
Arun Kumar Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris

Motivation for Indexing
Consider the following SQL query: Movies (M) MovieID Name Year Director SELECT * FROM Movies WHERE Year=2017 Q: How to obtain the matching records from the file? Heap file? Need to do a linear scan! O(N) I/O and CPU “Sorted” file? Binary search! O(log2(N)) I/O and CPU Indexing helps retrieve records faster for selective predicates! SELECT * FROM Movies WHERE Year>=2000 AND Year<2010

Concurrency Control Manager
Another View of Storage Manager Access Methods Heap File B+-tree Index Sorted File Hash Index Concurrency Control Manager Recovery Manager Buffer Manager I/O Manager I/O Accesses

Indexing: Outline Overview and Terminology B+ Tree Index Hash Index
Learned Index

Indexing Index: A data structure to speed up record retrieval
Search Key: Attribute(s) on which file is indexed; also called Index Key (used interchangeably) Any permutation of any subset of a relation’s attributes can be index key for an index Index key need not be a primary/candidate key Two main types of indexes: B+ Tree index: good for both range and equality search Hash index: good for equality search

Overview of Indexes Need to consider efficiency of search, insert, and delete Primarily optimized to reduce (disk) I/O cost B+ Tree index: O(logF(N)) I/O and CPU cost for equality search (N: number of “data entries”; F: “fanout” of non-leaf node) Range search, Insert, and Delete all start with an equality search Hash index: O(1) I/O and CPU cost for equality search Insert and delete start with equality search Not “good” for range search! 6

What is stored in the Index?
2 things: Search/index key values and data entries Alternatives for data entries for a given key value k: AltRecord: Actual data records of file that match k AltRID: <k, RID of a record that matches k> AltRIDlist: <k, list of RIDs of records that match k> API for operations on records: Search (IndexKey); could be a predicate for B+Tree Insert (IndexKey, data entry) Delete (IndexKey); could be a predicate for B+Tree 7

Q: What is the difference between “B+ Tree” and “B Tree”?
Overview of B+ Tree Index Index Entries (Non-leaf pages) Entries of the form: (IndexKey value, PageID) Entries of the form: AltRID: (IndexKey value, RID) Data Entries (Leaf pages) Non-leaf pages do not contain data values; they contain [d, 2d] index keys; d is order parameter Height-balanced tree; only root can have [1,d) keys Leaf pages in sorted order of IndexKey; connected as a doubly linked list Q: What is the difference between “B+ Tree” and “B Tree”? 8

Overview of Hash Index Bucket pages h 1 N-1 Hash function SearchKey Primary bucket pages Overflow pages Bucket pages have data entries (same 3 Alternatives) Hash function helps obtain O(1) search time 9

Q: A file can have at most one AltRecord index. Why?
Trade-offs of Data Entry Alternatives Pros and cons of alternatives for data entries: AltRecord: Entire file is stored as an index! If records are long, data entries of index are large and search time could be high AltRID and AltRIDlist: Data entries typically smaller than records; often faster for equality search AltRIDlist has more compact data entries than AltRID but entries are variable-length Q: A file can have at most one AltRecord index. Why? 10

More Indexing-related Terminology
Composite Index: IndexKey has > 1 attributes Primary Index: IndexKey contains the primary key Secondary Index: Any index that not a primary index Unique Index: IndexKey contains a candidate key All primary indexes are unique indexes! MovieID Name Year Director IMDB_URL IMDB_URL is a candidate key Index on MovieID? Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)? 11

More Indexing-related Terminology
Clustered index: order in which records are laid out is same as (or “very close to”) order of IndexKey domain Matters for (range) search performance! AltRecord implies index is clustered. Why? In practice, clustered almost always implies AltRecord In practice, a file is clustered on at most 1 IndexKey Unclustered index: an index that is not clustered MovieID Name Year Director IMDB_URL Index on Year? Index on (Year, Name)? 12

Learned Index

B+ Tree Index: Search Root Height = 1 Order = 2
17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Height = 1 Order = 2 Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached Comparison within non-leaf nodes: binary/linear search Examples: search 7*; 8*; 24*; range [19*,33*] 14

B+ Tree Index: Page Format
1 K 2 3 m m+1 index entries Pointer to a page with Values < K1 Pointer to a page with values s.t. K1≤ Values < K2 Pointer to a page with values ≥Km with values s.t., K2≤ Values < K3 Order = m/2 Non-leaf Page Binary search on the index to identify the first data page, then scan. Index file is much smaller hence this is faster Index entries are always of this type, data entries could be in any of the 3 alternative forms Leaf Page R 1 K 2 n P n+1 data entries record 1 record 2 Next Page Pointer record n Prev 15

B+ Trees in Practice Typical order value: 100 (so, non-leaf node can have up to 200 index keys) Typical occupancy: 67%; so, typical “fanout” = 133 Computing the tree’s capacity using fanout: Height 1 stores 133 leaf pages Height 4 store 1334 = 312,900,700 leaf pages Typically, higher levels of B+Tree cached in buffer pool Level 0 (root) = 1 page = 8 KB Level 1 = 133 pages ~ 1 MB Level 2 = 17,689 pages ~ 138 MB and so on 16

B+ Tree Index: Insert Search for correct leaf L
Insert data entry into L; if L has enough space, done! Otherwise, must split L (into new L and a new leaf L’) Redistribute entries evenly, copy up middle key Insert index entry pointing to L’ into parent of L A split might have to propagate upwards recursively: To split non-leaf node, redistribute entries evenly, but push up the middle key (not copy up, as in leaf splits!) Splits “grow” the tree; root split increases height. Tree growth: gets wider or one level taller at top. 17

B+ Tree Index: Insert Example: Insert 8* Split! Height++ Split!
Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Split! Height++ Split! Entry to be inserted in parent node Copied up (and continues to appear in the leaf) 2* 3* 5* 7* 8* 5 18

Minimum occupancy is guaranteed in both leaf and non-leaf page splits
B+ Tree Index: Insert Example: Insert 8* Insert in parent node. Pushed up (and only appears once in the index) 5 24 30 17 13 Minimum occupancy is guaranteed in both leaf and non-leaf page splits 19

B+ Tree Index: Insert Example: Insert 8*
2* 3* New Root 17 24 30 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 5 7* 5* 8* Example: Insert 8* Recursive splitting went up to root; height went up by 1 Splitting is somewhat expensive; is it avoidable? Can redistribute data entries with left or right sibling, if there is space! 20

Insert: Leaf Node Redistribution
Root 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 Example: Insert 8* 8 8* 14* 16* Redistributing data entries with a sibling improves page occupancy at leaf level and avoids too many splits; but usually not used for non-leaf node splits Could increase I/O cost (checking siblings) Propagating internal splits is better amortization Pointer management headaches 21

B+ Tree Index: Delete Start at root, find leaf L where entry belongs
Remove the entry; if L is at least half-full, done! Else, if L has only d-1 entries: Try to re-distribute, borrowing from sibling L’ If re-distribution fails, merge L and L’ into single leaf If merge occurred, must delete entry (pointing to L or sibling) from parent of L. A merge might have to propagate upwards recursively to root, which decreases height by 1 22

B+ Tree Index: Delete Example: Delete 22* Example: Delete 20*
3* Root 17 24 30 14* 16* 19* 20* 22* 24* 33* 34* 38* 39* 13 5 7* 5* 8* Example: Delete 20* Example: Delete 24*? 27 24* 27* 29* 27* 29* Deleting 22* is easy Deleting 20* is followed by redistribution at leaf level. Note how middle key is copied up. 23

B+ Tree Index: Delete Example: Delete 24* Must merge leaf nodes!
Need to merge recursively upwards! 30 19* 27* 29* 33* 34* 38* 39* Must merge leaf nodes! In non-leaf node, remove index entry with key value = 27 Pull down of the index entry 2* 3* 7* 14* 16* 19* 27* 29* 33* 34* 38* 39* 5* 8* New Root 30 13 5 17 24

Delete: Non-leaf Node Redistribution
Suppose this is the state of the tree when deleting 24* Instead of merge of root’s children, we can also redistribute entry from left child of root to right child Have considered redistribution of leaf pages and merging of leaf and non-leaf pages Root 13 5 17 20 22 30 14* 16* 17* 18* 20* 33* 34* 38* 39* 22* 27* 29* 21* 7* 5* 8* 3* 2* 25

Delete: After Redistribution
Rotate IndexKeys through the parent node It suffices to re-distribute index entry with key 20; for illustration, 17 also re-distributed 14* 16* 33* 34* 38* 39* 22* 27* 29* 17* 18* 20* 21* 7* 5* 8* 2* 3* Root 13 5 17 30 20 22 26

Delete: Redistribution Preferred
Unlike Insert, where redistribution is discouraged for non-leaf nodes, Delete prefers redistribution over merge decisions at both leaf or non-leaf levels. Why? Files usually grow, not shrink; deletions are rare! High chance of redistribution success (high fanouts) Only need to propagate changes to parent node 27

Handling Duplicates/Repetitions
Many data entries could have same IndexKey value Related to AltRIDlist vs AltRID for data entries Also, single data entry could still span multiple pages Solution 1: All data entries with a given IndexKey value reside on a single page Use “overflow” pages, if needed (not inside leaf list) Solution 2: Allow repeated IndexKey values among data entries Modify Search appropriately Use RID to get a unique composite key! 28

Order Concept in Practice
In practice, order (d) concept replaced by physical space criterion: at least half-full Non-leaf pages can typically hold many more entries than leaf pages, since leaf pages could have long data records (AltRecord) or RID lists (AltRIDlist) Often, different nodes could have different # entries: Variable sized IndexKey AltRecord and variable-sized data records AltRIDlist could leads to different numbers of data entries sharing an IndexKey value 29

B+ Tree Index: Bulk Loading
Given an existing file we want to index, multiple record-at-a-time Inserts are wasteful (too many IndexKeys!) Bulk loading avoids this overhead; reduces I/O cost 1) Sort data entries by IndexKey (AltRecord sorts file!) 2) Create empty root page; copy leftmost IndexKey of leftmost leaf page to root (non-leaf) and assign child 3) Go left to right; Insert only leftmost IndexKey from each leaf page into index as usual (NB: fewer keys!) 4) When non-leaf fills up, follow usual Split procedure and recurse upwards, if needed

B+ Tree Index: Bulk Loading
Insert IndexKey 22 No redistribution! Split again; push up 20 Height++ Insert IndexKey 17 Split; push up 14 14 5 14 17 20 and so on … 14* 16* 17* 18* 20* 22* 27* 21* 7* 5* 3* 2* Sorted pages of data entries

Learned Index

Overview of Hash Indexing
Reduces search cost to nearly O(1) Good for equality search (but not for range search) Many variants: Static hashing Extendible hashing Linear hashing, etc. (we will not discuss these)

Static Hashing N is fixed; primary bucket pages never deallocated
Hash function E.g., h(k) = a*k + b 1 2 SearchKey k h h(k) mod N N-1 Primary bucket pages Overflow pages N is fixed; primary bucket pages never deallocated Bucket pages contain data entries (same 3 Alts) Search: Overflow pages help handle hash collisions Average search cost is O(1) + #overflow pages

Static Hashing: Example
MovieID Name Year Director 20 Inception 2010 Christopher Nolan 15 Avatar 2009 Jim Cameron 52 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine Woody Allen Hash function: a = 1, b = 0 h(k) = k MOD 3 N = 3 Say, AltRecord and only one record fits per page! (15, Avatar, …) h (52, Gravity, …) (20, Inception, …) (74, Blue …) 1 2 15 MOD 3 = 0 74 MOD 3 = 2

Static Hashing: Insert and Delete
Equality search; find space on primary bucket page If not enough space, add overflow page Delete: Equality search; delete record If overflow page becomes empty, remove it Primary bucket pages are never removed!

Extendible (dynamic) hashing helps resolve first two issues
Static Hashing: Issues Since N is fixed, #overflow pages might grow and degrade search cost; deletes waste a lot of space Full reorg. is expensive and could block query proc. Skew in hashing: Could be due to “bad” hash function that does not “spread” values—but this issue is well-studied/solved Could be due to skew in the data (duplicates); this could cause more overflow pages—difficult to resolve Extendible (dynamic) hashing helps resolve first two issues

Extendible Hashing Idea: Instead of hashing directly to data pages, maintain a dynamic directory of pointers to data pages Directory can grow and shrink; chosen to double/halve Search I/O cost: 1 (2 if direc. not cached) 2 4* 12* 32* 16* 1* 13* 41* 17* 10* 7* 15* Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 00 01 10 11 Check last GD bits of h(k) Example: Search 17* 10001 Search 42* …10

Extendible Hashing: Insert
Search for k; if data page has space, add record; done If data page has no more space: If LD < GD, create Split Image of bucket; LD++ for both bucket and split image; insert record properly If LD == GD, create Split Image; LD++ for both buckets; insert record; but also, double the directory and GD++! Duplicate other direc. pointers properly Direc. typically grows in spurts

Example: Insert 24* 2 4* 12* 32* 16* 1* 13* 41* 17* 10* 7* 15* Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) Check last GD bits of h(k) 00 01 10 11 …00 No space in bucket A Need to split and LD++ Since LD was = GD, GD++ and direc. doubles 3 32* 16* 24* 4* 12* Bucket A Bucket A2 000 100

Global Depth (GD) Local Depth (LD) Example: Insert 24* Bucket A 3 Example: Insert 18* 3 32* 16* 24* 000 …010 Bucket B 2 3 001 1* 13* 41* 17* 1* 41* 17* 010 Example: Insert 25* Bucket C 2 011 …001 10* 18* 100 Need to split bucket B! Since LD < GD, direc. does not double this time Only LD++ on old bucket and modify direc. ptrs Bucket D 101 2 7* 15* 110 111 Bucket A2 3 4* 12* Bucket B2 3 13*

NB: Never does a bucket’s LD become > GD!
Extendible Hashing: Delete Search for k; delete record on data page If data page becomes empty, we can merge with another data page with same last GD-1 bits (its “historical split image”) and do LD--; update direc. ptrs If all splits images get merged back and all buckets have LD = GD-1, shrink direc. by half; GD-- NB: Never does a bucket’s LD become > GD!

Extendible Hashing: Delete
Global Depth (GD) Local Depth (LD) Example: Delete 41* Bucket A …01 2 1 2 4* 12* Example: Delete 26* 00 Bucket B 2 …10 01 1* 13* 41* 17* 10 Bucket C is now empty Can merge with A; LD-- Bucket C 2 11 26* Q: Why did we pick A to merge with? Bucket D 2 7* 15* In practice, deletes and thus, bucket merges are rare So, directory shrinking is even more uncommon

Q: Why not let N in static hashing grow and shrink too?
Static Hashing vs Extendible Hashing Q: Why not let N in static hashing grow and shrink too? Extendible hash direc. size is typically much smaller than data pages; with static hash, reorg. of all data pages is far more expensive Hashing skew is a common issue for both; in static hash, this could lead to large # overflow pages; in extendible hash, this could blow up direc. size (this is OK) If too many data entries share search key (duplicates), overflow pages needed for extendible hash too!

Learned Index

Statistical Machine Learning 101
An “ML model” is a program/function to approximate a hidden / unknown function based on a set of data points The set of functions representable by an ML model is called its hypothesis space Input/features Output/target Examples: Linear Regression Polynomial Regression

Statistical Machine Learning 101
Most ML models are parametric; an ML model’s learning algorithm is a program to help pick a specific function from its hypothesis space, i.e., set parameter values Accuracy of ML model quantified by how well the picked function “fits” the observed data Two kinds of accuracy: “train” vs “test” Train is on same data seen by learning algorithm Test is on held out data; evaluates “generalization” Different ML models represent different sets of functions and thus, typically yield different accuracy

Artificial Neural Networks (ANNs)
Extremely powerful, versatile, and popular ML model Can be viewed as a generalization of linear regression: append non-linearity after linear regression: composite is called “neuron”; line up many neurons: called a “layer”; stack many such layers from input to output Universal function approximator! A special form of ANNs can simulate arbitrary Turing Machines!

(Switch to Tim Kraska’s slides)
Artificial Neural Networks (ANNs) Different forms of ANNs tailored for different “raw” data representations for input and/or output Multi-Layer Perceptrons (MLPs) deal with a “flat” numeric vector for input and/or output Recurrent Neural Networks (RNNs) tailored for sequentially structured inputs (text, time series, etc.), Convolutional Neural Networks (CNNs) for images/spatially structured data, etc. Bottomline: ANNs can help approximate any function! (Switch to Tim Kraska’s slides)

Sorting: Outline Overview and Warm-up
Multi-way External Merge Sort (EMS) EMS and B+ trees

Motivation for Sorting
User’s SQL query has ORDER BY clause! First step of bulk loading of a B+ tree index Used in implementations of many relational ops: project, join, set ops, group by aggregate, etc. (next topic!) Q: But sorting is well-known; why should a DBMS bother? Often, the file (relation) to be sorted will not fit in RAM! “External” Sorting

External Sorting: Overview
Goal: Given relation R with N pages, SortKey A, M buffer pages (often, M << N), sort R on A to get sorted R’ Idea: Sorting algorithm should be disk page I/O-aware! Desiderata: High efficiency, i.e., low I/O cost, even for very large N Use sequential I/Os rather than random I/Os AMAP Interleave I/O and comp. (DMA); reduce CPU cost too NB: I/O-aware sorting is also a key part of the implementation of MapReduce/Hadoop!

Warm-up: 2-way External Merge Sort
Idea: Make Merge Sort I/O-aware! Sort phase: Read each page into buffer memory; do “internal” sort (use any popular fast sorting algorithm, e.g., quicksort); write it back to disk (a sorted “run”) Merge phase: Read 2 runs, merge them on the fly, write out a new double-length run; recurse till whole file is a run! NB: Sort phase is 1-pass; merge phase is often multi-pass!

Q: How to reduce this cost further?
Warm-up: 2-way External Merge Sort Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 2 Number of passes: Sort phase: 1 Merge phase: 1-page runs PASS 0 3,4 5,6 2,6 4,9 7,8 1,3 2 2-page runs PASS 1 2,3 4,6 4,7 8,9 1,3 5,6 4-page runs PASS 2 2,3 4,4 6,7 8,9 1,2 3,5 6 Make the tree wider, I.e. reduce the number of levels Also start out with runs of size B, essentially chopping of the base of the binary tree! Each pass does 1 read and 1 write of whole file: 2N page I/Os per pass 8-page runs PASS 3 9 1,2 2,3 3,4 4,5 6,6 7,8 I/O cost of 2-way EMS: N=7 pages =2*7*4=56 Whole file is sorted! Q: How to reduce this cost further?

Multi-way EMS: Motivation
Q: How many buffer pages does 2-way EMS use? Sort phase: 2 (1 for read, 1 for write) Merge phase: 3 (1 for each run input; 1 for merged output) So, 2-way EMS uses only 3 buffer pages! Idea: Why not exploit more buffer pages (say, B >> 3)? Sort phase: Read B pages at a time (not just 1 at a time)! Write out sorted runs of length B each (not just 1) But I/O cost of sort phase is still the same!

B-1 way merge; total # buffer pages used: B
Multi-way EMS: B-way Merge Phase Idea: In 2-way EMS, we merge 2 sorted runs at a time; in multi-way EMS, we merge B-1 sorted runs at a time! B-1 way merge; total # buffer pages used: B INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2 . . . # passes for Merge Phase reduces to:

Number of pages N = 100M * 0.5KB / 8KB = 6.25M
Multi-way EMS I/O Cost Overall, # passes = I/O cost per pass = Total I/O cost of EMS = Example: File with 100M records of length 0.5KB each; page size is 8KB; number of buffer pages for EMS B=1000 Number of pages N = 100M * 0.5KB / 8KB = 6.25M Total I/O cost of EMS = = 2 x 6.25M x (1 + 2) = 37.5M Only need the ceil!

Multi-way EMS I/O Cost Total number of passes = With 8KB page, 782MB N
Naive 2-way EMS B=1K B=10K B=100K 1M 21 3 2 10M 25 100M 28 1B 31 4 With 8KB page, 7.5TB! Only 2 passes to sort up to 74.5TB! (2 is the lower bound for EMS!)

Multi-way EMS: Improvements
While already efficient, some key algorithmic+systems-oriented improvements have been made to multi-way EMS to reduce overall runtime (not just counting I/O cost) Three prominent improvements: Replacement sort (aka heap sort) as internal sort “Blocked” I/O Double Buffering

(We are skipping the details of this algorithm)
Improvement 1: Replacement Sort In standard EMS, quick sort used during Sort Phase Produces runs of length B pages each Replacement sort is an alternative for Sort Phase Produces runs of average length 2B pages each So, number of runs reduced on average to Maintains a sorted heap in B-2 pages; 1 page for reading; 1 for sorted output Slightly higher CPU cost; but signif. lower I/O cost (We are skipping the details of this algorithm) New total I/O cost =

Improvement 2: “Blocked” I/O
Merge Phase did not recognize distinction between sequential I/O and random I/O! Time difference not reflected in counting I/O cost Idea: Read a “block” of b pages of each run at a time! So, only runs can be merged at a time b controls trade-off of # passes vs time-per-pass “Fan-in” of Merge Phase = New total I/O cost = or

Improvement 3: Double Buffering
Most machines have DMA; enables I/O-CPU parallelism Trivially feasible to exploit DMA in the Sort Phase But in the Merge Phase, CPU blocked by I/O for runs Idea: Allocate double the buffers for each run; while CPU processes one set, read pages (I/O) into other set! So, only runs can be merged at a time New fan-in of Merge Phase = F = New total I/O cost = or

On whether the index is clustered or not!
Using B+ Tree for EMS Suppose we already have a B+ tree index with the SortKey being equal to (or a prefix of) the IndexKey Data entries of the B+ tree are already in sorted order! Q: Is it a “good” idea to simply read the leaf level of the B+ tree to achieve the EMS then? It depends! On whether the index is clustered or not! Good idea! Might be really bad!

Using Clustered B+ Tree for EMS
Data Pages Index Data Entries Go down the tree to reach left-most leaf Scan leaf pages (data entries) left to right If AltRecord, done! O/W, retrieve data pages pointed to by successive data entries Alternative 1 may require more pages than the base data pages as on an average a B-tree page is only 67% full Tuples on a page may not be sorted. Almost clustered file Range predicate + order by I/O cost if AltRecord: height + # leaf pages I/O cost otherwise: height + # leaf pages + # data pages Either way, I/O cost often << from-scratch EMS!

Q: But when is this faster
Using Unclustered B+ Tree for EMS Unclustered means not AltRecord! Why? Same procedure as for clustered B+ tree Same I/O “cost” as for clustered tree with AltRID/AltRIDlist but many back-to-back random I/Os; thrashing! Data Pages Index Data Entries Q: But when is this faster than from-scratch EMS? Usually, much slower than from-scratch EMS!

The geekiest “sport” in the world: sortbenchmark.org
External Sorting as Competitive Sport! The geekiest “sport” in the world: sortbenchmark.org

CSE 232A Graduate Database Systems

Similar presentations

Presentation on theme: "CSE 232A Graduate Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 232A Graduate Database Systems

Similar presentations

Presentation on theme: "CSE 232A Graduate Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback