Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory
Single Attribute Index (p1) Relation a 1 2 i n a 1 2 i n b 1 2 b Equality Queries A = val i b Range Queries A > low A < high n b
Where does the data live? Index files for a relation R can occur in three forms: Data entries store actual data for relation R. Index file provides both indexing and storage. Data entries store pairs <k, rid>: k – value for a search key. rid – rid of record having search key value k. Actual data record is stored somewhere else Data entries store pairs <k, rid-list> K – value for a search key Rid-list – list of rid for all records with key value k
Primary / Secondary Index Index is said to be a primary index if search key contains primary key. Otherwise index is a secondary index on that relation Careful about terminology! Primary and key have different meanings here!’
Clustered vs Unclustered Index Index is said to be clustered if Data records in the file are organized as data entries in the index If data is stored in the index, then the index is clustered by definition. This is option (1) from previous slide. Otherwise, data file must be sorted in order to match index organization. Un-clustered index Organization on data entries in index is independent from organization of data records. These are options (2) and (3) File storing a relation R can only have 1 clustered index, but many un-clustered indices Why?
Clustered vs. Unclustered Index Suppose that Alternative (2) is used for data entries, and data records are stored in Heap file. To build clustered index, first sort the Heap file (with some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records 12
Single Attribute Index (p2) Sparse Indexes Require an index entry for every n tuples (comprising a block) Require each block to be laid out in tuple order Dense Indexes Require an index entry for every tuple Since all search entries inside index, order is free Memory-resident indexes faster and thus preferred ! a 1 2 i n A = val A > low A < high
B Trees B Trees implement the idea of a binary tree but… A few hundred pointers per node, not just two. B+ Trees are an extension of B Tree (Balanced Tree) Copies of the keys are stored in the internal nodes The keys and records are stored in leaves A leaf node may include a pointer to the next leaf node to speed sequential access When we say B Tree, we really mean B+ Tree Widely used in databases and file systems B Trees support equality as well as range queries
B+ Tree Indexes Non-leaf Pages Leaf Pages (Sorted by search key) Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages have index entries; only used to direct searches: index entry P K P K 1 2 P K P 1 2 m m 4
B-Tree Example 63 Root Node 36 84 91 Intermediate Nodes Leaf Nodes 15 57 63 76 87 92 100 null Data Records
Meaning of Internal Node 84 91 key < 84 84 ≤ key < 91 91 ≤ key
Meaning of Leaf Nodes 63 76 Next leaf pointer to record 63
Equality Predicates key = 87 63 36 84 91 15 36 57 63 76 87 92 100 null
Range Predicates 57 ≤ key < 95 63 36 84 91 15 36 57 63 76 87 92 100 null
Example B+ Tree Note how data entries in leaf level are sorted Root 17 Entries <= 17 Entries > 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Find 28*? 29*? All > 15* and < 30* Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes or even ancestors. 15
General B-Trees Number of keys: n Number of pointers: n + 1 All leaves at same depth All (key, record pointer) in leaves Node size should be at least: Root: 2 pointers Internal nodes: (n+1)/2 pointers Leaf nodes: (n+1)/2 pointers to data 5 15 21 31 42 56 Internal Leaf Max Min Definitions: Order: n Fanout Average # of pointers out (½Order) < Fanout < Order
Rules for B-Trees Two constants determine the number of entries stored in a node Rule 1: Every node (other than root) has at least MINIMUM entries Rule 2: Every node has at most MAXIMUM entries These two constants govern when overly-full nodes should be split and also when overly sparse nodes should be merged..
Rules for B-Trees (cont) Rule 3: Each node of a B-tree contains a partially-filled array of entries, sorted from smallest to largest Rule 4: The number of subtrees below a non-leaf node is one more than the number of entries in the node example: if a node has 10 entries, it has 11 children entries in subtrees are organized according to rule #5 Rule 5: For any non-leaf node: The entries are ordered Rule 6: Every leaf has the same depth Consequence: Automatic rebalancing upon insertion/deletion On-line demo: https://www.cs.usfca.edu/~galles/visualization/BTree.html
Cost of B-Tree Operations Height of B-Tree: H Assume no duplicates Assume no blocks in memory What is the random I/O cost of: Insertion: Deletion: Equality search: Range Search: Assume root and intermediate nodes in memory But not leaf nodes and data blocks What are the I/O costs?
B+ Trees in Practice Typical order: 200. Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 2: 1332 = 17,689 entries Height 3: 1333 = 2,352,637 entries Height 4: 1334 = 312,900,700 entries Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes