CACHE-CONSCIOUS INDEXES

Slides:



Advertisements
Similar presentations
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
CS4432: Database Systems II
CS CS4432: Database Systems II Basic indexing.
B+-tree and Hashing.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.
1 Database indices Database Systems manage very large amounts of data. –Examples: student database for NWU Social Security database To facilitate queries,
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
Making B+-Trees Cache Conscious in Main Memory
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
B+ Trees COMP
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
School of Engineering and Computer Science Victoria University of Wellington Copyright: Xiaoying Gao, Peter Andreae, VUW B Trees and B+ Trees COMP 261.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
COMP261 Lecture 23 B Trees.
TCSS 342, Winter 2006 Lecture Notes
B/B+ Trees 4.7.
Multiway Search Trees Data may not fit into main memory
Topics 10: Cache Conscious Indexes
CS522 Advanced database Systems
COP Introduction to Database Structures
CSE 332 Data Abstractions B-Trees
Extra: B+ Trees CS1: Java Programming Colorado State University
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Hash-Based Indexes Chapter 11
Lecture 22 Binary Search Trees Chapter 10 of textbook
B+ Tree.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Interval Heaps Complete binary tree.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Data Structures and Algorithms
Introduction to Database Systems
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B-Trees.
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
B- Trees D. Frey with apologies to Tom Anastasio
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Virtual Memory Hardware
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Multiway Trees Searching and B-Trees Advanced Tree Structures
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Database Design and Programming
CSE 373: Data Structures and Algorithms
Multiway Search Tree (MST)
Lecture 13: Computer Memory
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Priority Queues CSE 373 Data Structures.
CS703 - Advanced Operating Systems
Chapter 11 Instructor: Xin Zhang
Indexing, Access and Database System Architecture
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Heaps.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CACHE-CONSCIOUS INDEXES Jayant Haritsa Computer Science and Automation Indian Institute of Science

Prior Main Memory Indexes Binary search, T-Tree(balanced binary tree with several elements in a node) [1986]: minimize space overhead not cache conscious, poor search Enhanced B+-tree [1983]: 100% filled, hard-code node search more cache conscious, better search (node = cache line), but child pointers required Hashing: problem for ordered access, skewed data Cache-Sensitive Search (CSS) Tree: VLDB99

CSS-Trees Basic Idea: Eliminate pointers by embedding a complete search tree into an array (directory array on top of sorted data array). structure in an array internal nodes leaf nodes 1 15 16 30 31 80 1 2 3 4 5 49 50 64 6-10 11-15 16-20 21-25 26-30 child nodes of node i : 5*i+1 to 5*i+5 The values in the nodes are node identifiers, NOT key values. Total number of keys is 65 * 4 = 260 (leaves go from nodes 16 to 80 = 65 nodes). The tree is built bottom-up using the values of the underlying sorted-array. The real structure is this: The directory only goes from 0 to 15. Then there is a *separate array* which is fully sorted in key order (i.e 31 to 80 to 16 to30). Therefore, when you use the directory check whether the offset is within directory or outside. If outside, use the logical mapping (I and II) to find the equivalent offset in the sorted array. [Better to read the CSS paper itself rather than this writeup. 2017] internal node 31-55 56-80 leaf node A CSS-Tree (m = 4)

Tree Structure A complete (m+1)- ary search tree All nodes of tree are stored contiguously All nodes are by addressed by offset from start address of first node Children of internal node numbered b, are numbered from b(m+1)+1 to b(m+1)+m+1 Key number i maps to node number i/m By this we can calculate address of any key Built bottom-up Updates require rebuilding of index Complete (m+1)- ary tree is a tree in which nodes are filled level by level in left to right order

Utilization of a Cache line Cache line size = 32 bytes B+-Tree node CSS-Tree node K1 K2 K3 K4 K5 K6 K7 K8 P0 K1 P1 K2 P2 K3 P3 4 branches 9 branches Benefits for CSS-Trees: double number of keys per node fewer levels, less space search 20% faster Cache line size is 32 here, therefore the B+-tree node has one empty slot

Can we make B+-Trees more cache conscious? B+-Trees vs. CSS-Trees Search Performance CSS-Trees ? B+-Trees Update Performance Can we make B+-Trees more cache conscious?

CSB-TREES CSB – cache-sensitive B-trees

Tree Structure Balanced multiway search tree CSB+ -Tree of order d contains m keys, where d<= m <= 2*d. All child nodes of a node in one node group. Node group is stored contiguously and therefore each node in the group can be accessed using an offset to the first node in the group Instead of m+1 pointers only one pointer to first child is needed  partial pointer elimination technique  more keys per node

Node Structure Each node size = cache line Internal Node Leaf Node nKeys: number of keys in the node firstChild: pointer to first child node keyList[2d]: list of keys Leaf Node Set of <key, tupleID> pairs LSP / RSP: left/right sibling pointers Why is nKeys required? To do binary search? Reasons for having both left/right sibling pointers: 1) Can perhaps use in merge operations to identify neighbors ? 2) Can use such that searches have less interference with updates ? i.e. while update is happening at low value, do search from high value downwards (interesting research problem?)

CSB+- tree example (Order 2) K14 - - b b+t t K22 - - K11 K12 K13 K21 K23 Contiguous Node group contiguously stored

A CSB+-Tree of Order 1 [1,2] keys per node, node group max size = 3 22 Node group 7 30 3 13 19 25 33 Node size = 64 bytes, Order 1 means number of keys in each node is [1,2] and that there are most 3 nodes in each nodegroup 2 3 5 7 12 13 16 19 20 22 24 25 27 30 31 33 36 39

Searching a CSB+-Tree Similar to B+ -Tree Search for: 16 22 7 30 3 13 19 25 33 2 3 5 7 12 13 16 19 20 22 24 25 27 30 31 33 36 39

Node Search Techniques Basic approach: Binary search using standard while loop Code expansion approach: Loop unfolding by using if-then-else Pad unused slots with highest possible key  no counter arithmetic Uniform approach: Make the binary unfolding “right-deep” Single traversal function Variable approach: Traversal function dependent on key cardinality Offline compute optimal traversal functions Inline appropriate function at run time

Example of Uniform approach Search for value ‘a’ expanded code for max key cardinality Keys K1,K2,K3,K4 if (a > K2) { if (a > K3) if (a > K4) return p5; else return p4; } return p3; if (a > K1) return p2; return p1; K2 K1 K3 Larger If node is not full, less chance that control goes inside it K4 Smaller

Insertion b2 Parent b1 Parent n1 n3 n4 n1 n2 n3 n4 n2 This much of space is reallocated Initially b1 was pointing to space where n1,n3,n4 are contiguously stored Now b2 is pointing to new allocated space where n1,n2 n3,n4 are contiguously stored If the parent is full then parent will also spilt and process is repeated

Insert Cost Split cost higher since whenever there is a new split, CSB have to create a new node group, whereas B-trees have to only create a new node

Example Insertion of 6 22 7 30 3 13 19 25 33 2 3 5 7 12 13 16 19 20 22 24 25 27 30 31 33 36 39 6

Insert 6 (contd) 22 7 30 3 13 19 25 33 2 3 5 7 12 13 16 19 20 22 24 25 27 30 31 33 36 39 2 3 5 6 7

Insert 6 (contd) 22 7 30 3 6 13 19 25 33 12 13 16 19 20 22 24 25 27 30 31 33 36 39 2 3 5 6 7

Insert 17 22 7 30 3 6 13 19 25 33 17 12 13 16 19 20 22 24 25 27 30 31 33 36 39 2 3 5 6 7

Insert 17 (contd) 22 7 30 3 6 13 19 25 33 12 13 16 17 20 22 24 25 27 30 31 33 36 39 2 3 5 6 7 19 20 22

Insert 17 (contd) 22 7 30 17 3 6 13 19 25 33 12 13 16 17 24 25 27 30 31 33 36 39 2 3 5 6 7 19 20 22

Insert 17 (contd) 22 7 30 3 6 13 19 25 33 3 6 13 17 19 12 13 16 17 24 25 27 30 31 33 36 39 2 3 5 6 7 19 20 22

Insert 17 (contd) 22 17 7 30 25 33 3 6 13 19 12 13 16 17 24 25 27 30 31 33 36 39 2 3 5 6 7 19 20 22

Insert 17 (contd) 22 7 17 30 25 33 3 6 13 19 12 13 16 17 24 25 27 30 31 33 36 39 2 3 5 6 7 19 20 22

Deletion Deletion is done lazily It may result in space underuse (less than 50%) but does not require adjustment of tree

SEGMENTED CSB-TREES

Motivation As cache line sizes increase, impact of insertion cost is higher since node groups become proportionally larger Example: cache line size of 128 bytes  30 Keys  Node group size = 128 * (30+1)  4KB  Copying of 4 KB for each split 128 bytes  30Keys because 4 bytes for nKeys, 4 bytes for child pointer + 30 * 4 bytes for keys

Segmented CSB+-Trees divide child nodes into segments nodes in each segment are stored contiguously store pointers to each child segment in parent node 3 7 13 19 2 3 5 7 12 13 16 19 20 22 Benefit: copying less data on a split Problem: Search becomes slower

FULL CSB-TREES

Full CSB+-Trees preallocate space for a full node group benefits: cheaper copying on a split move the nodes into the free space less memory allocation overhead triggered by node group overflow, not node overflow space overhead: 30% Note that the free space is spatially close

PERFORMANCE

Benefit of CSB+-Trees Search is almost as efficient as CSS-Trees n is no of leaf nodes, c is cache line size (64 bytes), m = no of slots in node Incremental updates similar to Btrees creating new node groups when split Cache Misses Typical Value (m=16) B+-Tree log2n/(log2m -1) log2n/3.0 CSS-Tree log2n/log2(m+1) log2n/4.1 CSB+-Tree log2n/log2(m-1) log2n/3.9 Here, n is number of leaf nodes being indexed, m is number of slots in node, which is equal to cache-line-size/size of object (key, ptr). In current case, cache-line-size is 64 and object size is 4, hence m = 16. Therefore, the values in the Typical Value column are 3, 4.1, and 3.9.

Update Cost = Search Cost + Split Cost Update Analysis Update Cost = Search Cost + Split Cost Cache lines Accessed Typical Values on a Split (in cache lines) for m = 16 B+-Tree 2 2 CSB+-Tree (m-1)*2 30 CSB+-Tree (t seg) (m-2t+1)*2/t 13 CSB+-Tree (full) (m-1)/2 7.5 Here, m = 16.

If enough space, choose Full CSB+-Tree Feature Comparison Search Update Space MM Overhead B+-Tree slower faster medium medium CSB+-Tree faster slower lower higher CSB+-Tree (t seg) medium medium lower higher CSB+-Tree (full) faster faster higher lower If enough space, choose Full CSB+-Tree MM Overhead is Memory Management Overhead If searches > updates (e.g., on-line shopping), choose variants of CSB+-Tree

Looking into the Future widening gap between CPU and memory speed operating system: 32-bit to 64-bit larger data set: longer search paths Cache sensitive indexing structure even more useful in the future

END CC INDEXES E0 261