Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie Mellon University Information Sciences Research Center Bell Laboratories †
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Databases and the Memory Hierarchy Traditional Focus: buffer pool management (DRAM as a cache for disk) Important Focus Today: processor cache performance (SRAM as a cache for DRAM) e.g., [Ailamaki et al, VLDB ’99], etc. Disk Main Memory CPU L2/L3 Cache Larger, slower, cheaper L1 Cache
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Structures Used extensively in databases to accelerate performance selections, joins, etc. Common Implementation: B + -Trees Leaf Nodes Non-Leaf Nodes
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + - Tree Indices: Common Access Patterns Search: locate a single tuple Range Scan: locate a collection of tuples within a range
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance of B + - Tree Indices A main memory B + -Tree containing 10M keys: Search: 100K random searches Scan: 100 range scans of 1M keys, starting at random keys Detailed simulations based on Compaq ES40 system Most of execution time is wasted on data cache misses 65% for searches, 84% for range scans Data Cache Stalls Other Stalls Busy Time
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + -Trees: Optimizing Search for Cache vs. Disk To minimize the number of data transfers (I/O or cache misses): Optimal Node Width = Natural Data Transfer Size for disk: disk page size (~8 Kbytes) for cache: cache line size (~64 bytes) Much narrower nodes and higher trees Search performance more sensitive to changes in branching factors Optimized for diskOptimized for cache
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Previous Work: “Cache-Sensitive B + - Trees” Rao and Ross [SIGMOD 2000] Key insight: nearly all child ptrs can be eliminated by restricting data layout double the branching factor of cache-line-sized non-leaf nodes B + -TreesCSB + -Trees K1K1 K2K2 K3K3 K4K4 K5K5 K6K6 K7K7 K8K8 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Impact of CSB + - Trees on Search Performance Search is 15% faster due to reduction in height of tree However: update performance is worse [Rao & Ross, SIGMOD ’00] range scan performance does not improve There is still significant room for improvement Data Cache Stalls Other Stalls Busy Time B + -TreeCSB + -Tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Latency Tolerance in Modern Memory Hierarchies Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) Modern processors overlap multiple simultaneous cache misses e.g., Compaq ES40 supports 8 off-chip misses per processor Prefetch instructions allow software to fully exploit the parallelism What dictates performance: NOT simply the number of cache misses but rather the amount of exposed miss latency
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Approach New Proposal: “Prefetching B + -Trees” (pB + -Trees) use prefetching to reduce the amount of exposed miss latency Key Challenge: data dependences caused by chasing pointers Benefits: significant performance gains for: searches range scans updates (!) complementary to CSB + -Trees
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview Prefetching Searches Prefetching Range Scans Experimental Results Conclusions
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Example: Search where Node Width = 1 Line 0 Time (cycles) Cache miss We suffer one full cache miss at each level of the tree keys, 64B lines, 4B keys, ptrs & tupleIDs 4 levels in B + -Tree (cold cache)
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Same Example where Node Width = 2 Lines 0 Time (cycles) Cache miss 0 Time (cycles) Cache miss levels in tree Additional misses per node dominate reduction in # of levels.
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching How Things Change with Prefetching 0 Time (cycles) Cache miss Time (cycles) Cache miss # of misses exposed miss latency fetch all lines within a node in parallel 0 Cache miss Time (cycles)
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching pB + -Trees: Using Prefetching to Improve Search Basic Idea: make nodes wider than the natural data transfer size e.g., 8 cache lines wide prefetch all lines of a node before searching in the node Improved Search Performance: Larger branching factors, shallower trees Cost to access every node only increased slightly Reduced Space Overhead: primarily due to fewer non-leaf nodes Update Performance: ???
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview Prefetching Searches Prefetching Range Scans Experimental Results Conclusions
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Range Scan Cache Behavior: Normal B + - Trees Steps in Range Scan: search for the starting leaf node traverse the leaves until end is found 0 Time(cycles) Cache miss We suffer a full cache miss for each leaf node !
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching If Prefetching Wider Nodes e.g., node width = 2 lines 0 Time(cycles) Cache miss Time(cycles) Cache miss 320 Exposed miss latency is reduced by up to a factor of node width. A definite improvement, but can we still do better?
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching The Ideal Case Overlap misses until all latency is hidden, or run out of bandwidth How can we achieve this ? 0 Time(cycles) Cache miss 0 Time(cycles) Cache miss Time(cycles) Cache miss
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching The Pointer Chasing Problem Currently visitingWant to prefetch If prefetching through pointer chasing, still experience the full latency at each node Directly prefetch Ideal case
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Solution: Jump Pointer Arrays Put leaf addresses in an array Directly prefetch by using the jump pointers Back pointers needed to initialize prefetching
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Solution: Jump Pointer Arrays 0 Time Cache miss
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching External Jump Pointer Arrays: Efficient Updates Impact of an insertion is limited to its chunk Deletions leave empty slots Actively interleave empty slots during bulkload and chunk splits Back pointer to position in jump-pointer array is now a hint points to correct chunk but may require local search within chunk to init prefetching hints chunked linked-list
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Alternative Design: Internal Jump-Pointer Arrays B + -Trees already contain structures that point to the leaf nodes bottom non-leaf nodes the parents of the leaf nodes ( “bottom non-leaf nodes”) By linking them together, we can use them as a jump-pointer array Tradeoff: no need for back-pointers, and simpler to maintain consumes less space, though external array overhead is <1% but less flexible, chunk size is fixed by B + -Tree structure
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview Prefetching Searches Prefetching Range Scans Experimental Results search performance range scan performance update performance Conclusions
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Experimental Framework Results are for a main-memory database environment (we are extending this work to disk-based environments) Executables: we added prefetch instructions to C source code by hand used gcc to generate optimized MIPS executables with prefetch instructions Performance Measurement: detailed, cycle-by-cycle simulations Machine Model: based on Compaq ES40 system, with slightly updated parameters
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Search Performance 100K random searches after bulkload; 100% full (except root); warm caches # of tupleIDs in the trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8B+tree p8CSB+ pB + -Trees achieve 27-47% speedup vs. B + -Trees, 14-34% vs. CSB + -Trees optimal node width is 8 cache lines pB + -Trees and CSB + -Trees are complementary: p 8 CSB + -Trees are best
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Same Search Experiments with Cold Caches Large discrete steps within each curve What is happening here? # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ 100K random searches after bulkload; 100% full (except root); cold caches (i.e. cleared after each search).
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Analysis of Cold Cache Search Behavior Height of the tree dominates performance effect is blurred in warm cache case If the same height, the smaller the node size the better # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ Tree Type Number of Keys 10K30K100K300K1M3M10M B+B CSB p2B+p2B p4B+p4B p8B+p8B p 16 B p 8 CSB # of Levels in the Trees
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview Prefetching Searches Prefetching Range Scans Experimental Results search performance range scan performance update performance Conclusions
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Range Scan Performance Scans of 1K-1M keys: speedup over B + -Trees factor of from prefetching wider nodes additional factor of ~2 from jump-pointer arrays log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full) # of tupleIDs scanned through in a single call time (Cycles) B+tree p8B+tree p8iB+tree p8eB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Range Scan Performance Small scans (<1K keys): overshooting cost is noticeable exploit only if scan is expected to be large (e.g., search for end) log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full)
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview Prefetching Searches Prefetching Range Scans Experimental Results search performance range scan performance update performance Conclusions
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Update Performance pB + -Trees achieve at least a 1.24 speedup in all cases Why? percentage of entries used in leaf nodes time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches percentage of entries used in leaf nodes B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Update Performance Reason #1: faster search times Reason #2: less frequent node splits with wider nodes percentage of entries used in leaf nodes time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches percentage of entries used in leaf nodes B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching pB + -Trees: Other Results Similar results for: varying bulkload factors of trees large segmented range scans mature trees varying jump-pointer array parameters: prefetch distance chunk size Optimal node width: increases as memory bandwidth increases (matches the width predicted by our model in the paper)
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Conclusions Impact of Prefetching B + - Trees on performance: Search: speedup over B + - Trees wider nodes reduce height of tree, # of expensive misses outperform and are complementary to CSB + - Trees Updates: speedup over B + - Trees faster search and less frequent node splits in contrast with significant slowdowns for CSB + - Trees Range Scan: speedup over B + - Trees wider nodes: factor of ~3.5 speedup jump-pointer arrays: additional factor of ~2 speedup Prefetching B + - Trees also reduce space overhead. These benefits are likely to increase with future memory systems. Applicable to other levels of the memory hierarchy (e.g., disks).
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Backup Slides
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Revisiting the Optimal Node Width for Searches Total cache misses for a search is minimized when: w = 1 w = # of cache lines per node m = # of child pointers per one-cache-line wide node N = # of tupleIDs in index Total cache misses Misses per level# of levels in tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scheduling Prefetches Early Enough nini n i+1 n i+2 n i+3 n i+2 n i+3 currently visiting nini want to prefetch n i+3 p = &n 0 ; while(p) { work(p->data); p = p->next; } P Loading a node L Work() W Our Goal: fully hide latency thus achieving fastest possible computation rate of 1/W e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Performance without Prefetching nini n i+1 n i+2 n i+3 Time LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 while(p) { work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) Computation rate = 1/(L+W)
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Prefetching One Node Ahead nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Computation is overlapped with memory accesses. computation rate = 1/L LiLi WiWi loading n k work(n k ) data dependence visiting nini prefetch n i+1 pf(p->next) while(p) { pf(p->next); work(p->data); p = p->next; } Time
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Prefetching Three Nodes Ahead nini n i+1 n i+2 n i+3 LiLi WiWi W i+1 W i+2 W i+3 Computation rate does not improve (still = 1/L)! visiting nini prefetch n i+3 pf(p->next->next->next) L i+1 L i+2 L i+3 L Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96] any scheme which follows the pointer chain is limited to a rate of 1/L Time while(p) { pf(p->next->next->next); work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) data dependence
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Goal: Fully Hide Latency nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Achieves the fastest possible computation rate of 1/W. visiting nini prefetch n i+1 pf(&n i+3 ) Time LiLi WiWi loading n k work(n k ) data dependence while(p) { pf(&n i+3 ); work(p->data); p = p->next; }
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Challenges in Supporting Efficient Updates jump-pointer array back pointers Conceptual view of jump-pointer array: What if we really implemented it this way? Insertion: could incur significant overheads copying data within the array to create a new hole updating back-pointers Deletion: ok; just leave a hole
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Summary: Why We Expect Updates to Perform Well Insertions: only a small number of jump pointers move between insertion point and nearest hole in the chunk normally only update the hint pointer for the inserted node which does not require any significant overhead significant overheads only occur on chunk splits, which are rare Deletions: no data is moved (just leave an empty hole) no need to update any hints In general, the jump-pointer array requires little concurrency control.
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + - Trees Modeled and their Notations B + - Trees:regular B + - Trees CSB + - Trees:cache-sensitive B + - Trees [Rao & Ross, SIGMOD 2000] p w B + - Trees:prefetching B + - Trees with node size = w cache lines and no jump-pointer arrays we consider w = 2, 4, 8, and 16 p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and external jump-pointer arrays p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and internal jump-pointer arrays p 8 CSB + - Trees:prefetching cache-sensitive B + - Trees with node size = 8 cache lines (and no jump-pointer arrays) (Gory implementation details are in the paper.) e i
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Searches with Varying Bulkload Factors Similar trends with smaller bulkload factors as when 100% full Performance of pB + -Trees is somewhat less sensitive to bulkload factor percentage of entries used in leaf nodes time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8CSB+ p8B+tree percentage of entries used in leaf nodes time (M cycles) cold cacheswarm caches
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Range Scans with Varying Bulkload Factors Prefetching B + - Trees offer: larger speedups with smaller bulkload factors (more nodes to fetch) less sensitivity of performance to bulkload factor
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Large Segmented Range Scans 1M keys, scanned in 1000-key segments Similar performance gains as unsegmented scans
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Insertions with Cold Caches
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Deletions with Cold Caches
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching percentage of entries used in leaf nodes insertions with node splits B+tree p8B+tree p8eB+tree p8iB+tree Analysis of Nodes Splits upon Insertions Far fewer node splits Bulkload Factor = 60-90% Bulkload Factor = 100% At least 2 splits One split No splits Fewer node splits Fewer non-leaf node splits
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Searches (Warm Caches) number of search (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Insertions (Warm Caches) number of insertion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree CSB + -Tree could be 25% worse than B + -Tree under the same mature tree experiments (on diff h/w configuration) pB + -Trees are significantly faster than B + -Tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Deletions (Warm Caches) number of deletion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Searches (Cold Caches) number of search (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Insertions (Cold Caches) number of insertion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Deletions (Cold Caches) number of deletion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Large Segmented Range Scans B+treep8B+p8eB+p8iB
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search varying memory bandwidth (warm cache) normalized bandwidth (B) normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when pessimistic (B=5), p 8 B + -Tree still achieve significant speedups: 1.2 for warm cache
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search varying memory bandwidth (cold cache) normalized bandwidth (B) normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when B=5, 1.3 speedup for cold cache The optimal value for w increases when B gets larger
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scan varying prefetching distance (P 8 eB + -Tree) not sensitive to moderate increases in the prefetching distance Though overshooting cost shows up when #entries to scan is small
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scan varying chunk size (P 8 eB + -Tree) Not sensitive to varying chunk size
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Table 1 Terminology VariableDefinition w # of cache lines in an index node m # of child pointers in a one-line-wide node N # of pairs in an index d # of child pointers in non-leaf node ( = w m ) T1T1 Full latency of a cache miss T next Latency of an additional pipelined cache miss B Normalized memory bandwidth ( B = T 1 /T next ) K # of nodes to prefetch ahead C #of cache lines in jump-pointer array chunk p w B + -Tree Plain pB + -Tree with w -line-wide nodes p w B + -Tree p w B + -Tree with external jump-pointer arrays p w B + -Tree p w B + -Tree with internal jump-pointer arrays e i
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search w/ & w/o Jump-Pointer Arrays: Cold Cache entries in leaf nodes time (M cycles) p8B+tree p8eB+tree p8iB+tree different # of levels in tree
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time
Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Can We Do Even Better on Searches? Hiding latency across levels is difficult given: data dependence through the child pointer the relatively large branching factor of tree nodes equal likelihood of following any child assuming uniformly distributed random search keys What if we prefetch a node’s children in parallel with accessing it? duality between this and creating wider nodes BUT, this approach has the following relative disadvantages: storage overhead for the child (or grandchild) pointers size of node can only grow by multiples of the branching factor