Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie Mellon University Bell Laboratories current affiliation: Intel Research Pittsburgh Phillip B. Gibbons School of Computer Science Carnegie Mellon University Todd C. Mowry DB2 UDB Development Team IBM Toronto Lab Gary Valentin
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees B + - Tree Operations: Review Search: binary search in every node on the path Insertion/Deletion: search followed by data movement Range Scan: locate a collection of tuples in a range traverse the linked list of leaf nodes different from search-like operations
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Disk-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Traditional focus: I/O performance minimize # of disk accesses optimal tree nodes are disk pages typically 4KB-64KB large
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Cache-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Recent studies: cache performance e.g. [Rao & Ross, SIGMOD’00], [Bohannon, McIlroy, Rastogi, SIGMOD’01], [Chen, Gibbons, Mowry, SIGMOD’01] cache line size is B optimal tree nodes are only a few cache lines
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Large Difference in Node Sizes L2/L3 Cache CPU L1 Main Memory Disks
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Cache-optimized B + -Trees: Poor I/O Performance L2/L3 Cache CPU L1 Main Memory Disks may fetch a distinct disk page for every node on the path of a search similar penalty for range scan
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Disk-optimized B + -Trees: Poor Cache Performance L2/L3 Cache CPU L1 Main Memory Disks Binary search in a large node suffers excessive number of cache misses (explained later in the talk)
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Optimizing for Both Cache and Disk Performance? L2/L3 Cache CPU L1 Main Memory Disks
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Our Approach L2/L3 Cache CPU L1 Main Memory Disks Fractal Prefetching B + -Trees (fpB + -Trees) embedding cache optimized trees inside disk-optimized trees
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Page Structure of Disk-optimized B + -Trees We focus on fixed sized keys (please see our full paper for a discussion on variable sized keys) Header A huge array of index entries An index entry is or
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Binary Search in a B + -Tree Page Search for entry #71 1 st cache line2 nd 3 rd 4 th 128 th Suppose an index entry array has 1023 index entries, numbered index entries / cache line the array occupies 128 cache lines e.g. 8KB page, an entry is, 64B cache line, 8B header 9 th 8 th
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Binary Search in a B + -Tree Page Search for entry # Active Range Poor cache performance because of poor spatial locality
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Fractal Prefetching B + -Trees (fpB + -Trees) Embedding cache-optimized trees inside disk pages: good search cache performance binary search in cache-optimized nodes much better locality use cache prefetching good search disk performance nodes are embedded into disk pages
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Node Size Mismatch Problem Disk page size and cache-optimized node size determined by hardware parameters and key sizes Ideally cache-optimized trees fit nicely in disk pages But usually this is not true ! A 2-level tree overflows Unused Space A 2-level tree underflows. But adding one more level overflows.
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Two Solutions Solution 1: use different sizes for in-page leaf and nonleaf nodes e.g. smaller root when overflow, larger root when underflow Solution 2: overflowing nodes become roots of new pages Unused Space
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees The Two Solutions from Another Point of View Conceptually we apply disk and cache optimizations in different orders Solution 1: disk-first first build the disk-optimized pages then fit smaller trees into disk pages by allowing different node sizes Solution 2: cache-first first build the cache-optimized trees then group nodes together and place them into disk pages
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Insertion and Deletion Cache Performance In disk-optimized B + -Trees, data movement is very expensive the huge array structure in disk pages on average, we need to move half the array In our fpB + -Trees, the cost of data movement is much smaller small cache-optimized nodes We show that fpB + -Trees have much better insertion/deletion performance over disk-optimized B + -Trees with fixed sized keys
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Jump-pointer Array Prefetching for Range Scan Previous proposal for range scan cache performance (SIGMOD’01) build data structures to hold leaf node addresses prefetch leaf nodes during range scans Internal Jump Pointer Array Recall that range scans essentially traverse the linked list of leaf nodes
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees New Proposal: I/O Prefetching linking leaf parent pages together Employ jump-pointer array prefetching in I/O jump-pointer arrays contain leaf page IDs prefetching leaf pages to improve range scan I/O performance Very useful when leaf pages are not sequential on disk non-clustered index under frequent updates (when sequential prefetching is not applicable)
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Both Cache and I/O Prefetching in fpB + -Trees Two jump-pointer arrays in fpB + -Trees: One for range scan cache performance containing leaf node addresses for cache prefetching One for range scan disk performance containing leaf page IDs for I/O prefetching L2/L3 Cache CPU L1 Main Memory Disks
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees More Details in Our Paper Computation for optimal node sizes Data structures Algorithms Bulkload Search Insertion Deletion Range scan
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Implementation We implemented a buffer manager and three index structures on top of the buffer manager Buffer Manager Disk-optimized B + -Trees Disk-first fpB + -Trees Cache-first fpB + -Trees Baseline
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Experiments and Methodology Experiments Search: (1) Cache performance; (2) Disk performance improving cache performance while preserving good disk performance Update: (3) Cache performance solving data movement problem Range Scan: (4) Cache performance; (5) Disk performance jump-pointer array prefetching Methodology cache performance: detailed cycle-by-cycle simulations memory system parameters in near future better prefetching support range scan I/O performance: execution times on real machines search I/O performance: counting the number of I/Os I/O operations in search do not overlap
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Search Cache Performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkload; 100% full except root; 16KB pages total # of entries in all leaf pages execution time (M cycles) 3 fpB + -Trees perform significantly better than disk-optimized B + -Trees achieving speedups: at all sizes; over 1.25 when trees contain at least 1M entries The performances of two fpB + -Trees are similar
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Search I/O Performance Disk-first fpB + -Trees access < 3% more pages Very small I/O performance impact Cache-first fpB + -Trees may access up to 25% more pages in our results 4KB8KB16KB32KB page size # of I/O reads (x 1000) Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkloading 10M index entries; 100% full except root
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Insertion Cache Performance 2000 random insertions after bulkloading 3M keys 70% full fpB + -Trees are significantly faster than disk-optimized B + -Trees achieving up to 35-fold speedups over disk-optimized B + -Trees Data movement costs dominate disk-optimized B + -Tree performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 4KB8KB16KB32KB execution time (M cycles) Page Size
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Range Scan Cache Performance 100 scans starting at random locations in index bulkloaded with 3M keys 100% full; each range contains 1M keys; 16KB pages Disk-first and cache-first fpB + -Trees achieve speedups of 4.2 and 3.5 over disk-optimized B + -Trees Jump-pointer array cache prefetching is effective Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree execution time (M cycles)
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Range Scan I/O Performance IBM DB2 Universal Database Jump-pointer array I/O prefetching achieves speedups of for disk-optimized B + -Trees 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table no prefetchwith prefetchin memory normalized execution time
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Other Experiments We find similar benefits in deletion cache performance Up to 20-fold speedups We performed many cache performance experiments and got similar results for Varying tree sizes, bulkload factors, and page sizes Mature trees Varying key sizes: 20B keys We performed range scan I/O experiments in our own index implementations and saw up to 6.9 fold speedups
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Related Work Micro-indexing (discussed briefly by Lomet, SIGMOD Record, Sep. 2001) Continuous array of index entries Micro-index We are the first to quantitatively analyze performance for micro-indexing: improves search cache performance but suffers from the data movement problem in update because of the continuous array structure fpB + -Trees have much better update performance
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Fractal prefetching B + -Trees: Conclusion Search: combine cache-optimized and disk-optimized node sizes better cache performance o speedup over disk-optimized B + -Trees good disk performance for disk-first fpB + -Trees o disk-first fpB + -Trees visit < 3% more disk pages o we only recommend cache-first fpB + -Trees with very large memory Update: solve data movement problem by using smaller nodes better cache performance o up to a 20-fold speedup over disk-optimized B + -Trees Range Scan: employ jump-pointer array prefetching better cache performance better disk performance o speedup on IBM DB2
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Back Up Slides
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Previous Work: Prefetching B + -Trees (SIGMOD 2001) Study B + -Trees in main memory environment For search: prefetching wider tree nodes increase node size to multiple cache lines wide use prefetching to read all cache lines of a node in parallel B + -Tree with one-line nodes Prefetching B + -Tree with four-line nodes
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Prefetching B + trees (cont’d) For range scan: jump-pointer array prefetching build jump-pointer arrays to hold leaf node addresses prefetch leaf nodes with jump-pointer array two implementations External Jump Pointer Array Internal Jump Pointer Array
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Optimization in Disk-first Approach Two conflicting goals: 1) optimize search cache performance 2) maximize page fan-out to preserve good I/O performance Optimal Criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal Details in the paper
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Cache-first fpB + -Trees Structure Group sibling leaf nodes into the same pages for range scan Group parent and its children into the same page for search Leaf parent nodes may be put into overflow pages Overflow pages
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Optimal Node Sizes Computation (key=4B) Disk-first fpB + -Trees Page Size Nonleaf Node Leaf Node Page Fan-out Cost/ Optimal 4KB64B384B KB192B256B KB192B512B KB256B832B Cache-first fpB + -Trees Page Size Node Size Page Fan-out Cost/ Optimal 4KB576B KB576B KB704B KB640B Optimal criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal We used these optimal values in our experiments
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Search Cache Performance Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random searches after bulkload; 100% full except root; 16KB pages # of entries in leaf pages execution time (M cycles) 3 Cache-sensitive schemes (fpB + -Trees and micro-indexing) all perform significantly better than disk-optimized B + -Trees The performances of cache-sensitive schemes are similar
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Search Cache Performance (Varying Page Sizes) Same experiments but with different page sizes We see the same trends: cache-sensitive schemes are better They achieve speedups: at all sizes; when trees contain at least 1M entries execution time (M cycles) # of entries in leaf pages # of entries in leaf pages # of entries in leaf pages 4KB pages8KB pages32KB pages Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Optimal Width Selection Disk-first fpB + -TreesCache-first fpB + -Trees (16KB pages, 4B keys) Our selected trees perform within 2% and 5% of the best for disk- first and cache-first fpB + -Trees
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Search I/O Performance page size 4KB8KB16KB32KB Mature Trees 4KB8KB16KB32KB page size # of I/O reads (x 1000) After Bulkload: 100% full Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree (2000 random searches, 4B keys) Disk-first fpB + -Trees access < 3% more pages Very small I/O performance impact Cache-first fpB + -Trees may access up to 25% more pages in our results
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Insertion Cache Performance 4KB8KB16KB32KB execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys 70% full
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Insertion Cache Performance II 100 bulkload factor execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages fpB + -Trees are significantly faster than both disk-optimized B + -Trees and Micro-indexing fpB + -Trees achieve up to 35-fold speedups over disk-optimized B + -Trees across all page sizes
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Insertion Cache Performance II 100 bulkload factor execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages Two major costs: data movement, page split Micro-indexing still suffers from data movement costs fpB + -Trees avoid this problem with smaller nodes
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Space Utilization page size Disk-first fpB+tree Cache-first fpB+tree 4KB8KB16KB32KB4KB8KB16KB32KB page size space overhead (percentage) Disk-first fpB+tree Cache-first fpB+tree After Bulkload: 100% full Mature Trees Disk-first fpB + -Trees incur < 9% space overhead Cache-first fpB + -Trees may use up to 36% more pages in our results (4B keys)
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Range Scan Cache Performance bulkload factor execution time (M cycles) Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree 100 scans starting at random locations on index bulkloaded with 3M keys; Each range spans 1M keys; 16KB pages Disk-first and cache-first fpB + -Trees achieve speedups of and over disk-optimized B + -Trees
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Range Scan I/O Performance # of disks used execution time (s) (10 disks) (10M entries in the range) jump-pointer array prefetching plain range scan Setup: SGI Origin 200 with four 180MHz R10000 processors, 128MB memory, 12 SCSI disks (10 of them used in experiments); Range scan on mature trees Jump-pointer array prefetching achieves up to a speedup of 6.9
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees Jump-pointer Array Prefetching on IBM DB # of I/O processes normalized execution time SMP degree (# of parallel processes) normalized execution time no prefetch with prefetch in memory Setup: 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table; “SELECT COUNT(*) FROM data “ Jump-pointer array prefetching achieves speedups of