Download presentation
Presentation is loading. Please wait.
1
Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie Mellon University Bell Laboratories current affiliation: Intel Research Pittsburgh Phillip B. Gibbons School of Computer Science Carnegie Mellon University Todd C. Mowry DB2 UDB Development Team IBM Toronto Lab Gary Valentin
2
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 2 - B + - Tree Operations: Review Search: binary search in every node on the path Insertion/Deletion: search followed by data movement Range Scan: locate a collection of tuples in a range traverse the linked list of leaf nodes different from search-like operations
3
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 3 - Disk-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Traditional focus: I/O performance minimize # of disk accesses optimal tree nodes are disk pages typically 4KB-64KB large
4
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 4 - Cache-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Recent studies: cache performance e.g. [Rao & Ross, SIGMOD’00], [Bohannon, McIlroy, Rastogi, SIGMOD’01], [Chen, Gibbons, Mowry, SIGMOD’01] cache line size is 32-128B optimal tree nodes are only a few cache lines
5
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 5 - Large Difference in Node Sizes L2/L3 Cache CPU L1 Main Memory Disks
6
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 6 - Cache-optimized B + -Trees: Poor I/O Performance L2/L3 Cache CPU L1 Main Memory Disks may fetch a distinct disk page for every node on the path of a search similar penalty for range scan
7
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 7 - Disk-optimized B + -Trees: Poor Cache Performance L2/L3 Cache CPU L1 Main Memory Disks Binary search in a large node suffers excessive number of cache misses (explained later in the talk)
8
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 8 - Optimizing for Both Cache and Disk Performance? L2/L3 Cache CPU L1 Main Memory Disks
9
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 9 - Our Approach L2/L3 Cache CPU L1 Main Memory Disks Fractal Prefetching B + -Trees (fpB + -Trees) embedding cache optimized trees inside disk-optimized trees
10
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 10 - Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
11
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 11 - Page Structure of Disk-optimized B + -Trees We focus on fixed sized keys (please see our full paper for a discussion on variable sized keys) Header A huge array of index entries An index entry is or
12
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 12 - Binary Search in a B + -Tree Page Search for entry #71 1 st cache line2 nd 3 rd 4 th 128 th Suppose an index entry array has 1023 index entries, numbered 1-1023 8 index entries / cache line the array occupies 128 cache lines e.g. 8KB page, an entry is, 64B cache line, 8B header 9 th 8 th
13
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 13 - Binary Search in a B + -Tree Page Search for entry #71 10231 71 Active Range 512 256 128 64 96 80 72 68 70 71 Poor cache performance because of poor spatial locality
14
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 14 - Fractal Prefetching B + -Trees (fpB + -Trees) Embedding cache-optimized trees inside disk pages: good search cache performance binary search in cache-optimized nodes much better locality use cache prefetching good search disk performance nodes are embedded into disk pages
15
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 15 - Node Size Mismatch Problem Disk page size and cache-optimized node size determined by hardware parameters and key sizes Ideally cache-optimized trees fit nicely in disk pages But usually this is not true ! A 2-level tree overflows Unused Space A 2-level tree underflows. But adding one more level overflows.
16
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 16 - Two Solutions Solution 1: use different sizes for in-page leaf and nonleaf nodes e.g. smaller root when overflow, larger root when underflow Solution 2: overflowing nodes become roots of new pages Unused Space
17
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 17 - The Two Solutions from Another Point of View Conceptually we apply disk and cache optimizations in different orders Solution 1: disk-first first build the disk-optimized pages then fit smaller trees into disk pages by allowing different node sizes Solution 2: cache-first first build the cache-optimized trees then group nodes together and place them into disk pages
18
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 18 - Insertion and Deletion Cache Performance In disk-optimized B + -Trees, data movement is very expensive the huge array structure in disk pages on average, we need to move half the array In our fpB + -Trees, the cost of data movement is much smaller small cache-optimized nodes We show that fpB + -Trees have much better insertion/deletion performance over disk-optimized B + -Trees with fixed sized keys
19
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 19 - Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
20
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 20 - Jump-pointer Array Prefetching for Range Scan Previous proposal for range scan cache performance (SIGMOD’01) build data structures to hold leaf node addresses prefetch leaf nodes during range scans Internal Jump Pointer Array Recall that range scans essentially traverse the linked list of leaf nodes
21
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 21 - New Proposal: I/O Prefetching linking leaf parent pages together Employ jump-pointer array prefetching in I/O jump-pointer arrays contain leaf page IDs prefetching leaf pages to improve range scan I/O performance Very useful when leaf pages are not sequential on disk non-clustered index under frequent updates (when sequential prefetching is not applicable)
22
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 22 - Both Cache and I/O Prefetching in fpB + -Trees Two jump-pointer arrays in fpB + -Trees: One for range scan cache performance containing leaf node addresses for cache prefetching One for range scan disk performance containing leaf page IDs for I/O prefetching L2/L3 Cache CPU L1 Main Memory Disks
23
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 23 - More Details in Our Paper Computation for optimal node sizes Data structures Algorithms Bulkload Search Insertion Deletion Range scan
24
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 24 - Outline Overview Optimizing Searches and Updates Optimizing Range Scans Experimental Results Related Work Conclusion
25
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 25 - Implementation We implemented a buffer manager and three index structures on top of the buffer manager Buffer Manager Disk-optimized B + -Trees Disk-first fpB + -Trees Cache-first fpB + -Trees Baseline
26
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 26 - Experiments and Methodology Experiments Search: (1) Cache performance; (2) Disk performance improving cache performance while preserving good disk performance Update: (3) Cache performance solving data movement problem Range Scan: (4) Cache performance; (5) Disk performance jump-pointer array prefetching Methodology cache performance: detailed cycle-by-cycle simulations memory system parameters in near future better prefetching support range scan I/O performance: execution times on real machines search I/O performance: counting the number of I/Os I/O operations in search do not overlap
27
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 27 - Search Cache Performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkload; 100% full except root; 16KB pages 10 5 6 7 total # of entries in all leaf pages 1 1.5 2 2.5 3.5 execution time (M cycles) 3 fpB + -Trees perform significantly better than disk-optimized B + -Trees achieving speedups: 1.09-1.77 at all sizes; over 1.25 when trees contain at least 1M entries The performances of two fpB + -Trees are similar
28
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 28 - Search I/O Performance Disk-first fpB + -Trees access < 3% more pages Very small I/O performance impact Cache-first fpB + -Trees may access up to 25% more pages in our results 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 page size # of I/O reads (x 1000) Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkloading 10M index entries; 100% full except root
29
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 29 - Insertion Cache Performance 2000 random insertions after bulkloading 3M keys 70% full fpB + -Trees are significantly faster than disk-optimized B + -Trees achieving up to 35-fold speedups over disk-optimized B + -Trees Data movement costs dominate disk-optimized B + -Tree performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 4KB8KB16KB32KB 0 20 40 60 80 100 execution time (M cycles) Page Size
30
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 30 - Range Scan Cache Performance 100 scans starting at random locations in index bulkloaded with 3M keys 100% full; each range contains 1M keys; 16KB pages Disk-first and cache-first fpB + -Trees achieve speedups of 4.2 and 3.5 over disk-optimized B + -Trees Jump-pointer array cache prefetching is effective Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 0 200 400 600 800 1000 1200 execution time (M cycles)
31
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 31 - Range Scan I/O Performance IBM DB2 Universal Database Jump-pointer array I/O prefetching achieves speedups of 2.5-5.0 for disk-optimized B + -Trees 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table no prefetchwith prefetchin memory 0 20 40 60 80 100 normalized execution time
32
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 32 - Other Experiments We find similar benefits in deletion cache performance Up to 20-fold speedups We performed many cache performance experiments and got similar results for Varying tree sizes, bulkload factors, and page sizes Mature trees Varying key sizes: 20B keys We performed range scan I/O experiments in our own index implementations and saw up to 6.9 fold speedups
33
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 33 - Related Work Micro-indexing (discussed briefly by Lomet, SIGMOD Record, Sep. 2001) Continuous array of index entries Micro-index We are the first to quantitatively analyze performance for micro-indexing: improves search cache performance but suffers from the data movement problem in update because of the continuous array structure fpB + -Trees have much better update performance
34
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 34 - Fractal prefetching B + -Trees: Conclusion Search: combine cache-optimized and disk-optimized node sizes better cache performance o 1.1-1.8 speedup over disk-optimized B + -Trees good disk performance for disk-first fpB + -Trees o disk-first fpB + -Trees visit < 3% more disk pages o we only recommend cache-first fpB + -Trees with very large memory Update: solve data movement problem by using smaller nodes better cache performance o up to a 20-fold speedup over disk-optimized B + -Trees Range Scan: employ jump-pointer array prefetching better cache performance better disk performance o 2.5-5.0 speedup on IBM DB2
35
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 35 - Back Up Slides
36
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 36 - Previous Work: Prefetching B + -Trees (SIGMOD 2001) Study B + -Trees in main memory environment For search: prefetching wider tree nodes increase node size to multiple cache lines wide use prefetching to read all cache lines of a node in parallel B + -Tree with one-line nodes Prefetching B + -Tree with four-line nodes
37
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 37 - Prefetching B + trees (cont’d) For range scan: jump-pointer array prefetching build jump-pointer arrays to hold leaf node addresses prefetch leaf nodes with jump-pointer array two implementations External Jump Pointer Array Internal Jump Pointer Array
38
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 38 - Optimization in Disk-first Approach Two conflicting goals: 1) optimize search cache performance 2) maximize page fan-out to preserve good I/O performance Optimal Criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal Details in the paper
39
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 39 - Cache-first fpB + -Trees Structure Group sibling leaf nodes into the same pages for range scan Group parent and its children into the same page for search Leaf parent nodes may be put into overflow pages Overflow pages
40
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 40 - Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention
41
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 41 - Optimal Node Sizes Computation (key=4B) Disk-first fpB + -Trees Page Size Nonleaf Node Leaf Node Page Fan-out Cost/ Optimal 4KB64B384B4701.06 8KB192B256B9611.00 16KB192B512B19531.03 32KB256B832B40171.07 Cache-first fpB + -Trees Page Size Node Size Page Fan-out Cost/ Optimal 4KB576B4971.03 8KB576B9941.03 16KB704B20011.07 32KB640B40291.05 Optimal criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal We used these optimal values in our experiments
42
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 42 - Search Cache Performance Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random searches after bulkload; 100% full except root; 16KB pages 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3.5 execution time (M cycles) 3 Cache-sensitive schemes (fpB + -Trees and micro-indexing) all perform significantly better than disk-optimized B + -Trees The performances of cache-sensitive schemes are similar
43
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 43 - Search Cache Performance (Varying Page Sizes) Same experiments but with different page sizes We see the same trends: cache-sensitive schemes are better They achieve speedups: 1.09-1.77 at all sizes; 1.25-1.77 when trees contain at least 1M entries execution time (M cycles) 1 1.5 2 2.5 3 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3.5 3 10 5 6 7 # of entries in leaf pages 4KB pages8KB pages32KB pages Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree
44
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 44 - Optimal Width Selection Disk-first fpB + -TreesCache-first fpB + -Trees (16KB pages, 4B keys) Our selected trees perform within 2% and 5% of the best for disk- first and cache-first fpB + -Trees
45
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 45 - Search I/O Performance page size 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Mature Trees 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 page size # of I/O reads (x 1000) After Bulkload: 100% full Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree (2000 random searches, 4B keys) Disk-first fpB + -Trees access < 3% more pages Very small I/O performance impact Cache-first fpB + -Trees may access up to 25% more pages in our results
46
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 46 - Insertion Cache Performance 4KB8KB16KB32KB 0 20 40 60 80 100 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys 70% full
47
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 47 - Insertion Cache Performance II 100 bulkload factor 60708090 0 10 20 30 40 50 60 70 80 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages fpB + -Trees are significantly faster than both disk-optimized B + -Trees and Micro-indexing fpB + -Trees achieve up to 35-fold speedups over disk-optimized B + -Trees across all page sizes
48
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 48 - Insertion Cache Performance II 100 bulkload factor 60708090 0 10 20 30 40 50 60 70 80 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages Two major costs: data movement, page split Micro-indexing still suffers from data movement costs fpB + -Trees avoid this problem with smaller nodes
49
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 49 - Space Utilization 0 10 20 30 40 50 page size Disk-first fpB+tree Cache-first fpB+tree 4KB8KB16KB32KB4KB8KB16KB32KB 0 10 20 30 40 50 page size space overhead (percentage) Disk-first fpB+tree Cache-first fpB+tree After Bulkload: 100% full Mature Trees Disk-first fpB + -Trees incur < 9% space overhead Cache-first fpB + -Trees may use up to 36% more pages in our results (4B keys)
50
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 50 - Range Scan Cache Performance 60708090100 0 200 400 600 800 1000 1200 1400 1600 bulkload factor execution time (M cycles) Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree 100 scans starting at random locations on index bulkloaded with 3M keys; Each range spans 1M keys; 16KB pages Disk-first and cache-first fpB + -Trees achieve speedups of 3.5-4.2 and 3.0-3.5 over disk-optimized B + -Trees
51
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 51 - Range Scan I/O Performance 12345678910 0 20 40 60 80 100 # of disks used execution time (s) (10 disks) (10M entries in the range) jump-pointer array prefetching plain range scan Setup: SGI Origin 200 with four 180MHz R10000 processors, 128MB memory, 12 SCSI disks (10 of them used in experiments); Range scan on mature trees Jump-pointer array prefetching achieves up to a speedup of 6.9
52
Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 52 - Jump-pointer Array Prefetching on IBM DB2 123456789101112 0 20 40 60 80 100 # of I/O processes normalized execution time 12345678910 0 20 40 60 80 100 SMP degree (# of parallel processes) normalized execution time no prefetch with prefetch in memory Setup: 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table; “SELECT COUNT(*) FROM data “ Jump-pointer array prefetching achieves speedups of 2.5-5.0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.