Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.

Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie Mellon University Bell Laboratories current affiliation: Intel Research Pittsburgh Phillip B. Gibbons School of Computer Science Carnegie Mellon University Todd C. Mowry DB2 UDB Development Team IBM Toronto Lab Gary Valentin

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 2 - B + - Tree Operations: Review Search:  binary search in every node on the path Insertion/Deletion:  search followed by data movement Range Scan:  locate a collection of tuples in a range  traverse the linked list of leaf nodes  different from search-like operations

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 3 - Disk-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Traditional focus: I/O performance  minimize # of disk accesses  optimal tree nodes are disk pages typically 4KB-64KB large

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 4 - Cache-optimized B + -Trees L2/L3 Cache CPU L1 Main Memory Disks Recent studies: cache performance  e.g. [Rao & Ross, SIGMOD’00], [Bohannon, McIlroy, Rastogi, SIGMOD’01], [Chen, Gibbons, Mowry, SIGMOD’01]  cache line size is 32-128B  optimal tree nodes are only a few cache lines

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 5 - Large Difference in Node Sizes L2/L3 Cache CPU L1 Main Memory Disks

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 6 - Cache-optimized B + -Trees: Poor I/O Performance L2/L3 Cache CPU L1 Main Memory Disks  may fetch a distinct disk page for every node on the path of a search  similar penalty for range scan

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 7 - Disk-optimized B + -Trees: Poor Cache Performance L2/L3 Cache CPU L1 Main Memory Disks  Binary search in a large node suffers excessive number of cache misses (explained later in the talk)

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 8 - Optimizing for Both Cache and Disk Performance? L2/L3 Cache CPU L1 Main Memory Disks

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 9 - Our Approach L2/L3 Cache CPU L1 Main Memory Disks Fractal Prefetching B + -Trees (fpB + -Trees)  embedding cache optimized trees inside disk-optimized trees

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 10 - Outline  Overview  Optimizing Searches and Updates  Optimizing Range Scans  Experimental Results  Related Work  Conclusion

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 11 - Page Structure of Disk-optimized B + -Trees  We focus on fixed sized keys  (please see our full paper for a discussion on variable sized keys) Header A huge array of index entries An index entry is or

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 12 - Binary Search in a B + -Tree Page Search for entry #71 1 st cache line2 nd 3 rd 4 th 128 th Suppose an index entry array has 1023 index entries, numbered 1-1023 8 index entries / cache line the array occupies 128 cache lines e.g. 8KB page, an entry is, 64B cache line, 8B header 9 th 8 th

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 13 - Binary Search in a B + -Tree Page Search for entry #71 10231 71 Active Range 512 256 128 64 96 80 72 68 70 71 Poor cache performance because of poor spatial locality

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 14 - Fractal Prefetching B + -Trees (fpB + -Trees) Embedding cache-optimized trees inside disk pages: good search cache performance  binary search in cache-optimized nodes  much better locality  use cache prefetching good search disk performance  nodes are embedded into disk pages

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 15 - Node Size Mismatch Problem  Disk page size and cache-optimized node size  determined by hardware parameters and key sizes  Ideally cache-optimized trees fit nicely in disk pages  But usually this is not true ! A 2-level tree overflows Unused Space A 2-level tree underflows. But adding one more level overflows.

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 16 - Two Solutions Solution 1: use different sizes for in-page leaf and nonleaf nodes  e.g. smaller root when overflow, larger root when underflow Solution 2: overflowing nodes become roots of new pages Unused Space

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 17 - The Two Solutions from Another Point of View  Conceptually we apply disk and cache optimizations in different orders  Solution 1: disk-first  first build the disk-optimized pages  then fit smaller trees into disk pages by allowing different node sizes  Solution 2: cache-first  first build the cache-optimized trees  then group nodes together and place them into disk pages

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 18 - Insertion and Deletion Cache Performance  In disk-optimized B + -Trees, data movement is very expensive  the huge array structure in disk pages  on average, we need to move half the array  In our fpB + -Trees, the cost of data movement is much smaller  small cache-optimized nodes  We show that fpB + -Trees have much better insertion/deletion performance over disk-optimized B + -Trees with fixed sized keys

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 20 - Jump-pointer Array Prefetching for Range Scan  Previous proposal for range scan cache performance (SIGMOD’01)  build data structures to hold leaf node addresses  prefetch leaf nodes during range scans Internal Jump Pointer Array  Recall that range scans essentially traverse the linked list of leaf nodes

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 21 - New Proposal: I/O Prefetching linking leaf parent pages together  Employ jump-pointer array prefetching in I/O  jump-pointer arrays contain leaf page IDs  prefetching leaf pages to improve range scan I/O performance  Very useful when leaf pages are not sequential on disk  non-clustered index under frequent updates  (when sequential prefetching is not applicable)

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 22 - Both Cache and I/O Prefetching in fpB + -Trees Two jump-pointer arrays in fpB + -Trees:  One for range scan cache performance  containing leaf node addresses for cache prefetching  One for range scan disk performance  containing leaf page IDs for I/O prefetching L2/L3 Cache CPU L1 Main Memory Disks

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 23 - More Details in Our Paper  Computation for optimal node sizes  Data structures  Algorithms  Bulkload  Search  Insertion  Deletion  Range scan

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 25 - Implementation  We implemented a buffer manager and three index structures on top of the buffer manager Buffer Manager Disk-optimized B + -Trees Disk-first fpB + -Trees Cache-first fpB + -Trees Baseline

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 26 - Experiments and Methodology Experiments Search: (1) Cache performance; (2) Disk performance improving cache performance while preserving good disk performance Update: (3) Cache performance solving data movement problem Range Scan: (4) Cache performance; (5) Disk performance jump-pointer array prefetching Methodology cache performance: detailed cycle-by-cycle simulations  memory system parameters in near future  better prefetching support range scan I/O performance: execution times on real machines search I/O performance: counting the number of I/Os  I/O operations in search do not overlap

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 27 - Search Cache Performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkload; 100% full except root; 16KB pages 10 5 6 7 total # of entries in all leaf pages 1 1.5 2 2.5 3.5 execution time (M cycles) 3  fpB + -Trees perform significantly better than disk-optimized B + -Trees  achieving speedups: 1.09-1.77 at all sizes; over 1.25 when trees contain at least 1M entries  The performances of two fpB + -Trees are similar

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 28 - Search I/O Performance  Disk-first fpB + -Trees access < 3% more pages  Very small I/O performance impact  Cache-first fpB + -Trees may access up to 25% more pages in our results 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 page size # of I/O reads (x 1000) Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 2000 random searches after bulkloading 10M index entries; 100% full except root

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 29 - Insertion Cache Performance 2000 random insertions after bulkloading 3M keys 70% full  fpB + -Trees are significantly faster than disk-optimized B + -Trees  achieving up to 35-fold speedups over disk-optimized B + -Trees  Data movement costs dominate disk-optimized B + -Tree performance Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 4KB8KB16KB32KB 0 20 40 60 80 100 execution time (M cycles) Page Size

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 30 - Range Scan Cache Performance 100 scans starting at random locations in index bulkloaded with 3M keys 100% full; each range contains 1M keys; 16KB pages  Disk-first and cache-first fpB + -Trees achieve speedups of 4.2 and 3.5 over disk-optimized B + -Trees  Jump-pointer array cache prefetching is effective Disk-optimized B + -Tree Disk-first fpB + -Tree Cache-first fpB + -Tree 0 200 400 600 800 1000 1200 execution time (M cycles)

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 31 - Range Scan I/O Performance  IBM DB2 Universal Database  Jump-pointer array I/O prefetching achieves speedups of 2.5-5.0 for disk-optimized B + -Trees 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table no prefetchwith prefetchin memory 0 20 40 60 80 100 normalized execution time

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 32 - Other Experiments  We find similar benefits in deletion cache performance  Up to 20-fold speedups  We performed many cache performance experiments and got similar results for  Varying tree sizes, bulkload factors, and page sizes  Mature trees  Varying key sizes: 20B keys  We performed range scan I/O experiments in our own index implementations and saw up to 6.9 fold speedups

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 33 - Related Work  Micro-indexing (discussed briefly by Lomet, SIGMOD Record, Sep. 2001) Continuous array of index entries Micro-index  We are the first to quantitatively analyze performance for micro-indexing:  improves search cache performance  but suffers from the data movement problem in update because of the continuous array structure  fpB + -Trees have much better update performance

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 34 - Fractal prefetching B + -Trees: Conclusion Search: combine cache-optimized and disk-optimized node sizes  better cache performance o 1.1-1.8 speedup over disk-optimized B + -Trees  good disk performance for disk-first fpB + -Trees o disk-first fpB + -Trees visit < 3% more disk pages o we only recommend cache-first fpB + -Trees with very large memory Update: solve data movement problem by using smaller nodes  better cache performance o up to a 20-fold speedup over disk-optimized B + -Trees Range Scan: employ jump-pointer array prefetching  better cache performance  better disk performance o 2.5-5.0 speedup on IBM DB2

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 35 - Back Up Slides

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 36 - Previous Work: Prefetching B + -Trees (SIGMOD 2001)  Study B + -Trees in main memory environment  For search: prefetching wider tree nodes  increase node size to multiple cache lines wide  use prefetching to read all cache lines of a node in parallel B + -Tree with one-line nodes Prefetching B + -Tree with four-line nodes

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 37 - Prefetching B + trees (cont’d)  For range scan: jump-pointer array prefetching  build jump-pointer arrays to hold leaf node addresses  prefetch leaf nodes with jump-pointer array  two implementations External Jump Pointer Array Internal Jump Pointer Array

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 38 - Optimization in Disk-first Approach  Two conflicting goals: 1) optimize search cache performance 2) maximize page fan-out to preserve good I/O performance  Optimal Criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal  Details in the paper

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 39 - Cache-first fpB + -Trees Structure  Group sibling leaf nodes into the same pages for range scan  Group parent and its children into the same page for search  Leaf parent nodes may be put into overflow pages Overflow pages

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 40 - Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 41 - Optimal Node Sizes Computation (key=4B) Disk-first fpB + -Trees Page Size Nonleaf Node Leaf Node Page Fan-out Cost/ Optimal 4KB64B384B4701.06 8KB192B256B9611.00 16KB192B512B19531.03 32KB256B832B40171.07 Cache-first fpB + -Trees Page Size Node Size Page Fan-out Cost/ Optimal 4KB576B4971.03 8KB576B9941.03 16KB704B20011.07 32KB640B40291.05  Optimal criteria: maximize page fan-out while maintaining analytical search cost to be within 10% of the optimal  We used these optimal values in our experiments

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 42 - Search Cache Performance Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random searches after bulkload; 100% full except root; 16KB pages 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3.5 execution time (M cycles) 3  Cache-sensitive schemes (fpB + -Trees and micro-indexing) all perform significantly better than disk-optimized B + -Trees  The performances of cache-sensitive schemes are similar

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 43 - Search Cache Performance (Varying Page Sizes)  Same experiments but with different page sizes  We see the same trends: cache-sensitive schemes are better  They achieve speedups: 1.09-1.77 at all sizes; 1.25-1.77 when trees contain at least 1M entries execution time (M cycles) 1 1.5 2 2.5 3 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3 10 5 6 7 # of entries in leaf pages 1 1.5 2 2.5 3.5 3 10 5 6 7 # of entries in leaf pages 4KB pages8KB pages32KB pages Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 44 - Optimal Width Selection Disk-first fpB + -TreesCache-first fpB + -Trees (16KB pages, 4B keys)  Our selected trees perform within 2% and 5% of the best for disk- first and cache-first fpB + -Trees

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 45 - Search I/O Performance page size 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Mature Trees 4KB8KB16KB32KB 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 page size # of I/O reads (x 1000) After Bulkload: 100% full Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree (2000 random searches, 4B keys)  Disk-first fpB + -Trees access < 3% more pages  Very small I/O performance impact  Cache-first fpB + -Trees may access up to 25% more pages in our results

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 46 - Insertion Cache Performance 4KB8KB16KB32KB 0 20 40 60 80 100 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys 70% full

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 47 - Insertion Cache Performance II 100 bulkload factor 60708090 0 10 20 30 40 50 60 70 80 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages  fpB + -Trees are significantly faster than both disk-optimized B + -Trees and Micro-indexing  fpB + -Trees achieve up to 35-fold speedups over disk-optimized B + -Trees across all page sizes

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 48 - Insertion Cache Performance II 100 bulkload factor 60708090 0 10 20 30 40 50 60 70 80 execution time (M cycles) Disk-optimized B+tree Micro-indexing Disk-first fpB+tree Cache-first fpB+tree 2000 random insertions after bulkloading 3M keys; 16KB pages  Two major costs: data movement, page split  Micro-indexing still suffers from data movement costs  fpB + -Trees avoid this problem with smaller nodes

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 49 - Space Utilization 0 10 20 30 40 50 page size Disk-first fpB+tree Cache-first fpB+tree 4KB8KB16KB32KB4KB8KB16KB32KB 0 10 20 30 40 50 page size space overhead (percentage) Disk-first fpB+tree Cache-first fpB+tree After Bulkload: 100% full Mature Trees  Disk-first fpB + -Trees incur < 9% space overhead  Cache-first fpB + -Trees may use up to 36% more pages in our results (4B keys)

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 50 - Range Scan Cache Performance 60708090100 0 200 400 600 800 1000 1200 1400 1600 bulkload factor execution time (M cycles) Disk-optimized B+tree Disk-first fpB+tree Cache-first fpB+tree 100 scans starting at random locations on index bulkloaded with 3M keys; Each range spans 1M keys; 16KB pages  Disk-first and cache-first fpB + -Trees achieve speedups of 3.5-4.2 and 3.0-3.5 over disk-optimized B + -Trees

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 51 - Range Scan I/O Performance 12345678910 0 20 40 60 80 100 # of disks used execution time (s) (10 disks) (10M entries in the range) jump-pointer array prefetching plain range scan  Setup: SGI Origin 200 with four 180MHz R10000 processors, 128MB memory, 12 SCSI disks (10 of them used in experiments); Range scan on mature trees  Jump-pointer array prefetching achieves up to a speedup of 6.9

Chen, Gibbons, Mowry & Valentin Carnegie Mellon Fractal Prefetching B + -Trees - 52 - Jump-pointer Array Prefetching on IBM DB2 123456789101112 0 20 40 60 80 100 # of I/O processes normalized execution time 12345678910 0 20 40 60 80 100 SMP degree (# of parallel processes) normalized execution time no prefetch with prefetch in memory  Setup: 8-processor machine (RS/6000 line), 2GB memory, 80 SSA disks; mature index on a 12.8GB table; “SELECT COUNT(*) FROM data “  Jump-pointer array prefetching achieves speedups of 2.5-5.0

Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.

Similar presentations

Presentation on theme: "Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.

Similar presentations

Presentation on theme: "Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback