Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie.

Slides:

Advertisements

Similar presentations

Chapter 4: Trees Part II - AVL Tree

Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

CS4432: Database Systems II

B+-tree and Hashing.

@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.

1 Overview of Storage and Indexing Chapter 8 (part 1)

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Chapter 12 CPU Structure and Function. Example Register Organizations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.

1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.

1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.

1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

CS4432: Database Systems II

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,

External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.

Making B+-Trees Cache Conscious in Main Memory

 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.

ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CSC 213 – Large Scale Programming Lecture 37: External Caching & (a,b)-Trees.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:

1 Overview of Storage and Indexing Chapter 8 (part 1)

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.

1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.

Tree-Structured Indexes

CSC 4250 Computer Architectures

CSE 332 Data Abstractions B-Trees

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

ECE 445 – Computer Organization

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Database Management Systems (CS 564)

Lecture#12: External Sorting (R&G, Ch13)

Indexing and Hashing Basic Concepts Ordered Indices

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.

CACHE-CONSCIOUS INDEXES

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Presentation transcript:

Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie Mellon University Information Sciences Research Center Bell Laboratories †

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Databases and the Memory Hierarchy Traditional Focus:  buffer pool management (DRAM as a cache for disk) Important Focus Today:  processor cache performance (SRAM as a cache for DRAM)  e.g., [Ailamaki et al, VLDB ’99], etc. Disk Main Memory CPU L2/L3 Cache Larger, slower, cheaper L1 Cache

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Structures  Used extensively in databases to accelerate performance  selections, joins, etc. Common Implementation: B + -Trees Leaf Nodes Non-Leaf Nodes

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + - Tree Indices: Common Access Patterns Search:  locate a single tuple Range Scan:  locate a collection of tuples within a range

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance of B + - Tree Indices  A main memory B + -Tree containing 10M keys:  Search: 100K random searches  Scan: 100 range scans of 1M keys, starting at random keys  Detailed simulations based on Compaq ES40 system  Most of execution time is wasted on data cache misses  65% for searches, 84% for range scans Data Cache Stalls Other Stalls Busy Time

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + -Trees: Optimizing Search for Cache vs. Disk  To minimize the number of data transfers (I/O or cache misses): Optimal Node Width = Natural Data Transfer Size  for disk: disk page size (~8 Kbytes)  for cache: cache line size (~64 bytes)  Much narrower nodes and higher trees  Search performance more sensitive to changes in branching factors Optimized for diskOptimized for cache

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Previous Work: “Cache-Sensitive B + - Trees” Rao and Ross [SIGMOD 2000] Key insight:  nearly all child ptrs can be eliminated by restricting data layout  double the branching factor of cache-line-sized non-leaf nodes B + -TreesCSB + -Trees K1K1 K2K2 K3K3 K4K4 K5K5 K6K6 K7K7 K8K8 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Impact of CSB + - Trees on Search Performance  Search is 15% faster due to reduction in height of tree  However:  update performance is worse [Rao & Ross, SIGMOD ’00]  range scan performance does not improve  There is still significant room for improvement Data Cache Stalls Other Stalls Busy Time B + -TreeCSB + -Tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Latency Tolerance in Modern Memory Hierarchies Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9)  Modern processors overlap multiple simultaneous cache misses  e.g., Compaq ES40 supports 8 off-chip misses per processor  Prefetch instructions allow software to fully exploit the parallelism What dictates performance:  NOT simply the number of cache misses  but rather the amount of exposed miss latency

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Approach New Proposal: “Prefetching B + -Trees” (pB + -Trees)  use prefetching to reduce the amount of exposed miss latency Key Challenge:  data dependences caused by chasing pointers Benefits:  significant performance gains for:  searches  range scans  updates (!)  complementary to CSB + -Trees

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  Conclusions

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Example: Search where Node Width = 1 Line 0 Time (cycles) Cache miss We suffer one full cache miss at each level of the tree keys, 64B lines, 4B keys, ptrs & tupleIDs 4 levels in B + -Tree (cold cache)

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Same Example where Node Width = 2 Lines 0 Time (cycles) Cache miss 0 Time (cycles) Cache miss levels in tree Additional misses per node dominate reduction in # of levels.

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching How Things Change with Prefetching 0 Time (cycles) Cache miss Time (cycles) Cache miss # of misses  exposed miss latency  fetch all lines within a node in parallel 0 Cache miss Time (cycles)

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching pB + -Trees: Using Prefetching to Improve Search Basic Idea:  make nodes wider than the natural data transfer size  e.g., 8 cache lines wide  prefetch all lines of a node before searching in the node Improved Search Performance:  Larger branching factors, shallower trees  Cost to access every node only increased slightly Reduced Space Overhead:  primarily due to fewer non-leaf nodes Update Performance: ???

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  Conclusions

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Range Scan Cache Behavior: Normal B + - Trees Steps in Range Scan: search for the starting leaf node traverse the leaves until end is found 0 Time(cycles) Cache miss We suffer a full cache miss for each leaf node !

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching If Prefetching Wider Nodes e.g., node width = 2 lines 0 Time(cycles) Cache miss Time(cycles) Cache miss 320 Exposed miss latency is reduced by up to a factor of node width. A definite improvement, but can we still do better?

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching The Ideal Case Overlap misses until all latency is hidden, or run out of bandwidth How can we achieve this ? 0 Time(cycles) Cache miss 0 Time(cycles) Cache miss Time(cycles) Cache miss

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching The Pointer Chasing Problem Currently visitingWant to prefetch If prefetching through pointer chasing, still experience the full latency at each node Directly prefetch Ideal case

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Solution: Jump Pointer Arrays Put leaf addresses in an array Directly prefetch by using the jump pointers Back pointers needed to initialize prefetching

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Solution: Jump Pointer Arrays 0 Time Cache miss

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching External Jump Pointer Arrays: Efficient Updates  Impact of an insertion is limited to its chunk  Deletions leave empty slots  Actively interleave empty slots during bulkload and chunk splits  Back pointer to position in jump-pointer array is now a hint  points to correct chunk  but may require local search within chunk to init prefetching hints chunked linked-list

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Alternative Design: Internal Jump-Pointer Arrays  B + -Trees already contain structures that point to the leaf nodes bottom non-leaf nodes  the parents of the leaf nodes ( “bottom non-leaf nodes”)  By linking them together, we can use them as a jump-pointer array Tradeoff:  no need for back-pointers, and simpler to maintain  consumes less space, though external array overhead is <1%  but less flexible, chunk size is fixed by B + -Tree structure

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Experimental Framework  Results are for a main-memory database environment  (we are extending this work to disk-based environments) Executables:  we added prefetch instructions to C source code by hand  used gcc to generate optimized MIPS executables with prefetch instructions Performance Measurement:  detailed, cycle-by-cycle simulations Machine Model:  based on Compaq ES40 system, with slightly updated parameters

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Search Performance 100K random searches after bulkload; 100% full (except root); warm caches # of tupleIDs in the trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8B+tree p8CSB+  pB + -Trees achieve 27-47% speedup vs. B + -Trees, 14-34% vs. CSB + -Trees  optimal node width is 8 cache lines  pB + -Trees and CSB + -Trees are complementary: p 8 CSB + -Trees are best

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Same Search Experiments with Cold Caches  Large discrete steps within each curve What is happening here? # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ 100K random searches after bulkload; 100% full (except root); cold caches (i.e. cleared after each search).

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Analysis of Cold Cache Search Behavior Height of the tree dominates performance  effect is blurred in warm cache case If the same height, the smaller the node size the better # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ Tree Type Number of Keys 10K30K100K300K1M3M10M B+B CSB p2B+p2B p4B+p4B p8B+p8B p 16 B p 8 CSB # of Levels in the Trees

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Range Scan Performance Scans of 1K-1M keys: speedup over B + -Trees  factor of from prefetching wider nodes  additional factor of ~2 from jump-pointer arrays log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full) # of tupleIDs scanned through in a single call time (Cycles) B+tree p8B+tree p8iB+tree p8eB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Index Range Scan Performance Small scans (<1K keys): overshooting cost is noticeable  exploit only if scan is expected to be large (e.g., search for end) log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full)

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Update Performance  pB + -Trees achieve at least a 1.24 speedup in all cases Why? percentage of entries used in leaf nodes time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches percentage of entries used in leaf nodes B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Update Performance Reason #1: faster search times Reason #2: less frequent node splits with wider nodes percentage of entries used in leaf nodes time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches percentage of entries used in leaf nodes B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching pB + -Trees: Other Results Similar results for:  varying bulkload factors of trees  large segmented range scans  mature trees  varying jump-pointer array parameters:  prefetch distance  chunk size Optimal node width:  increases as memory bandwidth increases  (matches the width predicted by our model in the paper)

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Conclusions  Impact of Prefetching B + - Trees on performance:  Search: speedup over B + - Trees  wider nodes reduce height of tree, # of expensive misses  outperform and are complementary to CSB + - Trees  Updates: speedup over B + - Trees  faster search and less frequent node splits  in contrast with significant slowdowns for CSB + - Trees  Range Scan: speedup over B + - Trees  wider nodes: factor of ~3.5 speedup  jump-pointer arrays: additional factor of ~2 speedup  Prefetching B + - Trees also reduce space overhead.  These benefits are likely to increase with future memory systems.  Applicable to other levels of the memory hierarchy (e.g., disks).

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Backup Slides

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Revisiting the Optimal Node Width for Searches Total cache misses for a search is minimized when: w = 1 w = # of cache lines per node m = # of child pointers per one-cache-line wide node N = # of tupleIDs in index Total cache misses Misses per level# of levels in tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scheduling Prefetches Early Enough nini n i+1 n i+2 n i+3 n i+2 n i+3 currently visiting nini want to prefetch n i+3 p = &n 0 ; while(p) { work(p->data); p = p->next; } P Loading a node L Work() W Our Goal: fully hide latency thus achieving fastest possible computation rate of 1/W e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Performance without Prefetching nini n i+1 n i+2 n i+3 Time LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 while(p) { work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) Computation rate = 1/(L+W)

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Prefetching One Node Ahead nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Computation is overlapped with memory accesses. computation rate = 1/L LiLi WiWi loading n k work(n k ) data dependence visiting nini prefetch n i+1 pf(p->next) while(p) { pf(p->next); work(p->data); p = p->next; } Time

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Prefetching Three Nodes Ahead nini n i+1 n i+2 n i+3 LiLi WiWi W i+1 W i+2 W i+3 Computation rate does not improve (still = 1/L)! visiting nini prefetch n i+3 pf(p->next->next->next) L i+1 L i+2 L i+3 L Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96] any scheme which follows the pointer chain is limited to a rate of 1/L Time while(p) { pf(p->next->next->next); work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) data dependence

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Our Goal: Fully Hide Latency nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Achieves the fastest possible computation rate of 1/W. visiting nini prefetch n i+1 pf(&n i+3 ) Time LiLi WiWi loading n k work(n k ) data dependence while(p) { pf(&n i+3 ); work(p->data); p = p->next; }

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Challenges in Supporting Efficient Updates jump-pointer array back pointers Conceptual view of jump-pointer array: What if we really implemented it this way? Insertion: could incur significant overheads copying data within the array to create a new hole updating back-pointers Deletion: ok; just leave a hole

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Summary: Why We Expect Updates to Perform Well Insertions:  only a small number of jump pointers move  between insertion point and nearest hole in the chunk  normally only update the hint pointer for the inserted node  which does not require any significant overhead  significant overheads only occur on chunk splits, which are rare Deletions:  no data is moved (just leave an empty hole)  no need to update any hints In general, the jump-pointer array requires little concurrency control.

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching B + - Trees Modeled and their Notations  B + - Trees:regular B + - Trees  CSB + - Trees:cache-sensitive B + - Trees [Rao & Ross, SIGMOD 2000]  p w B + - Trees:prefetching B + - Trees with node size = w cache lines and no jump-pointer arrays we consider w = 2, 4, 8, and 16  p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and external jump-pointer arrays  p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and internal jump-pointer arrays  p 8 CSB + - Trees:prefetching cache-sensitive B + - Trees with node size = 8 cache lines (and no jump-pointer arrays) (Gory implementation details are in the paper.) e i

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Searches with Varying Bulkload Factors  Similar trends with smaller bulkload factors as when 100% full  Performance of pB + -Trees is somewhat less sensitive to bulkload factor percentage of entries used in leaf nodes time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8CSB+ p8B+tree percentage of entries used in leaf nodes time (M cycles) cold cacheswarm caches

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Range Scans with Varying Bulkload Factors Prefetching B + - Trees offer:  larger speedups with smaller bulkload factors (more nodes to fetch)  less sensitivity of performance to bulkload factor

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Large Segmented Range Scans  1M keys, scanned in 1000-key segments  Similar performance gains as unsegmented scans

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Insertions with Cold Caches

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Deletions with Cold Caches

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching percentage of entries used in leaf nodes insertions with node splits B+tree p8B+tree p8eB+tree p8iB+tree Analysis of Nodes Splits upon Insertions  Far fewer node splits Bulkload Factor = 60-90% Bulkload Factor = 100% At least 2 splits One split No splits  Fewer node splits  Fewer non-leaf node splits

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Searches (Warm Caches) number of search (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Insertions (Warm Caches) number of insertion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree CSB + -Tree could be 25% worse than B + -Tree under the same mature tree experiments (on diff h/w configuration) pB + -Trees are significantly faster than B + -Tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Deletions (Warm Caches) number of deletion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Searches (Cold Caches) number of search (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Insertions (Cold Caches) number of insertion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Deletions (Cold Caches) number of deletion (x 1000) time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Mature Trees: Large Segmented Range Scans B+treep8B+p8eB+p8iB

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search varying memory bandwidth (warm cache) normalized bandwidth (B) normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when pessimistic (B=5), p 8 B + -Tree still achieve significant speedups: 1.2 for warm cache

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search varying memory bandwidth (cold cache) normalized bandwidth (B) normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when B=5, 1.3 speedup for cold cache The optimal value for w increases when B gets larger

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scan varying prefetching distance (P 8 eB + -Tree) not sensitive to moderate increases in the prefetching distance Though overshooting cost shows up when #entries to scan is small

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Scan varying chunk size (P 8 eB + -Tree) Not sensitive to varying chunk size

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Table 1 Terminology VariableDefinition w # of cache lines in an index node m # of child pointers in a one-line-wide node N # of pairs in an index d # of child pointers in non-leaf node ( = w  m ) T1T1 Full latency of a cache miss T next Latency of an additional pipelined cache miss B Normalized memory bandwidth ( B = T 1 /T next ) K # of nodes to prefetch ahead C #of cache lines in jump-pointer array chunk p w B + -Tree Plain pB + -Tree with w -line-wide nodes p w B + -Tree p w B + -Tree with external jump-pointer arrays p w B + -Tree p w B + -Tree with internal jump-pointer arrays e i

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Search w/ & w/o Jump-Pointer Arrays: Cold Cache entries in leaf nodes time (M cycles) p8B+tree p8eB+tree p8iB+tree different # of levels in tree

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time

Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching Can We Do Even Better on Searches?  Hiding latency across levels is difficult given:  data dependence through the child pointer  the relatively large branching factor of tree nodes  equal likelihood of following any child  assuming uniformly distributed random search keys  What if we prefetch a node’s children in parallel with accessing it?  duality between this and creating wider nodes  BUT, this approach has the following relative disadvantages:  storage overhead for the child (or grandchild) pointers  size of node can only grow by multiples of the branching factor