Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie.

Similar presentations


Presentation on theme: "Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie."— Presentation transcript:

1 Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie Mellon University Information Sciences Research Center Bell Laboratories †

2 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 2 - Databases and the Memory Hierarchy Traditional Focus:  buffer pool management (DRAM as a cache for disk) Important Focus Today:  processor cache performance (SRAM as a cache for DRAM)  e.g., [Ailamaki et al, VLDB ’99], etc. Disk Main Memory CPU L2/L3 Cache Larger, slower, cheaper L1 Cache

3 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 3 - Index Structures  Used extensively in databases to accelerate performance  selections, joins, etc. Common Implementation: B + -Trees Leaf Nodes Non-Leaf Nodes

4 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 4 - B + - Tree Indices: Common Access Patterns Search:  locate a single tuple Range Scan:  locate a collection of tuples within a range

5 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 5 - Cache Performance of B + - Tree Indices  A main memory B + -Tree containing 10M keys:  Search: 100K random searches  Scan: 100 range scans of 1M keys, starting at random keys  Detailed simulations based on Compaq ES40 system  Most of execution time is wasted on data cache misses  65% for searches, 84% for range scans Data Cache Stalls Other Stalls Busy Time

6 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 6 - B + -Trees: Optimizing Search for Cache vs. Disk  To minimize the number of data transfers (I/O or cache misses): Optimal Node Width = Natural Data Transfer Size  for disk: disk page size (~8 Kbytes)  for cache: cache line size (~64 bytes)  Much narrower nodes and higher trees  Search performance more sensitive to changes in branching factors Optimized for diskOptimized for cache

7 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 7 - Previous Work: “Cache-Sensitive B + - Trees” Rao and Ross [SIGMOD 2000] Key insight:  nearly all child ptrs can be eliminated by restricting data layout  double the branching factor of cache-line-sized non-leaf nodes B + -TreesCSB + -Trees K1K1 K2K2 K3K3 K4K4 K5K5 K6K6 K7K7 K8K8 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4 K1K1 K3K3 K2K2 K4K4

8 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 8 - Impact of CSB + - Trees on Search Performance  Search is 15% faster due to reduction in height of tree  However:  update performance is worse [Rao & Ross, SIGMOD ’00]  range scan performance does not improve  There is still significant room for improvement Data Cache Stalls Other Stalls Busy Time B + -TreeCSB + -Tree

9 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 9 - Latency Tolerance in Modern Memory Hierarchies Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9)  Modern processors overlap multiple simultaneous cache misses  e.g., Compaq ES40 supports 8 off-chip misses per processor  Prefetch instructions allow software to fully exploit the parallelism What dictates performance:  NOT simply the number of cache misses  but rather the amount of exposed miss latency

10 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 10 - Our Approach New Proposal: “Prefetching B + -Trees” (pB + -Trees)  use prefetching to reduce the amount of exposed miss latency Key Challenge:  data dependences caused by chasing pointers Benefits:  significant performance gains for:  searches  range scans  updates (!)  complementary to CSB + -Trees

11 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 11 - Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  Conclusions

12 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 12 - Example: Search where Node Width = 1 Line 0 Time (cycles) Cache miss 300450150 We suffer one full cache miss at each level of the tree. 600 1000 keys, 64B lines, 4B keys, ptrs & tupleIDs 4 levels in B + -Tree (cold cache)

13 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 13 - Same Example where Node Width = 2 Lines 0 Time (cycles) Cache miss 0 Time (cycles) Cache miss 300600150450 150 450 600 750 3 levels in tree 900 300 Additional misses per node dominate reduction in # of levels.

14 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 14 - How Things Change with Prefetching 0 Time (cycles) Cache miss 300600150450 0 Time (cycles) Cache miss 480 160 320 # of misses  exposed miss latency  fetch all lines within a node in parallel 0 Cache miss 300600150450750900 Time (cycles)

15 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 15 - pB + -Trees: Using Prefetching to Improve Search Basic Idea:  make nodes wider than the natural data transfer size  e.g., 8 cache lines wide  prefetch all lines of a node before searching in the node Improved Search Performance:  Larger branching factors, shallower trees  Cost to access every node only increased slightly Reduced Space Overhead:  primarily due to fewer non-leaf nodes Update Performance: ???

16 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 16 - Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  Conclusions

17 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 17 - Range Scan Cache Behavior: Normal B + - Trees Steps in Range Scan: search for the starting leaf node traverse the leaves until end is found 0 Time(cycles) Cache miss 300450600 We suffer a full cache miss for each leaf node ! 150750900

18 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 18 - If Prefetching Wider Nodes e.g., node width = 2 lines 0 Time(cycles) Cache miss 300450600150750900 0 Time(cycles) Cache miss 320 Exposed miss latency is reduced by up to a factor of node width. A definite improvement, but can we still do better? 160480

19 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 19 - The Ideal Case Overlap misses until all latency is hidden, or run out of bandwidth How can we achieve this ? 0 Time(cycles) Cache miss 0 Time(cycles) Cache miss 3004506001507509000 Time(cycles) Cache miss 320160480 200

20 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 20 - The Pointer Chasing Problem Currently visitingWant to prefetch If prefetching through pointer chasing, still experience the full latency at each node Directly prefetch Ideal case

21 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 21 - Our Solution: Jump Pointer Arrays Put leaf addresses in an array Directly prefetch by using the jump pointers Back pointers needed to initialize prefetching

22 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 22 - Our Solution: Jump Pointer Arrays 0 Time Cache miss

23 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 23 - External Jump Pointer Arrays: Efficient Updates  Impact of an insertion is limited to its chunk  Deletions leave empty slots  Actively interleave empty slots during bulkload and chunk splits  Back pointer to position in jump-pointer array is now a hint  points to correct chunk  but may require local search within chunk to init prefetching hints chunked linked-list

24 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 24 - Alternative Design: Internal Jump-Pointer Arrays  B + -Trees already contain structures that point to the leaf nodes bottom non-leaf nodes  the parents of the leaf nodes ( “bottom non-leaf nodes”)  By linking them together, we can use them as a jump-pointer array Tradeoff:  no need for back-pointers, and simpler to maintain  consumes less space, though external array overhead is <1%  but less flexible, chunk size is fixed by B + -Tree structure

25 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 25 - Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

26 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 26 - Experimental Framework  Results are for a main-memory database environment  (we are extending this work to disk-based environments) Executables:  we added prefetch instructions to C source code by hand  used gcc to generate optimized MIPS executables with prefetch instructions Performance Measurement:  detailed, cycle-by-cycle simulations Machine Model:  based on Compaq ES40 system, with slightly updated parameters

27 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 27 - Simulation Parameters Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 2 Mem, 1 Branch Reorder Buffer Size64 insts Integer Multiply/Divide12/76 cycles All Other Integer1 cycle FP Divide/Square Root15/20 cycles All Other FP2 cycles Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Data Cache64 KB, 2-way set assoc. Primary Instruction Cache 64 KB, 2-way set-assoc. Miss Handlers32 for data, 2 for inst Unified Secondary Cache2 MB, direct-mapped Primary-to-Secondary Miss Latency 15 cycles (plus contention) Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles Models all the gory details, including memory system contention

28 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 28 - Index Search Performance 100K random searches after bulkload; 100% full (except root); warm caches. 10 4 5 6 7 20 30 40 50 60 70 80 # of tupleIDs in the trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8B+tree p8CSB+  pB + -Trees achieve 27-47% speedup vs. B + -Trees, 14-34% vs. CSB + -Trees  optimal node width is 8 cache lines  pB + -Trees and CSB + -Trees are complementary: p 8 CSB + -Trees are best

29 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 29 - Same Search Experiments with Cold Caches  Large discrete steps within each curve What is happening here? 10 4 5 6 7 60 80 100 120 140 160 180 # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ 100K random searches after bulkload; 100% full (except root); cold caches (i.e. cleared after each search).

30 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 30 - Analysis of Cold Cache Search Behavior Height of the tree dominates performance  effect is blurred in warm cache case If the same height, the smaller the node size the better 10 4 5 6 7 60 80 100 120 140 160 180 # of tupleIDs in trees time (M cycles) B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+tree p8CSB+ Tree Type Number of Keys 10K30K100K300K1M3M10M B+B+ 5667788 CSB + 4555667 p2B+p2B+ 4455666 p4B+p4B+ 3344455 p8B+p8B+ 3334444 p 16 B + 2333344 p 8 CSB + 3333344 # of Levels in the Trees

31 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 31 - Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

32 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 32 - Index Range Scan Performance Scans of 1K-1M keys: 6.5-8.7 speedup over B + -Trees  factor of 3.5-3.7 from prefetching wider nodes  additional factor of ~2 from jump-pointer arrays log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full) 10 1 2 3 4 5 6 4 6 8 # of tupleIDs scanned through in a single call time (Cycles) B+tree p8B+tree p8iB+tree p8eB+tree

33 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 33 - Index Range Scan Performance Small scans (<1K keys): overshooting cost is noticeable  exploit only if scan is expected to be large (e.g., search for end) log scale 100 scans starting at random locations on index bulkloaded with 3M keys (100% full)

34 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 34 - Overview  Prefetching Searches  Prefetching Range Scans  Experimental Results  search performance  range scan performance  update performance  Conclusions

35 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 35 - Update Performance  pB + -Trees achieve at least a 1.24 speedup in all cases Why? 5060708090100 percentage of entries used in leaf nodes 50 60 70 80 90 100 110 time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches 5060708090100 percentage of entries used in leaf nodes 50 60 70 80 90 100 110 B+tree p8B+tree p8eB+tree p8iB+tree

36 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 36 - Update Performance Reason #1: faster search times Reason #2: less frequent node splits with wider nodes 5060708090100 percentage of entries used in leaf nodes 50 60 70 80 90 100 110 time (M cycles) InsertionsDeletions 100K random insertions/deletions on 3M-key bulkloaded index; warm caches 5060708090100 percentage of entries used in leaf nodes 50 60 70 80 90 100 110 B+tree p8B+tree p8eB+tree p8iB+tree

37 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 37 - pB + -Trees: Other Results Similar results for:  varying bulkload factors of trees  large segmented range scans  mature trees  varying jump-pointer array parameters:  prefetch distance  chunk size Optimal node width:  increases as memory bandwidth increases  (matches the width predicted by our model in the paper)

38 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 38 - Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time

39 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 39 - Conclusions  Impact of Prefetching B + - Trees on performance:  Search: 1.27-1.55 speedup over B + - Trees  wider nodes reduce height of tree, # of expensive misses  outperform and are complementary to CSB + - Trees  Updates: 1.24-1.52 speedup over B + - Trees  faster search and less frequent node splits  in contrast with significant slowdowns for CSB + - Trees  Range Scan: 6.5-8.7 speedup over B + - Trees  wider nodes: factor of ~3.5 speedup  jump-pointer arrays: additional factor of ~2 speedup  Prefetching B + - Trees also reduce space overhead.  These benefits are likely to increase with future memory systems.  Applicable to other levels of the memory hierarchy (e.g., disks).

40 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 40 - Backup Slides

41 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 41 - Revisiting the Optimal Node Width for Searches Total cache misses for a search is minimized when: w = 1 w = # of cache lines per node m = # of child pointers per one-cache-line wide node N = # of tupleIDs in index Total cache misses Misses per level# of levels in tree

42 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 42 - Scheduling Prefetches Early Enough nini n i+1 n i+2 n i+3 n i+2 n i+3 currently visiting nini want to prefetch n i+3 p = &n 0 ; while(p) { work(p->data); p = p->next; } P Loading a node L Work() W Our Goal: fully hide latency thus achieving fastest possible computation rate of 1/W e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.

43 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 43 - Performance without Prefetching nini n i+1 n i+2 n i+3 Time LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 while(p) { work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) Computation rate = 1/(L+W)

44 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 44 - Prefetching One Node Ahead nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Computation is overlapped with memory accesses. computation rate = 1/L LiLi WiWi loading n k work(n k ) data dependence visiting nini prefetch n i+1 pf(p->next) while(p) { pf(p->next); work(p->data); p = p->next; } Time

45 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 45 - Prefetching Three Nodes Ahead nini n i+1 n i+2 n i+3 LiLi WiWi W i+1 W i+2 W i+3 Computation rate does not improve (still = 1/L)! visiting nini prefetch n i+3 pf(p->next->next->next) L i+1 L i+2 L i+3 L Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96] any scheme which follows the pointer chain is limited to a rate of 1/L Time while(p) { pf(p->next->next->next); work(p->data); p = p->next; } LiLi WiWi loading n k work(n k ) data dependence

46 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 46 - Our Goal: Fully Hide Latency nini n i+1 n i+2 n i+3 LiLi WiWi L i+1 W i+1 L i+2 W i+2 L i+3 W i+3 Achieves the fastest possible computation rate of 1/W. visiting nini prefetch n i+1 pf(&n i+3 ) Time LiLi WiWi loading n k work(n k ) data dependence while(p) { pf(&n i+3 ); work(p->data); p = p->next; }

47 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 47 - Challenges in Supporting Efficient Updates jump-pointer array back pointers Conceptual view of jump-pointer array: What if we really implemented it this way? Insertion: could incur significant overheads copying data within the array to create a new hole updating back-pointers Deletion: ok; just leave a hole

48 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 48 - Summary: Why We Expect Updates to Perform Well Insertions:  only a small number of jump pointers move  between insertion point and nearest hole in the chunk  normally only update the hint pointer for the inserted node  which does not require any significant overhead  significant overheads only occur on chunk splits, which are rare Deletions:  no data is moved (just leave an empty hole)  no need to update any hints In general, the jump-pointer array requires little concurrency control.

49 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 49 - B + - Trees Modeled and their Notations  B + - Trees:regular B + - Trees  CSB + - Trees:cache-sensitive B + - Trees [Rao & Ross, SIGMOD 2000]  p w B + - Trees:prefetching B + - Trees with node size = w cache lines and no jump-pointer arrays we consider w = 2, 4, 8, and 16  p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and external jump-pointer arrays  p 8 B + - Trees:prefetching B + - Trees with node size = 8 cache lines and internal jump-pointer arrays  p 8 CSB + - Trees:prefetching cache-sensitive B + - Trees with node size = 8 cache lines (and no jump-pointer arrays) (Gory implementation details are in the paper.) e i

50 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 50 - Searches with Varying Bulkload Factors  Similar trends with smaller bulkload factors as when 100% full  Performance of pB + -Trees is somewhat less sensitive to bulkload factor 5060708090100 percentage of entries used in leaf nodes 40 50 60 70 80 90 time (M cycles) B+tree CSB+ p2B+tree p4B+tree p16B+tree p8CSB+ p8B+tree 5060708090100 percentage of entries used in leaf nodes 100 120 140 160 180 200 time (M cycles) cold cacheswarm caches

51 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 51 - Range Scans with Varying Bulkload Factors Prefetching B + - Trees offer:  larger speedups with smaller bulkload factors (more nodes to fetch)  less sensitivity of performance to bulkload factor

52 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 52 - Large Segmented Range Scans  1M keys, scanned in 1000-key segments  Similar performance gains as unsegmented scans

53 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 53 - Insertions with Cold Caches

54 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 54 - Deletions with Cold Caches

55 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 55 - 5560657075808590 percentage of entries used in leaf nodes 0 2000 4000 6000 8000 10000 insertions with node splits B+tree p8B+tree p8eB+tree p8iB+tree Analysis of Nodes Splits upon Insertions  Far fewer node splits Bulkload Factor = 60-90% Bulkload Factor = 100% At least 2 splits One split No splits  Fewer node splits  Fewer non-leaf node splits

56 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 56 - Mature Trees: Searches (Warm Caches) 4080120160200 number of search (x 1000) 0 50 100 150 200 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

57 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 57 - Mature Trees: Insertions (Warm Caches) 4080120160200 number of insertion (x 1000) 0 50 100 150 200 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree CSB + -Tree could be 25% worse than B + -Tree under the same mature tree experiments (on diff h/w configuration) pB + -Trees are significantly faster than B + -Tree

58 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 58 - Mature Trees: Deletions (Warm Caches) 4080120160200 number of deletion (x 1000) 0 50 100 150 200 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

59 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 59 - Mature Trees: Searches (Cold Caches) 4080120160200 number of search (x 1000) 0 100 200 300 400 500 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

60 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 60 - Mature Trees: Insertions (Cold Caches) 4080120160200 number of insertion (x 1000) 0 100 200 300 400 500 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

61 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 61 - Mature Trees: Deletions (Cold Caches) 4080120160200 number of deletion (x 1000) 0 100 200 300 400 500 time (M cycles) B+tree p8B+tree p8eB+tree p8iB+tree

62 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 62 - Mature Trees: Large Segmented Range Scans B+treep8B+p8eB+p8iB+ 0 500 1000 1500 2000 2500 3000 3500 4000 3537 825 479 452

63 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 63 - Search varying memory bandwidth (warm cache) 51015202530 normalized bandwidth (B) 60 65 70 75 80 85 90 95 100 normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when pessimistic (B=5), p 8 B + -Tree still achieve significant speedups: 1.2 for warm cache

64 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 64 - Search varying memory bandwidth (cold cache) 51015202530 normalized bandwidth (B) 50 60 70 80 90 100 110 normalized execution time p2B+tree p4B+tree p8B+tree p16B+tree p19B+tree Even when B=5, 1.3 speedup for cold cache The optimal value for w increases when B gets larger

65 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 65 - Scan varying prefetching distance (P 8 eB + -Tree) not sensitive to moderate increases in the prefetching distance Though overshooting cost shows up when #entries to scan is small

66 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 66 - Scan varying chunk size (P 8 eB + -Tree) Not sensitive to varying chunk size

67 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 67 - Table 1 Terminology VariableDefinition w # of cache lines in an index node m # of child pointers in a one-line-wide node N # of pairs in an index d # of child pointers in non-leaf node ( = w  m ) T1T1 Full latency of a cache miss T next Latency of an additional pipelined cache miss B Normalized memory bandwidth ( B = T 1 /T next ) K # of nodes to prefetch ahead C #of cache lines in jump-pointer array chunk p w B + -Tree Plain pB + -Tree with w -line-wide nodes p w B + -Tree p w B + -Tree with external jump-pointer arrays p w B + -Tree p w B + -Tree with internal jump-pointer arrays e i

68 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 68 - Search w/ & w/o Jump-Pointer Arrays: Cold Cache entries in leaf nodes 10 4 5 6 7 60 80 100 120 140 160 180 time (M cycles) p8B+tree p8eB+tree p8iB+tree different # of levels in tree

69 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 69 - Cache Performance Revisited Search: eliminated 45% of original data cache stalls 1.47 speedup Scan: eliminated 97% of original data cache stalls 8-fold speedup Data Cache Stalls Other Stalls Busy Time

70 Chen, Gibbons & Mowry Carnegie Mellon Improving Index Performance through Prefetching - 70 - Can We Do Even Better on Searches?  Hiding latency across levels is difficult given:  data dependence through the child pointer  the relatively large branching factor of tree nodes  equal likelihood of following any child  assuming uniformly distributed random search keys  What if we prefetch a node’s children in parallel with accessing it?  duality between this and creating wider nodes  BUT, this approach has the following relative disadvantages:  storage overhead for the child (or grandchild) pointers  size of node can only grow by multiples of the branching factor


Download ppt "Carnegie Mellon Improving Index Performance through Prefetching Shimin Chen, Phillip B. Gibbons † and Todd C. Mowry School of Computer Science Carnegie."

Similar presentations


Ads by Google