Download presentation
Presentation is loading. Please wait.
Published byIsrael Washburne Modified over 10 years ago
1
PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun Yang Alumni: Cem Fide, An-Chow Lai Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~impetus
2
Copyright © 2002 by Babak Falsafi 2 Network Our Group’s Research Focus Handheld to Server Memory Design Processor Design Design issues Performance Power Reliability Programmability
3
Copyright © 2002 by Babak Falsafi 3 Impetus Impetus Projects Today’s talk: PUMA 2 : Bridging CPU/Memory Gap Others: 1. PowerTap: Power-Aware Computer Systems 2. JITR: Soft-error Tolerant Microarchitecture 3. GigaTrans: Beyond Superscalar & ILP Goals: Impact products, research, education in architecture E.g., Reactive NUMA => Sun WildFire DSM
4
Copyright © 2002 by Babak Falsafi 4 Outline Impetus Overview PUMA 2 Hitting the Memory Wall Last-touch Memory Access Speculative Memory Ordering
5
Copyright © 2002 by Babak Falsafi 5 Hitting the Memory Wall Growing distance: Processors getting faster Memory only getting larger Caching less effective: Simplistic (demand) fetch/replace Deeper higher worst case access latencies Multiple hierarchies in multiprocessors! 50% processor utilization in servers [Ailamaki, VLDB’99] Commercial databases running on a Xeon server
6
Copyright © 2002 by Babak Falsafi 6 Conventional Data Demand Fetch Fetch data upon CPU request zero lookahead upon miss crude guess for replacement Works only when working set fits in L1/L2 changes infrequently Out-of-order core at best tolerate L1-L2 latency CPU L1 L2 L3 Memory 10 clk 2 clk 50 clk 500 clk
7
Copyright © 2002 by Babak Falsafi 7 PUMA 2 PUMA 2 Proactively Uniform Memory Access Architecture Goal: Bridge the CPU/Memory Performance Gap How? Prediction and speculation Hide/tolerate memory latency Hardware techniques transparent to software
8
Copyright © 2002 by Babak Falsafi 8 This Talk 1. Last-touch memory access model predict the last processor reference evict and fetch upon last reference + significantly enhance fetch lookahead 2. Speculative memory ordering overlap accesses tolerate latency but, overlapping memory affects memory order show, hardware can both relax & enforce order
9
Copyright © 2002 by Babak Falsafi 9 Outline Impetus Overview PUMA 2 Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering
10
Copyright © 2002 by Babak Falsafi 10 Hiding the Memory Latency Using Prediction/Speculation Mechanisms required: 1. Predict “what” memory address to fetch Goal: Minimize traffic, avoid thrashing, etc. 2. Predict “when” to fetch Goal: Maximize hiding latency 3. Storage “where” fetched data is placed Goal: Avoid lookups in auxiliary structures
11
Copyright © 2002 by Babak Falsafi 11 Current Proposals for Data Prefetching Custom Prefetchers Stride, stream, dependence-based, etc. General-Purpose Prefetchers Precomputation/slipstream prefetcher Address correlating prefetcher Key shortcomings: Insufficient lookahead (e.g., 10~100 cycles) Low accuracy for general access patterns Can not place directly in L1 (use buffers)
12
Copyright © 2002 by Babak Falsafi 12 Markov Prefetchers (Joseph & Grunwald, ISCA’97) Predict “what”: correlate L1 miss addresses Predict “when”: consecutive L1 misses High prediction coverage Clustered L1 misses Insufficient lookahead One-to-many predictions Low accuracy Prefetch into a buffer High prefetch hit time Related Work: Address Correlating Prefetchers
13
Copyright © 2002 by Babak Falsafi 13 Insufficient Lookahead in Correlating Prefetchers Consecutive L1 misses often clustered Exacerbated in out-of-order cores load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on miss lookahead
14
Copyright © 2002 by Babak Falsafi 14 We Propose Fetch on Last Touch Predict & fetch on last touch + Evict dead block + Enhance fetch lookahead + Fetch directly into L1 load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on last touch lookahead load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on miss
15
Copyright © 2002 by Babak Falsafi 15 Enhancing Fetch Lookahead Cumulative Distribution 248 163264 128256 20481024 512 > 2048 Between last touch & next miss (Our proposal) Between two misses (Markov) Processor cycles 0 20 40 60 80 100 L2 latency Memory latency
16
Copyright © 2002 by Babak Falsafi 16 Dead-Block Prediction [ISCA’01] Correlate a trace of memory accesses to a block Uniquely identify different dead-times PC 3 : load/store A1 PC 1 : load/store A1 PC 3 : load/store A1 PC 5 : load/store A3 Access stream to a block frame (miss) (hit) (miss) PC 0 : load/store A0(hit) Trace = (PC 1,PC 3, PC 3 ) Last touch First touch
17
Copyright © 2002 by Babak Falsafi 17 Miss-Address Prediction Correlate last 2 misses within a cache block frame Correlation = (A0,A1) (A3) PC 3 : load/store A1 PC 1 : load/store A1 PC 3 : load/store A1 PC 5 : load/store A3 (miss) (hit) (miss) PC 0 : load/store A0(hit) … … Trace = (PC 1,PC 3, PC 3 ) (A0,A1,PC 1,PC 3,PC 3 ) (A3)
18
Copyright © 2002 by Babak Falsafi 18 Prefetch A3 11 Dead-Block Correlating Prefetcher Correlating Prediction Table A3A0,A1,PC 1,PC 3,PC 3 A0,PC 1,PC 3 History Table PC 3 encode Current Access Latest A1 Two-level prediction table History table Correlating Prediction Table Encoding truncated addition Two bit saturating counter
19
Copyright © 2002 by Babak Falsafi 19 Methodology Simulated using SimpleScalar 3.0 2 GHz, 8-issue, 128-entry window 32K, DM, 1-cycle L1D 1M, 4-way, 12-cycle L2 70-cycle memory 2M, 8-way, 24-cycle prediction table 128-entry prefetch buffer (for Markov only) Memory-intensive integer, float-point, linked-data apps 14 Benchmarks 5 Olden, 4 SpecINT, 5 SpecFP
20
Copyright © 2002 by Babak Falsafi 20 0% 20% 40% 60% 80% 100% 120% Dead-Block Coverage and Accuracy MispredictedTrainingPredicted DBP predicts 90% and miss by 4% only bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim Fraction of all misses
21
Copyright © 2002 by Babak Falsafi 21 Miss-Address Prediction 0% 20% 40% 60% 80% 100% 120% 140% MispredictedTrainingPredicted > 190% DBCP predicts 82%, misses 3% Markov (Joseph & Grunwald) predicts 81%, but misses 229% bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim MD Fraction of all misses M=Markov D=DBCP
22
Copyright © 2002 by Babak Falsafi 22 Memory Stall Time Reduction 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% MarkovDBCP DBCP reduces memory stall time by 62% on average Markov reduces memory stall time by 30% only bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim Fraction Reduced
23
Copyright © 2002 by Babak Falsafi 23 DBCP vs. Larger On-Chip L2 bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim 18-cycle 3M-L224-cycle 2M DBCP12-cycle 3M-L2 Fraction Reduced
24
Copyright © 2002 by Babak Falsafi 24 Conclusions Dead-Block Predictors (DBP) Predict when to evict block Enable timely prefetching Can prefetch into L1 cache High coverage of 90%, mispredicting only 4% Dead-Block Correlating Prefetchers (DBCP) Accurate and timely prefetch Reduce memory stall time by 62%
25
Copyright © 2002 by Babak Falsafi 25 Other Mechanisms in PUMA 2 Self-invalidation predictors [ISCA’00] Predict when to self-invalidate in multiprocessors Converts 3-hop latencies to 2-hop Memory sharing predictors [ISCA’99] Predict subsequent sharers of block Powerful mechanism to move data Both exhibit high coverage and accuracy
26
Copyright © 2002 by Babak Falsafi 26 Outline Impetus Overview PUMA 2 Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering
27
Copyright © 2002 by Babak Falsafi 27 Sequential Consistency (SC) [LAMPORT] Memory should appear in program order & atomic e.g., critical section lock modify data unlock +intuitive programming –extremely slow! What Programmers Want (SC) PPP.... Shared Memory
28
Copyright © 2002 by Babak Falsafi 28 What Machines Provide (RC) Overlap remote accesses software enforces order (when needed) e.g., first lock, then data special “ordering” instructions Release Consistency (RC) [Gharachorloo, et al.] allows any (re-)ordering in e.g., IA-64, SPARC +high performance –complicates programming PPP.... Shared Memory Overlap Accesses
29
Copyright © 2002 by Babak Falsafi 29 Can We Have SC Programming With RC Performance? Observation: SC must only appear in program order need order only when others race to access SC hardware can emulate RC iff overlap accesses speculatively keep a log of computation in program order roll back in case of a race + no help from software SC programming + infrequent rollback better than RC performance
30
Copyright © 2002 by Babak Falsafi 30 Related Work: Hiding the Store Latency A number of SC optimizations 1. Multiple pending prefetches Commit to L1 in order [Gharachorloo et al.] E.g., MIPS R10000’s pending misses 2. Relaxing order within ROB Speculative loads [Gharachorloo et al.] E.g., MIPS R10000’s speculative loads Extensions to ROB Speculative retirement [Ranganathan et al.] Limited speculation in small associative buffers!
31
Copyright © 2002 by Babak Falsafi 31 Memory Queue Pipeline Reorder Buffer Done WR X RD Z WR A RD Y RD A ALU WR A Idle WR X Miss RD A Idle Execution in SC Memory System WR X, RD Y, RD Z access remote memory X, Y, Z, A are unrelated need not be ordered WR X blocks pipeline hundreds of cycles Can not overlap RD Y & RD Z with WR X RD Y Idle
32
Copyright © 2002 by Babak Falsafi 32 WR X WR A RD A ALU Done Out of order Execution in RC Memory System + Accesses to A complete while WR X is pending + Overlaps remote accesses to X, Y, Z – Software must guarantee that X, Y, Z, A are unrelated Pipeline Reorder Buffer... RD Y RD Z RD Z Miss WR X Miss RD Y Miss... Memory Queue
33
Copyright © 2002 by Babak Falsafi 33 Speculatively & Fully Relaxing Order With Vijaykumar [ISCA’99] H/W support for relaxing all order Storage to tolerate long latencies Old processor state Old memory state Fast lookup to detect possible order violation upon cache invalidations and replacements Infrequent rollbacks Typical of well-behaved applications Rollbacks are due to false sharing or data races
34
Copyright © 2002 by Babak Falsafi 34 Done WR X WR A RD A ALU SC++: A Design for Speculative SC SHiQ: Back up computation in a queue BLT: Quick lookup to detect races Speculative History Queue Pipeline Reorder Buffer... RD Y RD Z RD Z Miss WR X Miss RD Y Miss... Memory Queue Block containing A Block containing Y & Z Block Lookup Table Detect races from directory accesses
35
Copyright © 2002 by Babak Falsafi 35 Applications Beyond Memory Order SC++ can be used as generic speculation: rollbacks are rare verifying speculation >> ROB can sustain Examples: Value speculation [Sorin et al.] Speculating beyond locks [Rajwar et al.]
36
Copyright © 2002 by Babak Falsafi 36 Performance of SC, RC and SC++ Data from RSIM DSM simulator 16, 1 GHz MIPS R10000 processors Up to 70% gap between SC & RC SC++ can fully emulate RC
37
Copyright © 2002 by Babak Falsafi 37 Sensitivity to Queue Size Queue size varies across apps (& systems) History is highly bursty Can spill history to L2
38
Copyright © 2002 by Babak Falsafi 38 Conclusions First to show SC + Speculation = RC! identified design requirements current systems do not satisfy requirements proposed a design, SC++ Hardware can provide simple programming with high performance!
39
Copyright © 2002 by Babak Falsafi 39 Other Ongoing Projects Ultra-Deep-Submicron Designs 1. Power Management: [MICRO’01,HPCA’02,HPCA’01,ISLPED’00] First architectural proposal to reduce leakage Resizable Caches Way/Bank Predicting Caches Power-Aware Snoopy Coherence 2. Transient-Error Tolerant Superscalar: [MICRO’01] Error-tolerant instruction scheduling Beyond ILP & Superscalar: [ICS’01,PPoPP’01] Low-overhead mechanisms for thread-level speculation Selective dependence tracking
40
For More Information Please visit our web site Impetus Group Computer Architecture Lab Carnegie Mellon University http://www.ece.cmu.edu/~impetus
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.