Presentation is loading. Please wait.

Presentation is loading. Please wait.

PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun.

Similar presentations


Presentation on theme: "PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun."— Presentation transcript:

1 PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun Yang Alumni: Cem Fide, An-Chow Lai Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~impetus

2 Copyright © 2002 by Babak Falsafi 2 Network Our Group’s Research Focus Handheld to Server  Memory Design  Processor Design Design issues  Performance  Power  Reliability  Programmability

3 Copyright © 2002 by Babak Falsafi 3 Impetus Impetus Projects Today’s talk:  PUMA 2 : Bridging CPU/Memory Gap Others: 1. PowerTap: Power-Aware Computer Systems 2. JITR: Soft-error Tolerant Microarchitecture 3. GigaTrans: Beyond Superscalar & ILP Goals:  Impact products, research, education in architecture  E.g., Reactive NUMA => Sun WildFire DSM

4 Copyright © 2002 by Babak Falsafi 4 Outline  Impetus Overview  PUMA 2  Hitting the Memory Wall  Last-touch Memory Access  Speculative Memory Ordering

5 Copyright © 2002 by Babak Falsafi 5 Hitting the Memory Wall Growing distance:  Processors getting faster  Memory only getting larger Caching less effective:  Simplistic (demand) fetch/replace  Deeper  higher worst case access latencies  Multiple hierarchies in multiprocessors! 50% processor utilization in servers [Ailamaki, VLDB’99]  Commercial databases running on a Xeon server

6 Copyright © 2002 by Babak Falsafi 6 Conventional Data Demand Fetch Fetch data upon CPU request  zero lookahead upon miss  crude guess for replacement Works only when  working set fits in L1/L2  changes infrequently Out-of-order core  at best tolerate L1-L2 latency CPU L1 L2 L3 Memory 10 clk 2 clk 50 clk 500 clk

7 Copyright © 2002 by Babak Falsafi 7 PUMA 2 PUMA 2 Proactively Uniform Memory Access Architecture Goal: Bridge the CPU/Memory Performance Gap How?  Prediction and speculation  Hide/tolerate memory latency  Hardware techniques transparent to software

8 Copyright © 2002 by Babak Falsafi 8 This Talk 1. Last-touch memory access model  predict the last processor reference  evict and fetch upon last reference + significantly enhance fetch lookahead 2. Speculative memory ordering  overlap accesses  tolerate latency  but, overlapping memory affects memory order  show, hardware can both relax & enforce order

9 Copyright © 2002 by Babak Falsafi 9 Outline  Impetus Overview  PUMA 2  Hitting the Memory Wall  Last-Touch Memory Access  Speculative Memory Ordering

10 Copyright © 2002 by Babak Falsafi 10 Hiding the Memory Latency Using Prediction/Speculation Mechanisms required: 1. Predict “what” memory address to fetch  Goal: Minimize traffic, avoid thrashing, etc. 2. Predict “when” to fetch  Goal: Maximize hiding latency 3. Storage “where” fetched data is placed  Goal: Avoid lookups in auxiliary structures

11 Copyright © 2002 by Babak Falsafi 11 Current Proposals for Data Prefetching Custom Prefetchers  Stride, stream, dependence-based, etc. General-Purpose Prefetchers  Precomputation/slipstream prefetcher  Address correlating prefetcher Key shortcomings:  Insufficient lookahead (e.g., 10~100 cycles)  Low accuracy for general access patterns  Can not place directly in L1 (use buffers)

12 Copyright © 2002 by Babak Falsafi 12 Markov Prefetchers (Joseph & Grunwald, ISCA’97)  Predict “what”: correlate L1 miss addresses  Predict “when”: consecutive L1 misses  High prediction coverage  Clustered L1 misses  Insufficient lookahead  One-to-many predictions  Low accuracy  Prefetch into a buffer  High prefetch hit time Related Work: Address Correlating Prefetchers

13 Copyright © 2002 by Babak Falsafi 13 Insufficient Lookahead in Correlating Prefetchers  Consecutive L1 misses often clustered  Exacerbated in out-of-order cores load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on miss lookahead

14 Copyright © 2002 by Babak Falsafi 14 We Propose Fetch on Last Touch  Predict & fetch on last touch + Evict dead block + Enhance fetch lookahead + Fetch directly into L1 load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on last touch lookahead load/store A1 (miss) load/store A1(hit) load/store C3(miss)... load/store A3(miss) Fetch on miss

15 Copyright © 2002 by Babak Falsafi 15 Enhancing Fetch Lookahead Cumulative Distribution 248 163264 128256 20481024 512 > 2048 Between last touch & next miss (Our proposal) Between two misses (Markov) Processor cycles 0 20 40 60 80 100 L2 latency Memory latency

16 Copyright © 2002 by Babak Falsafi 16 Dead-Block Prediction [ISCA’01]  Correlate a trace of memory accesses to a block  Uniquely identify different dead-times PC 3 : load/store A1 PC 1 : load/store A1 PC 3 : load/store A1 PC 5 : load/store A3 Access stream to a block frame (miss) (hit) (miss) PC 0 : load/store A0(hit) Trace = (PC 1,PC 3, PC 3 ) Last touch First touch

17 Copyright © 2002 by Babak Falsafi 17 Miss-Address Prediction  Correlate last 2 misses within a cache block frame Correlation = (A0,A1)  (A3) PC 3 : load/store A1 PC 1 : load/store A1 PC 3 : load/store A1 PC 5 : load/store A3 (miss) (hit) (miss) PC 0 : load/store A0(hit) … … Trace = (PC 1,PC 3, PC 3 ) (A0,A1,PC 1,PC 3,PC 3 )  (A3)

18 Copyright © 2002 by Babak Falsafi 18 Prefetch A3 11 Dead-Block Correlating Prefetcher Correlating Prediction Table A3A0,A1,PC 1,PC 3,PC 3 A0,PC 1,PC 3 History Table PC 3 encode Current Access Latest A1 Two-level prediction table  History table  Correlating Prediction Table  Encoding  truncated addition  Two bit saturating counter

19 Copyright © 2002 by Babak Falsafi 19 Methodology Simulated using SimpleScalar 3.0  2 GHz, 8-issue, 128-entry window  32K, DM, 1-cycle L1D  1M, 4-way, 12-cycle L2  70-cycle memory  2M, 8-way, 24-cycle prediction table  128-entry prefetch buffer (for Markov only) Memory-intensive integer, float-point, linked-data apps  14 Benchmarks  5 Olden, 4 SpecINT, 5 SpecFP

20 Copyright © 2002 by Babak Falsafi 20 0% 20% 40% 60% 80% 100% 120% Dead-Block Coverage and Accuracy MispredictedTrainingPredicted  DBP predicts 90% and miss by 4% only bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim Fraction of all misses

21 Copyright © 2002 by Babak Falsafi 21 Miss-Address Prediction 0% 20% 40% 60% 80% 100% 120% 140% MispredictedTrainingPredicted > 190%  DBCP predicts 82%, misses 3%  Markov (Joseph & Grunwald) predicts 81%, but misses 229% bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim MD Fraction of all misses M=Markov D=DBCP

22 Copyright © 2002 by Babak Falsafi 22 Memory Stall Time Reduction 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% MarkovDBCP  DBCP reduces memory stall time by 62% on average  Markov reduces memory stall time by 30% only bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim Fraction Reduced

23 Copyright © 2002 by Babak Falsafi 23 DBCP vs. Larger On-Chip L2 bh em3d health mst treeadd compress perl gcc mcf ammp art equake mgrid swim 18-cycle 3M-L224-cycle 2M DBCP12-cycle 3M-L2 Fraction Reduced

24 Copyright © 2002 by Babak Falsafi 24 Conclusions Dead-Block Predictors (DBP)  Predict when to evict block  Enable timely prefetching  Can prefetch into L1 cache  High coverage of 90%, mispredicting only 4% Dead-Block Correlating Prefetchers (DBCP)  Accurate and timely prefetch  Reduce memory stall time by 62%

25 Copyright © 2002 by Babak Falsafi 25 Other Mechanisms in PUMA 2 Self-invalidation predictors [ISCA’00]  Predict when to self-invalidate in multiprocessors  Converts 3-hop latencies to 2-hop Memory sharing predictors [ISCA’99]  Predict subsequent sharers of block  Powerful mechanism to move data Both exhibit high coverage and accuracy

26 Copyright © 2002 by Babak Falsafi 26 Outline  Impetus Overview  PUMA 2  Hitting the Memory Wall  Last-Touch Memory Access  Speculative Memory Ordering

27 Copyright © 2002 by Babak Falsafi 27 Sequential Consistency (SC) [LAMPORT] Memory should appear  in program order & atomic  e.g., critical section  lock  modify data  unlock +intuitive programming –extremely slow! What Programmers Want (SC) PPP.... Shared Memory

28 Copyright © 2002 by Babak Falsafi 28 What Machines Provide (RC) Overlap remote accesses  software enforces order (when needed)  e.g., first lock, then data  special “ordering” instructions Release Consistency (RC)  [Gharachorloo, et al.]  allows any (re-)ordering  in e.g., IA-64, SPARC +high performance –complicates programming PPP.... Shared Memory Overlap Accesses

29 Copyright © 2002 by Babak Falsafi 29 Can We Have SC Programming With RC Performance? Observation:  SC must only appear in program order  need order only when others race to access SC hardware can emulate RC iff  overlap accesses speculatively  keep a log of computation in program order  roll back in case of a race + no help from software  SC programming + infrequent rollback  better than RC performance

30 Copyright © 2002 by Babak Falsafi 30 Related Work: Hiding the Store Latency A number of SC optimizations 1. Multiple pending prefetches  Commit to L1 in order [Gharachorloo et al.]  E.g., MIPS R10000’s pending misses 2. Relaxing order within ROB  Speculative loads [Gharachorloo et al.]  E.g., MIPS R10000’s speculative loads  Extensions to ROB  Speculative retirement [Ranganathan et al.] Limited speculation in small associative buffers!

31 Copyright © 2002 by Babak Falsafi 31 Memory Queue Pipeline Reorder Buffer Done WR X RD Z WR A RD Y RD A ALU WR A Idle WR X Miss RD A Idle Execution in SC Memory System  WR X, RD Y, RD Z access remote memory  X, Y, Z, A are unrelated  need not be ordered  WR X blocks pipeline hundreds of cycles  Can not overlap RD Y & RD Z with WR X RD Y Idle

32 Copyright © 2002 by Babak Falsafi 32 WR X WR A RD A ALU Done Out of order Execution in RC Memory System + Accesses to A complete while WR X is pending + Overlaps remote accesses to X, Y, Z – Software must guarantee that X, Y, Z, A are unrelated Pipeline Reorder Buffer... RD Y RD Z RD Z Miss WR X Miss RD Y Miss... Memory Queue

33 Copyright © 2002 by Babak Falsafi 33 Speculatively & Fully Relaxing Order With Vijaykumar [ISCA’99]  H/W support for relaxing all order  Storage to tolerate long latencies  Old processor state  Old memory state  Fast lookup to detect possible order violation  upon cache invalidations and replacements  Infrequent rollbacks  Typical of well-behaved applications  Rollbacks are due to false sharing or data races

34 Copyright © 2002 by Babak Falsafi 34 Done WR X WR A RD A ALU SC++: A Design for Speculative SC  SHiQ: Back up computation in a queue  BLT: Quick lookup to detect races Speculative History Queue Pipeline Reorder Buffer... RD Y RD Z RD Z Miss WR X Miss RD Y Miss... Memory Queue Block containing A Block containing Y & Z Block Lookup Table Detect races from directory accesses

35 Copyright © 2002 by Babak Falsafi 35 Applications Beyond Memory Order SC++ can be used as generic speculation:  rollbacks are rare  verifying speculation >> ROB can sustain Examples:  Value speculation [Sorin et al.]  Speculating beyond locks [Rajwar et al.]

36 Copyright © 2002 by Babak Falsafi 36 Performance of SC, RC and SC++  Data from RSIM DSM simulator  16, 1 GHz MIPS R10000 processors  Up to 70% gap between SC & RC SC++ can fully emulate RC

37 Copyright © 2002 by Babak Falsafi 37 Sensitivity to Queue Size  Queue size varies across apps (& systems)  History is highly bursty  Can spill history to L2

38 Copyright © 2002 by Babak Falsafi 38 Conclusions First to show SC + Speculation = RC!  identified design requirements  current systems do not satisfy requirements  proposed a design, SC++ Hardware can provide simple programming with high performance!

39 Copyright © 2002 by Babak Falsafi 39 Other Ongoing Projects Ultra-Deep-Submicron Designs 1. Power Management: [MICRO’01,HPCA’02,HPCA’01,ISLPED’00]  First architectural proposal to reduce leakage  Resizable Caches  Way/Bank Predicting Caches  Power-Aware Snoopy Coherence 2. Transient-Error Tolerant Superscalar: [MICRO’01]  Error-tolerant instruction scheduling Beyond ILP & Superscalar: [ICS’01,PPoPP’01]  Low-overhead mechanisms for thread-level speculation  Selective dependence tracking

40 For More Information Please visit our web site Impetus Group Computer Architecture Lab Carnegie Mellon University http://www.ece.cmu.edu/~impetus


Download ppt "PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun."

Similar presentations


Ads by Google