Presentation is loading. Please wait.

Presentation is loading. Please wait.

Temporal Streaming of Shared Memory

Similar presentations


Presentation on theme: "Temporal Streaming of Shared Memory"— Presentation transcript:

1 Temporal Streaming of Shared Memory
Thomas F. Wenisch Stephen Somogyi Nikos Hardavellas Jangwoo Kim Anastassia Ailamaki Babak Falsafi

2 The Distributed Shared Memory Wall
Cache 10’s cycles CPU CPU Memory 100’s cycles $ $ Mem Mem Coherence ’s cycles Network Coherence misses 25% of time in OLTP Increases with larger cache size [Barroso 98]

3 Need solution for arbitrary access patterns
Prior Work Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04] Limited applicability e.g., migratory sharing or one network hop Prefetching [Joseph 97] [Nesbit 04] [Gracia Pérez 04] Use simple patterns or fetch many addresses e.g, strides Large Exec. Windows / Runahead [Mutlu 03] Fetch dependent addresses serially e.g, pointer-chasing Need solution for arbitrary access patterns

4 Shared Memory Streaming
Record sequences of shared memory accesses Transfer data sequences ahead of requests Baseline DSM Streaming DSM Miss A CPU Mem CPU Mem Fill A Miss A Miss B Fill A,B,C,… Fill B Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing

5 Our Contributions Temporal Streaming Temporal Streaming Engine
Recent coherence miss sequences recur Generates MLP from dependent misses 50% misses closely follow previous sequence Temporal Streaming Engine Ordered streams allow practical HW Performance improvement: 7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

6 Outline Introduction Temporal Streaming Temporal Streaming Engine
Results Conclusion

7 Relationship Between Misses
Intuition: Miss sequences repeat Because code sequences repeat Observed for uniprocessors in [Chilimbi 02] Temporal Address Correlation Same miss addresses repeat in the same order Correlated miss sequence = stream Miss seq. Q W A B C D E R T A B C D E Y

8 Relationship Between Streams
Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes Temporal Stream Locality Recent streams likely to recur Node 1 Q W A B C D E R Node 2 T A B C D E Y Addr. correlation + stream locality = temporal correlation

9 Identifying Temporal Streams
Use history of recent miss sequences Upon miss A: Locate successors of recent prior miss A Node 1 Q W A B C D E Stream A,B,C,D,E likely at Node 2 ? Node 2 T A

10 Memory Level Parallelism
Streams create MLP for dependent misses Not possible with larger windows / runahead Temporal streaming breaks dependence chains Baseline Temporal Streaming A A CPU CPU B B C C Must wait to follow pointers Fetch in parallel

11 Temporal Streaming   Record Node 1 Directory Node 2 Miss A Miss B
Miss C Miss D  Record

12 Temporal Streaming   Record Node 1 Directory Node 2 Miss A Miss B
Miss C Miss D Req. A Miss A Fill A  Record

13 Temporal Streaming    Record  Locate Node 1 Directory Node 2
Miss A Miss B Miss C Miss D Req. A Miss A Locate A Fill A Stream B, C, D  Record  Locate

14 Temporal Streaming    Record   Locate  Stream Node 1 Directory
Miss A Miss B Miss C Miss D Req. A Miss A Locate A Fill A Stream B, C, D  Record  Locate  Stream Fetch B, C, D Hit B Hit C

15 Temporal Streaming Engine
 Record Coherence Miss Order Buffer (CMOB) ~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses CPU $ Fill E Local Memory CMOB Q W A B C D E R T Y

16 Temporal Streaming Engine
 Locate Annotate directory Already has coherence info for every block CMOB append  send pointer to directory Coherence miss  forward stream request Directory A shared Node CMOB[23] B modified Node CMOB[401]

17 Temporal Streaming Engine
Fetch data to match use rate Addresses in FIFO stream queue Fetch into streamed value buffer CPU L1 $ Node i: stream {A,B,C…} Streamed Value Buffer A data Stream Queue F E D C B Fetch A ~32 entries

18 Practical HW Mechanisms
Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup

19 Evaluation Methodology
Flexus: Full-system trace and OoO timing simulation Leverages SMARTS sampling Benchmark Applications Scientific em3d, moldyn, ocean OLTP: TPC-C WH IBM DB2 7.2 Oracle 10g SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3 Model Parameters 16 4GHz SPARC CPUs 8-wide OoO; 8-stage pipe 256-entry ROB/LSQ 64K L1, 8MB L2 TSO w/ speculation

20 Significant opportunity for temporal streaming
Temporal Correlation % misses that closely follow a past miss sequence Allow for skips/reordering within 8 address window 50% (Comm.) to near-perfect (Sci.) repetition Significant opportunity for temporal streaming

21 TSE Performance Impact
Time Breakdown Speedup 95% CI em3d moldyn ocean Apache DB2 Oracle Zeus TSE eliminates 25%-95% of coherent read stalls 6% to 230% performance improvement

22 Conclusion Temporal Streaming Temporal Streaming Engine
Intuition: Recent coherence miss sequences recur Impact: Eliminates % of coherence misses Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement 7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

23 Questions ? Impetus Group
Computer Architecture Lab (CALCM) Carnegie Mellon University Flexus full-system OoO MP simulator now available

24 Backup Slides

25 Temporal Correlation Correlated accesses: Sci: >90% Comm: >40%
Majority of streams recur in perfect order

26 Temporal Streaming Effectiveness
Stream data into streamed value buffer Discarded fetches do not pollute existing caches Eliminates substantial fraction of misses

27 Stream Lengths Comm: Short streams; low base MLP (1.2-1.3)
Sci: Long streams; high base MLP ( ) Temporal Streaming can improve in both cases

28 CMOB Capacity Requirements
Comm: Smooth improvement to 1.5MB Sci: Must capture iteration access pattern

29 Interconnect Bisection Bandwidth
HP GS1280 provides 49.6 GB/s [Cvetanovic 03] TSE overhead < 7% of currently available BW

30 TSE outperforms Stride and GHB for coherence misses
TSE Comparison TSE outperforms Stride and GHB for coherence misses

31 Coherence Stalls in Server Apps
Memory stalls in OLTP [Barroso 98] Current trend is towards larger caches Fraction of coherent read stalls is growing

32 Directory Overhead CMOB pointer contains:
Node id – log( # nodes ) CMOB index – log( # CMOB entries ) Roughly doubles full bitmap directory

33 More lookahead allows more stream meta-info
Streaming ü Targets a sequence of misses What to stream? Use relationship btw. misses Improves accuracy When to stream? Match fetch to use rate Small temporary storage More lookahead More time to predict sequence load A stream B, C, D load B hit stream E load C hit stream F More lookahead allows more stream meta-info


Download ppt "Temporal Streaming of Shared Memory"

Similar presentations


Ads by Google