Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by David Wolinsky

Similar presentations


Presentation on theme: "Presented by David Wolinsky"— Presentation transcript:

1 Presented by David Wolinsky
Exploiting eDRAM Bandwidth with Data Prefetching: Simulation and Measurements Presented by David Wolinsky

2 Outline eDRAM Prefetching Blue Gene/L Overview Cache System Overview
System Analysis Conclusion

3 eDRAM Embedded DRAM – that’s right on the chip Wider buses
Higher operation speeds Benefits over SRAM Reduces costs Greater density (more capacity) Lower power Hide latency by increased bandwidth

4 Prefetching(1) Stores data before its needed Done via streaming
If a stream is valid, a new line is added for each subsequent data request Two streaming methods N-Deep History Optimistic

5 Prefetching(2) N-Deep History data streaming Optimistic
L1 misses, L2 records the address to stream detection buffer If L2 address has entry, stream is established Subsequent accesses trigger prefetch request Optimistic Issues a fetch request for each new L2 cache request not in prefetch Prefetches next line Streams are maintained until displaced Fully associative lines

6 Blue Gene/L Overview 65,536 Nodes 2 PPC440 per node System on Chip
SIMD FPU unit on Each chip

7 Cache System Overview (1)
32kB L1 Cache with 32B cache line Private 2kB prefetch SRAM with 128B cache line 4 MB L3 eDRAM with 128B cache line

8 System Analysis (1) Optimal stream detector size NAS optimal is 16
Splash-2 optimal 8 Actual count is 16

9 System Analysis (2) Optimal number of line buffers
Chose 15 for performance / cost ratio

10 System Analysis (3) Optimal stream detector size
LRU fastest but expensive to implement RRMRU uses round robin but won’t replace the 3 most recently used, BG/L uses this

11 System Analysis (4) Optimal stream detector size NAS optimal is 16
Splash-2 optimal 8 Actual count is 16

12 System Analysis (5) Does it matter if streams are bidirectional or unidirectional? Results show that bidirectional helps a little but isn’t performance / cost optimal

13 System Analysis (6) Prefetches in simulator and hardware
Stream detectors are very competitive with optimal Prefetching is very good against no prefetching Simulator was in 10 to 20% of accuracy of hardware

14 Conclusion No direct comparison to L2 SRAM caches
Doesn’t work well on Linux using standard 4kB pages The simulator provides a somewhat accurate way to test different memory configurations prior to implementing them in hardware eDRAM with prefetch is much faster than with no prefetch


Download ppt "Presented by David Wolinsky"

Similar presentations


Ads by Google