Download presentation
Presentation is loading. Please wait.
1
Presented by David Wolinsky
Exploiting eDRAM Bandwidth with Data Prefetching: Simulation and Measurements Presented by David Wolinsky
2
Outline eDRAM Prefetching Blue Gene/L Overview Cache System Overview
System Analysis Conclusion
3
eDRAM Embedded DRAM – that’s right on the chip Wider buses
Higher operation speeds Benefits over SRAM Reduces costs Greater density (more capacity) Lower power Hide latency by increased bandwidth
4
Prefetching(1) Stores data before its needed Done via streaming
If a stream is valid, a new line is added for each subsequent data request Two streaming methods N-Deep History Optimistic
5
Prefetching(2) N-Deep History data streaming Optimistic
L1 misses, L2 records the address to stream detection buffer If L2 address has entry, stream is established Subsequent accesses trigger prefetch request Optimistic Issues a fetch request for each new L2 cache request not in prefetch Prefetches next line Streams are maintained until displaced Fully associative lines
6
Blue Gene/L Overview 65,536 Nodes 2 PPC440 per node System on Chip
SIMD FPU unit on Each chip
7
Cache System Overview (1)
32kB L1 Cache with 32B cache line Private 2kB prefetch SRAM with 128B cache line 4 MB L3 eDRAM with 128B cache line
8
System Analysis (1) Optimal stream detector size NAS optimal is 16
Splash-2 optimal 8 Actual count is 16
9
System Analysis (2) Optimal number of line buffers
Chose 15 for performance / cost ratio
10
System Analysis (3) Optimal stream detector size
LRU fastest but expensive to implement RRMRU uses round robin but won’t replace the 3 most recently used, BG/L uses this
11
System Analysis (4) Optimal stream detector size NAS optimal is 16
Splash-2 optimal 8 Actual count is 16
12
System Analysis (5) Does it matter if streams are bidirectional or unidirectional? Results show that bidirectional helps a little but isn’t performance / cost optimal
13
System Analysis (6) Prefetches in simulator and hardware
Stream detectors are very competitive with optimal Prefetching is very good against no prefetching Simulator was in 10 to 20% of accuracy of hardware
14
Conclusion No direct comparison to L2 SRAM caches
Doesn’t work well on Linux using standard 4kB pages The simulator provides a somewhat accurate way to test different memory configurations prior to implementing them in hardware eDRAM with prefetch is much faster than with no prefetch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.