Presented by David Wolinsky Exploiting eDRAM Bandwidth with Data Prefetching: Simulation and Measurements Presented by David Wolinsky
Outline eDRAM Prefetching Blue Gene/L Overview Cache System Overview System Analysis Conclusion
eDRAM Embedded DRAM – that’s right on the chip Wider buses Higher operation speeds Benefits over SRAM Reduces costs Greater density (more capacity) Lower power Hide latency by increased bandwidth
Prefetching(1) Stores data before its needed Done via streaming If a stream is valid, a new line is added for each subsequent data request Two streaming methods N-Deep History Optimistic
Prefetching(2) N-Deep History data streaming Optimistic L1 misses, L2 records the address to stream detection buffer If L2 address has entry, stream is established Subsequent accesses trigger prefetch request Optimistic Issues a fetch request for each new L2 cache request not in prefetch Prefetches next line Streams are maintained until displaced Fully associative lines
Blue Gene/L Overview 65,536 Nodes 2 PPC440 per node System on Chip SIMD FPU unit on Each chip
Cache System Overview (1) 32kB L1 Cache with 32B cache line Private 2kB prefetch SRAM with 128B cache line 4 MB L3 eDRAM with 128B cache line
System Analysis (1) Optimal stream detector size NAS optimal is 16 Splash-2 optimal 8 Actual count is 16
System Analysis (2) Optimal number of line buffers Chose 15 for performance / cost ratio
System Analysis (3) Optimal stream detector size LRU fastest but expensive to implement RRMRU uses round robin but won’t replace the 3 most recently used, BG/L uses this
System Analysis (4) Optimal stream detector size NAS optimal is 16 Splash-2 optimal 8 Actual count is 16
System Analysis (5) Does it matter if streams are bidirectional or unidirectional? Results show that bidirectional helps a little but isn’t performance / cost optimal
System Analysis (6) Prefetches in simulator and hardware Stream detectors are very competitive with optimal Prefetching is very good against no prefetching Simulator was in 10 to 20% of accuracy of hardware
Conclusion No direct comparison to L2 SRAM caches Doesn’t work well on Linux using standard 4kB pages The simulator provides a somewhat accurate way to test different memory configurations prior to implementing them in hardware eDRAM with prefetch is much faster than with no prefetch