Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian Al Davis Zhen Fang ‡† Ramesh Illikkal* Ravi Iyer* University of Utah, NVidia ‡ and Intel Labs* † Work done while at Intel
Memory Bottleneck DRAM power as high as 25% of total datacenter power Low-Power DRAM in place of DDR3. – BOOM from HP Labs – Energy Proportional Memory from Stanford 2 CPU DDR3 CPU LPDDR BASELINE Low Power Memory
Latency Wall Memory latency wall not going away – Emerging scale-out workloads e.g. Cloudsuite – Move towards energy-efficient in-order cores Reduced Latency DRAM offers very low latency – Row-cycle time (tRC) of 8-12ns (DDR3 tRC = 48.75ns, LPDDR2 tRC = 60ns) 3
No one memory works best 4
Heterogeneous Memory 5 PERFORMANCE OPTIMIZED DRAM CPU Combine high-performance and low-power dram to outperform DDR3 at a lower energy cost Large number of possible designs – Different DRAM device combinations – Channel Organization – Data Placement Granularity POWER OPTIMIZED DRAM
Critical Word Regularity 6 Most DRAM requests are for word-0 of the cache-line Frequency of accesses to individual words of a cache-line
Critical Word Acceleration 7 CPU LPDDR RLDRAM LPDDR RLDRAM Word 0 Words Critical Word fetched from RLDRAM to boost performance Rest of the cache-line placed & retrieved from LPDRAM for energy efficiency.
Results 8 Throughput improved by 12.9% System energy improved by 6%