Rajeev Balasubramonian CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories Rajeev Balasubramonian Andrew B. Kahng Naveen Muralimanohar Ali Shafiee Vaishnav Srinivas
Main Memory Matters Software Architecture Technology In-Memory DBs, Key-Value Stores Graph Algorithms, Deep Learning Software Commodity CPUs, Accelerators Shift in bottlenecks Example innovations: NDP, DDR to GDDR5 3x TOPS in TPU Architecture Technology DDR4, HMC, HBM, NVM The Innovation Hub is Moving to Memory
Two Silos CACTI 7 can be used out-of-the-box when defining memory parameters for traditional memory systems CACTI 7 primitives can be leveraged to model and evaluate new memory architectures
Talk Outline CACTI for the main memory Inputs/outputs The nuts and bolts Modeling I/O power Design space exploration Case studies: two novel architectures Cascaded Channels Narrow Channels
CACTI for Memory Cost Table Exhaustive Search Bandwidth Table Capacity Cost Table #channels, ECC vs. Not Bandwidth Table DRAM Type: DDR3,DDR4 Power Parameters Access Pattern: bw, row buffer hits, Rd/Wr ratio Channel Configs Energy per access Inputs and outputs
Cost and capacity relationship is not linear DIMM Cost Cost factors: technology, capacity, support for ECC, max bandwidth, vendor Aggregated costs from online sources Cost is volatile and should be updated periodically Cost in dollars 4GB 8GB 16GB 32GB 64GB DDR3 UDIMM 40 76 RDIMM 42 64 122 304 LRDIMM 211 287 1079 DDR4 26 46 33 60 126 310 279 331 1474 Cost and capacity relationship is not linear
Bandwidth Bandwidth depends on load, voltage, and DIMM type 1DPC (MHz) DDR3 UDIMM-DR 533 667 RDIMM-DR 800 RDIMM-QR LRDIMM-QR 1.2V DDR4 1066 933
Power Modeling Extending CACTI-I/O DDR4 and SerDes support added SerDes parameters from literature for different lengths/speeds For parallel buses, support for more accurate termination power with HSPICE simulations Different termination models for each bus type Different frequency, DIMMs per channel On-DIMM and on-board Different range (short or long)
Interconnect Model API
Power Analysis (DDR3)
Power Analysis (DDR4)
Cost and Bandwidth Analysis Highest possible BW for the demanded capacity Lowest possible cost for the demanded capacity
Two Case Studies Key Observations New Idea I: Cascaded Segments High DPC less BW More channels high bw and low cost New Idea I: Cascaded Segments Each segment has few DIMMs higher BW New Idea II: Narrow Channels Partition the channel into many parallel channels Fewer DIMMs per data wire, new ECC higher BW Lower power on DIMM
Cascaded Channels Same DPC, higher BW Same BW, lower cost CPU RoB DIMM CPU RoB 533 MHz 667MHz 667MHz Relay on Board chip Same BW, lower cost 64 GB CPU 32 GB RoB 667 MHz 667MHz 667MHz one memory cycle increase in latency
Unbalanced channel Load Hybrid Memory NVM is slow Software optimized to access DRAM more Unbalanced channel Load balanced channel Load D CPU N One Channel DRAM One Channel NVM Frontend DRAM Backend NVM
Narrow Channels Higher Bandwidth but Higher Latency Command/Address Bus is shared between channels Higher Bandwidth but Higher Latency Lower frequency/power for DRAM Chips! ECC on DIMM and CRC for link to reduce bw
Methodology Trace-based simulation Trace fed to USIMM Memory-intensive Benchmarks (NPB and SPEC2006) Trace generated by Simics 8-core at 3.2 GHz L1D = 32KB, L1I = 32KB, L2 = 8MB Power CACTI 7
Cascaded Channels DDR3 DDR4 25% higher BW 13% higher BW 22% higher IPC
Cascaded Latency
Cascaded Power: DRAM Cartridge 533 MHz 70% utilization 667MHz 70% utilization 667MHz 35% utilization CPU CPU DIMM BoB I/O Total Power/BW Baseline 23.2W 5.5W 9.4W 38.1W 7.9 (nJ/B) Cascaded 22.6W 6.4W 12.2W 41.2W 6.7 (nJ/B)
Cascaded Cost
Cascaded Hybrid Percentage of Load on DRAM
Narrow Channel: Performance Performance Improvement: 2-channel-x36 18% 3-channel-x24 17%
Narrow Channel: Power 23% overall memory power reduction
Conclusion CACTI 7: models off-chip memories and I/O Detailed I/O power model Design space exploration Analyzes trade-offs: capacity, power, bandwidth, and cost Two novel architectures Cascaded channels Narrow channels