Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rajeev Balasubramonian

Similar presentations


Presentation on theme: "Rajeev Balasubramonian"— Presentation transcript:

1 Rajeev Balasubramonian
CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories Rajeev Balasubramonian Andrew B. Kahng Naveen Muralimanohar Ali Shafiee Vaishnav Srinivas

2 Main Memory Matters Software Architecture Technology
In-Memory DBs, Key-Value Stores Graph Algorithms, Deep Learning Software Commodity CPUs, Accelerators Shift in bottlenecks Example innovations: NDP, DDR to GDDR5  3x TOPS in TPU Architecture Technology DDR4, HMC, HBM, NVM The Innovation Hub is Moving to Memory

3 Two Silos CACTI 7 can be used out-of-the-box when defining memory parameters for traditional memory systems CACTI 7 primitives can be leveraged to model and evaluate new memory architectures

4 Talk Outline CACTI for the main memory
Inputs/outputs The nuts and bolts Modeling I/O power Design space exploration Case studies: two novel architectures Cascaded Channels Narrow Channels

5 CACTI for Memory Cost Table Exhaustive Search Bandwidth Table
Capacity Cost Table #channels, ECC vs. Not Bandwidth Table DRAM Type: DDR3,DDR4 Power Parameters Access Pattern: bw, row buffer hits, Rd/Wr ratio Channel Configs Energy per access Inputs and outputs

6 Cost and capacity relationship is not linear
DIMM Cost Cost factors: technology, capacity, support for ECC, max bandwidth, vendor Aggregated costs from online sources Cost is volatile and should be updated periodically Cost in dollars 4GB 8GB 16GB 32GB 64GB DDR3 UDIMM 40 76 RDIMM 42 64 122 304 LRDIMM 211 287 1079 DDR4 26 46 33 60 126 310 279 331 1474 Cost and capacity relationship is not linear

7 Bandwidth Bandwidth depends on load, voltage, and DIMM type 1DPC (MHz)
DDR3 UDIMM-DR 533 667 RDIMM-DR 800 RDIMM-QR LRDIMM-QR 1.2V DDR4 1066 933

8 Power Modeling Extending CACTI-I/O DDR4 and SerDes support added
SerDes parameters from literature for different lengths/speeds For parallel buses, support for more accurate termination power with HSPICE simulations Different termination models for each bus type Different frequency, DIMMs per channel On-DIMM and on-board Different range (short or long)

9 Interconnect Model API

10 Power Analysis (DDR3)

11 Power Analysis (DDR4)

12 Cost and Bandwidth Analysis
Highest possible BW for the demanded capacity Lowest possible cost for the demanded capacity

13 Two Case Studies Key Observations New Idea I: Cascaded Segments
High DPC  less BW More channels  high bw and low cost New Idea I: Cascaded Segments Each segment has few DIMMs  higher BW New Idea II: Narrow Channels Partition the channel into many parallel channels Fewer DIMMs per data wire, new ECC  higher BW Lower power on DIMM

14 Cascaded Channels Same DPC, higher BW Same BW, lower cost CPU RoB
DIMM CPU RoB 533 MHz 667MHz 667MHz Relay on Board chip Same BW, lower cost 64 GB CPU 32 GB RoB 667 MHz 667MHz 667MHz one memory cycle increase in latency

15 Unbalanced channel Load
Hybrid Memory NVM is slow  Software optimized to access DRAM more Unbalanced channel Load balanced channel Load D CPU N One Channel DRAM One Channel NVM Frontend DRAM Backend NVM

16 Narrow Channels Higher Bandwidth but Higher Latency
Command/Address Bus is shared between channels Higher Bandwidth but Higher Latency Lower frequency/power for DRAM Chips! ECC on DIMM and CRC for link to reduce bw

17 Methodology Trace-based simulation Trace fed to USIMM
Memory-intensive Benchmarks (NPB and SPEC2006) Trace generated by Simics 8-core at 3.2 GHz L1D = 32KB, L1I = 32KB, L2 = 8MB Power CACTI 7

18 Cascaded Channels DDR3 DDR4 25% higher BW 13% higher BW 22% higher IPC

19 Cascaded Latency

20 Cascaded Power: DRAM Cartridge
533 MHz 70% utilization 667MHz 70% utilization 667MHz 35% utilization CPU CPU DIMM BoB I/O Total Power/BW Baseline 23.2W 5.5W 9.4W 38.1W 7.9 (nJ/B) Cascaded 22.6W 6.4W 12.2W 41.2W 6.7 (nJ/B)

21 Cascaded Cost

22 Cascaded Hybrid Percentage of Load on DRAM

23 Narrow Channel: Performance
Performance Improvement: 2-channel-x36  18% 3-channel-x24  17%

24 Narrow Channel: Power 23% overall memory power reduction

25 Conclusion CACTI 7: models off-chip memories and I/O
Detailed I/O power model Design space exploration Analyzes trade-offs: capacity, power, bandwidth, and cost Two novel architectures Cascaded channels Narrow channels


Download ppt "Rajeev Balasubramonian"

Similar presentations


Ads by Google