Download presentation
Published byLydia Mathews Modified over 9 years ago
1
CACTI-IO: CACTI With Off-Chip Power-Area-Timing Models
Norman P. Jouppi¥, Andrew B. Kahng†‡, Naveen Muralimanohar¥, Vaishnav Srinivas† November 6th, 2012 ECE† and CSE‡ Departments University of California, San Diego Hewlett-Packard Laboratories¥, Palo Alto
2
Need for off-chip power-area-timing models CACTI-IO models
Agenda Introduction Need for off-chip power-area-timing models CACTI-IO models Case studies using CACTI-IO: High-capacity DDR3 configurations 3-D stacking LPDDRx for servers Summary
3
Memory Subsystem Performance
Latency/Access times: The Memory Wall Modern architectures try to hide the latency impact Capacity: Need for large server main memory Bandwidth: The Memory Bandwidth Limit Latency hiding techniques do not help Off-chip limits bandwidth Source: Rogers et al. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling
4
Memory Subsystem Power
Memory subsystem power a significant portion
5
Memory Subsystem Power
Memory subsystem power a significant portion DRAM
6
Memory Subsystem Power
Memory subsystem power a significant portion DRAM, Buffers
7
Memory Subsystem Power
Memory subsystem power a significant portion DRAM, Buffers, Caches
8
Memory Subsystem Power
Memory subsystem power a significant portion DRAM, Buffers, Caches, Interconnect/IO/PHY
9
Memory Subsystem Power
Memory subsystem power a significant portion DRAM, Buffers, Caches, Interconnect/IO/PHY Off-chip IO power is a key component Source: Economou et al. Full-System Power Analysis and Modeling for Server Environments
10
Off-chip Performance Memory bandwidth limited by off-chip interface
11
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling
12
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling Signal/Power Integrity
13
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling Signal/Power Integrity: ISI
14
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling Signal/Power Integrity: ISI, Crosstalk
15
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling Signal/Power Integrity: ISI, Crosstalk, Supply Noise
16
Off-chip Performance Memory bandwidth limited by off-chip interface
Source-synchronous signaling Signal, power integrity: ISI, Crosstalk, Supply Noise Pincount
17
Off-chip Power Off-chip power significant portion of the memory subsystem
18
Off-chip Power Off-chip power significant portion of the memory subsystem Higher off-chip capacitance and voltages
19
Off-chip Power Off-chip power significant portion of the memory subsystem Higher off-chip capacitance and voltages Terminations and Vref-biased receivers
20
Off-chip Power Off-chip power significant portion of the memory subsystem Higher off-chip capacitance and voltages Terminations and Vref-biased receivers Clocking elements
21
Off-chip PAT Models For Architects
Off-chip models for full-system simulator Simulators today do not account for IO/PHY power Accurate off-chip power and performance numbers Co-optimize off-chip & on-chip power/performance Explore new off-chip topologies and technologies
22
CACTI well known for memory architects
# Memory State (R=Read, W=Write, I=Idle or S=Sleep) //-iostate "R" -iostate "W" //-iostate "I" //-iostate "S" # Is ECC Enabled (Y=Yes, N=No) -dram_ecc "N" #Address bus timing //-addr_timing 0.5 //DDR, for LPDDR2 and LPDDR3 -addr_timing 1.0 //SDR for DDR3, Wide-IO //-addr_timing 2.0 //2T timing //addr_timing 3.0 // 3T timing # Bandwidth (Gbytes per second, this is the effective bandwidth) -bus_bw GBps # Memory Density (Gbit per memory/DRAM die) -mem_density 2 Gb # IO frequency (MHz) (frequency of the external memory interface). -bus_freq 800 MHz # Duty Cycle (fraction of time in the Memory State defined above) -duty_cycle 1.0 # Activity factor for Data (0->1 transitions) per cycle (for DDR, need to account for the higher activity in this parameter. E.g. max. activity factor for DDR is 1.0, for SDR is 0.5) -activity_dq 1.0 # Activity factor for Control/Address (0->1 transitions) per cycle (for DDR, need to account for the higher activity in this parameter. E.g. max. activity factor for DDR is 1.0, for SDR is 0.5) -activity_ca 0 # Number of DQ pins -num_dq 1 # Number of DQS pins -num_dqs 0 //8 differential pairs # Number of CA pins -num_ca 0 # Number of CLK pins -num_clk 2 //1 differential pair # Number of Physical Ranks -num_mem_dq 2 //Number of ranks (loads on DQ and DQS) per DIMM or buffer chip # Width of the Memory Data Bus -mem_data_width 1 //x4 or x8 or x16 or x32 memories CACTI-IO CACTI well known for memory architects CACTI-IO includes off-chip PAT models CACTI-IO config file includes off-chip parameters CACTI-IO Tech Report available
23
Need for off-chip power-area-timing models CACTI-IO Models
Agenda Introduction Need for off-chip power-area-timing models CACTI-IO Models Case Studies using CACTI-IO: High-capacity DDR3 configurations 3-D Stacking BOOM: LPDDRx for servers Summary
24
Dynamic Power Dynamic Power (switching lumped caps) Interconnect Power tL VSW Vdd / Z0 if 2tL tb tb VSW Vdd / Z0 if 2tL > tb
25
Termination Power DQ: CA: Fly-by VDD/2 termination Multi rank
Few termination types READ and WRITE Assume 50% 0’s, 1’s Includes Rx, Tx CA: Fly-by VDD/2 termination
26
PHY Power Reference generators Vref-biased receivers
Clock distribution DLL/PLL Phase Rotators
27
Performance: Eye Compliance
Timing Budget: Tx, Channel, and Rx (setup/hold) Voltage Budget: Tx (VOL/VOH), Channel, Rx (VIL/VIH)
28
Channel Jitter DOE for topology parameters Ron/Rtt/Cdram some of the key parameters Linear interpolation of Taguchi array
29
Timing Budget
30
Voltage Budget
31
Area Driver area depends on RON and RTT Predriver stages fanout to driver Fixed area for ESD and controls
32
Validation CACTI-IO models account for off-chip power, area and timing Validation against SPICE Within 15% error across all the simulations Lookup tables validated by construction
33
Power for LPDDR2 DQ Single-Lane
Total IO Power
34
Power for DDR3 DQ Single-Lane
Total IO Power Termination Power
35
Need for off-chip power-area-timing models CACTI-IO Models
Agenda Introduction Need for off-chip power-area-timing models CACTI-IO Models Case Studies using CACTI-IO: High-capacity DDR3 configurations 3-D Stacking BOOM: LPDDRx for servers Summary
36
Case Studies Using CACTI-IO
We present three case studies: High-capacity DDR3 configurations 3-D configurations BOOM (Buffered Output On Module): LPDDRx for servers Compare the configurations for: Capacity Bandwidth IO Power Efficiency BOOM case study with IO+DRAM power
37
Case Study 1: High-capacity DDR3
RDIMM
38
Case Study 1: High-capacity DDR3
RDIMM, LRDIMM
39
Case Study 1: High-capacity DDR3
RDIMM, LRDIMM, BoB (Buffer on Board) BoB uses serial bus to host
40
Case Study 1: High-capacity DDR3
RDIMM, LRDIMM, BoB (Buffer on Board) BoB uses serial bus to host LRDIMM offers highest capacity BoB offers best bandwidth and power efficiency per GB of capacity
41
Case Study 2: 3-D Stacking
TSS based Peak bandwidth of 176 GB/s for Micron’s Hybrid Memory Cube (HMC) Power efficiency varies by around 2X Source: Micron
42
BOOM: LPDDRx for servers
BOOM (Buffered Output On Module) architecture from Hewlett-Packard: Buffer chip on the board LPDDRx memories (lower speed, power) Wider bus from the buffer to the DRAMs Achieves better power efficiency using LPDDRx memories Still meets performance using buffer
43
BOOM Topology
44
Case Study 3: BOOM 50% increase in IO efficiency with LPDDRx No terminations with wider, slower buses Serial bus from the buffer offers more savings
45
BOOM: IO+DRAM Power
46
IO power a significant portion of the combined power (DRAM+IO): 50-60%
BOOM: IO+DRAM Power IO power a significant portion of the combined power (DRAM+IO): 50-60% IO Idle power a very significant contributor LPDDR2 unterminated signaling reduces idle power BOOM-N4-L-400 w/ serial bus to host provides a 3.4X energy savings (DRAM+IO) over the BOOM-N2-D-800 Combining IO+DRAM allows for correct optimizations
47
Optimizing Fanout IO power vs. number of ranks while capacity and bandwidth are constant Slower and wider provides better power Die area and clock distribution goes up as bus gets wider, so MHz seems like a sweet spot
48
Need for off-chip power-area-timing models CACTI-IO Models
Agenda Introduction Need for off-chip power-area-timing models CACTI-IO Models Case Studies using CACTI-IO: High-capacity DDR3 configurations 3-D Stacking BOOM: LPDDRx for servers Summary
49
Summary Introduced CACTI-IO with off-chip models
CACTI-IO models include IO/Interconnect dynamic and termination power PHY power Voltage/Timing budgets for eye compliance IO area 3 case studies show the capabilities of CACTI-IO Calculate off-chip power/area/timing Combine on-chip and off-chip power Identify key configuration choices and optimizations Ongoing work: Extend the models to other types of off-chip memory and off-chip configurations, including PCRAM
50
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.