Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {ishan.thakkar, VLSID 2016 KOLKATA, INDIA January 4-8, 2016 DOI /VLSID
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 1
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 2
Introduction 3 Main memory is DRAM It is a critical component of all computing systems: server, desktop, mobile, embedded, sensor DRAM stores data in cell capacitor Fully charged cell-capacitor logic ‘1’ Fully discharged cell-capacitor logic ‘0’ DRAM cell loses data over time, as cell-capacitor leaks charge over time For temperatures below 85°C, DRAM cell loses data in 64ms For higher temperatures, DRAM cell loses data at faster rate DRAM: Dynamic Random Access Memory Word Line Bit Line Cell Capacitor Access Transistor To preserve data integrity, the charge on each DRAM cell (cell-capacitor) must be periodically restored or refreshed.
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 4
Background on DRAM Structure 5 Based on their structure, DRAMs are classified in two categories: 1.2D DRAMs: Planar single layer DRAMs 2.3D-Stacked DRAMs: Multiple 2D DRAM layers stacked on one-another using TSVs 2D DRAM structure TSV: Through Silicon Via 2D DRAM Structure Hierarchy Chip Bank Subarray Bitcell Rank
2D DRAM: Rank and Chip Structure 6... Mux 4 Banks DRAM Chip DRAM RankDRAM Chip 2D DRAM rank: Multiple chips work in tandem
3D-Stacked DRAM Structure 7 HMC Structure Hierarchy Vault Bank Subarray Bitcell DRAM Logic Hybrid Memory Cube In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard for 3D-Stacked DRAMs defined by a consortium of industries
DRAM Bank Structure 8 Sense Amplifiers Row Address Decoder Row Buffer Columns Rows Subarray Column Mux Data bits Bank Core Bank Peripherals Column Address Decoder 3D-Stacked and 2D DRAMs have similar bank structures
DRAM Subarray Structure 9 Sense Amps Row Address Word Line Bit Line Cell Capacitor Access Transistor Word Line Bit Line Sense Amp DRAM Cell DRAM Cell 3D-Stacked and 2D DRAMs have similar subarray structures
All bitlines of the bank are pre-charged to 0.5 V DD Basic DRAM Operations 10 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGE
The target row is opened, Basic DRAM Operations 11 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4
The target row is opened, then it’s captured by SAs Basic DRAM Operations 12 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4
Basic DRAM Operations 13 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 SAs drive each bitline fully either to V DD or 0V – restore the open row Row 4
Basic DRAM Operations 14 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 Open row is stored in global row buffer
Basic DRAM Operations 15 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATIONREAD Row Address Row 4 Subarray ID: 1 Row 4 Column 1 Target data block is selected, and then multiplexed out from row buffer
Basic DRAM Operations 16 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATIONREAD Row Address Row 4 Subarray ID: 1 Row 4 Column 1 A duet of PRECHARGE-ACTIVATION operations restores/refreshes the target row dummy PRECHARGE-ACTIVATION operations are performed to refresh the rows
Refresh: 2D Vs 3D-Stacked DRAMs 17 3D-Stacked DRAMs have Higher capacity/density more rows need to be refreshed Higher power density higher operating temperature (>85°C) smaller retention period (time before DRAM cells lose data) of 32ms than that of 64ms for 2D DRAMs Thus, refresh problem for 3D-Stacked DRAMs is more critical Therefore, in this study, we target a standardized 3D-Stacked DRAM architecture HMC Refresh Dummy ACTIVATION-PRECHARGE are performed on all rows every retention cycle (32 ms) To prevent long pauses a JEDEC standardized Distributed Refresh method is used
Background: Refresh Operation 18 Example Distributed Refresh Operation – 1Gb HMC Vault RB1 tRFC tREFI = 3.9µs RB2 tRFC tREFI = 3.9µs RB8192 tRFC tREFI = 3.9µs Retention Cycle = 32ms Size of RB is 16 tREC tRFC Row1 tRC Row2 tRC Row3 tRC Row4 tRC Row15 tRC Row16 tRC tREC tREFI: Refresh Interval tRFC: Refresh Cycle Time tRC: Row Cycle Time tRFC = time taken to refresh entire RB
Performance Overhead of Distributed Refresh 19 Source: J Liu+, ISCA 2012 Performance overhead of refresh increases with increase in device capacity
Energy Overhead of Distributed Refresh 20 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity
Energy Overhead of Distributed Refresh 21 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity Refresh is a growing problem, which needs to be addressed to realize low-latency, low-energy DRAMs
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 22
Related Work 23 We improve upon Scattered Refresh Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh
All-Bank Refresh Vs Per-Bank Refresh 24 Distributed Refresh can be implemented at two different granularities All-bank Refresh: All banks are refreshed simultaneously, and none of the banks is allowed to serve any request until refresh is complete Supported by all general purpose DDRx DRAMs DRAM operation is completely stalled no. of available banks (#AB) is zero Exploits bank-level parallelism (BLP) for refreshing smaller tRFC Per-bank Refresh: Only one bank is refreshed at a time, so all other banks are allowed to serve other requests Supported by LPDDRx DRAMs #AB > 0 No BLP larger value of tRFC tRFC: Refresh Cycle Time
All-Bank Refresh Vs Per-Bank Refresh 25 tRC: Row Cycle Time Smaller value of tRFC Number of available banks (#AB) = 0 DRAM operation is completely stalled tRFC: Refresh Cycle Time Dummy ACTIVATION-PRECHARGE operations for refresh command Per-Bank Refresh #AB > 0 No BLP larger value of tRFC Both All-bank Refresh and Per-bank Refresh have drawbacks and they can be improved L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID
Scattered Refresh 26 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 Improves upon Per-bank Refresh – uses subarray-level parallelism (SLP) for refresh Each row of RB is mapped to a different subarray SLP gives opportunity to overlap PRECHARGE with next ACTIVATE reduces tRFC Source: T Kalyan+, ISCA 2012 Scattered L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?
All-Bank Scattered Scattered Refresh 27 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank Room for improvement - Scattered Refresh tRFC for All-bank Refresh < tRFC for Scattered Refresh < tRFC for Per-bank Refresh
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 28
Contributions 29 Crammed Refresh: Per-bank Refresh + All-bank Refresh 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh and all banks in All-bank Refresh Massed Refresh: Crammed Refresh + Scattered Refresh 2 banks are refreshed in parallel Uses SLP in both banks being refreshed #AB: Number of banks available to serve other requests while remaining banks are being refreshed #BLP: Bank-level Parallelism #SLP: Subarray-level Parallelism Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Idea is to keep balance between #AB and BLP for refresh
Scattered Crammed Per-Bank Crammed Refresh – tRFC Timing 30 Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4 Bank-level parallelism (BLP) for refresh Only 2 banks are refreshed in parallel #AB>0 L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID tRFC for Crammed Refresh < tRFC for Scattered Refresh
Massed Crammed Massed Refresh – tRFC Timing 31 Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank Bank-level parallelism (BLP) + Subarray-level parallelism (SLP) for refresh tRFC for Massed Refresh < tRFC for Crammed Refresh How to implement BLP and SLP together? L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID
Subarray-level Parallelism (SLP) 32 Global Row-address Latch Per-Subarray Row-address Latch Source: Y Kim+, ISCA 2012 Global Row-address Latch hinders SLP
Bank-level Parallelism (BLP) 33 Physical Address Latch LayerAddr[2] RowAddr[14]BankAddr[1] 17-bit Address Counter Refresh Scheduler Address Calculator Control Refresh Controller Physical Addr Decoder Row Addr Latch LayerID LID BankID BID Mask EN Memory die 1 Memory die 2 Memory die 3 Memory die 4 Logic Base (LoB) Vault Controller TSV Launch Pads To Banks BLP is implemented by masking BankID during refresh
Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 34
Evaluation Setup 35 Trace-driven simulation for PARSEC benchmarks Memory access traces extracted from detailed cycle-accurate simulations using gem5 These memory traces were then provided as inputs to the DRAM simulator DRAMSim2 Energy, timing and area analysis CACTI-3DD based simulation – based on 4Gb HMC quad model DRAMSim2 configuration Configured DRAMSim2 using CACTI-3DD results
Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion Outline 36
Results I – Energy, Timing, Area 37
Results II – Throughput 38 Crammed refresh achieves 7.1% and 2.9% more throughput on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 8.4% and 4.3% more throughput on average over distributed per-bank refresh and scattered refresh respectively
Results III – Energy Delay Product (EDP) 39 Crammed refresh achieves 6.4% and 2.7% less EDP on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 7.5% and 3.9% less EDP on average over distributed per-bank refresh and scattered refresh respectively
Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion Outline 40
Conclusions 41 Proposed Massed Refresh technique exploits Bank-level as well as subarray-level parallelism while refresh operations Proposed Crammed Refresh and Massed Refresh techniques Improve throughput and energy-efficiency of DRAM Crammed Refresh improves upon state-of-the-art 7.1% & 6.4% improvements in throughput and EDP over the distributed per-bank refresh 2.9% & 2.7% improvements in throughput and EDP over the scattered refresh schemes respectively Massed Refresh improves upon state-of-the-art 8.4% & 7.5% improvements in throughput and EDP over the distributed per-bank refresh 4.3% & 3.9% improvements in throughput and EDP over the scattered refresh schemes respectively
Questions / Comments ? Thank You 42