Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Figure (a) 8 * 8 array (b) 16 * 8 array.

A Case for Refresh Pausing in DRAM Memory Systems

Prith Banerjee ECE C03 Advanced Digital Design Spring 1998

Recent Progress In Embedded Memory Controller Design

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.

11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Memory Technology “Non-so-random” Access Technology:

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

SYEN 3330 Digital SystemsJung H. Kim 1 SYEN 3330 Digital Systems Chapter 9 – Part 1.

Digital Logic Design Instructor: Kasım Sinan YILDIRIM

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

15-740/ Computer Architecture Lecture 25: Main Memory

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Prof. Hsien-Hsin Sean Lee

YASHWANT SINGH, D. BOOLCHANDANI

Lecture 3. Lateches, Flip Flops, and Memory

Chapter 5 Internal Memory

ESE532: System-on-a-Chip Architecture

Reducing Memory Interference in Multicore Systems

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

SRAM Memory External Interface

Dynamic Random Access Memory (DRAM) Basic

MOS Memory and Storage Circuits

Module IV Memory Organization.

Cache Memory Presentation I

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Samira Khan University of Virginia Oct 9, 2017

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Prof. Gennady Pekhimenko University of Toronto Fall 2017

Subject Name: Embedded system Design Subject Code: 10EC74

Virtually Pipelined Network Memory

The Main Memory system: DRAM organization

Lecture: DRAM Main Memory

Module IV Memory Organization.

Computer Architecture

Lecture: DRAM Main Memory

Digital Logic & Design Dr. Waseem Ikram Lecture 40.

Computer Evolution and Performance

Lecture 22: Cache Hierarchies, Memory

15-740/ Computer Architecture Lecture 19: Main Memory

Presentation transcript:

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {ishan.thakkar, VLSID 2016 KOLKATA, INDIA January 4-8, 2016 DOI /VLSID

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 1

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 2

Introduction 3 Main memory is DRAM It is a critical component of all computing systems: server, desktop, mobile, embedded, sensor DRAM stores data in cell capacitor Fully charged cell-capacitor  logic ‘1’ Fully discharged cell-capacitor  logic ‘0’ DRAM cell loses data over time, as cell-capacitor leaks charge over time For temperatures below 85°C, DRAM cell loses data in 64ms For higher temperatures, DRAM cell loses data at faster rate DRAM: Dynamic Random Access Memory Word Line Bit Line Cell Capacitor Access Transistor To preserve data integrity, the charge on each DRAM cell (cell-capacitor) must be periodically restored or refreshed.

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 4

Background on DRAM Structure 5 Based on their structure, DRAMs are classified in two categories: 1.2D DRAMs: Planar single layer DRAMs 2.3D-Stacked DRAMs: Multiple 2D DRAM layers stacked on one-another using TSVs 2D DRAM structure TSV: Through Silicon Via 2D DRAM Structure Hierarchy Chip Bank Subarray Bitcell Rank

2D DRAM: Rank and Chip Structure 6... Mux 4 Banks DRAM Chip DRAM RankDRAM Chip 2D DRAM rank: Multiple chips work in tandem

3D-Stacked DRAM Structure 7 HMC Structure Hierarchy Vault Bank Subarray Bitcell DRAM Logic Hybrid Memory Cube In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard for 3D-Stacked DRAMs defined by a consortium of industries

DRAM Bank Structure 8 Sense Amplifiers Row Address Decoder Row Buffer Columns Rows Subarray Column Mux Data bits Bank Core Bank Peripherals Column Address Decoder 3D-Stacked and 2D DRAMs have similar bank structures

DRAM Subarray Structure 9 Sense Amps Row Address Word Line Bit Line Cell Capacitor Access Transistor Word Line Bit Line Sense Amp DRAM Cell DRAM Cell 3D-Stacked and 2D DRAMs have similar subarray structures

All bitlines of the bank are pre-charged to 0.5 V DD Basic DRAM Operations 10 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGE

The target row is opened, Basic DRAM Operations 11 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4

The target row is opened, then it’s captured by SAs Basic DRAM Operations 12 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4

Basic DRAM Operations 13 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 SAs drive each bitline fully either to V DD or 0V – restore the open row Row 4

Basic DRAM Operations 14 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 Open row is stored in global row buffer

Basic DRAM Operations 15 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATIONREAD Row Address Row 4 Subarray ID: 1 Row 4 Column 1 Target data block is selected, and then multiplexed out from row buffer

Basic DRAM Operations 16 Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID? EN Global Address Latch Row Buffer Column Mux Column Address Decoder PRECHARGEACTIVATIONREAD Row Address Row 4 Subarray ID: 1 Row 4 Column 1 A duet of PRECHARGE-ACTIVATION operations restores/refreshes the target row  dummy PRECHARGE-ACTIVATION operations are performed to refresh the rows

Refresh: 2D Vs 3D-Stacked DRAMs 17 3D-Stacked DRAMs have Higher capacity/density  more rows need to be refreshed Higher power density  higher operating temperature (>85°C)  smaller retention period (time before DRAM cells lose data) of 32ms than that of 64ms for 2D DRAMs Thus, refresh problem for 3D-Stacked DRAMs is more critical Therefore, in this study, we target a standardized 3D-Stacked DRAM architecture HMC Refresh Dummy ACTIVATION-PRECHARGE are performed on all rows every retention cycle (32 ms) To prevent long pauses  a JEDEC standardized Distributed Refresh method is used

Background: Refresh Operation 18 Example Distributed Refresh Operation – 1Gb HMC Vault RB1 tRFC tREFI = 3.9µs RB2 tRFC tREFI = 3.9µs RB8192 tRFC tREFI = 3.9µs Retention Cycle = 32ms Size of RB is 16 tREC tRFC Row1 tRC Row2 tRC Row3 tRC Row4 tRC Row15 tRC Row16 tRC tREC tREFI: Refresh Interval tRFC: Refresh Cycle Time tRC: Row Cycle Time tRFC = time taken to refresh entire RB

Performance Overhead of Distributed Refresh 19 Source: J Liu+, ISCA 2012 Performance overhead of refresh increases with increase in device capacity

Energy Overhead of Distributed Refresh 20 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity

Energy Overhead of Distributed Refresh 21 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity Refresh is a growing problem, which needs to be addressed to realize low-latency, low-energy DRAMs

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 22

Related Work 23 We improve upon Scattered Refresh Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh

All-Bank Refresh Vs Per-Bank Refresh 24 Distributed Refresh can be implemented at two different granularities All-bank Refresh: All banks are refreshed simultaneously, and none of the banks is allowed to serve any request until refresh is complete Supported by all general purpose DDRx DRAMs DRAM operation is completely stalled  no. of available banks (#AB) is zero Exploits bank-level parallelism (BLP) for refreshing  smaller tRFC Per-bank Refresh: Only one bank is refreshed at a time, so all other banks are allowed to serve other requests Supported by LPDDRx DRAMs #AB > 0 No BLP  larger value of tRFC tRFC: Refresh Cycle Time

All-Bank Refresh Vs Per-Bank Refresh 25 tRC: Row Cycle Time Smaller value of tRFC Number of available banks (#AB) = 0  DRAM operation is completely stalled tRFC: Refresh Cycle Time Dummy ACTIVATION-PRECHARGE operations for refresh command Per-Bank Refresh #AB > 0 No BLP  larger value of tRFC Both All-bank Refresh and Per-bank Refresh have drawbacks and they can be improved L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

Scattered Refresh 26 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 Improves upon Per-bank Refresh – uses subarray-level parallelism (SLP) for refresh Each row of RB is mapped to a different subarray SLP gives opportunity to overlap PRECHARGE with next ACTIVATE  reduces tRFC Source: T Kalyan+, ISCA 2012 Scattered L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?

All-Bank Scattered Scattered Refresh 27 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank Room for improvement - Scattered Refresh tRFC for All-bank Refresh < tRFC for Scattered Refresh < tRFC for Per-bank Refresh

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 28

Contributions 29 Crammed Refresh: Per-bank Refresh + All-bank Refresh 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh and all banks in All-bank Refresh Massed Refresh: Crammed Refresh + Scattered Refresh 2 banks are refreshed in parallel Uses SLP in both banks being refreshed #AB: Number of banks available to serve other requests while remaining banks are being refreshed #BLP: Bank-level Parallelism #SLP: Subarray-level Parallelism Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Idea is to keep balance between #AB and BLP for refresh

Scattered Crammed Per-Bank Crammed Refresh – tRFC Timing 30 Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4 Bank-level parallelism (BLP) for refresh Only 2 banks are refreshed in parallel  #AB>0 L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID tRFC for Crammed Refresh < tRFC for Scattered Refresh

Massed Crammed Massed Refresh – tRFC Timing 31 Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank Bank-level parallelism (BLP) + Subarray-level parallelism (SLP) for refresh tRFC for Massed Refresh < tRFC for Crammed Refresh How to implement BLP and SLP together? L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

Subarray-level Parallelism (SLP) 32 Global Row-address Latch Per-Subarray Row-address Latch Source: Y Kim+, ISCA 2012 Global Row-address Latch hinders SLP

Bank-level Parallelism (BLP) 33 Physical Address Latch LayerAddr[2] RowAddr[14]BankAddr[1] 17-bit Address Counter Refresh Scheduler Address Calculator Control Refresh Controller Physical Addr Decoder Row Addr Latch LayerID LID BankID BID Mask EN Memory die 1 Memory die 2 Memory die 3 Memory die 4 Logic Base (LoB) Vault Controller TSV Launch Pads To Banks BLP is implemented by masking BankID during refresh

Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion Outline 34

Evaluation Setup 35 Trace-driven simulation for PARSEC benchmarks Memory access traces extracted from detailed cycle-accurate simulations using gem5 These memory traces were then provided as inputs to the DRAM simulator DRAMSim2 Energy, timing and area analysis CACTI-3DD based simulation – based on 4Gb HMC quad model DRAMSim2 configuration Configured DRAMSim2 using CACTI-3DD results

Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion Outline 36

Results I – Energy, Timing, Area 37

Results II – Throughput 38 Crammed refresh achieves 7.1% and 2.9% more throughput on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 8.4% and 4.3% more throughput on average over distributed per-bank refresh and scattered refresh respectively

Results III – Energy Delay Product (EDP) 39 Crammed refresh achieves 6.4% and 2.7% less EDP on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 7.5% and 3.9% less EDP on average over distributed per-bank refresh and scattered refresh respectively

Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion Outline 40

Conclusions 41 Proposed Massed Refresh technique exploits Bank-level as well as subarray-level parallelism while refresh operations Proposed Crammed Refresh and Massed Refresh techniques Improve throughput and energy-efficiency of DRAM Crammed Refresh improves upon state-of-the-art 7.1% & 6.4% improvements in throughput and EDP over the distributed per-bank refresh 2.9% & 2.7% improvements in throughput and EDP over the scattered refresh schemes respectively Massed Refresh improves upon state-of-the-art 8.4% & 7.5% improvements in throughput and EDP over the distributed per-bank refresh 4.3% & 3.9% improvements in throughput and EDP over the scattered refresh schemes respectively

Questions / Comments ? Thank You 42