1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

Slides:



Advertisements
Similar presentations
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
High Performing Cache Hierarchies for Server Workloads
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Scalable Many-Core Memory Systems Lecture 2, Topic 1: DRAM Basics and DRAM Scaling Prof. Onur Mutlu HiPEAC.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
The Evicted-Address Filter
Computer Architecture: Main Memory (Part II) Prof. Onur Mutlu Carnegie Mellon University.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
UH-MEM: Utility-Based Hybrid Memory Management
Reducing Memory Interference in Multicore Systems
Memory Scaling: A Systems Architecture Perspective
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin.
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Application Slowdown Model
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Row Buffer Locality Aware Caching Policies for Hybrid Memories
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Achieving High Performance and Fairness at Low Cost
Reducing DRAM Latency via
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Computer Architecture Lecture 10b: Memory Latency
Presented by Florian Ettinger
Computer Architecture Lecture 30: In-memory Processing
Presentation transcript:

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

2 Executive Summary Problem: DRAM latency is a critical performance bottleneck Our Goal: Reduce DRAM latency with low area cost Observation: Long bitlines in DRAM are the dominant source of DRAM latency Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption Results: When the fast segment is used as a cache to the slow segment  Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)

3 Outline Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results

4 Historical DRAM Trend 16X -20% DRAM latency continues to be a critical bottleneck

5 DRAM Latency = Subarray Latency + I/O Latency What Causes the Long Latency? DRAM Chip channel cell array I/O DRAM Chip channel I/O subarray DRAM Latency = Subarray Latency + I/O Latency Dominant Subarray I/O

6 Why is the Subarray So Slow? Subarray Bitline: 512 cells extremely large sense amplifier (≈100X the cell size) capacitor access transistor wordline bitline Cell Row decoder Sense amplifier Long Bitline: Amortize sense amplifier → Small area Long Bitline: Large bitline cap. → High latency cell

7 Trade-Off: Area (Die Size) vs. Latency Faster Smaller Short BitlineLong Bitline Trade-Off: Area vs. Latency

8 Trade-Off: Area (Die Size) vs. Latency cells/bitline Commodity DRAM Long Bitline Cheaper Faster Fancy DRAM Short Bitline GOAL

9 Short Bitline Low Latency Approximating the Best of Both Worlds Long Bitline Small Area Long Bitline Low Latency Short BitlineOur Proposal Small Area Short Bitline  Fast Need Isolation Add Isolation Transistors High Latency Large Area

10 Approximating the Best of Both Worlds Low Latency Our Proposal Small Area Long Bitline Small Area Long Bitline High Latency Short Bitline Low Latency Short Bitline Large Area Tiered-Latency DRAM Low Latency Small area using long bitline

11 Outline Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results

12 Tiered-Latency DRAM Near Segment Far Segment Isolation Transistor Divide a bitline into two segments with an isolation transistor Sense Amplifier

13 Far Segment Near Segment Access Near Segment Isolation Transistor Turn off the isolation transistor Isolation Transistor ( off ) Sense Amplifier Reduced bitline capacitance  Low latency & low power Reduced bitline length

14 Near Segment Far Segment Access Turn on the isolation transistor Far Segment Isolation Transistor Isolation Transistor ( on ) Sense Amplifier Large bitline capacitance Additional resistance of isolation transistor Long bitline length  High latency & high power

15 Latency, Power, and Area Evaluation Commodity DRAM: 512 cells/bitline TL-DRAM: 512 cells/bitline – Near segment: 32 cells – Far segment: 480 cells Latency Evaluation – SPICE simulation using circuit-level DRAM model Power and Area Evaluation – DRAM area/power simulator from Rambus – DDR3 energy calculator from Micron

16 Commodity DRAM vs. TL-DRAM Latency Power –56% +23% –51% +49% DRAM Latency (tRC) DRAM Power DRAM Area Overhead ~3% : mainly due to the isolation transistors TL-DRAM Commodity DRAM Near Far Commodity DRAM Near Far TL-DRAM (52.5ns)

17 Latency vs. Near Segment Length Latency (ns) Longer near segment length leads to higher near segment latency

18 Latency vs. Near Segment Length Latency (ns) Far segment latency is higher than commodity DRAM latency Far Segment Length = 512 – Near Segment Length

19 Trade-Off: Area (Die-Area) vs. Latency cells/bitline Cheaper Faster Near SegmentFar Segment GOAL

20 Outline Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results

21 Leveraging Tiered-Latency DRAM TL-DRAM is a substrate that can be leveraged by the hardware and/or software Many potential uses 1.Use near segment as hardware-managed inclusive cache to far segment 2.Use near segment as hardware-managed exclusive cache to far segment 3.Profile-based page mapping by operating system 4.Simply replace DRAM with TL-DRAM

22 subarray Near Segment as Hardware-Managed Cache TL-DRAM I/O cache main memory Challenge 1: How to efficiently migrate a row between segments? Challenge 2: How to efficiently manage the cache? far segment near segment sense amplifier channel

23 Inter-Segment Migration Near Segment Far Segment Isolation Transistor Sense Amplifier Source Destination Goal: Migrate source row into destination row Naïve way: Memory controller reads the source row byte by byte and writes to destination row byte by byte → High latency

24 Inter-Segment Migration Our way: – Source and destination cells share bitlines – Transfer data from source to destination across shared bitlines concurrently Near Segment Far Segment Isolation Transistor Sense Amplifier Source Destination

25 Inter-Segment Migration Near Segment Far Segment Isolation Transistor Sense Amplifier Our way: – Source and destination cells share bitlines – Transfer data from source to destination across shared bitlines concurrently Step 2: Activate destination row to connect cell and bitline Step 1: Activate source row Additional ~4ns over row access latency Migration is overlapped with source row access

26 subarray Near Segment as Hardware-Managed Cache TL-DRAM I/O cache main memory Challenge 1: How to efficiently migrate a row between segments? Challenge 2: How to efficiently manage the cache? far segment near segment sense amplifier channel

27 Three Caching Mechanisms Time Caching Time Baseline Row 1 Row 2 Wait-inducing rowWait until finishing Req1 Cached row Reduced wait Is there another benefit of caching? Req. for Row 1 Req. for Row 2 Req. for Row 1 Req. for Row 2 1.SC (Simple Caching) – Classic LRU cache – Benefit: Reduced reuse latency 2.WMC (Wait-Minimized Caching) – Identify and cache only wait-inducing rows – Benefit: Reduced wait 3.BBC (Benefit-Based Caching) – BBC ≈ SC + WMC – Benefit: Reduced reuse latency & reduced wait

28 Outline Motivation & Key Idea Tiered-Latency DRAM Leveraging Tiered-Latency DRAM Evaluation Results

29 Evaluation Methodology System simulator – CPU: Instruction-trace-based x86 simulator – Memory: Cycle-accurate DDR3 DRAM simulator Workloads – 32 Benchmarks from TPC, STREAM, SPEC CPU2006 Metrics – Single-core: Instructions-Per-Cycle – Multi-core: Weighted speedup

30 Configurations System configuration – CPU: 5.3GHz – LLC: 512kB private per core – Memory: DDR channel, 1 rank/channel 8 banks, 32 subarrays/bank, 512 cells/bitline Row-interleaved mapping & closed-row policy TL-DRAM configuration – Total bitline length: 512 cells/bitline – Near segment length: cells

31 Single-Core: Performance & Power IPC ImprovementNormalized Power 12.7% –23% Using near segment as a cache improves performance and reduces power consumption

32 Single-Core: Varying Near Segment Length By adjusting the near segment length, we can trade off cache capacity for cache latency Larger cache capacity Higher caching latency Maximum IPC Improvement

33 Dual-Core Evaluation We categorize single-core benchmarks into two categories 1.Sens: benchmarks whose performance is sensitive to near segment capacity 2.Insens: benchmarks whose performance is insensitive to near segment capacity Dual-core workload categorization 1.Sens/Sens 2.Sens/Insens 3.Insens/Insens

34 Dual-Core: Sens/Sens Performance Improv. BBC/WMC show more perf. improvement Near segment length (cells) Larger near segment capacity leads to higher performance improvement in sensitive workloads

35 Dual-Core: Sens/Insens & Insens/Insens Using near segment as a cache provides high performance improvement regardless of near segment capacity Performance Improv.

36 Other Mechanisms & Results in Paper More mechanisms for leveraging TL-DRAM – Hardware-managed exclusive caching mechanism – Profile-based page mapping to near segment – TL-DRAM improves performance and reduces power consumption with other mechanisms More than two tiers – Latency evaluation for three-tier TL-DRAM Detailed circuit evaluation for DRAM latency and power consumption – Examination of tRC and tRCD Implementation details and storage cost analysis in memory controller

37 Conclusion Problem: DRAM latency is a critical performance bottleneck Our Goal: Reduce DRAM latency with low area cost Observation: Long bitlines in DRAM are the dominant source of DRAM latency Key Idea: Divide long bitlines into two shorter segments – Fast and slow segments Tiered-latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption Results: When the fast segment is used as a cache to the slow segment  Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%)

38 Thank You

39 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

40 Backup Slides

41 Storage Cost in Memory Controller Organization – Bitline Length: 512 cells/bitline – Near Segment Length: 32 cells – Far Segment Length: 480 cells – Inclusive Caching Simple caching and wait-minimized caching – Tag Storage: 9 KB – Replace Information: 5 KB Benefit-based caching – Tag storage: 9 KB – Replace Information: 8 KB (8 bit benefit field/near segment row)

42 Hardware-managed Exclusive Cache Near and Far segment: Main memory Caching: Swapping near and far segment row – Need one dummy row to swap Performance improvement is lower than Inclusive caching due to high swapping latency 7.2% 8.9% 9.9% 9.4% 11.4% 14.3%

43 Profile-Based Page Mapping Operating system profiles applications and maps frequently accessed rows to the near segment Allocating frequently accessed rows in the near segment provides performance improvement 8.9% 11.6% 7.2% 19% 24.8% 21.5%

44 Three-Tier Analysis Three tiers – Add two isolation transistors – Near/Mid/Far segment length: 32/224/256 Cells More tiers enable finer-grained caching and partitioning mechanisms Latency –56% Three-Tier TL-DRAM Commodity DRAM Near Mid Far –23% 57%