A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
A Case for Refresh Pausing in DRAM Memory Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
Reducing Refresh Power in Mobile Devices with Morphable ECC
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
The Evicted-Address Filter
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
Lecture: DRAM Main Memory
15-740/ Computer Architecture Lecture 19: Main Memory
Presentation transcript:

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Executive Summary Problem: Requests to same DRAM bank are serialized Our Goal: Parallelize requests to same DRAM bank at a low cost Observation: A bank consists of subarrays that occassionally share global structures Solution: Increase independence of subarrays to enable parallel operation Result: Significantly higher performance and energy-efficiency at low cost (+0.15% area) 2

Outline Motivation & Key Idea Background Mechanism Related Works Results 3

Introduction 4 Bank DRAM Bank Req Bank conflict! 4x latency

Three Problems 1.Requests are serialized 2.Serialization is worse after write requests 3.Thrashing in row-buffer Bank conflicts degrade performance 5 Row Bank Row-Buffer Req Thrashing: increases latency

Case Study: Timeline 6 time WrRd WrRd time Bank time Bank Case #1. Different Banks Case #2. Same Bank 1. Serialization Wr Rd Wr2 2Rd Wr2 2Rd Write Penalty 3. Thrashing Row-Buffer Served in parallel Delayed

Our Goal Goal: Mitigate the detrimental effects of bank conflicts in a cost-effective manner Naïve solution: Add more banks – Very expensive We propose a cost-effective solution 7

A DRAM bank is divided into subarrays Key Observation #1 8 Row Row-Buffer Row 32k rows Logical Bank A single row-buffer cannot drive all rows Global Row-Buf Physical Bank Local Row-Buf Subarray 1 Subarray 64 Many local row-buffers, one at each subarray

Key Observation #2 Each subarray is mostly independent… – except occasionally sharing global structures 9 Global Row-Buf Global Decoder Bank Local Row-Buf Subarray 1 Subarray 64 ···

Key Idea: Reduce Sharing of Globals 10 Global Row-Buf Global Decoder Bank Local Row-Buf ··· 1. Parallel access to subarrays 2. Utilize multiple local row-buffers

Overview of Our Mechanism 11 ··· Req Global Row-Buf Local Row-Buf Req Local Row-Buf Req 1. Parallelize 2. Utilize multiple local row-buffers Subarray 64 Subarray 1 To same bank... but diff. subarrays

Outline Motivation & Key Idea Background Mechanism Related Works Results 12

DRAM System Organization of DRAM System 13 Bank Rank Bank Rank Channel Bus CPU

1.More channels: expensive 2.More ranks: low performance 3.More banks: expensive Naïve Solutions to Bank Conflicts 14 DRAM System Channel Bus Many CPU pins Channel RRRR Low frequency Channel Rank Bank Significantly increases DRAM die area Large load

data Logical Bank 15 Row wordlines bitlines Precharged State Activated State ACTIVATE PRECHARGE addr Decoder V DD ? Row-Buffer RD/WR 0 Total latency: 50ns!

Physical Bank 16 Row-Buffer 32k rows very long bitlines: hard to drive Global Row-Buf Local Row-Buf Subarray 1 ··· Local bitlines: short 512 rows Subarray 64

Hynix 4Gb DDR3 (23nm) Lim et al., ISSCC’12 Bank0Bank1Bank2Bank3 Subarray Subarray Decoder Tile Magnified 17 Bank5Bank6Bank7Bank8

Bank: Full Picture 18 Global Row-Buf Local Row-Buf ··· Local bitlines Subarray 64 Subarray 1 Local bitlines Global bitlines Bank Global Decoder Subarray Decoder Latch

Outline Motivation & Key Idea Background Mechanism Related Works Results 19

Problem Statement 20 ··· Req Global Row-Buf Local Row-Buf Serialized! To different subarrays

MASA (Multitude of Activated Subarrays) Overview: MASA 21 ··· addr V DD addr Global Decoder V DD Local Row-Buf ACTIVATED Global Row-Buf ACTIVATED READ Challenges: Global Structures

1. Global Address Latch 2. Global Bitlines 22

Local row-buffer Global row-buffer Challenge #1. Global Address Latch 23 ··· addr V DD addr Global Decoder V DD Latch PRECHARGED ACTIVATED

Local row-buffer Global row-buffer Solution #1. Subarray Address Latch 24 ··· V DD Global Decoder V DD Latch ACTIVATED

Challenges: Global Structures 1. Global Address Latch Problem: Only one raised wordline Solution: Subarray Address Latch 2. Global Bitlines 25

Challenge #2. Global Bitlines 26 Local row-buffer Switch READ Global bitlines Global row-buffer Collision

Wire Solution #2. Designated-Bit Latch 27 Global bitlines Global row-buffer Local row-buffer Switch READ DD DD

Challenges: Global Structures 1. Global Address Latch Problem: Only one raised wordline Solution: Subarray Address Latch 2. Global Bitlines Problem: Collision during access Solution: Designated-Bit Latch 28

Baseline (Subarray-Oblivious) MASA MASA: Advantages 29 time Wr2 2Rd Serialization 2. Write Penalty3. Thrashing time Wr Rd Saved

MASA: Overhead DRAM Die Size: Only 0.15% increase – Subarray Address Latches – Designated-Bit Latches & Wire DRAM Static Energy: Small increase – 0.56mW for each activated subarray – But saves dynamic energy Controller: Small additional storage – Keep track of subarray status (< 256B) – Keep track of new timing constraints 30

Cheaper Mechanisms 31 D D Latches 1. Serialization2. Wr-Penalty3. Thrashing MASA SALP-2 SALP-1

Outline Motivation & Key Idea Background Mechanism Related Works Results 32

Related Works Randomized bank index [Rau ISCA’91, Zhang+ MICRO’00, …] – Use XOR hashing to generate bank index – Cannot parallelize bank conflicts Rank-subsetting [Ware+ ICCD’06, Zheng+ MICRO’08, Ahn+ CAL’09, …] – Partition rank and data-bus into multiple subsets – Increases unloaded DRAM latency Cached DRAM [Hidaka+ IEEE Micro’90, Hsu+ ISCA’93, …] – Add SRAM cache inside of DRAM chip – Increases DRAM die size (+38.8% for 64kB) Hierarchical Bank [Yamauchi+ ARVLSI’97] – Parallelize accesses to subarrays – Adds complex logic to subarrays – Does not utilize multiple local row-buffers 33

Outline Motivation & Key Idea Background Mechanism Related Works Results 34

Methodology DRAM Area/Power – Micron DDR3 SDRAM System-Power Calculator – DRAM Area/Power Model [Vogelsang, MICRO’10] – CACTI-D [Thoziyoor+, ISCA’08] Simulator – CPU: Pin-based, in-house x86 simulator – Memory: Validated cycle-accurate DDR3 DRAM simulator Workloads – 32 Single-core benchmarks SPEC CPU2006, TPC, STREAM, random-access Representative 100 million instructions – 16 Multi-core workloads Random mix of single-thread benchmarks 35

Configuration System Configuration – CPU: 5.3GHz, 128 ROB, 8 MSHR – LLC: 512kB per-core slice Memory Configuration – DDR – (default) 1 channel, 1 rank, 8 banks, 8 subarrays-per-bank – (sensitivity) 1-8 chans, 1-8 ranks, 8-64 banks, subarrays Mapping & Row-Policy – (default) Line-interleaved & Closed-row – (sensitivity) Row-interleaved & Open-row DRAM Controller Configuration – 64-/64-entry read/write queues per-channel – FR-FCFS, batch scheduling for writes 36

Single-Core: Instruction Throughput 37 17%20% MASA achieves most of the benefit of having more banks (“Ideal”)

Single-Core: Instruction Throughput 38 SALP-1, SALP-2, MASA improve performance at low cost 20% 17% 13% 7% DRAM Die Area < 0.15%0.15%36.3%

Single-Core: Sensitivity to Subarrays 39 You do not need many subarrays for high performance

Single-Core: Row-Interleaved, Open-Row 40 15% 12% MASA’s performance benefit is robust to mapping and page-policy

Single-Core: Row-Interleaved, Open-Row 41 MASA increases energy-efficiency -19% +13%

Other Results/Discussion in Paper Multi-core results Sensitivity to number of channels & ranks DRAM die area overhead of: – Naively adding more banks – Naively adding SRAM caches Survey of alternative DRAM organizations – Qualitative comparison 42

Conclusion Problem: Requests to same DRAM bank are serialized Our Goal: Parallelize requests to same DRAM bank at a low cost Observation: A bank consists of subarrays that occassionally share global structures MASA: Reduces sharing to enable parallel access and to utilize multiple row-buffers Result: Significantly higher performance and energy-efficiency at low cost (+0.15% area) 43

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Exposing Subarrays to Controller Every DIMM has an SPD (Serial Presence Detect) – 256-byte EEPROM – Contains information about DIMM and DRAM devices – Read by BIOS during system-boot SPD reserves 100+ bytes for manufacturer and user – Sufficient for subarray-related information 1.Whether SALP-1, SALP-2, MASA are supported 2.Physical address bit positions for subarray index 3.Values of timing constraints: tRA, tWA (Image: JEDEC) 45

Multi-Core: Memory Scheduling Configuration: 8-16 cores, 2 chan, 2 ranks-per-chan Our mechanisms further improve performance when employed with application-aware schedulers We believe it can be even greater with subarray-aware schedulers 46

Number of Subarrays-Per-Bank As DRAM chips grow in capacity… – More rows-per-bank  More subarrays-per-bank Not all subarrays may be accessed in parallel – Faulty rows remapped to spare rows – If remapping occurs between two subarrays… They can no longer be accessed in parallel Subarray group – Restrict remapping: only within a group of subarrays – Each subarray group can accessed in parallel – We refer to a subarray group as a “subarray” We assume 8 subarrays-per-bank 47

Area & Power Overhead Latches: Per-Subarray Row-Address, Designated-Bit – Storage: 41 bits per subarray – Area: 0.15% in die area (assuming 8 subarrays-per-bank) – Power: 72.2uW (negligible) Multiple Activated Subarrays – Power: 0.56mW static power for each additional activated subarray Small compared to 48mW baseline static power SA-SEL Wire/Command – Area: One extra wire (negligible) – Power: SA-SEL consumes 49.6% the power of ACT Memory Controller: Tracking the status of subarrays – Storage: Less than 256 bytes Activated? Which wordline is raised? Designated? 48