High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

CSE502: Computer Architecture Memory / DRAM. CSE502: Computer Architecture SRAM vs. DRAM SRAM = Static RAM – As long as power is present, data is retained.

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

Recent Progress In Embedded Memory Controller Design

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Memory Technology “Non-so-random” Access Technology:

Main Memory Background Random Access Memory (vs. Serial Access Memory) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs.

Simulation of Memory Management Using Paging Mechanism in Operating Systems Tarek M. Sobh and Yanchun Liu Presented by: Bei Wang University of Bridgeport.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Shih-Fan, Peng 2013 IEE5008 –Autumn 2013 Memory Systems DRAM Controller for Video Application Shih-Fan, Peng Department of Electronics Engineering National.

1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 5 CPU Scheduling Slide 1 Chapter 5 CPU Scheduling.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Reducing Memory Interference in Multicore Systems

CSE 502: Computer Architecture

Zhichun Zhu Zhao Zhang ECE Department ECE Department

SystemC Simulation Based Memory Controller Optimization

Chapter 6: CPU Scheduling

Lecture 15: DRAM Main Memory Systems

Module 5: CPU Scheduling

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

A Talk on Adaptive History-Based Memory Scheduling

Chapter 6: CPU Scheduling

CPU SCHEDULING.

15-740/ Computer Architecture Lecture 19: Main Memory

DRAM Hwansoo Han.

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010

Table of Contents Background ◦Devices and organizations DRAM Protocol ◦Operations and timing constraints Power Analysis Experimental Setup ◦Policies and Algorithms Results Conclusions Appendix 2

What is the Problem? Controller performance is sensitive to policies and parameters Real simulations show surprising behaviors Policies interact in non-trivial and non-linear ways 3

DRAM Devices – 1T1C Cell Row address is decoded and chooses the wordline Values are sent across the bitline to the sense amps Very space-efficient but must be refreshed 4

Organization – Rows and Columns Can only read from/write to an active row Can access row after it is sensed but before the data is restored Read or write to any column within a row Row reuse avoids having to sense and restore new rows 5

DRAM Operation 6

Organization One memory controller per channel 1-4 ranks/DIMM in a JEDEC system Registered DIMMs at slower speeds may have more DIMMs/channel 7

A Read Cycle Activate the row and wait for it to be sensed before issuing the read Data begins to be sent after t CAS Precharge once the row is restored 8

Command Interactions Commands must wait for resources to be available Data, address and command buses must be available Other banks and ranks can affect timing (t RTRS, t FAW ) 9

Power Modeling Based on Micron guidelines (TN-41-01) Calculates background and event power 10

Controller Design 11 Address Mapping Policy Row Buffer Management Policy Command Ordering Policy Pipelined operation with reordering

Controller Design 12

Transaction Queue Not varied in this simulation Policies ◦Reads go before writes ◦Fetches go before reads ◦Variable number of transactions may be decoded Optimized to avoid bottlenecks Request reordering 13

Row Buffer Management Policy 14

Address Mapping Policy Chosen to work with row buffer management policy Can either improve row locality or bank distribution Performance depends on workload 15

Address Mapping Policy – 433.calculix Low Locality (~5s) – irregular distribution SDRAM Baseline (~3.5s) – more regular distribution 16

Command Ordering Algorithm Second Level of Command Scheduling ◦FCFS (FIFO) ◦Bank Round Robin ◦Rank Round Robin ◦Command Pair Rank Hop ◦First Available (Age) ◦First Available (Queue) ◦First Available (RIFF) 17

Command Ordering Algorithm – First Available Requires tracking of when rank/bank resources are available Evaluates every potential command choice ◦Age, Queue, RIFF – secondary criteria 18

Results - Bandwidth 19

Results - Latency 20

Results – Execution Time 21

Results - Energy 22

Command Ordering Algorithms 23

Command Ordering Algorithms 24

Conclusions The right combination of policies can achieve good latency/bandwidth for a given benchmark ◦Address mapping policies and row buffer management policies should be chosen together ◦Command ordering algorithms become important as the memory system is heavily loaded Open Page policies require more energy than Close Page policies in most conditions The extra logic for more complex schemes helps improve bandwidth but may not be necessary Address mapping policies should balance row reuse and bank distribution to reuse open rows and use available resources in parallel 25

Appendix 26

Bandwidth (cont.) 27

Row Reuse Rate (cont.) 28

Bandwidth (cont.) 29

Results – Execution Time 30

Results – Row Reuse Rate Open Page/Open Page Aggressive have the greatest reuse rate Close page aggressive rarely exceeds 10% reuse SDRAM Baseline and SDRAM High Performance work well with open page 429.mcf has very little ability to reuse rows, 35% at the most 458.sjeng can reuse 80% with SDRAM Baseline or SDRAM High Performance, else the rate is very low 31

Execution Time (cont.) 32

Row Reuse Rate (cont.) 33

Average Latency (cont.) 34

Average Latency (cont.) 35

Results - Bandwidth High Locality is consistently worse than others Close Page Baseline (Opt) work better with Close Page (Aggressive) SDRAM Baseline/High Performance work better with Open Page (Aggressive) Greater bandwidth correlates inversely with execution time – configurations that gave benchmarks more bandwidth finished sooner 470.lbm (1783%), (1.5s, 5.1GB/s) – (26.8s, 823MB/s) 458.sjeng (120%), (5.18s, 357MB/s) – (6.24s, 285MB/s) 36

Results - Energy Close Page (Aggressive) generally takes less energy than Open Page (Aggressive) The disparity is less for heavy-bandwidth applications like 470.lbm ◦Banks are mostly in standby mode Doubling the number of ranks ◦Approximately doubles the energy for Open Page (Aggressive) ◦Increases Close Page (Aggressive) energy by about 50% Close Page Aggressive can use less energy when row reuse rates are significant 470.lbm (424%), (1.5s, 12350mJ) – (26.8s, 52410mJ) 458.sjeng (670%), (5.18s, 14013mJ) – (6.24s, 93924mJ) 37

Bandwidth (cont.) 38

Bandwidth (cont.) 39

Results – Average Latency 40

Energy (cont.) 41

Energy (cont.) 42

Average Latency (cont.) 43

Memory System Organization 44

Transaction Queue RIFF or FIFO Prioritizes read or fetch Allows reordering Increases controller complexity Avoids hazards 45

Transaction Queue – Decode Window Out-of-order decoding Avoids queuing delays Helps to keep per-bank queues full Increases controller complexity Allows reordering 46

Row Buffer Management Policy Close Page / Close Page Aggressive 47

Row Buffer Management Policy Open Page / Open Page Aggressive 48