DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Slides:



Advertisements
Similar presentations
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Advertisements

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
A Case for Refresh Pausing in DRAM Memory Systems
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005.
Main Memory by J. Nelson Amaral.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
An introduction to SDRAM and memory controllers 5kk73.
CSIT 301 (Blum)1 Memory. CSIT 301 (Blum)2 Types of DRAM Asynchronous –The processor timing and the memory timing (refreshing schedule) were independent.
Dezső Sima Spring 2008 (Ver. 1.0)  Sima Dezső, 2008 FB-DIMM technology.
Memory Technology “Non-so-random” Access Technology:
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Multi-core Systems and Coherence.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Computer Architecture Lecture 24 Fasih ur Rehman.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
15-740/ Computer Architecture Lecture 25: Main Memory
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha
CSE 502: Computer Architecture
Modern Computer Architecture
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Dynamic Random Access Memory (DRAM) Basic
A Requests Bundling DRAM Controller for Mixed-Criticality System
Lecture 15: DRAM Main Memory Systems
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
The Main Memory system: DRAM organization
Lecture: DRAM Main Memory
DRAM Bandwidth Slide credit: Slides adapted from
Lecture: Memory Technology Innovations
Reducing DRAM Latency via
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
Presentation transcript:

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08

DRAM Background

Typical Memory Busses: address, command, data, DIMM (Dual In-Line Memory Module) selection

DRAM cell

DRAM array

DRAM device or chip

Command/data movement: DRAM chip

Operations(commands) protocol, timing

Examples of DRAM operations(commands)

The purpose of a row access command is to move data from the DRAMarrays to the sense amplifiers. tRCD and tRAS

A column read command moves data from the array of sense amplifiers of a given bank to the memory controller. tCAS, tBurst

Precharge: separate phase that is a prerequisite for the subsequent phases of a row access operation (bitlines set to Vcc/2 or Vcc)

Organization, access, protocols

Logical Channels: set of physical channels connected to the same memory controller

Examples of Logical Channels

Rank = set of banks

Row = DRAM page

Width: aggregating DRAM chips

Scheduling: banks

Scheduling banks

Scheduling: ranks

Open x Close page Open-page: data access to and from cells requires separate row and column commands – Favors accesses on the same row (sense aps open) – Typical general purpose computers (desktop/laptop) Close-page: – Intense amount of requests, favors random accesses – Large multiprocessor/multicore systems

Available Parallelism in DRAM System Organization Channel: Pros: performance different logical channels, independent memory controllers schedulling strategies cons Number of pins, power to deliver Smart but not adaptive firmware

Available Parallelism in DRAM System Organization Rank pros accesses can proceed in parallel in different ranks (busses availability) cons Rank-to-rank switching penalties in high frequency Globally synchronous DRAM (global clock)

Available Parallelism in DRAM System Organization Bank Different banks (busses availability) Row Only 1 row/bank can be active at any time period Column Depends on management (close-page / open-page)

Paper: Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07

Issues parallel bus scaling: frequency, widths, length, depth (man hops => latency ) #memory controllers increased CPUs, GPUs – #DIMMs/channel (depth) decreases 4DIMMs/channel in DDRs 2 DIMMs/channel in DDR2 1 DIMM/channel in DDR3 scheduling

Contributions Applied DDR based memory controller policies in FBDIMM memory Evaluation of Performance Exploit FBDIMM depth: rank (DIMM) parallelism latency and bandwidth for FBDIMM and DDR – high utilization of the channels, FBDIMM 7% in latency 10% – low utilization of the channels 25% in latency 10 % in bandwidth

Northbound channel: reads / Southbound-channel: writes AMB: pass-through switch, buffer, serial/parallel converter

Methodology DRAMsim simulator Execution-driven simulator Detailed models of FBDIMM and DDR2 based on real standard configurations Standalone / coupled with M5/SS/Sesc Benchmarks: bandwidth-bound SVM from Bio-Parallel (r:90%) SPEC-mixed: 16 independent (r:w = 2:1) UA from NAS (r:w = 3:2) ART (SPEC-2000, OpenMP) (r:w = 2:1)

Methodology: cont Different scheduling policies: greedy, OBF, most/last pending and RIFF 16-way CMP, 8MB L2 Multi-threaded traces gathered with CMP$im SPEC traces using Simplescalar with 1MB L2, in-order core 1 rank/DIMM

High-bandwidth utilization: – Better bandwidth: FBDIMM – Larger latency

ART and UA: latency reduction

Low utilization: serialization cost Depth: FBDIMM scheduler offsets serialization

Overhead: queue, south and rank availability Single-rank: higher latency

Scheduling Best: RIFF, priority on reads than writes

Bandwidth is less sensitive th Higher latency in open-page mode More channels => decreases channel utilization