Samira Khan University of Virginia Nov 14, 2018

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

CSE502: Computer Architecture Memory / DRAM. CSE502: Computer Architecture SRAM vs. DRAM SRAM = Static RAM – As long as power is present, data is retained.
A Case for Refresh Pausing in DRAM Memory Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
CS 704 Advanced Computer Architecture
COMP541 Memories II: DRAMs
William Stallings Computer Organization and Architecture 7th Edition
Reducing Memory Interference in Multicore Systems
CSE 502: Computer Architecture
Samira Khan University of Virginia Mar 29, 2016
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Dynamic Random Access Memory (DRAM) Basic
A Requests Bundling DRAM Controller for Mixed-Criticality System
William Stallings Computer Organization and Architecture 7th Edition
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
Samira Khan University of Virginia Oct 9, 2017
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
Lecture 15: DRAM Main Memory Systems
William Stallings Computer Organization and Architecture 8th Edition
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Application Slowdown Model
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
The Main Memory system: DRAM organization
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
William Stallings Computer Organization and Architecture 7th Edition
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
William Stallings Computer Organization and Architecture 8th Edition
BIC 10503: COMPUTER ARCHITECTURE
Lecture: Memory Technology Innovations
DRAM SCALING CHALLENGE
Adapted from slides by Sally McKee Cornell University
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Reducing DRAM Latency via
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
William Stallings Computer Organization and Architecture 8th Edition
Bob Reese Micro II ECE, MSU
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hi, my name is Hasan. Today, I will be presenting CROW, which.
Computer Architecture Lecture 6a: ChargeCache
Architecting Phase Change Memory as a Scalable DRAM Alternative
Presented by Florian Ettinger
18-447: Computer Architecture Lecture 19: Main Memory
Presentation transcript:

Samira Khan University of Virginia Nov 14, 2018 COMPUTER ARCHITECTURE CS 6354 Main Memory II Samira Khan University of Virginia Nov 14, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from the last class Main Memory

DRAM SUBSYSTEM ORGANIZATION Channel DIMM Rank Chip Bank Row/Column

MEMORY CONTROLLERS

DRAM VERSUS OTHER TYPES OF MEMORIES Long latency memories have similar characteristics that need to be controlled. The following discussion will use DRAM as an example, but many issues are similar in the design of controllers for other types of memories Flash memory Other emerging memory technologies Phase Change Memory Spin-Transfer Torque Magnetic Memory

DRAM CONTROLLER: FUNCTIONS Ensure correct operation of DRAM (refresh and timing) Service DRAM requests while obeying timing constraints of DRAM chips Constraints: resource conflicts (bank, bus, channel), minimum write- to-read delays Translate requests to DRAM command sequences Buffer and schedule requests to improve performance Reordering, row-buffer, bank, rank, bus management Manage power consumption and thermals in DRAM Turn on/off DRAM chips, manage power modes

DRAM SCHEDULING POLICIES (I) FCFS (first come first served) Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones

DRAM SCHEDULING POLICIES (II) A scheduling policy is essentially a prioritization order Prioritization can be based on Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality Oldest miss in the core? How many instructions in core are dependent on it?

WHY ARE DRAM CONTROLLERS DIFFICULT TO DESIGN? Need to obey DRAM timing constraints for correctness There are many (50+) timing constraints in DRAM tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued tRC: Minimum number of cycles between the issuing of two consecutive activate commands to the same bank … Need to keep track of many resources to prevent conflicts Channels, banks, ranks, data bus, address bus, row buffers Need to handle DRAM refresh Need to optimize for performance (in the presence of constraints) Reordering is not simple Predicting the future?

MANY DRAM TIMING CONSTRAINTS From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April 2010.

Why can we reduce DRAM timing parameters without any errors? Reducing DRAM Timing Why can we reduce DRAM timing parameters without any errors? I will mostly talk about reducing DRAM timing parameters. Before explaining the details of our work, I would provide high level questions and our answers. First question is why reducing DRAM timing parameters works? Our observation is that DRAM timing parameters are overly conservative and dictated by the slowest DRAM chip at the highest operating temperature. Second question is how much we can reduce DRAM timing parameters? Our DRAM latency characterization shows that it depends on the operating temperature and DRAM chip. Third question is whether this approach is safe? Our answer is Yes. We exploit only the extra margin in timing parameters. We perform 33 day long evaluation with reduced timing parameters and observed no errors. Let me explain details about our work.

(11 – 11 – 28) (8 – 8 – 19) Runtime: 527min x86 CPU Runtime: 477min SPEC mcf Apache GUPS Memcached Parsec -10.5% (no error) MemCtrl (11 – 11 – 28) Timing Parameters DRAM Module (8 – 8 – 19) A system consists of a processor and DRAM based-memory system that are connected by memory channel. DRAM has many timing parameters as minimum number of clock cycles between commands. Memory controller follows these timing parameters when accessing DRAM. We execute a benchmark with using the standard timing parameters. Then, we intentionally reduce the timing parameters and execute same benchmark again. When comparing these two cases, we observe significant performance improvement by reducing timing parameters, for example, 10% performance improvement for spec mcf benchmark. Without any errors.----- DDR3 1600MT/s (11-11-28)

DRAM Stores Data as Charge DRAM Cell Three steps of charge movement 1. Sensing The right figure shows DRAM organization which is consists of DRAM cells and sense-amplifiers. There are three steps for accessing DRAM. First, when DRAM selects one of rows in its cell array, the selected cells drive their charge to sense-amplifiers. Then, sense-amplifiers detect the charge, recognizing data of cells. We call this operation as sensing. Second, after sensing data from DRAM cells, sense-amplifiers drive charge to the cells for reconstructing original data. We call this operation as restore. Third, after restoring the cell data, DRAM deselects the accessed row and puts the sense-amplifiers to original state for next access. We call this operation as precharge. As shown in these three steps, DRAM operation is charge movement between cell and sense-amplifier. 2. Restore Sense-Amplifier 3. Precharge

Why does DRAM need the extra timing margin? DRAM Charge over Time Cell In practice Cell Data 1 margin charge Sense-Amplifier Sense-Amplifier Data 0 Timing Parameters Sensing Restore time Timing parameters exist to ensure sufficient movement of charge. The figure shows the timeline for accessing DRAM in X-axis and Y-axis shows the charge amount in DRAM cell and sense-amplifier. When selecting a row of cells, each cell drives their charge toward to the corresponding sense-amplifier. If driven charge is more than enough to detect the data, the sense-amplifier starts to restore data by driving charge toward the cell. This restore operation finishes when the cell is fully charged. Ideally, DRAM timing parameters are just enough for the sufficient movement of charge. However, in practice, there are large margin in DRAM timing parameters. Why are these margin required? In theory Why does DRAM need the extra timing margin?

Two Reasons for Timing Margin 1. Process Variation DRAM cells are not equal Leads to extra timing margin for cell that can store small amount of charge ` 1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge 1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin when operating at low temperature There are two reasons for the margin in DRAM timing parameters. First is process variation and second is temperature dependence. Let me introduce the first factor, process variation.

DRAM Cells are Not Equal Ideal Real Smallest Cell Largest Cell Same Size  Different Size  Large variation in cell size  Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Same Charge  Different Charge  Large variation in charge  Same Latency Different Latency Large variation in access latency

Process Variation DRAM Cell Small cell can store small charge ❶ Cell Capacitance Capacitor ❷ Contact Resistance ❸ Transistor Performance Small cell can store small charge Bitline Contact Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Access Transistor Small cell capacitance High contact resistance Slow access transistor ACCESS  High access latency

Two Reasons for Timing Margin 1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge ` 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the high temperature 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the low temperature 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the high temperature Let me introduce the second factor temperature dependence.

Charge Leakage ∝ Temperature Room Temp. Hot Temp. (85°C) DRAM leakage increases the variation. DRAM loses their charge over time and loses more charge at higher temperature. Therefore, DRAM cell has smallest charge when operating at hot temperature, for example 85C which is the highest temperature guaranteed by DRAM standard. This temperature dependence increases the variation in DRAM cell charge. Cells store small charge at high temperature and large charge at low temperature  Large variation in access latency Small Leakage Large Leakage

DRAM Timing Parameters DRAM timing parameters are dictated by the worst-case The smallest cell with the smallest charge in all DRAM products Operating at the highest temperature Large timing margin for the common-case

Can lower latency for the common-case Key Approach Optimize DRAM timing parameters for the common-case The smallest cell with the smallest charge in a DRAM module Operating at the current temperature Common-case cell has extra charge than the worst-case cell Can lower latency for the common-case

Key Observations 1. Sensing 2. Restore 3. Precharge Sense cells with extra charge faster  Lower sensing latency 2. Restore No need to fully restore cells with extra charge  Lower restore latency Remember there are three steps in DRAM operation, sensing, restore, and precharge. In all these steps, the excessive charge enables potential timing parameter reduction. Let me introduce these three scenarios in order. 3. Precharge No need to fully precharge bitlines for cells with extra charge  Lower precharge latency

Adaptive-Latency DRAM Key idea Optimize DRAM timing parameters online Two components DRAM manufacturer profiles multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM System monitors DRAM temperature & uses appropriate DRAM timing parameters reliable DRAM timing parameters DRAM temperature

DRAM Testing Infrastructure Temperature Controller FPGAs Heater FPGAs PC This is the photo of our infrastructure, that we built mostly from scratch. For higher throughput, we employ eight FPGAs, all of which are enclosed in thermally-regulated environment.

Control Factors Timing parameters Temperature: 55 – 85°C Sensing: tRCD Restore: tRAS (read), tWR(write) Precharge: tRP Temperature: 55 – 85°C Refresh interval: 64 – 512ms Longer refresh interval leads to smaller charge Standard refresh interval: 64ms

more timing parameter reduction 1. Timings ↔ Charge Temperature: 85°C/Refresh Interval: 64, 128, 256, 512ms 10 102 103 104 105 Errors Sensing Restore Restore Precharge (Read) (Write) More charge enables more timing parameter reduction

more timing parameter reduction 2. Timings ↔ Temperature Temperature: 55, 65, 75, 85°C/Refresh Interval: 512ms 10 102 103 104 105 Errors Sensing Restore Restore Precharge (Read) (Write) Lower temperature enables more timing parameter reduction

3. Summary of 115 DIMMs Latency reduction for read & write (55°C) Read Latency: 32.7% Write Latency: 55.1% Latency reduction for each timing parameter (55°C) Sensing: 17.3% Restore: 37.3% (read), 54.8% (write) Precharge: 35.2%

Real System Evaluation Method CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC) DRAM: 4GByte DDR3-1600 (800Mhz Clock) OS: Linux Storage: 128GByte SSD Workload 35 applications from SPEC, STREAM, Parsec, Memcached, Apache, GUPS

Performance Improvement Single-Core Evaluation Average Improvement 5.0% 1.4% 6.7% Performance Improvement all-35-workload AL-DRAM improves performance on a real system

Performance Improvement multi-programmed & multi-threaded workloads Multi-Core Evaluation Average Improvement 10.4% 14.0% 2.9% Performance Improvement all-35-workload AL-DRAM provides higher performance for multi-programmed & multi-threaded workloads

Conclusion Observations Idea: Adaptive-Latency DRAM DRAM timing parameters are dictated by the worst-case cell (smallest cell across all products at highest temperature) DRAM operates at lower temperature than the worst case Idea: Adaptive-Latency DRAM Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Analysis: Characterization of 115 DIMMs Great potential to lower DRAM timing parameters (17 – 54%) without any errors Real System Performance Evaluation Significant performance improvement (14% for memory- intensive workloads) without errors (33 days) Let me conclude DRAM timing parameters are dictated by the worst case cell. We propose Adaptive-Latency DRAM that optimizes DRAM timing parameters for common cases Our characterization shows the significant potential to lower DRAM timing parameters. Our real system evaluation shows that our mechanism provides significant performance improvement without introducing errors.

Adaptive Latency DRAM Pros Cons Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Gets performance improvement basically without any extra hardware cost Can be implemented now in real systems Cons How to determine the “safe” latency? Need to change the timing parameters with server temperature change Let me conclude DRAM timing parameters are dictated by the worst case cell. We propose Adaptive-Latency DRAM that optimizes DRAM timing parameters for common cases Our characterization shows the significant potential to lower DRAM timing parameters. Our real system evaluation shows that our mechanism provides significant performance improvement without introducing errors.

Samira Khan University of Virginia Nov 14, 2018 COMPUTER ARCHITECTURE CS 6354 Main Memory II Samira Khan University of Virginia Nov 14, 2018 The content and concept of this course are adapted from CMU ECE 740