Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
A Case for Refresh Pausing in DRAM Memory Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Presented by: Nick Kirchem Feb 13, 2004
ESE532: System-on-a-Chip Architecture
Seth Pugsley, Jeffrey Jestes,
Reducing Memory Interference in Multicore Systems
A Staged Memory Resource Management Method for CMP Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
ISPASS th April Santa Rosa, California
A Requests Bundling DRAM Controller for Mixed-Criticality System
Cache Memory Presentation I
Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian
Lecture 15: DRAM Main Memory Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Lecture: Memory, Multiprocessors
Virtually Pipelined Network Memory
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Ali Shafiee Rajeev Balasubramonian Feifei Li
Adapted from slides by Sally McKee Cornell University
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
DRAM Hwansoo Han.
Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian
Rajeev Balasubramonian
Haonan Wang, Adwait Jog College of William & Mary
Presented by Florian Ettinger
Presentation transcript:

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads Niladrish Chatterjee Rajeev Balasubramonian Al Davis Naveen Muralimanohar * Norm Jouppi* University of Utah and *HP Labs

Memory Trends DRAM bandwidth bottleneck Bandwidth is precious Source: Tom’s Hardware DRAM bandwidth bottleneck Multi-socket, multi-core, multi-threaded 1 TB/s by 2017 Pin constrained processors Bandwidth is precious Efficient utilization is important Write handling can impact efficient utilization Expected to get worse in the future Chipkill support in DRAM PCM cells with longer write latencies It is well recognized that for current and future systems, the main memory will be the main performance bottleneck

DRAM Writes Writes receive low priority in the DRAM world Buffered in the memory controller’s Write Queue Drained when absolutely necessary (if the occupancy reaches the high-water-mark) Writes are drained in batches At the end of the write burst, the data bus is “turned-around” and reads are performed The turn-around penalty (tWTR) has been constant through DRAM generations (7.5ns) Reads are not interleaved to prevent bus underutilization due to frequent turn-around Writes are drained in a burst. During this time, reads are not scheduled to the DRAM rank because the bus has to be turned around between a DRAM write a DRAM read. The delay is needed to reverse the direction of the DRAM I/O bus.

Baseline Write Drain - Timing time BANK 1 WRITE 1 READ 5 READ 8 High queuing delay Low bus utilization BANK 2 WRITE 3 WRITE 4 BANK 3 READ 6 READ 9 READ 11 BANK 4 Even though bank 3 was not busy, the many pending reads to that bank could not be serviced because it would then require multiple turnarounds. The whitespaces represent lost scheduling opportunity in some sense. WRITE 2 READ 7 READ 10 DATA BUS 1 2 3 4 tWTR 5 6 7 8 9 10 11

Write Induced Slowdown Write-imbalance Long bank idle cycles because other banks are busy servicing writes Reads pending on these banks can not start their bank access before the last write from the other banks has completed and the bus has been turned around. High queuing delay for reads waiting on these banks Our methods seek to address this issue.

If there were no writes (RDONLY), throughput could be boosted by 35%. Motivational Results Ideal is again unattainable because some banks would be busy servicing writes. Also the reads would have bank conflicts. But this provides an absolute ceiling on what can be achieved by parallelizing reads with writes. In our scheme that I will describe next, we try to get as close to ideal as possible. If there were no writes (RDONLY), throughput could be boosted by 35%. If all the pending reads could finish their bank access in parallel with the write drain (IDEAL), throughput could be boosted by 14%.

Staged Reads - Overview A mechanism to perform “useful” Read operations during write drain Decouple a read stalled by write drains into two stages 1st stage : Reads access idle banks in parallel with the writes; the read data is buffered internally in the chip. 2nd stage : After all writes have completed, and the bus has been turned-around, the buffered data is streamed out over the chip’s I/O pins.

Staged Reads - Timing Drain the Staged Read Registers time BANK 1 Drain the Staged Read Registers Issue Staged-Reads to free banks Start issuing regular reads Turn around the bus WRITE 1 READ 5 READ 8 SR BANK 2 WRITE 3 WRITE 4 Lower Queuing delay Higher bus utilization BANK 3 READ 6 READ 9 READ 11 SR SR BANK 4 We are able to utilize the idle banks by decoupling the read operations. WRITE 2 READ 7 READ 10 SR DATA BUS 1 2 3 4 tWTR 5 6 7 9 11 10 8

Staged Read Registers A small pool of cache-line sized (64B) registers 16 or 32 SR registers, i.e., 256B /chip Placed near the I/O pads Data from each bank’s row-buffer can be routed to the SR pool based on a simple DEMUX setting during the 1st stage of Staged Reads. Output port of the SR register pool connects to the global I/O network to stream out latched data A DEMUX routs read data from a bank’s

Implementation – Logical Organization Red and green Write Read Staged Reads

Implementability DRAM Array Row Logic Cost Column Logic Center Stripe DRAM Chip Layout DRAM Array Row Logic Cost Column Logic SR Registers Center Stripe I/O Gating We restrict our changes to the least cost-sensitive region of the chip.

Implementation Staged-Read (SR) Registers shared by all banks Low area overhead (<0.25% of the DRAM chip) No effect on regular reads Two new DRAM commands CAS-SR : Move data from the Sense Amplifiers to SR Registers SR-Read : Move data from SR Registers to DRAM data pins CACTI Extra gate delay for new reads Two new DRAM commands – CAS-SR (very similar to CAS just enables the DEMUX to latch reads into SR) SR-RD – low latency operation of moving from SR to I/O

Exploiting SR: WIMB Scheduler SR mechanism works well if there are writes to some banks and reads to others (write-imbalance). We artificially increase write imbalance Banks are ordered using the following metric M = (pending writes – pending reads) Writes are drained to banks with higher M values – leaving more opportunities to schedule Staged Reads to low M banks

Evaluation Methodology SIMICS with cycle-accurate DRAM simulator SPEC-CPU 2006mp, PARSECmp, BIOBENCHmt, and STREAMmt Evaluated configurations Baseline, SR_16, SR_32, SR_Inf, WIMB+SR_32 CPU 16-core Out-of-Order CMP, 3.2 GHz L2 Unified Cache Shared, 4MB/8-way, 10-cycle access Total DRAM Capacity 8 GB Memory Configuration 2 Channels, 2 ranks/Channel, 8 banks/rank DRAM Chip Micron DDR3-1600 (800 MHz) Memory Controller FR-FCFS, 48-entry WQ (HI/LO 32/16)

Results SR_32+WIMB : 7% (max 33%) We simulated SR with baseline write-scheduler with 16 & 32 regs – in addition to estimate the upper bound of staged read opportunities we simulated a config which had no restriction on the number of sr-s We also simulated a config with the WIMB scheduler and 32 regs 32 SR sufficient SR32 better than SRInf By artificially creating imbalance, WIMB 7.2% improvement in throughput SR_32+WIMB : 7% (max 33%)

Results - II High MPKI + Few written banks leads to higher performance with SR By actively creating bank imbalance, SR_32+WIMB performs better than SR_32.

Future Memory Systems : Chipkill ECC stored per rank in a separate chip on rank & in RAID-like fashion, parity is maintained across ranks. Each cache-line write now requires two reads and two writes. higher write traffic SR_32 achieves 9% speedup over a RAID-5 baseline.

Future Memory Systems: PCM Phase-Change Memory has been suggested as a scalable alternative to DRAM for main memory. PCM has extremely long write latencies (~4x that of DRAM). SR_32 can alleviate the long write-induced stalls (~12% improvement) SR_32 performs better than SR_32+WIMB artificial write imbalance introduced by WIMB increases bank conflicts and reduces the benefits of SR

Conclusions : Staged Reads Simple technique to prevent write-induced stalls for DRAM reads. Low-cost implementation – suited for niche high-performance markets. Higher benefits for future write-intensive systems.

Back-up Slides

Impact of Write Drain With Staged-Reads we approximate the Ideal behavior to reduce the queuing delays of stalled reads

Threshold Sensitivity : HI/LO = 16/8

Threshold Sensitivity : HI/LO = 128/64

More Banks , Less Channels