Memory Controller Innovations for High-Performance Systems

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

A Case for Refresh Pausing in DRAM Memory Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

High Performing Cache Hierarchies for Server Workloads

1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

QUANTIFYING THE RELATIONSHIP BETWEEN THE POWER DELIVERY NETWORK AND ARCHITECTURAL POLICIES IN A 3D-STACKED MEMORY DEVICE Manjunath Shevgoor, Niladrish.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.

I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.

Reducing Refresh Power in Mobile Devices with Morphable ECC

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Enabling Big Memory with Emerging Technologies Manjunath Shevgoor Enabling Big Memory with Emerging Technologies1.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

A Case for Toggle-Aware Compression for GPU Systems

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Good morning everyone, my name is Arjun Deb

Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Untrodden Paths for Near Data Processing

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Cluster Resource Management: A Scalable Approach

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Ali Shafiee Rajeev Balasubramonian Feifei Li

Lecture 6: Reliability, PCM

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories

Niladrish Chatterjee Manjunath Shevgoor Rajeev Balasubramonian

Enabling Transparent Memory-Compression for Commodity Memory Systems

Rajeev Balasubramonian

Presentation transcript:

Memory Controller Innovations for High-Performance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25th 2013

Micron Road Trip MICRON BOISE SALT LAKE CITY

DRAM Chip Innovations

Feedback - I Don’t bother modifying the DRAM chip.

Feedback - II We love what you’re doing with the memory controller and OS.

Academic Research Agendas Not giving up on memory device innovations Several examples of academic papers resonating commercial innovations Greater focus on memory controller improvements

This Talk’s Focus: The Memory Controller More relevant to Intel, Micron Cores are being commoditized, but memory controller features are still evolving – new devices (buffer chips, HMC), chipkill, compression Lots of room for improvement – MCs haven’t seen the same innovation frenzy as the cores

Example IBM Server Source: P. Bose, WETI Workshop, 2012

Power Contributions PROCESSOR PERCENTAGE OF TOTAL SERVER POWER MEMORY

Power Contributions PROCESSOR PERCENTAGE OF TOTAL SERVER POWER MEMORY

Memory Basics … x8 MC HOST MULTI-CORE PROCESSOR MC MC MC

Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

Making a Case for Compression Prior work: IBM MXT, Ekman and Stenstrom, LCP, Alameldeen and Wood, etc. Can improve several metrics: primarily memory capacity secondary benefit in apps with locality: bandwidth, energy Typically worsens access complexity and introduces data copies The MemZip approach: focus on other metrics no change in memory capacity improvements in energy, bandwidth, reliability, complexity

… MemZip MC HOST MULTI-CORE PROCESSOR MC MC MC x8 Rank subsetting Data fetch in 8-byte increments Need metadata Modified data layout MDC

Cache Line Format BASE-DELTA-IMMEDIATE FREQUENT PATTERN COMPRESSION

Using Spare Space for Energy and Reliability COMPRESSED CACHE LINE 26 BYTES ROOM FOR ECC AND DBI CODES

Making the ECC Access More Efficient Baseline ECC: ECC code is fetched in parallel from 9th chip Subranking with embedded-ECC: no extra chip; ECC is located in the same row as data; need extra COL-RDs to fetch ECC codes MemZip with embedded-ECC: in many cases, the ECC is fetched with no additional COL-RD

DBI for Energy Efficiency Data Bus Inversion: to save energy, either send data or the inverse of data Break the cache line into small words; each word needs an inversion bit; the inversion bits make up the DBI code We use either 0, 1, 2, or 3 bytes of DBI codes ORIGINAL DATA Transfer 1: 11111111 Transfer 2: 00000000 WITH DBI ENCODING Transfer 1: 11111111 0 Transfer 2: 11111111 1 DBI code 2 bits for DBI code size

Methodology Simics (8 out-of-order cores) and USIMM memory system timing Micron power calculator for DRAM power estimates Collection of workloads from SPEC2k6, NASPB, Parsec, CloudSuite; multi-programmed and multi-threaded

Effect of Sub-Ranking 2-way sub-ranking has best performance 8-way sub-ranking is worse than baseline

Effect of Compression on Performance 20% performance improvement With compression, 4-way is the best, but only slightly better than 8x2-way

Effect on Memory Energy 8x2-way has lowest traffic and energy Additional 17% reduction in activity with DBI

Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

Chipkill Overview Chipkill: the ability to recover from an entire memory chip failure Commercial symbol-based chipkill: 4 check symbols are required to recover from 1 data symbol corruption; hence needs 32+4 x4 DRAM chips per access (two channels)

LOT-ECC 1st level: checksums for error detection and location … PA PPA PA PPA PA PPA PA PPA 1st level: checksums for error detection and location 2nd and 3rd levels: parity for error recovery

LESS-ECC 1st level: parity for error detection and recovery XA … CA0 CA1 CA7 CXA 1st level: parity for error detection and recovery 2nd level: checksums for error location detection

Reducing Storage Checksums can be made large/small Small checksums can also be effectively cached on chip In LESS-ECC, the checksum can be designed so that basic: 8-bit checksum for 64 bits of data ES1: 8-bit checksum for 512 bits of data ES2: 64-bit checksum for 8Kb of data ES3: 64-bit checksum for 4Gb of data Storage Overhead LOT-ECC LESS-ECC X8 26% 13% X16 52%

Error Rates Checksums have a small probability of failing to detect an error LOT-ECC uses checksums in the 1st level and hence causes SDC (silent data corruption) LESS-ECC uses checksums in the 2nd level and hence causes DUE (detected but unrecoverable error); DUEs are more favorable

LESS-ECC Summary Benefits: energy, parallelism, storage, SDC Disadvantage: checksum cache and more logic at the memory controller

LESS-ECC Performance

LESS-ECC Memory Energy

LESS-ECC Energy Efficiency LESS-ECC-x8 has 0.5% lower energy than LOT-ECC-x8 but 15% less energy per usable byte LESS-ECC-x16 has 26% lower energy than LOT-ECC-x8 (both have similar storage overhead of 26%)

Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

Current Constraints and IR-Drop MC ensures that requests are scheduled appropriately; many timing constraints, such as tFAW A new constraint emerges in future 3D-stacked devices: IR-drop

Many Possible Solutions Note that charge pumps and decaps scale with dies, but IR-drop gets worse Provide higher voltage (power increase!) Provide more pins and TSVs (cost increase!) Alternative: Use an architectural solution; the MC schedules requests in a manner that does not violate IR-drop limits Similar to the power tokens used in some PCM papers, but we have to be aware of where activity is happening Place data such that IR-drop-prone regions are avoided

Example Voltage Map Y Coordinate

IR-Drop Regions

Scheduling Constraints For a given part of the device, identify the worst-case set of requests that will cause an IR-drop violation For example, for region A-Top, if you issue one COL-RD to the furthest bank, that region cannot handle any other request Continue to widen the list of constraints by considering larger regions on the device

Scheduling Constraints 1 COL-RD = 1 COL-WR = 2 ACTs = 6 PREs

Overcoming Scheduling Limitations Starvation: If B-Top always has 2 accesses, A-Top requests will be starved; prioritize requests that have much longer than average wait times Page placement: dynamically identify frequently accessed pages and move them to favorable regions

Performance Impact With All Constraints, (Real PDN) performance falls by 4.6X With Starvation management, gap is reduced to 1.47X Profiled Page Placement with Starvation Control is within 15% of unrealistic Ideal PDN

Summary Many features expected of future memory controllers: handling compression, errors, new devices Lots of low-hanging fruit Significant energy/performance benefits from compression Energy-efficient and storage-efficient chipkill possible, but requires some effort in the MC More scheduling constraints being imposed as technology evolves; we show in an IR-drop case study for 3D-stacked devices that the performance impacts can be large

Acks Students in the Utah Arch Lab (Amirali Boroumand, Nil Chatterjee, Seth Pugsley, Ali Shafiee, Manju Shevgoor, Meysam Taassori) Other collaborators from Samsung (Jung-Sik Kim), HP Labs (Naveen Muralimanohar), ARM (Ani Udipi), U. Nebrija (Pedro Reviriego), Utah (Al Davis) Funding sources: NSF, Samsung, HP, IBM