1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

Slides:



Advertisements
Similar presentations
Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Memory Controller Innovations for High-Performance Systems
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
A brief introduction to Memory system proportionality and resilience Mattan Erez The University of Texas at Austin.
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM and Storage Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)
15-740/ Computer Architecture Lecture 25: Main Memory
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
Lynn Choi School of Electrical Engineering
Reducing Memory Interference in Multicore Systems
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
The Main Memory system: DRAM organization
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Lecture 6: Reliability, PCM
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
Lecture 22: Cache Hierarchies, Memory
15-740/ Computer Architecture Lecture 19: Main Memory
Presentation transcript:

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah

2 Main Memory Problems PROCESSOR DIMM 1. Energy 2. High capacity at high bandwidth 3. Reliability

3 Motivation: Memory Energy Contributions of memory to overall system energy: 25-40%, IBM, Sun, and Google server data summarized by Meisner et al., ASPLOS’09 HP servers: 175 W out of ~785 W for 256 GB memory (HP power calculator) Intel SCC: memory controller contributes 19-69% of chip power, ISSCC’10

4 Motivation: Reliability DRAM data from Schroeder et al., SIGMETRICS’09:  25K-70K errors per billion device hours per Mbit  8% of DRAM DIMMs affected by errors every year DRAM error rates may get worse as scalability limits are reached; PCM (hard and soft) error rates expected to be high as well Primary concern: storage and energy overheads for error detection and correction ECC support is not too onerous; chip-kill is much worse

5 Motivation: Capacity, Bandwidth Processor DIMM

6 Motivation: Capacity, Bandwidth Processor Cores are increasing, but pins are not DIMM

7 Motivation: Capacity, Bandwidth Processor Cores are increasing, but pins are not DIMM High channel frequency  fewer DIMMs Will eventually need disruptive shifts: NVM, optics Can’t have high capacity, high bandwidth, and low energy Pick 2 of the 3!

8 Memory System Basics Processor M DIMM M M Multiple on-chip memory controllers that handle multiple 64-bit channels

9 Memory System Basics: FB-DIMM Processor M DIMM M FB-DIMM: Can boost capacity with narrow channels and buffering at each DIMM M DIMM M M

10 What’s a Rank? Processor M x8 64b DIMM Rank: DRAM chips required to provide the 64b output expected by a JEDEC standard bus For example: 8 x8 DRAM chips

11 What’s a Bank? Processor M x8 64b DIMM Bank: A portion of a rank that is tied up when servicing a request; multiple banks in a rank enable parallel handling of multiple requests BANK

12 What’s an Array? Processor M x8 64b DIMM Array: Matrix of cells One array provides 1 bit/cycle Each array reads out an entire row Large array  high density BANK

13 What’s a Row Buffer? … Array Wordline Bitlines Row Buffer RAS CAS Output pin

14 Row Buffer Management Row buffer: collection of rows read out by arrays in a bank Row buffer hits incur low latency and low energy Bitlines must be precharged before a new row can be read Open page policy: delays the precharge until a different row is encountered Close page policy: issues the precharge immediately

15 Primary Sources of Energy Inefficiency Overfetch: 8 KB of data read out for each cache line request Poor row buffer hit rates: diminished locality in multi-cores Electrical medium: bus speeds have been increasing Reliability measures: overhead in building a reliable system from inherently unreliable parts

16 SECDED Support 64-bit data word8-bit ECC One extra x8 chip per rank Storage and energy overhead of 12.5% Cannot handle complete failure in one chip

17 Chipkill Support I Use 72 DRAM chips to read out 72 bits Dramatic increase in activation energy and overfetch Storage overhead is still 12.5% 64-bit data word8-bit ECC At most one bit from each DRAM chip

18 Chipkill Support II Use 13 DRAM chips to read out 13 bits Storage and energy overhead: 62.5% Other options exist; trade-off between energy and storage 8-bit data word5-bit ECC At most one bit from each DRAM chip

19 Summary So Far We now understand… why memory energy is a problem - overfetch, row buffer miss rates why reliability incurs high energy overheads - chipkill support requires high activation per useful bit why capacity and bandwidth increases cost energy - need high frequency and buffering per hop

20 Crucial Timing Disruptive changes may be compelling today… Increasing role of memory energy Increasing role of memory errors Impact of multi-core: high bandwidth needs, loss of locality Emerging technologies (NVM, optics)  will require a revamp of memory architecture  ideas can be easily applied to NVM  role of DRAM may change

21 Attacking the Problem Find ways to maximize row buffer utility Find ways to reduce overfetch Treat reliability as a first-class design constraint Use photonics and 3D to boost capacity and bandwidth Solutions must be very cost-sensitive

22 Maximizing Row Buffer Locality Micro-pages (ASPLOS’10) Handling multiple memory controllers (PACT’10) On-going work: better write scheduling, better bank management (data mapping, row closure)

23 Micro-Pages Key observation: most accesses to a page are localized to a small region (micro-page)

24 Solution Identify hot micro-pages Co-locate hot micro-pages in reserved DRAM rows Memory controller keeps track of re-direction Low overheads if applications have few hot micro-pages that account for most memory accesses Processor M DIMM

25 Results Overall 9% improvement in performance and 15% reduction in energy

26 Handling Multiple Memory Controllers Data mapping across multiple memory controllers is key:  Must equalize load and queuing delays  Must minimize “distance”  Must maximize row buffer hit rates M DIMM M M

27 Solution Cost function to guide initial page placement Similar cost function to guide page migration Initial page placement improves performance by 7%, page migration by 9% Row buffer hit rates can be doubled

28 Reducing Overfetch Key idea: eliminate overfetch by employing smaller arrays and activating a single array in a single chip Single Subarray Access (SSA), ISCA’10 Positive effects: Minimizes activation energy Small activation footprint: more arrays can be asleep longer Enables higher parallelism and reduces queuing delays Negative effects: Longer transfer time Drop in density No row buffer hits Vulnerable to chip failure Change to standards

29 Energy Results Dynamic energy reduction of 6x In some cases, 3x reduction in leakage

30 Performance Results SSA better on half the programs (mem-intensive ones)

31 Support for Reliability Checksum support per row allows low-cost error detection Can build a 2 nd tier error-correction scheme, based on RAID DRAM chip Checksum Data row … Parity DRAM chip Reads: single array read Writes: two array reads and two array writes

32 Capacity and Bandwidth Silicon photonics to break the pin barrier at the processor But, several concerns at the DIMM:  Breaking the DRAM pin barrier will impact cost!  High capacity  daisy-chaining and loss of power  High static power for photonics; need high utilization  Scheduling for large capacities

33 Exploiting 3D Stacks (ISCA’11) Processor DIMM Waveguide DRAM chips Interface die + Stack controller Memory controller Interface die for photonic penetration Does not impact DRAM design Few photonic hops; high utilization Interface die schedules low-level operations

34 Packet-Based Scheduling Protocol High capacity  high scheduling complexity Move to a packet-based interface  Processor issues an address request  Processor reserves a slot for data return  Scheduling minutiae are handled by stack controller  Data is returned at the correct time  Back-up slot in case deadline is not met Better plug’n’play Reduced complexity at processor Can handle heterogeneity

35 Summary Treat reliability as a first-order constraint Possible to use photonics to break pin barrier and not disrupt memory chip design: boosts bandwidth and capacity ! Can reduce memory chip energy by reducing overfetch and with better row buffer management

36 Acks Terrific students in the Utah Arch group Prof. Al Davis (Utah) and collaborators at HP, Intel, IBM Funding from NSF, Intel, HP, University of Utah