1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.

Slides:



Advertisements
Similar presentations
Memory Controller Innovations for High-Performance Systems
Advertisements

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Memory Hierarchy (I). 2 Outline Random-Access Memory (RAM) Nonvolatile Memory Disk Storage Suggested Reading: 6.1.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)
15-740/ Computer Architecture Lecture 25: Main Memory
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
Lecture: Large Caches, Virtual Memory
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Lecture 12: Cache Innovations
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Cache Innovations, Virtual Memory
Lecture: Memory Technology Innovations
Lecture 6: Reliability, PCM
Lecture 24: Memory, VM, Multiproc
Lecture 15: Memory Design
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
Lecture 22: Cache Hierarchies, Memory
15-740/ Computer Architecture Lecture 19: Main Memory
Lecture: Cache Hierarchies
A Case for Interconnect-Aware Architectures
Presentation transcript:

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

2 Acks Terrific students in the Utah Arch group Collaborators at HP, Intel, IBM Prof. Al Davis, who re-introduced us to memory systems

3 Current Trends Continued device scaling Multi-core processors The power wall Pin limitations Problematic interconnects Need for high reliability

4 Anatomy of Future High-Perf Processors The Core Hi-Perf The Core Lo-Perf Designs well understood Combo of hi- and lo-perf Risky Ph.D.!!

5 Anatomy of Future High-Perf Processors Core Large shared L3 Partitioned into many banks Assume one bank per core L2/L3 Cache Bank

6 Anatomy of Future High-Perf Processors C Many cores! Large distributed L3 cache $C$C$C$ C$C$C$C$ C$C$C$C$ C$C$C$C$

7 Anatomy of Future High-Perf Processors C On-chip network Includes routers and long wires Used for cache coherence Used for off-chip requests/responses $C$C$C$ C$C$C$C$ C$C$C$C$ C$C$C$C$

8 Anatomy of Future High-Perf Processors C Memory controller handles off-chip requests to memory $C$C$C$ C$C$C$C$ C$C$C$C$ C$C$C$C$ Memory Controller DIMM

9 CPU Anatomy of Future High-Perf Processors DIMM CPU DIMM CPU DIMM Multi-socket motherboard Lots of cores, lots of memory, all connected

10 CPU Anatomy of Future High-Perf Processors DIMM CPU DIMM CPU DIMM DRAM backed up by slower, higher capacity emerging non-volatile memories Eventually backed up by disk… maybe Non Volatile PCM Non Volatile PCM DISK

11 Life of a Cache Miss Core L1 Miss in L1 on-chip network $ Look up L2/L3 bank On- and off-chip network MC Wait in MC queue DIMM Access DRAM Non Volatile PCM Access PCM DISK Access disk

12 Research Topics Core L1 Miss in L1 on-chip network $ Look up L2/L3 bank On- and off-chip network MC Wait in MC queue DIMM Access DRAM Non Volatile PCM Access PCM DISK Access disk Not very hot topics!

13 Research Topics Core L1 Miss in L1 on-chip network $ Look up L2/L3 bank On- and off-chip network MC Wait in MC queue DIMM Access DRAM Non Volatile PCM Access PCM DISK Access disk

14 Problems with DRAM DRAM main memory contributes 1/3 rd of total energy in datacenters Long latencies; high bandwidth needs Error resilience is very expensive DRAM is a commodity and chips have to be compliant with standards  Initial designs instituted in the 1990s  Innovations are evolutionary  Traditional focus on density

15 Time for a Revolutionary Change? Energy is far more critical today Cost-per-bit perhaps not as relevant today Memory reliability is increasing in importance Multi core access streams have poor locality Queuing delays are starting to dominate Potential migration to new memory technologies and interconnects

Incandescent light bulb Low purchase cost High operating cost Commodity Energy-efficient light bulb Higher purchase cost Much lower operating cost Value-addition 16 It’s worth a small increase in capital costs to gain large reductions in operating costs $ W $ W And not 10X, just 15-20%! Key Idea

17 DRAM Basics … Memory bus or channel Rank DRAM chip or device Bank Array 1/8 th of the row buffer One word of data output DIMM On-chip Memory Controller

18 RAS CAS Cache Line DRAM Chip Row Buffer One bank shown in each chip DRAM Operation

19 New Design Philosophy Eliminate overfetch; activate a single chip and a single small array  much lower energy, slightly higher cost Provide higher parallelism Add features for error detection [ Appears in ISCA’10 paper, Udipi et al. ]

20 Single Subarray Access (SSA) MEMORY CONTROLLER 8 8 ADDR/CMD BUS 64 Bytes Bank Subarray Bitlines Row buffer Global Interconnect to I/O ONE DRAM CHIP DIMM DATA BUS

21 Address Cache Line DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray Sleep Mode (or other parallel accesses) Subarray SSA Operation

22 Consequences Why is this good?  Minimal activation energy for a line  More circuits can be placed in low-power sleep  Can perform multiple operations in parallel Why is this bad?  Higher area and cost (roughly 5 – 10%)  Longer data transfer time  Not compatible with today’s standards  No opportunity for row buffer hits

23 Narrow Vs. Wide Buses? What provides higher utilization? 1 wide bus or 8 narrow buses? Must worry about load imbalance and long data transfer time in latter Must worry about many bubbles in the former because of dependences [ Ongoing work, Chatterjee et al. ] Bank AccessDT Bank AccessDT One 64-bit wide bus Back-to-back requests to the same bank 64 idle bits Bank AccessDT Bank AccessDT 8 idle bits Eight 8-bit wide buses

24 Methodology Tested with simulators (Simics) and multi-threaded benchmark suites 8-core simulations DRAM energy and latency from Micron datasheets

25 Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA Results – Energy

26 Results – Energy – Breakdown

27 Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12) Results – Performance

28 Results – Performance – Breakdown

29 Error Resilience in DRAM Important to not only withstand a single error, but also entire chip failure – referred to as chipkill DRAM chips do not include error correction features -- error tolerance must be built on top Example: 8-bit ECC code for a 64-bit word; for chipkill correctness, each of the 72 bits must be read from a separate DRAM chip  significant overfetch! …. 72-bit word on every bus transfer

30 Two-Tiered Solution Add a small (8-bit) checksum for every cache line Maintain one extra DRAM chip for parity across 8 DRAM chips When the checksum flags an error, use the parity to re-construct the corrupted cache line Writes will require updates to parity as well 017Parity ….

31 Research Topics Core L1 Miss in L1 on-chip network $ Look up L2/L3 bank On- and off-chip network MC Wait in MC queue DIMM Access DRAM Non Volatile PCM Access PCM DISK Access disk

32 Topic 2 – On-chip Networks Conventional wisdom: buses are not scalable; need routed packet-switched networks But, routed networks require bulky energy-hungry routers Results:  Buses can be made scalable by having a hierarchy of buses and Bloom filters to stifle broadcasts  Low-swing buses can yield low energy, simpler coherence, and scalable operation [ Appears in HPCA’10 paper, Udipi et al. ]

33 Topic 3: Data Placement in Caches, Memory In a large distributed NUCA cache, or in a large distributed NUMA memory, data placement must be controlled with heuristics that are aware of:  capacity pressure in the cache bank  distance between CPU and cache bank and DIMM  queuing delays at memory controller  potential for row buffer conflicts at each DIMM [ Appears in HPCA’09, Awasthi et al. and in PACT’10, Awasthi et al. (Best paper!) ]

34 Topic 4: Silicon Photonic Interconnects Silicon photonics can provide abundant bandwidth and makes sense for off-chip communication Can help multi-cores overcome the bandwidth problem Problems: DRAM design that best interfaces with silicon photonics, protocols that allow scalable operation [ On-going work, Udipi et al. ]

35 Topic 5: Memory Controller Design Problem: abysmal row buffer hit rates, quality of service Solutions:  Co-locate hot cache lines in the same page  Predict row buffer usage and “prefetch” row closure  QoS policies that leverage multiple memory “knobs” [ Appears in ASPLOS’10 paper, Sudan et al. and on-going work, Awasthi et al., Sudan et al. ]

36 Topic 6: Non Volatile Memories Emerging memories (PCM):  can provide higher densities at smaller feature sizes  are based on resistance, not charge (hence non-volatile)  can serve as replacements to DRAM/Flash/disk Problem: when a cell is programmed to a given resistance, the resistance tends to drift over time  may require efficient refresh or error correction [ On-going work, Awasthi et al. ]

37 Title Bullet