Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*

Slides:



Advertisements
Similar presentations
Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.
Advertisements

QuT: A Low-Power Optical Network-on-chip
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
Optical Interconnects Speeding Up Computing Matt Webb PICTURE HERE.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Chapter 17 Parallel Processing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.
Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Optical Interconnects Speeding Up Computing Matt Webb PICTURE HERE.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.
Array Waveguide Gratings (AWGs). Optical fiber is a popular carrier of long distance communications due to its potential speed, flexibility and reliability.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.
UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Main Memory CS448.
COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,
Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.
TSV-Constrained Micro- Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and Ankur Srivastava, University of Maryland, College Park, MD,
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
A High-Speed & High-Capacity Single-Chip Copper Crossbar John Damiano, Bruce Duewer, Alan Glaser, Toby Schaffer, John Wilson, and Paul Franzon North Carolina.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.
Toward Reliable and Efficient Reporting in Wireless Sensor Networks Authors: Fatma Bouabdallah Nizar Bouabdallah Raouf Boutaba.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Lynn Choi School of Electrical Engineering
Memory Segmentation to Exploit Sleep Mode Operation
Seth Pugsley, Jeffrey Jestes,
Optoelectronic Integration
Lecture: Memory, Multiprocessors
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Lecture 24: Memory, VM, Multiproc
Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,
15-740/ Computer Architecture Lecture 19: Main Memory
Rajeev Balasubramonian
Presentation transcript:

Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi* University of Utah and *HP Labs

Memory Trends - I Multi-socket, multi-core, multi-thread Source: Tom’s Hardware Multi-socket, multi-core, multi-thread High bandwidth requirement 1 TB/s by 2017 Edge-bandwidth bottleneck Pin count, per pin bandwidth Signal integrity and off-chip power Limited number of DIMMs Without melting the system Or setting up in the Tundra! Source: ZDNet

Memory Trends - II The job of the memory controller is hard 18+ timing parameters for DRAM! Maintenance operations Refresh, scrub, power down, etc. Several DIMM and controller variants Hard to provide interoperability Need processor-side support for new memory features Now throw in heterogeneity Memristors, PCM, STT-RAM, etc.

Improving the interface Memory Interconnect - Efficient application of Silicon Photonics, without modifying DRAM dies DIMM CPU 1 … MC 2 Communication protocol – Streamlined Slot-based Interface Memory interface under severe pressure

PART 1 – Memory Interconnect

Silicon Photonic Interconnects We need something that can break the edge-bandwidth bottleneck Ring modulator based photonics Off chip laser source Indirect modulation using resonant rings Relatively cheap coupling on- and off-chip DWDM for high bandwidth density As many as 67 wavelengths possible Limited by Free Spectral Range, and coupling losses between rings Source: Xu et al. Optical Express 16(6), 2008 DWDM 64 λ × 10 Gbps/ λ = 80 GB/s per waveguide

Static Photonic Energy Photonic interconnects Large static power dissipation: ring tuning Much lower dynamic energy consumption – relatively independent of distance Electrical interconnects Relatively small static power dissipation Large dynamic energy consumption Should not over-provision photonic bandwidth, use only where necessary

The Questions We’re Trying to Answer What should the role of electrical signaling be? How do we make photonics less invasive to memory die design? Should we replace all interconnects with photonics? On-chip too? What should the role of 3D be in an optically connected memory? Should we be designing photonic DRAM dies? Stacks? Channels?

Contributions Beyond Prior Work Beamer et al. (ISCA 2010) First paper on fully integrated optical memory Studied electrical-optical balance point Focus on losses, proposed photonic power guiding We build upon this Focus on tuning power constraints Effect of low-swing wires Effect of 3D and daisy-chaining

Energy Balance Within a DRAM Chip Photonic Energy Electrical Energy

Single Die Design Full-swing on-chip wires Low-swing on-chip wires 1 Photonic DRAM die Full-swing on-chip wires Low-swing on-chip wires 46% energy reduction going between best full-swing config (4 stops) and best low-swing config (1 stop). Similar to state-of-the-art design, based on prior work. Argues for a specially designed photonic DRAM. More efficient on-chip electrical communication provides the added benefit of allowing fewer photonic resources.

3D Stacking Imminent for Capacity Simply stack photonic dies? Vertical coupling and hierarchical power guiding suggested by prior work This is our baseline design But, more photonic rings in the channel Exactly the same number active as before Energy optimal point shifts towards fewer “stops” single set of rings becomes optimal 2.4x energy consumption, for 8x capacity 8 Optimally Designed Photonic DRAM dies Based on published papers from memory manufacturers

Key Idea – Exploiting TSVs Move all photonic components to a separate interface die, shared by several memory dies Photonics off-chip only TSVs for inter-die communication Best of both worlds; high BW and low static energy Efficient low-swing wires on-die 8 Optimally Designed Photonic DRAM dies 8 Commodity DRAM dies Single photonic interface die

Photonic Interface die Proposed Design ADVANTAGE 1: Increased activity factor, more efficient use of photonics ADVANTAGE 2: Rings are co-located; easier to isolate or tune thermally ADVANTAGE 3: Not disruptive to the design of commodity memory dies DRAM chips Processor DIMM Memory controller Waveguide Photonic Interface die

Energy Characteristics Single die on the channel Four 8-die stacks on the channel Static energy trumps distance-independent dynamic energy

Makes the job of the memory controller difficult! Final System DRAM chips Processor DIMM Waveguide Memory controller Photonic Interface die Makes the job of the memory controller difficult! 23% reduced energy consumption 4X capacity per channel Potential for performance improvements due to increased bank count Less disruptive to memory die design

PART 2 – Communication Protocol

The Scalability Problem Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface Processor-side support required for every memory innovation Current micro-management requires several signals Heavy pressure on address/command bus Worse with several independent banks, large amounts of state

Proposed Solution Release MC’s tight control, make memory stack more autonomous Move mundane tasks to the interface die Maintenance operation (refresh, scrub, etc.) Routine operations (DRAM precharge, NVM wear leveling) Timing control (18+ constraints for DRAM alone) Coding and any other special requirements

What would it take to do this? “Back-pressure” from the memory But, “Free-for-all” would be inefficient Needs explicit arbitration Novel slot-based interface Memory controller retains control over data bus Memory module only needs address, returns data

Memory Access Operation ML ML > ML x x x S1 S2 Arrival Issue Start looking First free slot Backup slot Time Talk about backup as being more important contribution Slot – Cache line data bus occupancy X – Reserved Slot ML – Memory Latency = Addr. latency + Bank access + Data bus latency

Advantages Plug and play Better support for heterogeneous systems Everything is interchangeable and interoperable Only interface-die support required (communicate ML) Better support for heterogeneous systems Easier DRAM-NVM data movement on the same channel More innovation in the memory system Without processor-side support constraints Fewer commands between processor and memory Energy, performance advantages

Target System and Methodology Terascale memory node in an Exascale system 1 TB of memory, 1 TB/s of bandwidth Assuming 80 GB/s per channel, we need 16 channels, with 64 GB per channel 2 GB dies x 8 dies per stack x 4 stacks per channel Focus on the design of a single channel In-house DRAM simulator + SIMICS PARSEC, STREAM, synthetic random traffic Max. traffic load used, just below channel saturation Say that this is a very aggressive design target right here.

Performance Impact – Synthetic Traffic < 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth

Performance Impact – PARSEC/STREAM Apps have very low BW requirements Scaled down system, similar trends

Tying it together – The Interface Die

Summary of Design Proposed 3D-stacked interface die with 2 major functions Holds photonic devices for Electrical-Optical-Electrical conversion Photonics only on the busy shared bus between this die and the processor Intra-memory communication all-electrical exploiting TSVs and low-swing wires Holds device controller logic Handles all mundane/routine tasks for the memory devices Refresh, scrub, coding, timing constraints, sleep modes, etc. Processor-side controller deals with more important functions such as scheduling, channel arbitration, etc. Simple speculative slot based interface

Key Contributions Efficient application of photonics 23% lower energy 4X capacity, potential for performance improvements Minimally disruptive to memory die design Single memory die design for photonics and electronics Streamlined memory interface More interoperability and flexibility Innovation without processor-side changes Support for heterogeneous memory

Backup Slides

Laser Power Calculation The detectors need to receive some minimum amount of photonic power in order to reliably determine 0/1 Depends on their “sensitivity” Going from source to destination, there are several points of power loss – the waveguide, the rings, splitters, couplers, etc. Work backwards to determine total input laser power required Also some concerns about “non-linearity”, when total path loss exceeds a certain amount Rule of thumb ~20dB

What if trimming only costs 50 µW/ring? Full-swing On-chip wires Low swing On-chip wires

Concentrated vs. Distributed Energy considerations More electrical traversal if rings are concentrated in one location Same photonic energy BW considerations Entire BW used for a single cache line in concentrated design Smaller serialization delay, lowered queuing delay - reduced overall memory access latency Going from 1 to 8 stops, 67% latency increase for 12% energy reduction Not worth it! Distributed but cache lines striped across several arrays is one option but this would increase overfetch

Thermal Impact Unlike prior work, this is not memory stacked on the processor There is only relatively cool DRAM There isn’t a super-hot layer at the bottom Thermal issues are not a big concern Thermal simulations with Hotspot 5.0 Less than 0.1K temperature rise due to additional activity on the interface die

Intermediate Design – Single Stack Decision on extent of photonic penetration very similar to single die system, except for the addition of TSVs (this came virtually for free in the fully photonic 3D design) But, absolute energy comes down dramatically due to the elimination of idling rings 48% less energy than simply stacking 8 photonic dies together 8 Commodity DRAM dies Single photonic interface die

Handling the Uncommon Case Memory controller (by design) no longer deals with minutiae of per-bank state Data may not be available at the reserved “slot” due to bank conflicts/refresh/low-power wakeup Speculatively reserve a second slot farther away Slot-2 far enough away that it can be reused by a subsequent request in the common case uncommon case suffers a latency penalty based on location of Slot-2 Beyond Slot-2, requests simply have to be retried

Optically Connected Memory Photonics clearly useful off-chip Breaks pin barrier, improves socket-edge BW Reduces energy consumption Allows larger capacity Beyond this, what? How can we best apply photonics to the rest of the memory system?